ivan laptev*, marcin marszałek**, cordelia schmid**, benjamin rozenfeld*** * irisa/inria rennes,...

33
Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University, Israel Learning realistic human Learning realistic human actions actions from movies from movies

Upload: mckenna-wiltshire

Post on 11-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld***

* IRISA/INRIA Rennes, France** INRIA Rhône-Alpes, France*** Bar-Ilan University, Israel

Learning realistic human actionsLearning realistic human actionsfrom moviesfrom movies

Page 2: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Human actions: MotivationHuman actions: Motivation

Action recognition useful for:

• Content-based browsinge.g. fast-forward to the next goal scoring scene

• Video recyclinge.g. find “Bush shaking hands with Putin”

• Human scientistsinfluence of smoking in movies on adolescent

smoking

Huge amount of video is available and growing

Human actions are major events in movies, TV news, personal video …

150,000 uploads every day

Page 3: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

What are human actions?What are human actions?

• Actions in current datasets:

• Actions in “the Wild”:

KTH action dataset

Page 4: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Web video search– Useful for some action classes: kissing, hand shaking– Very noisy or not useful for the majority of other action classes– Implies manual selection work– Examples are frequently non-representative

Web image search– Useful for learning action context: static scenes and objects – See also [Li-Jia & Fei-Fei ICCV07]

Goodle Video, YouTube, MyspaceTV, …

Access to realistic human actionsAccess to realistic human actions

Page 5: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Actions in movies Actions in movies • Realistic variation of human actions

• Many classes and many examples per class

Problems:

• Typically only a few class-samples per movie

• Manual annotation is very time consuming

Page 6: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

…117201:20:17,240 --> 01:20:20,437

Why weren't you honest with me?Why'd you keep your marriage a secret?

117301:20:20,640 --> 01:20:23,598

lt wasn't my secret, Richard.Victor wanted it that way.

117401:20:23,800 --> 01:20:26,189

Not even our closest friendsknew about our marriage.…

…RICK

Why weren't you honest with me? Why did you keep your marriage a secret?

Rick sits down with Ilsa.

ILSA

Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage.

01:20:17

01:20:23

subtitlesmovie script

• Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com, www.weeklyscript.com …

• Subtitles (with time info.) are available for the most of movies

• Can transfer time to scripts by text alignment

Script alignment Script alignment [Everingham et al. BMVC06][Everingham et al. BMVC06]

Page 7: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

01:21:5001:21:59

01:22:0001:22:03

01:22:1501:22:17

Script alignment Script alignment RICK All right, I will. Here's looking at you, kid.

ILSA I wish I didn't love you so much.

She snuggles closer to Rick.

CUT TO:

EXT. RICK'S CAFE - NIGHT

Laszlo and Carl make their way through the darkness toward a side entrance of Rick's. They run inside the entryway.

The headlights of a speeding police car sweep toward them.

They flatten themselves against a wall to avoid detection.

The lights move past them.

CARL I think we lost them.

Page 8: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

– On the good side:• Realistic variation of actions: subjects, views, etc…• Many examples per class, many classes• No extra overhead for new classes• Actions, objects, scenes and their combinations• Character names may be used to resolve “who is doing what?”

– Problems:• No spatial localization• Temporal localization may be poor• Missing actions: e.g. scripts do not always follow the movie• Annotation is incomplete, not suitable as ground truth for

testing action detection• Large within-class variability of action classes in text

Script-based action annotation Script-based action annotation

Page 9: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Script alignment: Evaluation Script alignment: Evaluation

Example of a “visual false positive”

A black car pulls up, two army officers get out.

• Annotate action samples in text

• Do automatic script-to-video alignment

• Check the correspondence of actions in scripts and movies

a: quality of subtitle-script matching

Page 10: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Text-based action retrieval Text-based action retrieval

“… Will gets out of the Chevrolet. …” “… Erin exits her new truck…”

• Large variation of action expressions in text:

GetOutCar action:

Potential false positives:

“…About to sit down, he freezes…”

• => Supervised text classification approach

Page 11: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Movie actions dataset Movie actions dataset 12

mo

vies

20 d

iffe

ren

t m

ovi

es

• Learn vision-based classifier from automatic training set

• Compare performance to the manual training set

Page 12: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Action Classification: OverviewAction Classification: Overview

Bag of space-time features + multi-channel SVM

Histogram of visual words

Multi-channelSVM

Classifier

Collection of space-time patches

HOG & HOFpatch

descriptors

[Schuldt’04, Niebles’06, Zhang’07]

Page 13: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Space-Time Features: DetectorSpace-Time Features: Detector

Space-time corner detector[Laptev, IJCV 2005]

Dense scale sampling (no explicit scale selection)

Page 14: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Space-Time Features: DescriptorSpace-Time Features: Descriptor

Histogram of oriented spatial

grad. (HOG)

Histogram of optical

flow (HOF)

3x3x2x4bins HOG descriptor

3x3x2x5bins HOF descriptor

Public code

available soon!

Multi-scale space-time patches from corner detector

Page 15: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

We use global spatio-temporal grids In the spatial domain:

1x1 (standard BoF)2x2, o2x2 (50% overlap)h3x1 (horizontal), v1x3 (vertical)3x3

In the temporal domain:t1 (standard BoF), t2, t3

Spatio-temporal bag-of-featuresSpatio-temporal bag-of-features

Page 16: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

We use SVMs with a multi-channel chi-square kernel for classification

Channel c is a combination of a detector, descriptor and a grid

Dc(H

i, H

j) is the chi-square distance between

histograms

Ac is the mean value of the distances between all

training samples

The best set of channels C for a given training set is found based on a greedy approach

Multi-channel chi-square kernelMulti-channel chi-square kernel

Page 17: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Table: Classification performance of different channels and their combinations

It is worth trying different gridsIt is beneficial to combine channels

Combining channelsCombining channels

Page 18: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Evaluation of spatio-temporal gridsEvaluation of spatio-temporal grids

Figure: Number of occurrences for each channel component within the optimized channel combinations for the KTH action dataset and our manually labeled movie dataset

Page 19: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Comparison to the state-of-the-artComparison to the state-of-the-art

Figure: Sample frames from the KTH actions sequences, all six classes (columns) and scenarios (rows) are presented

Page 20: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Comparison to the state-of-the-artComparison to the state-of-the-art

Table: Average class accuracy on the KTH actions dataset

Table: Confusion matrix for the KTH actions

Page 21: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Training noise robustnessTraining noise robustness

Figure: Performance of our video classification approach in the presence of wrong labels

Up to p=0.2 the performance decreases insignicantlyAt p=0.4 the performance decreases by around 10%

Page 22: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Figure: Example results for action classification trained on the automatically annotated data. We show the key frames for test movies with the highest confidence values for true/false pos/neg

Action recognition in real-world videosAction recognition in real-world videos

Page 23: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Note the suggestive FP: hugging or answering the phoneNote the dicult FN: getting out of car or handshaking

Action recognition in real-world videosAction recognition in real-world videos

Page 24: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Table: Average precision (AP) for each action class of our test set. We compare results for clean (annotated) and automatic training data. We also show results for a random classifier (chance)

Action recognition in real-world videosAction recognition in real-world videos

Page 25: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Action classification Action classification (CVPR 2008)(CVPR 2008)

Test episodes from movies “The Graduate”, “It’s a wonderful life”, “Indiana Jones and the Last Crusade”

Page 26: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

“Buffy the Vampire Slayer” TV series: 7 seasons = 39 DVDs = 144 episodes ≈100 hours of video ≈10M frames

Going large-scaleGoing large-scale

Figure: Key frames of sample “Buffy” clips.

Page 27: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

“Buffy the Vampire Slayer” TV series: 15 000 video samples with captions captions are parsed with Link Grammar verbs are stemmed using Morphy semantics can be obtained with WordNet

What actions are available?What actions are available?

Page 28: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Text-based retrievalText-based retrieval

Textual retrieval results in: 10-35% false positives 5-10% “borderline” samples

Sources of errors as for Hollywood movies: errors in script synchronization errors in script parsing out-of-view actions

Page 29: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Visual re-rankingVisual re-ranking

Vision can be used to re-rank the samples and prune the false positives: often no FP at 40% recall half of the FP at 80% recall

Page 30: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Figure: Example results for our visual modelling procedure. We show the key frames of true/false positives/negatives with the highest confidence values.

Going large-scaleGoing large-scalelo

w s

core

hig

h s

core

Page 31: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

ConclusionsConclusions

SummaryAutomatic generation of realistic action samples

New action dataset available www.irisa.fr/vista/actionsTransfer of recent bag-of-features experience to videos

Improved performance on KTH benchmarkPromising results for actions in the wild

Future directionsAutomatic action class discoveryInternet-scale video searchVideo+Text+Sound+…

Page 32: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,
Page 33: Ivan Laptev*, Marcin Marszałek**, Cordelia Schmid**, Benjamin Rozenfeld*** * IRISA/INRIA Rennes, France ** INRIA Rhône-Alpes, France *** Bar-Ilan University,

Going large-scaleGoing large-scale

Punch Kick

Fall Walk