recognizing action at a distance a.a. efros, a.c. berg, g. mori, j. malik uc berkeley

Recognizing Action at a Distance

A.A. Efros, A.C. Berg, G. Mori, J. Malik

UC Berkeley

Looking at People

• 3-pixel man• Blob tracking

– vast surveillance literature

• 300-pixel man• Limb tracking

– e.g. Yacoob & Black, Rao & Shah, etc.

Far fieldNear field

Medium-field Recognition

The 30-Pixel Man

Appearance vs. Motion

Jackson PollockNumber 21 (detail)

• Recognize human actions at a distance– Low resolution, noisy data– Moving camera, occlusions– Wide range of actions (including non-periodic)

Our Approach

• Motion-based approach– Non-parametric; use large amount of data– Classify a novel motion by finding the most similar

motion from the training set• Related Work

– Periodicity analysis• Polana & Nelson; Seitz & Dyer; Bobick et al; Cutler & Davis;

Collins et al.

– Model-free • Temporal Templates [Bobick & Davis]

• Orientation histograms [Freeman et al; Zelnik & Irani]

• Using MoCap data [Zhao & Nevatia, Ramanan & Forsyth]

Gathering action data

• Tracking – Simple correlation-based tracker– User-initialized

Figure-centric Representation

• Stabilized spatio-temporal volume– No translation information– All motion caused by person’s

limbs• Good news: indifferent to camera

motion

• Bad news: hard!

• Good test to see if actions, not just translation, are being captured

input sequence

Remembrance of Things Past• “Explain” novel motion sequence by

matching to previously seen video clips– For each frame, match based on some temporal

extent

Challenge: how to compare motions?

motion analysisrun

walk leftswing

walk rightjog

database

How to describe motion?

• Appearance – Not preserved across different clothing

• Gradients (spatial, temporal)– same (e.g. contrast reversal)

• Edges/Silhouettes – Too unreliable

• Optical flow– Explicitly encodes motion – Least affected by appearance – …but too noisy

Spatial Motion Descriptor

Image frame Optical flow yxF ,

yx FF , yyxx FFFF ,,, blurred

yyxx FFFF ,,,

Spatio-temporal Motion Descriptor

Sequence A

Sequence B

Temporal extent E

Bframe-to-frame

similarity matrix

motion-to-motionsimilarity matrix

I matrix

blurry I

Football Actions: matching

InputSequence

Matched Frames

input matched

Football Actions: classification

10 actions; 4500 total frames; 13-frame motion descriptor

Classifying Ballet Actions16 Actions; 24800 total frames; 51-frame motion descriptor. Men used to classify women and vice versa.

Classifying Tennis Actions

6 actions; 4600 frames; 7-frame motion descriptorWoman player used as training, man as testing.

Classifying Tennis

• Red bars show classification results

Querying the Databaseinput sequence

database

walk leftswing

walk rightjog

run walk left swing walk right jogAction Recognition:

Joint Positions:

2D Skeleton Transfer

• We annotate database with 2D joint positions

• After matching, transfer data to novel sequence– Ajust the match for best fit

Input sequence:

Transferred 2D skeletons:

3D Skeleton Transfer

• We populate database with rendered stick figures from 3D Motion Capture data

• Matching as before, we get 3D joint positions (kind of)!

Input sequence:

Transferred 3D skeletons:

“Do as I Do” Motion Synthesis

• Matching two things:– Motion similarity across sequences– Appearance similarity within sequence (like VideoTextures)

• Dynamic Programming

input sequence

synthetic sequence

“Do as I Do” Source Motion Source Appearance

Result

3400 Frames

“Do as I Say” Synthesis

• Synthesize given action labels– e.g. video game control

run walk left swing walk right jog

synthetic sequence

walk leftswing

walk rightjog

“Do as I Say”

• Red box shows when constraint is applied

Actor Replacement

SHOW VIDEO

Conclusions

• In medium field action is about motion

• What we propose:– A way of matching motions at coarse scale

• What we get out:– Action recognition– Skeleton transfer – Synthesis: “Do as I Do” & “Do as I say”

• What we learned?– A lot to be said for the “little guy”!

Thank You

Smoothness for Synthesis

• is action similarity between source and target • is appearance similarity within target frames• For every source frame i, find best target frame • by maximizing following cost function:

• Optimize using dynamic programming

)1,(),( 2

iiiappapp

iiactact WiW

The Database Analogy

Conclusions

• Action is about motion

• Purely motion-based descriptor for actions

• We treat optical flow – Not as measurement of pixel displacement– But as a set of noisy features that are carefully

smoothed and aggregated

• Can handle very poor, noisy data

Cool Video, Attempt II

Comparing motion descriptors

motion-to-motionsimilarity matrixblurry I

frame-to-framesimilarity matrix

I matrix

recognizing action at a distance a.a. efros, a.c. berg, g. mori, j. malik uc berkeley

frame motion descriptor

noisy slide

motion analysis

novel motion sequence

similar motion

jog database slide

motion similarity matrix

motion synthesis matching

Documents

research article mori folium and mori fructus...

bombyx mori

recognizing action at a distance - computer...

image manifolds 16-721: learning-based methods in vision...

16-721: learning-based methods in vision staff: instructor:...

efros config inspector

ion achiri vasile ciobanu maria efros petru efros valentin

woods - veneto ceramics · clase3 r10. mori mori mori maple...

mori informa mori 2015-1

biografi haji haji abdul abdul abdul malik malik malik ......

fisica y geometria del desorden - efros

mori chack

fcv scene efros

holiness and glory in the bible- efros, israel

dmg mori green intelligent manufacturing strategy · dmg...

efros memoriu.doc

mori involvementandconsultation

learning category-specific mesh reconstruction from image...

dmg mori bmt40 / bmt60 - otrix...

computer vision group university of california berkeley...