recognizing action at a distance a.a. efros, a.c. berg, g. mori, j. malik uc berkeley
Post on 24-Dec-2015
216 Views
Preview:
TRANSCRIPT
Recognizing Action at a Distance
A.A. Efros, A.C. Berg, G. Mori, J. Malik
UC Berkeley
Looking at People
• 3-pixel man• Blob tracking
– vast surveillance literature
• 300-pixel man• Limb tracking
– e.g. Yacoob & Black, Rao & Shah, etc.
Far fieldNear field
Medium-field Recognition
The 30-Pixel Man
Appearance vs. Motion
Jackson PollockNumber 21 (detail)
Goals
• Recognize human actions at a distance– Low resolution, noisy data– Moving camera, occlusions– Wide range of actions (including non-periodic)
Our Approach
• Motion-based approach– Non-parametric; use large amount of data– Classify a novel motion by finding the most similar
motion from the training set• Related Work
– Periodicity analysis• Polana & Nelson; Seitz & Dyer; Bobick et al; Cutler & Davis;
Collins et al.
– Model-free • Temporal Templates [Bobick & Davis]
• Orientation histograms [Freeman et al; Zelnik & Irani]
• Using MoCap data [Zhao & Nevatia, Ramanan & Forsyth]
Gathering action data
• Tracking – Simple correlation-based tracker– User-initialized
Figure-centric Representation
• Stabilized spatio-temporal volume– No translation information– All motion caused by person’s
limbs• Good news: indifferent to camera
motion
• Bad news: hard!
• Good test to see if actions, not just translation, are being captured
input sequence
Remembrance of Things Past• “Explain” novel motion sequence by
matching to previously seen video clips– For each frame, match based on some temporal
extent
Challenge: how to compare motions?
motion analysisrun
walk leftswing
walk rightjog
database
How to describe motion?
• Appearance – Not preserved across different clothing
• Gradients (spatial, temporal)– same (e.g. contrast reversal)
• Edges/Silhouettes – Too unreliable
• Optical flow– Explicitly encodes motion – Least affected by appearance – …but too noisy
Spatial Motion Descriptor
Image frame Optical flow yxF ,
yx FF , yyxx FFFF ,,, blurred
yyxx FFFF ,,,
Spatio-temporal Motion Descriptor
t
…
…
…
…
Sequence A
Sequence B
Temporal extent E
Bframe-to-frame
similarity matrix
A
motion-to-motionsimilarity matrix
A
B
I matrix
E
E
blurry I
E
E
Football Actions: matching
InputSequence
Matched Frames
input matched
Football Actions: classification
10 actions; 4500 total frames; 13-frame motion descriptor
Classifying Ballet Actions16 Actions; 24800 total frames; 51-frame motion descriptor. Men used to classify women and vice versa.
Classifying Tennis Actions
6 actions; 4600 frames; 7-frame motion descriptorWoman player used as training, man as testing.
Classifying Tennis
• Red bars show classification results
Querying the Databaseinput sequence
database
run
walk leftswing
walk rightjog
run walk left swing walk right jogAction Recognition:
Joint Positions:
2D Skeleton Transfer
• We annotate database with 2D joint positions
• After matching, transfer data to novel sequence– Ajust the match for best fit
Input sequence:
Transferred 2D skeletons:
3D Skeleton Transfer
• We populate database with rendered stick figures from 3D Motion Capture data
• Matching as before, we get 3D joint positions (kind of)!
Input sequence:
Transferred 3D skeletons:
“Do as I Do” Motion Synthesis
• Matching two things:– Motion similarity across sequences– Appearance similarity within sequence (like VideoTextures)
• Dynamic Programming
input sequence
synthetic sequence
“Do as I Do” Source Motion Source Appearance
Result
3400 Frames
“Do as I Say” Synthesis
• Synthesize given action labels– e.g. video game control
run walk left swing walk right jog
synthetic sequence
run
walk leftswing
walk rightjog
“Do as I Say”
• Red box shows when constraint is applied
Actor Replacement
SHOW VIDEO
Conclusions
• In medium field action is about motion
• What we propose:– A way of matching motions at coarse scale
• What we get out:– Action recognition– Skeleton transfer – Synthesis: “Do as I Do” & “Do as I say”
• What we learned?– A lot to be said for the “little guy”!
Thank You
Smoothness for Synthesis
• is action similarity between source and target • is appearance similarity within target frames• For every source frame i, find best target frame • by maximizing following cost function:
• Optimize using dynamic programming
appW
actW
)1,(),( 2
11
n
iiiappapp
n
iiactact WiW
i
The Database Analogy
Conclusions
• Action is about motion
• Purely motion-based descriptor for actions
• We treat optical flow – Not as measurement of pixel displacement– But as a set of noisy features that are carefully
smoothed and aggregated
• Can handle very poor, noisy data
Cool Video, Attempt II
Comparing motion descriptors
t
motion-to-motionsimilarity matrixblurry I
…
…
…
…
frame-to-framesimilarity matrix
I matrix
top related