recognizing human figures and actions greg mori simon fraser university

Recognizing Human Figures and Actions

Greg MoriSimon Fraser University

• Action recognition– Where are the people?– What are they doing?

• Applications– Image understanding, image retrieval and search– HCI– Surveillance– Computer Graphics

• 3-pixel man

• Blob tracking

• 300-pixel man

• Find and track limbs

Far field

Near field

Medium field• 30-pixel man

• Coarse-level actions

Outline

• Human figures in motion– Action Recognition

• Localizing joint positions– Exemplar-based approach– Parts-based approach

• Motion Synthesis– Novel graphics application

Appearance vs. Motion

Jackson PollockNumber 21 (detail)

QuickTime™ and a decompressorare needed to see this picture.

Action Recognition

• Recognize human actions at a distance– Low resolution, noisy data– Moving camera, occlusions– Wide range of actions (including non-periodic)

QuickTime™ and a decompressor

are needed to see this picture.

Our Approach

• Motion-based approach– Classify a novel motion by finding the most similar

motion from the training set– Use large amounts of data (“non-parametric”)

• Related Work– Periodicity analysis

• Polana & Nelson; Seitz & Dyer; Bobick et al; Cutler & Davis; Collins et al.

– Model-free • Temporal Templates [Bobick & Davis]

• Orientation histograms [Freeman et al; Zelnik & Irani]

• Using MoCap data [Zhao & Nevatia, Ramanan & Forsyth]

Gathering action data

• Tracking – Simple correlation-based tracker

Figure-centric Representation

• Stabilized spatio-temporal volume– No translation information– All motion caused by person’s

limbs• Good news: indifferent to camera

motion• Bad news: hard!

• Good test to see if actions, not just translation, are being captured

input sequence

Remembrance of Things Past

• “Explain” novel motion sequence by matching to previously seen video clips– For each frame, match based on some temporal

extent

Challenge: how to compare motions?

walk leftswing

walk rightjog

database

How to describe motion?

• Appearance – Not preserved across different clothing

• Gradients (spatial, temporal)– same (e.g. contrast reversal)

• Edges– Unreliable at this scale

• Optical flow– Explicitly encodes motion

– Least affected by appearance

– …but too noisy

Spatial Motion Descriptor

Image frame Optical flow

yx FF , yyxx FFFF ,,, blurred

yyxx FFFF ,,,

Spatio-temporal Motion Descriptor

Sequence A

Sequence B

Temporal window w

Bframe-to-frame

similarity matrix

motion-to-motionsimilarity matrix

I matrix

blurry I

Soccer

• Real actions, moving camera, poor video

• 8 classes of actions

• 4500 frames of labeled data

• 1-nearest-neighbor classifier

Classifying Ballet Actions16 Actions; 24800 total frames; 51-frame motion descriptor. Men used to classify women and vice versa.

Classifying Tennis Actions

6 actions; 4600 frames; 7-frame motion descriptorWoman player used as training, man as testing.

Classifying Tennis

• Red bars show classification results

Outline

Human Figures in Still Images

• Detection of humans is possible for stereotypical poses– Standing– Walking– (Viola et al., Poggio et al.)

• But we want to do more– Wider variety of poses– Localize joint positions

Problem

Shape Matching For Finding People

Database of Exemplars

Shape Contexts• Deformable template approach

– Shapes represented as a collection of edge points

• Two stages– Fast pruning

• Quick tests to construct a shortlist of candidate objects

• Database of known objects could be large

– Detailed matching• Perform computationally expensive comparisons on

only the few shapes in the shortlist

• Publications– Mori et al., CVPR 2001

– Mori and Malik, CVPR 2003• Featured in New York Times Science section

Results: Tracking by Repeated Finding

QuickTime™ and aCinepak decompressor

Multiple Exemplars

• Parts-based approach– Use a combination of keypoints or

limbs from different exemplars– Reduces the number of exemplars needed

• Compute a matching cost for each limb from every exemplar

• Compute pairwise “consistency” costs for neighbouring limbs

• Use dynamic programming to find best K configurations

Combining Exemplars

Finding People (II): Parts-based Approach

• Bottom-up

• Segmentation as preprocessing

• Detect half-limbs and torsos

• Assemble partial configurations– Prune using global constraints

• Extend partial configurations to full human figures

Segmentation for Recognition

• Window-scanning (e.g. face detection)– O(N M S)

SUPERPIXELS

SEGMENTS

• Segmentation– Support masks for

computation of

features

– Efficiency

– Scalability

– 600K pixels 300 superpixels, 50 segments

– O(N) + O(log(M))

Limb/Torso Detectors• Learn limb and torso

detectors from hand-labeled data

• Cues:– Contour

• Average edge strength on boundary

– Shape• Similarity to rectangle

– Shading• x,y gradients, blurred

– Focus• Ratio of high to low frequency

energies

Assembling Partial Configurations

• Combinatorial search over sets of limbs and torsos– 3 half-limbs plus a torso

configurations

• Prune using global constraints– Proximity– Relative widths– Maximum lengths– Symmetry in colour

• Complete half-limbs– 2 or 3-limbed people

• Sort partial configurations– Use limb, torso, and segmentation scores

• Extend final limbs of best configurations

Results

Rank 3

Outline

“Do as I Do” Motion Synthesis

• Matching two things:– Motion similarity across sequences– Appearance similarity within sequence

• Dynamic Programming

input sequence

synthetic sequence

Smoothness for Synthesis

• is similarity between input and target frames

• is appearance similarity within target frames

• For input frames {i}, find best target frames { } by maximizing following cost function:

• Optimize using dynamic programming: – N frames in input sequence– M target frames in database

“Do as I Do” SynthesisTarget Frames Input Sequence

Result

3400 Frames

“Do as I Say” Synthesis

• Synthesize given action labels– e.g. video game control

run walk left swing walk right jog

synthetic sequence

walk leftswing

walk rightjog

“Do as I Say”

• Red box shows when constraint is applied

Frame 9½

Putting It All Together

• Can we do a better job of splicing clips together?

Frame 9 Frame 10

YES… if we can find the joints!

Morphed Transitions

8 Transitions

Morphed Transitions

3 Transitions

Actor Replacement

• Rendering new character into existing footage

• Algorithm– Track original character– Find matches from new character– Erase original character– Render in new character

• Need to worry about occlusions

Show the impressive video

Future Directions

• Much remains to be done!

• Action Recognition– Using joint positions, shape: the “morpho-kinetics” of

action recognition– Better models of activities

• Detecting and localizing figures– Combining top-down exemplar methods with bottom-up

segmentation methods– Exploiting temporal cues

Acknowledgements

• References– Mori, Belongie, and Malik, “Shape Contexts Enable Efficient Retrieval

of Similar Shapes”, CVPR 2001– Mori and Malik, “Estimating Human Body Configurations using Shape

Context Matching”, ECCV 2002– Efros, Berg, Mori, and Malik, “Recognizing Action at A Distance”

ICCV 2003– Mori and Malik, “Recognizing Objects in Adversarial Clutter: Breaking

a Visual CAPTCHA”, CVPR 2003– Mori, Ren, Efros, Malik, “Recovering Human Body Configurations:

Combining Segmentation and Recognition” CVPR 2004

• Thank you!

recognizing human figures and actions greg mori simon fraser university

similar motion

ballet actions16 actions

tennis actions6 actions

poggio et

camera motionbad news

total frames

cutler davis collins

noisy datamoving camera

Documents

fraser suites singapore • fraser place robertson …

mori involvementandconsultation

supporting farmer innovations, recognizing indigenous...

bombyx mori

convolutional restricted boltzmann machines for feature...

alberto mori in spagnolo.pdfalberto mori . utópos ....

a discriminative key pose sequence model for recognizing...

greg mori, maryam moslemi, andy rova, payam sabzmeydani,...

research article mori folium and mori fructus...

iccv 2003uc berkeley computer vision group recognizing...

znecistovani mori

computer vision group university of california berkeley...

camilo mori

dmg mori bmt40 / bmt60 - otrix...

mori-net global edition advance mori-server … · mori-net...

lock & mori

social media - simon fraser...

reconciliation mori

mori point

thiago mori leite linguagem- intensidade...