activity detection seminar sivan edri. this capability of the human vision system argues for...

71
Older Approaches Appearance Methods Shape & Motion Activity Detection Seminar Sivan Edri

Upload: julius-cunningham

Post on 12-Jan-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Older ApproachesAppearance Methods Shape &

Motion

Activity Detection SeminarSivan Edri

Page 2: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Shape

Page 3: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

This capability of the human vision system argues for recognition of movement directly from the motion itself, as opposed to first reconstructing a three-dimensional model of a person and then recognizing the motion of the model

Motivation

Page 4: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

First, I will present the construction of a binary motion-energy image (MEI) which represents where motion has occurred in an image sequence – where there is motion.

Next, we generate a motion-history image (MHI) which is a scalar-valued image where intensity is a function of recency of motion – how the motion is moving.

Page 5: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Taken together, the MEI and MHI can be considered as a two component version of a temporal template, a vector-valued image where each component of each pixel is some function of the motion at that pixel location.

These templates are matched against the stored models of views of known movements.

Page 6: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Motion-Energy Image (MEI)

Example of someone sitting. Top row contains key frames. The bottom row is cumulative motion images starting from Frame 0.

Page 7: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Let be an image sequence and let be a binary image sequence

indicating regions of motion. For many applications image differencing is adequate to generate D.

Then, the binary MEI is defined

Motion-Energy Image (MEI)

Page 8: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Motion-Energy Image (MEI)

MEIs of sitting movement over 90 viewing angle. The smooth change implies only a coarse sampling of viewing direction is necessary to recognize the movement from all angles.

Page 9: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Motion-History Image (MHI)

Page 10: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

To represent how (as opposed to where) the image motion is moving, we form a motion-history image (MHI). In an MHI , pixel intensity is a function of the temporal history of motion at that point.

The result is a scalar-valued image where more recently moving pixels are brighter.

Motion-History Image (MHI)

Page 11: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,
Page 12: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Note that the MEI can be generated by thresholding the MHI above zero.

Given this situation, one might consider why not use the MHI alone for recognition?

Motion-History Image (MHI)

Page 13: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

The computation is recursive.The MHI at time t is computed from the MHI at time t-1 and the current motion image

, and the current MEI is computed by thresholding the MHI. The recursive definition implies that no history of the previous images or their motion fields need to be stored nor manipulated, making the computation both fast and space efficient.

Method Pros

Page 14: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

There is no consideration of optic flow, the direction of image motion.

Note the relation between the construction of the MHI and direction of motion. Consider the waving example where the arms fan upwards.

Method Cons

Page 15: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Testing on Aerobics Data: One Camera

Page 16: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Testing on Aerobics Data: One Camera

Page 17: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Testing on Aerobics Data: One Camera

Page 18: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

To evaluate the power of the temporal template representation, 18 video sequences of aerobic exercises were recorded, performed several times by an experienced aerobics instructor.

Seven views of the movement -90o to +90o

in 30o increments in the horizontal plane were recorded.

Testing on Aerobics Data: One Camera

Page 19: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

The only preprocessing done on the data was to reduce the image resolution to 320 x 240 from the captured 640 x 480.

This step had the effect of not only reducing the data set size, but also of providing somelimited blurring which enhances the stability of the global statistics.

Testing on Aerobics Data: One Camera

Page 20: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

The Mahalanobis distance is a measure of the distance between a point P and a distribution D.

It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D.

This distance is zero if P is at the mean of D, and grows as P moves away from the mean.

Mahalanobis Distance

Page 21: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

The Mahalanobis distance of an observation from a group of

observations with meanand covariance matrix S is defined as:

Mahalanobis Distance

Page 22: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Mahalanobis Distance

S = I S != I

P(x)decreases

fast

P(x) decreases slow

µ µ

P(x) decreases

Page 23: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Intuitively, for one random variable Mahalanobis distance is computed:

Lets say we have the next samples: 1, 1, 9, 9

What is the mean? What is the variance? What is the standard deviation? Lets compute the Mahalanobis distance of

sample 9:

Mahalanobis Distance Intuitive Example

Page 24: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Collect training examples of each movement from a variety of viewing angles.

Compute statistical descriptions of the MEIs & MHIs using moment-based features.

Our choice is 7 Hu moments. To recognize an input movement, a

Mahalanobis distance is calculated between the moment description of the input and each of the known movements.

The Method

Page 25: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,
Page 26: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,
Page 27: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Confusion Example

An example of MHIs with similar statistics. (a) Test input of move 13 at 30o. (b) Closest match which is move 6 at 0o. (c) Correct match.

Page 28: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

ExampleSquatting V.S. Sitting

Page 29: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

For this experiment, two cameras are used, placed such that they have orthogonal views of the subject.

The recognition system now finds the minimum sum of Mahalanobis distances between the two input templates and two stored views of a movement that have the correct angular difference between them, in this case 90o.

Combining Multiple Views

Page 30: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,
Page 31: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

During the training phase, we measure the minimum and maximum duration that a movement may take, Tmin and Tmax .

If the test motions are performed at varying speeds, we need to choose the right T for the computation of the MEI and the MHI.

The Algorithm

Page 32: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

At each time step, a new MHI is computed setting , where is the longest time window we want the system to consider.

We choose where n is the number of temporal integration windows to be considered.

The Algorithm

Page 33: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

A simple thresholding of MHI values less than generates from :

The Algorithm

Page 34: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

T-∆T

∆T

T = 20∆T = 5T- ∆T = 15

20

4

10

15 0 5

Page 35: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

To compute the shape moments, we scale by . This scale factor causes all

the MHIs to range from 0 to 1 and provides invariance with respect to the speed of the movement. Iterating, we compute all n MHIs, thresholding of the MHIs yields the corresponding MEIs.

The Algorithm

Page 36: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Compute the various scaled MHIs and MEIs. Compute the Hu moments for each image. Check the Mahalanobis distance of the MEI

parameters against the known view/movement pairs.

Any movement found to be within a threshold distance of the input is tested for agreement of the MHI. If more than one movement is matched, we select the movement with the smallest distance.

The Algorithm

Page 37: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Motion

Page 38: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Motivation

Page 39: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

People can easily track individual players and recognize actions such as running, kicking, jumping etc. This is possible in spite of the fact that the resolution is not high – each player might be, say, just 30 pixels tall.

How do we develop computer programs that can replicate this impressive human ability?

Motivation

Page 40: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Data flow for the algorithm. Starting with a stabilized figure-centric motion sequence, we compute the spatio-temporal motion descriptor centered at each frame. The descriptors are then matched to a database of pre-classiffied actions using the k-nearest-neighbor framework. The retrieved matches can be used to obtain the correct classification label, as well as other associated information.

Page 41: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer (an eye or a camera) and the scene.

Optical Flow

Page 42: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

https://www.youtube.com/watch?v=JlLkkom6tWw

Optical Flow video

Page 43: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Optical Flow

),,(),,( ttvyuxItyxI

Constant Brightness Assumption - 2D Case:

Take the Taylor series expansion of I :

tdt

dIv

dy

dIu

dx

dItyxIttvyuxI ),,(,,

using brightness assumption:

vIuII yxt 0

*Taken from optical flow presentation by Hagit Hel-Or

Page 44: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,
Page 45: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

vIuII0 vxt

x

t,xI tt,xI

u

tI

Optical Flow Equation- Intuition

uII xt

The change in value It at a pixel P is dependent on:

The distance moved (u).

x

II x

xI

*Taken from optical flow presentation by Hagit Hel-Or

Page 46: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

tyx vu III

Optical Flow Equation

tIvuI ],[

Only the component of the flow in the gradient direction can be

determined.

The component of the flow parallel to an edge is unknown.

*Taken from optical flow presentation by Hagit Hel-Or

Page 47: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

tyx vu III

Optical Flow Equation

Shoot! One equation, two velocity unknowns (u,v)

Solving for u,v:

*Taken from optical flow presentation by Hagit Hel-Or

Page 48: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Spatial coherence

Impose additional constraints◦ Assume the pixel’s neighbors have the same (u,v)

Nt

t

t

NyNx

yx

yx

tyx

v

u

vu

p

p

p

pp

pp

pp

2

1

22

11

I

I

I

II

II

II

III

AN2

x21

bN1

bAx

bAAAx tt 1

bAAxA tt

p1

pN

p2

*Taken from optical flow presentation by Hagit Hel-Or

Page 49: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Lukas Kanade Scheme

Equivalent to Solving least squares:

ATA ATb

bAx)AA( TT

• The summations are over all pixels in the K x K window• This technique was first proposed by Lukas & Kanade

(1981)

x

*Taken from optical flow presentation by Hagit Hel-Or

Page 50: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

When can we solve LK Eq?

Optimal (u, v) satisfies Lucas-Kanade equation

• ATA should be invertible • The eigen values of ATA should not be too small

(noise)• ATA should be well-conditioned:

l1/ l2 should not be too large (l1 = larger eigen value)

*Taken from optical flow presentation by Hagit Hel-Or

Page 51: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Hessian Matrix

Ix = 0Iy = 0

M=

M= 0 0

0 0Non Invertable

Ix = 0Iy = k

M= 0 0

0 k2

Non Invertable

Ix = kIy = 0

M= k2 0

0 0Non Invertable

Ix = k1

Iy = k2RM=

k2 0

0 0Non Invertable

k1, k2 correlated)R = rotation(

Ix = k1

Iy = k2M=

k12 0

0 k22

Invertablek1 * k2 = 0

Page 52: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

The Aperture Problem

Different motions – classified as similar

source: Ran Eshel *Taken from optical flow presentation by Hagit Hel-Or

Page 53: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

The algorithm starts by computing a figure-centric spatio-temporal volume for each person. Such a representation can be obtained by tracking the human figure and then constructing a window in each frame centered at the figure.

Measuring Motion Similarity

Page 54: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Track each player and recover a stabilized spatiotemporal volume, which is the only data used by the algorithm.

Page 55: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Finding similarity between different motions requires both spatial and temporal information. This leads to the notion of the spatio-temporal motion descriptor, an aggregate set of features sampled in space and time, that describe the motion over a local time period.

Measuring Motion Similarity

Page 56: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

The features are based on pixel-wise optical flow as the most natural technique for capturing motion independent of appearance.

We think of the spatial arrangement of optical flow vectors as a template that is to be matched in a robust way.

Measuring Motion Similarity

Page 57: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Given a stabilized figure-centric sequence, we first compute optical flow at each frame using the Lucas-Kanade algorithm.

Computing Motion Descriptors

(a) original image (b) optical flow Fx,y

Page 58: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Computing Motion Descriptors

(c) Separating the x and y components of optical flow vectors (d) Half-wave rectification of each component to produce 4 separate channels (e) Final blurry motion channels

Page 59: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

If the four motion channels for frame i of sequence A are ai

1, ai2, ai

3, ai4, and similarly

for frame j of sequence B then the similarity between motion descriptors centered at frames i and j is:

where T and I are the temporal and spatial extents of the motion descriptor respectively.

Computing Motion Descriptors

Page 60: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

To compare two sequences A and B, the similarity computation will need to be done for every frame of A and B.

Computing Motion Descriptors

Page 61: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Ballet: choreographed actions, stationary camera.

Clips of motions were digitized from an instructional video for ballet showing professional dancers, two men and two women, performing mostly standard ballet moves. The motion descriptors were computed with 51 frames of temporal extent.

Classifying Actions - Results

Page 62: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

(a) Ballet dataset (24800 frames). Video of the male dancers was used to classify the video of the female dancers and vice versa. Classification used 5-nearest-neighbors. The main diagonal shows the fraction of frames correctly classified for each class and is as follows: [.94 .97 .88 .88 .97 .91 1 .74 .92 .82 .99 .62 .71 .76 .92 .96].

Page 63: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Tennis: real actions, stationary camera.

For this experiment, footage of two amateur tennis players outdoors were shot. Each player was video-taped on different days in different locations with slightly different camera positions. Motion descriptors were computed with 7 frames of temporal extent.

Classifying Actions - Results

Page 64: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

(b) Tennis dataset. The video was sub-sampled by a factor of four, rendering the figures approximately 50 pixels tall. Actions were hand-labeled with six labels. Video of the female tennis player (4610 frames) was used to classify the video of the male player (1805 frames). Classification used 5-nearest-neighbors. The main diagonal is: [.46 .64 .7 .76 .88 .42].

Page 65: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

The visual quality of the motion descriptor matching suggests that the method could be used in graphics for action synthesis, creating a novel video sequence of an actor by assembling frames of existing Footage.

The ultimate goal would be to collect a large database of, say, Charlie Chaplin footage and then be able to “direct” him in a new movie.

Action Synthesis

Page 66: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Given a “target” actor database T,and a “driver” actor sequence D, the goal is to create a synthetic sequence S, that contains the actor from T performing actions described by D.

In practice, the synthesized motion sequence S must satisfy two criteria:◦The actions in S must match the actions in

the “driver” sequence D.◦The “target” actor must appear natural when

performing the sequence S.

“Do as I Do” Synthesis

Page 67: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

“Do as I Do” Action Synthesis. The top row is a sequence of a “driver” actor, the bottom row is the synthesized sequence of the “target” actor (one of the authors) performing the action of the “driver”.

Page 68: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

We can also synthesize a novel “target” actor sequence by simply issuing commands, or action labels, instead of using the “driver” actor.

For example, one can imagine a video game where pressing the control buttons will make the real-life actor on the screen move in the appropriate way.

“Do as I Say” Synthesis

Page 69: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,
Page 70: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

Figure Correction

We use the power of our data to correct imperfections in each individual sample. The input frames (top row) are automatically corrected to produce cleaned up figures

(bottom row).

Page 71: Activity Detection Seminar Sivan Edri.  This capability of the human vision system argues for recognition of movement directly from the motion itself,

The Recognition of Human Movement Using Temporal Templates Aaron F. Bobick, Member, IEEE Computer Society, and James W. Davis, Member, IEEE Computer Society

Recognizing Action at a Distance Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik Computer Science Division, UC Berkeley Berkeley, CA 94720, USA

http://en.wikipedia.org/wiki/Mahalanobis_distance http://en.wikipedia.org/wiki/Optical_flow Optical flow presentation by Hagit Hel-Or

References