activity detection seminar sivan edri. this capability of the human vision system argues for...

Older ApproachesAppearance Methods Shape &

Motion

Activity Detection SeminarSivan Edri

This capability of the human vision system argues for recognition of movement directly from the motion itself, as opposed to first reconstructing a three-dimensional model of a person and then recognizing the motion of the model

Motivation

First, I will present the construction of a binary motion-energy image (MEI) which represents where motion has occurred in an image sequence – where there is motion.

Next, we generate a motion-history image (MHI) which is a scalar-valued image where intensity is a function of recency of motion – how the motion is moving.

Taken together, the MEI and MHI can be considered as a two component version of a temporal template, a vector-valued image where each component of each pixel is some function of the motion at that pixel location.

These templates are matched against the stored models of views of known movements.

Motion-Energy Image (MEI)

Example of someone sitting. Top row contains key frames. The bottom row is cumulative motion images starting from Frame 0.

Let be an image sequence and let be a binary image sequence

indicating regions of motion. For many applications image differencing is adequate to generate D.

Then, the binary MEI is defined



MEIs of sitting movement over 90 viewing angle. The smooth change implies only a coarse sampling of viewing direction is necessary to recognize the movement from all angles.

Motion-History Image (MHI)

To represent how (as opposed to where) the image motion is moving, we form a motion-history image (MHI). In an MHI , pixel intensity is a function of the temporal history of motion at that point.

The result is a scalar-valued image where more recently moving pixels are brighter.


Note that the MEI can be generated by thresholding the MHI above zero.

Given this situation, one might consider why not use the MHI alone for recognition?


The computation is recursive.The MHI at time t is computed from the MHI at time t-1 and the current motion image

, and the current MEI is computed by thresholding the MHI. The recursive definition implies that no history of the previous images or their motion fields need to be stored nor manipulated, making the computation both fast and space efficient.

Method Pros

There is no consideration of optic flow, the direction of image motion.

Note the relation between the construction of the MHI and direction of motion. Consider the waving example where the arms fan upwards.

Method Cons

Testing on Aerobics Data: One Camera

To evaluate the power of the temporal template representation, 18 video sequences of aerobic exercises were recorded, performed several times by an experienced aerobics instructor.

Seven views of the movement -90o to +90o

in 30o increments in the horizontal plane were recorded.


The only preprocessing done on the data was to reduce the image resolution to 320 x 240 from the captured 640 x 480.

This step had the effect of not only reducing the data set size, but also of providing somelimited blurring which enhances the stability of the global statistics.


The Mahalanobis distance is a measure of the distance between a point P and a distribution D.

It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D.

This distance is zero if P is at the mean of D, and grows as P moves away from the mean.

Mahalanobis Distance

The Mahalanobis distance of an observation from a group of

observations with meanand covariance matrix S is defined as:



S = I S != I

P(x)decreases

fast

P(x) decreases slow

µ µ

P(x) decreases

Intuitively, for one random variable Mahalanobis distance is computed:

Lets say we have the next samples: 1, 1, 9, 9

What is the mean? What is the variance? What is the standard deviation? Lets compute the Mahalanobis distance of

sample 9:

Mahalanobis Distance Intuitive Example

Collect training examples of each movement from a variety of viewing angles.

Compute statistical descriptions of the MEIs & MHIs using moment-based features.

Our choice is 7 Hu moments. To recognize an input movement, a

Mahalanobis distance is calculated between the moment description of the input and each of the known movements.

The Method

Confusion Example

An example of MHIs with similar statistics. (a) Test input of move 13 at 30o. (b) Closest match which is move 6 at 0o. (c) Correct match.

ExampleSquatting V.S. Sitting

For this experiment, two cameras are used, placed such that they have orthogonal views of the subject.

The recognition system now finds the minimum sum of Mahalanobis distances between the two input templates and two stored views of a movement that have the correct angular difference between them, in this case 90o.

Combining Multiple Views

During the training phase, we measure the minimum and maximum duration that a movement may take, Tmin and Tmax .

If the test motions are performed at varying speeds, we need to choose the right T for the computation of the MEI and the MHI.

The Algorithm

At each time step, a new MHI is computed setting , where is the longest time window we want the system to consider.

We choose where n is the number of temporal integration windows to be considered.

The Algorithm

A simple thresholding of MHI values less than generates from :

The Algorithm

T-∆T

∆T

T = 20∆T = 5T- ∆T = 15

20

4

10

15 0 5

To compute the shape moments, we scale by . This scale factor causes all

the MHIs to range from 0 to 1 and provides invariance with respect to the speed of the movement. Iterating, we compute all n MHIs, thresholding of the MHIs yields the corresponding MEIs.

The Algorithm

Compute the various scaled MHIs and MEIs. Compute the Hu moments for each image. Check the Mahalanobis distance of the MEI

parameters against the known view/movement pairs.

Any movement found to be within a threshold distance of the input is tested for agreement of the MHI. If more than one movement is matched, we select the movement with the smallest distance.

The Algorithm

Motion

Motivation

People can easily track individual players and recognize actions such as running, kicking, jumping etc. This is possible in spite of the fact that the resolution is not high – each player might be, say, just 30 pixels tall.

How do we develop computer programs that can replicate this impressive human ability?

Motivation

Data flow for the algorithm. Starting with a stabilized figure-centric motion sequence, we compute the spatio-temporal motion descriptor centered at each frame. The descriptors are then matched to a database of pre-classiffied actions using the k-nearest-neighbor framework. The retrieved matches can be used to obtain the correct classification label, as well as other associated information.

Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer (an eye or a camera) and the scene.

Optical Flow

https://www.youtube.com/watch?v=JlLkkom6tWw

Optical Flow video



Optical Flow

),,(),,( ttvyuxItyxI

Constant Brightness Assumption - 2D Case:

Take the Taylor series expansion of I :

tdt

dIv

dy

dIu

dx

dItyxIttvyuxI ),,(,,

using brightness assumption:

vIuII yxt 0

*Taken from optical flow presentation by Hagit Hel-Or

vIuII0 vxt

x

t,xI tt,xI

u

tI

Optical Flow Equation- Intuition

uII xt

The change in value It at a pixel P is dependent on:

The distance moved (u).

x

II x

xI


tyx vu III

Optical Flow Equation

tIvuI ],[

Only the component of the flow in the gradient direction can be

determined.

The component of the flow parallel to an edge is unknown.


tyx vu III

Optical Flow Equation

Shoot! One equation, two velocity unknowns (u,v)

Solving for u,v:


Spatial coherence

Impose additional constraints◦ Assume the pixel’s neighbors have the same (u,v)

Nt

t

t

NyNx

yx

yx

tyx

v

u

vu

p

p

p

pp

pp

pp

2

1

22

11

I

I

I

II

II

II

III

AN2

x21

bN1

bAx

bAAAx tt 1

bAAxA tt

p1

pN

p2


Lukas Kanade Scheme

Equivalent to Solving least squares:

ATA ATb

bAx)AA( TT

• The summations are over all pixels in the K x K window• This technique was first proposed by Lukas & Kanade

(1981)

x


When can we solve LK Eq?

Optimal (u, v) satisfies Lucas-Kanade equation

• ATA should be invertible • The eigen values of ATA should not be too small

(noise)• ATA should be well-conditioned:

l1/ l2 should not be too large (l1 = larger eigen value)


Hessian Matrix

Ix = 0Iy = 0

M=

M= 0 0

0 0Non Invertable

Ix = 0Iy = k

M= 0 0

0 k2

Non Invertable

Ix = kIy = 0

M= k2 0

0 0Non Invertable

Ix = k1

Iy = k2RM=

k2 0

0 0Non Invertable

k1, k2 correlated)R = rotation(

Ix = k1

Iy = k2M=

k12 0

0 k22

Invertablek1 * k2 = 0

The Aperture Problem

Different motions – classified as similar

source: Ran Eshel *Taken from optical flow presentation by Hagit Hel-Or

The algorithm starts by computing a figure-centric spatio-temporal volume for each person. Such a representation can be obtained by tracking the human figure and then constructing a window in each frame centered at the figure.

Measuring Motion Similarity

Track each player and recover a stabilized spatiotemporal volume, which is the only data used by the algorithm.

Finding similarity between different motions requires both spatial and temporal information. This leads to the notion of the spatio-temporal motion descriptor, an aggregate set of features sampled in space and time, that describe the motion over a local time period.


The features are based on pixel-wise optical flow as the most natural technique for capturing motion independent of appearance.

We think of the spatial arrangement of optical flow vectors as a template that is to be matched in a robust way.


Given a stabilized figure-centric sequence, we first compute optical flow at each frame using the Lucas-Kanade algorithm.

Computing Motion Descriptors

(a) original image (b) optical flow Fx,y


(c) Separating the x and y components of optical flow vectors (d) Half-wave rectification of each component to produce 4 separate channels (e) Final blurry motion channels

If the four motion channels for frame i of sequence A are ai

1, ai2, ai

3, ai4, and similarly

for frame j of sequence B then the similarity between motion descriptors centered at frames i and j is:

where T and I are the temporal and spatial extents of the motion descriptor respectively.


To compare two sequences A and B, the similarity computation will need to be done for every frame of A and B.


Ballet: choreographed actions, stationary camera.

Clips of motions were digitized from an instructional video for ballet showing professional dancers, two men and two women, performing mostly standard ballet moves. The motion descriptors were computed with 51 frames of temporal extent.

Classifying Actions - Results

(a) Ballet dataset (24800 frames). Video of the male dancers was used to classify the video of the female dancers and vice versa. Classification used 5-nearest-neighbors. The main diagonal shows the fraction of frames correctly classified for each class and is as follows: [.94 .97 .88 .88 .97 .91 1 .74 .92 .82 .99 .62 .71 .76 .92 .96].

Tennis: real actions, stationary camera.

For this experiment, footage of two amateur tennis players outdoors were shot. Each player was video-taped on different days in different locations with slightly different camera positions. Motion descriptors were computed with 7 frames of temporal extent.

Classifying Actions - Results

(b) Tennis dataset. The video was sub-sampled by a factor of four, rendering the figures approximately 50 pixels tall. Actions were hand-labeled with six labels. Video of the female tennis player (4610 frames) was used to classify the video of the male player (1805 frames). Classification used 5-nearest-neighbors. The main diagonal is: [.46 .64 .7 .76 .88 .42].

The visual quality of the motion descriptor matching suggests that the method could be used in graphics for action synthesis, creating a novel video sequence of an actor by assembling frames of existing Footage.

The ultimate goal would be to collect a large database of, say, Charlie Chaplin footage and then be able to “direct” him in a new movie.

Action Synthesis

Given a “target” actor database T,and a “driver” actor sequence D, the goal is to create a synthetic sequence S, that contains the actor from T performing actions described by D.

In practice, the synthesized motion sequence S must satisfy two criteria:◦The actions in S must match the actions in

the “driver” sequence D.◦The “target” actor must appear natural when

performing the sequence S.

“Do as I Do” Synthesis

“Do as I Do” Action Synthesis. The top row is a sequence of a “driver” actor, the bottom row is the synthesized sequence of the “target” actor (one of the authors) performing the action of the “driver”.

We can also synthesize a novel “target” actor sequence by simply issuing commands, or action labels, instead of using the “driver” actor.

For example, one can imagine a video game where pressing the control buttons will make the real-life actor on the screen move in the appropriate way.

“Do as I Say” Synthesis

Figure Correction

We use the power of our data to correct imperfections in each individual sample. The input frames (top row) are automatically corrected to produce cleaned up figures

(bottom row).

The Recognition of Human Movement Using Temporal Templates Aaron F. Bobick, Member, IEEE Computer Society, and James W. Davis, Member, IEEE Computer Society

Recognizing Action at a Distance Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik Computer Science Division, UC Berkeley Berkeley, CA 94720, USA

http://en.wikipedia.org/wiki/Mahalanobis_distance http://en.wikipedia.org/wiki/Optical_flow Optical flow presentation by Hagit Hel-Or

References

http://en.wikipedia.org/wiki/Mahalanobis_distance

http://en.wikipedia.org/wiki/Optical_flow

activity detection seminar sivan edri. this capability of the human vision system argues for...

Documents

motionhistory image

direction of image motion

current motion image

direction of motion

regions of motion

motion fields

binary motionenergy

temporal history of