action recognition (thesis presentation)

Human action recognition using spatio-temporal features Nikhil Sawant (2007MCS2899) Guide : Dr. K.K. Biswas

Upload: nikhilus85

Post on 10-May-2015




8 download


Page 1: Action Recognition (Thesis presentation)

Human action recognition using spatio-temporal features

Nikhil Sawant(2007MCS2899)

Guide : Dr. K.K. Biswas

Page 2: Action Recognition (Thesis presentation)

Human activity recognitionH


r res



Longer Time Scale

Courtesy : Y. Ke, Fathi and Mori, Bobick and Davis, Schuldt et al, Leibe et al, Vaswani et al.

Pose Estimation

Action Recognition

Action Classification


Activity Recognition

Page 3: Action Recognition (Thesis presentation)

Use Action recognition?

• Video surveillance• Interactive environment• Video classification & indexing• Movie search• Assisted Care• Sports annotation

Page 4: Action Recognition (Thesis presentation)


• Action recognition against the stable background• Action classification • Event detection• Scale invariant action recognition• Resistant to change in view upto certain degrees

Page 5: Action Recognition (Thesis presentation)


• Action recognition against the stable background• Action classification • Event detection• Scale invariant action recognition• Resistant to change in view upto certain degrees

• Action recognition in cluttered background• Action detection invariant of speed

Page 6: Action Recognition (Thesis presentation)

Existing Approaches

• Tracking interest points

• Flow based Approaches

• Shape based Approaches

Page 7: Action Recognition (Thesis presentation)

Tracking interest points

Images Courtesy : P. Correra

Tracking 5 crucial points i.e. Head, 2 hands, 2 feet. Mostly present at the local maxima on the plot of geodesic distance

• Use of Moving light displays (MLDs) by Johansson in 1973– Not feasible for as additional constraints are added

• Use of silhouette and geodesic distance by P. Correra

Page 8: Action Recognition (Thesis presentation)

Tracking interest points

• Use of Moving light displays (MLDs) by Johansson in 1973– Not feasible for as additional constraints are added

• Use of silhouette and geodesic distance by P. Correra– It is difficult to track all the Crucial points all the time– Occlusion creates problem in tracking– Complex actions involving occlusion of body parts are

difficult to track– Results depend on the quality of the silhourtte

Page 9: Action Recognition (Thesis presentation)

Flow based approaches

• Action recognition is done by making use of flow generated by motion– Use of optical flows– Spatio-temporal features– Spatio-temporal regularity based features

Page 10: Action Recognition (Thesis presentation)

Shape based Approaches

• Blank et. al. shown Action can be describe as space time shape– Use of possion equation for features– Local space time saliency– Action dynamics– Shape structure and orientation

Images Courtesy : M. Blank

Page 11: Action Recognition (Thesis presentation)

Our Approach• flow based features + shaped based features• spatio-temporal features• Viola-Jones type rectangular features• Adaboost

• STEPS:-– Target Localization – Background subtraction– Local oriented histogram– Formation of descriptor– Use of Adaboost for learning

Page 12: Action Recognition (Thesis presentation)

Optical flow and motion features

Page 13: Action Recognition (Thesis presentation)

Target Localization

• Possible search space is xyt cube• Action needs to be localized in space and time• Target localization helps reducing search space• Background subtraction• ROI marked

Original Video Silhouette Original Video with ROI marked

Page 14: Action Recognition (Thesis presentation)

Motion estimation

• Make use of optical flows for motion estimation• Optical flow is the pattern of relative motion

between the object/object feature points and the viewer/camera

• Several methods : motion compensation encoding, object segmentation, etc

• We make use of Lucas – Kanade, two frame differential method

• Opencv implementation used

Page 15: Action Recognition (Thesis presentation)

Noise removal• Presence of noisy optical flows• Noise removal by averaging• Optical flows with magnitude > C * Omean are

ignored, where C – constant [1.5 - 2], Omean - mean of optical flow within ROI

Noisy Optical flows After noise removal

Page 16: Action Recognition (Thesis presentation)

Organizing optical flow

• Local oriented Histogram

• Weighted averaging

Page 17: Action Recognition (Thesis presentation)

Organizing optical flow(Local oriented Histogram)

• We fix XDIV x YDIV grid around ROI

• On(u, v) is considered in bij if xi < u < xi+1

yj < v < yi+1

• Obij = Σ On(u, v) / Σ 1Such that, xi < u < xi+1

yj < v < yi+1

for all i < XDIV & j < YDIV

Page 18: Action Recognition (Thesis presentation)

• Membership of the optical flows should be inversely proportional to their distance from the centre

Organizing optical flow(Local oriented Histogram)







Page 19: Action Recognition (Thesis presentation)

Organizing optical flow(Weighted Averaging)

• Oj = (O1, O2,…..Om)

such that for all i Є {1,....,N}

Page 20: Action Recognition (Thesis presentation)

Organizing optical flows

Page 21: Action Recognition (Thesis presentation)

Formation of motion descriptor

• Optical flow is represented in xy component form

• Effective optical flow from each box is written in a single row as[Oex00, Oey00, Oex10, Oey10,….. ] vector

• Vectors for each action are stored for every training subject

• Adaboost is used to learn the patterns

Page 22: Action Recognition (Thesis presentation)

Learning with Adaboost

Strong classifier

Weak classifierWeight


Page 23: Action Recognition (Thesis presentation)

Classification Example taken from Antonio Torralba @MIT

Weak learners from the family of lines

h => p(error) = 0.5 it is at chance

Each data point has

a class label:

wt =1and a weight:

+ 1 ( )

-1 ( )yt =

Page 24: Action Recognition (Thesis presentation)

This one seems to be the best

This is a ‘weak classifier’: It performs slightly better than chance.

Classification Example

Each data point has

a class label:

wt =1and a weight:

+ 1 ( )

-1 ( )yt =

Page 25: Action Recognition (Thesis presentation)

We set a new problem for which the previous weak classifier performs at chance again

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+ 1 ( )

- 1 ( )yt =

Classification Example

Page 26: Action Recognition (Thesis presentation)

We set a new problem for which the previous weak classifier performs at chance again

Classification Example

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+ 1 ( )

- 1 ( )yt =

Page 27: Action Recognition (Thesis presentation)

We set a new problem for which the previous weak classifier performs at chance again

Classification Example

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+ 1 ( )

- 1 ( )yt =

Page 28: Action Recognition (Thesis presentation)

We set a new problem for which the previous weak classifier performs at chance again

Classification Example

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+ 1 ( )

- 1 ( )yt =

Page 29: Action Recognition (Thesis presentation)

The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers.

f1 f2



Classification Example

Page 30: Action Recognition (Thesis presentation)

Our Dataset

• Video resolution 320 x 240 • Stable background


Walking 8 34

Running 8 20

Flying 5 25

Waving 5 25

Pick up 6 24

Stand up 6 48

Sitting down 6 24

Page 31: Action Recognition (Thesis presentation)

Our Dataset (Tennis actions)

• Small tennis dataset


Forehand 3 11

Backhand 3 10

Service 2 9

Page 32: Action Recognition (Thesis presentation)

Training and Testing Dataset• Training and testing data is mutually exclusive• Training and testing subjects are mutually

exclusive• Frames used for training and testing


Walking 1184 1710

Running 183 335

Flying 182 373

Waving 198 317

Pick up 111 160

Stand up 128 187

Sitting down 230 282

Page 33: Action Recognition (Thesis presentation)

Classification result (framewise)

• Overall Error : 12.21 %

Walking Running Flying Waving Pick up Sit down Stand up Error

Walking 1644 46 0 17 1 2 3.86%

Running 35 295 3 2 11.94%

Flying 1 2 349 11 9 1 6.43%

Waving 11 8 269 29 15.14%

Pick up 8 7 1 120 23 1 25%

Sit down 1 1 26 179 14.97%

Stand up 23 282 8.15%

Page 34: Action Recognition (Thesis presentation)

Classification results (clipwise)

• Overall error : 6.94%

Walking Running Waving1 waving2 bending Sit-down Stand-up Error

Walking 10 0.0%

Running 10 0.0%

Waving1 9 1 10.0%

waving2 10 0.0%

bending 9 1 10.0%

Sit-down 10 0.0%

Stand-up 1 9 10.0%

Page 35: Action Recognition (Thesis presentation)

Action classification

Page 36: Action Recognition (Thesis presentation)

Classification results(Tennis events)

• Overall Error : 19.17% (per frame)

Forehand Backhand Service Error

Forehand 54 7 11 21.95%

Backhand 11 53 10.75%

Service 8 49 14.04%

Page 37: Action Recognition (Thesis presentation)

Event Detection

• Confusion at the junction two actions

• Use of prediction logic

Current frame ‘f’Next n framesPrevious n frames

f f+1 f+2 f+3 f+4… …f-1f-2f-3f-4……f-n f+n

Page 38: Action Recognition (Thesis presentation)

Event Detection

Without using prediction logic With prediction logic

Page 39: Action Recognition (Thesis presentation)


Bend 9 9

Jack 9 9

Jump 9 9

Pjump 9 9

Run 9 10

Side 9 9

Skip 9 10

Walk 9 10

Wave1 9 9

Wave2 9 9

Page 40: Action Recognition (Thesis presentation)

Standard Dataset(Weizmann Dataset)

Walk Side Skip

Wave1 Wave2



Jack Jump Pjump

Page 41: Action Recognition (Thesis presentation)

confusion matrix (framewise)Bend Jack Jump Pjump Run Side Skip Walk Wave1 Wave2

Bend 271 1 1 20 3 30 11

Jack 18 368 8 48 3 2 3 9 16

Jump 9 3 157 8 2 26 19 7

Pjump 36 26 237 22 6

Run 4 2 5 158 3 50 6 1 2

Side 11 9 77 1 1 84 3 58 2 1

Skip 3 9 76 43 5 109 24 1 7

Walk 2 5 16 2 13 5 395

Wave1 47 2 12 238 27

Wave2 30 6 1 4 1 55 269

• Overall Error : 29.17% (per frame)

Page 42: Action Recognition (Thesis presentation)

Weizmann dataset

• Smaller resolution (180 x 144), Previously (320 x 240)

• Weaker motion vectors compare to previous experiments– Weizmann : mag 0 – 1.75 px – Earlier experiment : mag 0 – 5.5 px

• Lack of background frames available, used already given poor quality silhouette

Page 43: Action Recognition (Thesis presentation)

Use of MV + Shape Info(SI)• Only MV are not enough• Shape of the person also gives information

about the actions• No. of foreground pixels in each box

• Error : 23.45%

Page 44: Action Recognition (Thesis presentation)

Use of MV + Differential SI • We calculate Differential Shape Info• Make use of Viola-Jones rectangular features• Rectangular features are used at grid level

rather than pixel level

• Error : 19.69%

Page 45: Action Recognition (Thesis presentation)

confusion matrix (framewise)

Bend Jack Jump Pjump Run Side Skip Walk Wave1 Wave2

Bend 326 7 2 2

Jack 6 418 39 1 3 8

Jump 18 1 189 1 5 4 13

Pjump 11 55 243 6 1 11

Run 2 2 173 2 45 7

Side 8 30 11 1 152 12 33

Skip 1 20 32 83 4 121 13 1 2

Walk 1 1 2 1 1 432

Wave1 43 1 10 10 232 30

Wave2 13 25 328

Page 46: Action Recognition (Thesis presentation)

Spatio-temporal features



Page 47: Action Recognition (Thesis presentation)

Spatio-temporal descriptor

• Volume Descriptor in row form– [Frame1 | Frame2 | Frame3 |

Frame4 | Frame5 | ……]

• Motion and Differential shape information for the volume

• Error : 8.472% (per frame)

Page 48: Action Recognition (Thesis presentation)

Event classification (clipwise)bend Jack Jump Pjump Run Side Skip Walk Wave1 Wave2 Error

bend 9 0.0%

Jack 9 0.0%

Jump 9 0.0%

Pjump 9 0.0%

Run 9 1 10.0%

Side 9 0.0%

Skip 10 0.0%

Walk 10 0.0%

Wave1 8 1 11.1%

Wave2 9 0.0%

• Error : 2.15% Better than 12.7% error rate reported by T. Goodhart, Action recognition usign spatio-temporal regularity based feature, 2008

Page 49: Action Recognition (Thesis presentation)

Action recognition in cluttered background

Page 50: Action Recognition (Thesis presentation)

Cluttered environment

• background is not stable• The actor might be occluded• Slight change in camera location (panning)• Scale variation• Speed variation

Page 51: Action Recognition (Thesis presentation)


• Training is done without background subtraction• Manually mark Start and end of action in training

videos• Also the bounding box around the actor is marked• No shape information is added in the training data• Training is done with noisy background• Currently bending and drinking actions supported

Page 52: Action Recognition (Thesis presentation)

Training data drinking

Page 53: Action Recognition (Thesis presentation)

Training data bending

Page 54: Action Recognition (Thesis presentation)

Template length

• bending – – Average no. of frames for action – 55– Variation – 40 – 110– TLEN 45 frames

• Drinking –– Average no. of frames for action – 50– Variation – 35 – 70– TLEN 40 frames

Page 55: Action Recognition (Thesis presentation)

Single template formation• Length of the template kept constant• Some of the frames eliminated

• One action - one template• Adds robustness in the training• Speed variation during training is tackled

1 2 3 4 5 6 7 8 9 10 11 12 13 14 151 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 3 4 5 6 8 9 10 11 13 14 151 2 3 4 5 6 7 8 9 10 11 12

Page 56: Action Recognition (Thesis presentation)

Optical flow and Adaboost

• We have constant length sequences• Optical flows are calculated• No shape information as background is cluttered,

background subtraction not possible• Formation of Spatio-temporal template with

TSPAN = 1 and TLEN = length of sequence = const• Templates are learned with Adaboost.

Page 57: Action Recognition (Thesis presentation)


• An action Cuboid is formed with specific height, width, length

• Cuboid is moved over each and every valid starting location in the video


Width Length

Page 58: Action Recognition (Thesis presentation)



Page 59: Action Recognition (Thesis presentation)


• An action Cuboid is formed with specific height, width, length

• Cuboid is moved over each and every valid starting location in the video

• A spatio-temporal template is formed for each Cuboid location and tested with Adaboost

• Appropriate entry is made in confidence matrix• Height, width and length updated for scale and

speed invariance


Width Length

Page 60: Action Recognition (Thesis presentation)

Confidence matrix

• Confidence matrix is a 3D matrix• Confidence matrix has an entry for each and

every valid location of cube in the video• Confidence matrix contains the confidence value

given by the Adaboost over various iteration• We expect that true positives will be surrounded

with a dense fog of large confidence values• averaging is done to reduce the effect of the

false positives.

Page 61: Action Recognition (Thesis presentation)

Confidence matrix

Page 62: Action Recognition (Thesis presentation)


Page 63: Action Recognition (Thesis presentation)


Page 64: Action Recognition (Thesis presentation)


Page 65: Action Recognition (Thesis presentation)


Page 66: Action Recognition (Thesis presentation)


Page 67: Action Recognition (Thesis presentation)


Page 68: Action Recognition (Thesis presentation)

Key References• Y. Ke, R. Sukthankar, M. Hebert, “Spatio-temporal Shape and Flow Correlation for Action Recognition”, In Proc.

Visual Surveillance Workshop, 2007.• P. Viola and M. Jones. “Robust real-time face detection”. In ICCV, volume 20(11), pages 1254-1259, 2001.• M. Lucena, J.M. Fuertes and N. P. la Blanca, “Using Optical Flow for Tracking”, Volume 2905/2003, Progress in

Pattern Recognition, Speech and Image Analysis.• Y. Ke, R. Sukthankar, and M. Hebert. “Event detection in crowded videos”. In ICCV, 2007.• F. Niu and M. Abdel-Mottaleb, “View –Invariant Human Activity Recognition Based on Shape and Motion Features,”

in Proc. of the IEEE Sixth International Symposium on Multimedia Software Engineering, pp. 546-556, 2004.• D.M. Gavrila. “The visual analysis of human movement: A survey”. Computer Vision and Image Understanding,

73:82–98, 1999.• D. M. Gavrila. “A bayesian, exemplar-based approach to hierarchical shape matching”. IEEE Trans. Pattern Anal.

Mach. Intell., 29(8):1408–1421, 2007.• K. Gaitanis, P. Correa, and B. Macq, “Human Action Recognition using silhouette based feature extraction and

Dynamic Bayesian Networks”. • M. Ahmad, S. Lee, “Human action recognition using shape and CLG-motion flowfrom multi-viewimage sequences”,

7th IEEE International Conference on Automatic Face and Gesture Recognition, April 2006.• 10. Haritaoglu, D. Harwood, and L. Davis, “W4: real-time surveillance of people and their activities,” IEEE

Transactions on Pattern Analysis and Machine Intelligence 22, pp. 809–830, Aug 2000.• Ismail Haritaoglu, David Harwood, and Larry S. Davis, “W4: Who? When? Where? What? a Real-time System for

Detecting and Tracking People," Proc. the third IEEE International Conference on Automatic Face and Gesture Recognition Nara, Japan , IEEE Computer Society Press, Los Alamitos, Calif., 1998, pp.222-227.

• P. Correa1, J. Czyz1, T. Umeda1, F. Marqu, X. Marichal3, B. Macq, “Silhouette-based probabilistic 2D human motion estimation for real time application”, in ICIP 2005.

• Y. Ke, R. Sukthankar, and M. Hebert. “Efficient visual event detection using volumetric features, In ICCV’05.