univ. of texas at san antonio

Univ. of Texas at San Antonio

Human Action RecognitionHuman Action Recognition

Hong Lin


Experimental Results: Image ClassificationExperimental Results: Image ClassificationExperimental Results: Image Sequence TaggingExperimental Results: Image Sequence TaggingOutlineOutline

Research Background

Method

Experiment

Current work


Experimental Results: Image ClassificationExperimental Results: Image ClassificationExperimental Results: Image Sequence TaggingExperimental Results: Image Sequence TaggingResearch BackgroundResearch Background

Human action recognition:– automatically analyze ongoing activities from an unknown video;

– essential for visual surveillance, human computer interaction,

video retrieval, et al.

Two categories methods:– Single-view:

high variation of appearances, shapes;

potential occlusions;

– Multi-view:

difficulties in correlation discovery among multiple views;


Research Background:Research Background:

Roughly, we divide activity recognition techniques under single view into two categories:

Model-based methods[1][2] rely on human body tracking or pose estimation in order to model the dynamics of individual body parts for action recognitionAppearance-based methods[3][4]

employ appearance features for action recognition 1. global space-time shape templates 2. local spatiotemporal interest points

[1] C. Fanti, L. Zelnik-manor, and P. Perona, “Hybrid models for human motion recognition,” inProc. IEEE CVPR, pp. 1166–1173 Jun. 2005.[2] A. Yilmaz, “Recognizing human actions in videos acquired by uncalibrated moving cameras,” inProc. IEEE ICCV, pp. 150–157 Oct. 2005.[3] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 12, pp. 2247–2253, Dec. 2007.[4] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A local SVM approach,” in Proc. ICPR, pp. 32–36, Aug. 2004.


Experimental Results: Image ClassificationExperimental Results: Image ClassificationExperimental Results: Image Sequence TaggingExperimental Results: Image Sequence TaggingResearch Background: Single-viewResearch Background: Single-view

Local space-time feature-based methods:– Advantage:

Capture local salient characteristics of appearance and motion;

Robust to spatiotemporal shifts and scales, background clutter and multiple motions;

– Framework of local space-time feature-based method:

Local space–time feature extraction:

1.Detector: select spatio-temporal interest points in video by maximizing specific saliency functions

2.Descriptor: capture shape and motion in the neighborhoods of selected points using image measurements


Experimental Results: Image ClassificationExperimental Results: Image ClassificationExperimental Results: Image Sequence TaggingExperimental Results: Image Sequence TaggingResearch Background: Single-viewResearch Background: Single-view

Reference:[5]A. Klaser, M. Marszalek, C. Schmid, A spatio-temporal descriptor based on 3d-gradients, in: BMVC’08, 2008.[6]H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid, Evaluation of local spatio-temporal features for action recognition, in: BMVC’09, 2009.

BoW+SVM framework:

•Bag-of-Words (BoW) [5] :

Space-time interest point features are quantized into visual words;

A video is then represented as the frequency histogram over the visual words.

•SVM classification for modeling and recognition

•Wang et al. [6] gave a comprehensive evaluation of the popular local feature detectors and descriptors for the standard BoW+SVM framework


Experimental Results: Image ClassificationExperimental Results: Image ClassificationExperimental Results: Image Sequence TaggingExperimental Results: Image Sequence TaggingMethodMethod

Motivation: information about the structure of human body


Experimental Results: Image ClassificationExperimental Results: Image ClassificationExperimental Results: Image Sequence TaggingExperimental Results: Image Sequence TaggingMethodMethod

Partwise BoW + Graph-based Multi-task Learning

– Partwise BoW representation

Discover the information about human body structure

– Multi-task Learning

Discover the latent correlation among part-wise visual features

Single-task classification

Part-induced multi-task classification


Experimental Results: Image ClassificationExperimental Results: Image ClassificationExperimental Results: Image Sequence TaggingExperimental Results: Image Sequence TaggingMethod: Partwise BoW + graph-based MTLMethod: Partwise BoW + graph-based MTL

Partwise bag-of-word (PBoW) representation– Local space–time feature extraction:

Harris3D, HoG/HoF– Body part localization: part model, skeleton information

– PBoW generation:

7 Components of PBoW:

Level 0: limb-wise BoW

head-wise BoW

leg-wise BoW

foot-wise BoW

Level 1: upper body-wise BoW

lower body-wise BoW

Level 2: full body-wise BoW

Part model & Skeleton


Experimental Results: Image ClassificationExperimental Results: Image ClassificationExperimental Results: Image Sequence TaggingExperimental Results: Image Sequence TaggingMethod: Partwise BoW + graph-based MTLMethod: Partwise BoW + graph-based MTL

Graph-based Multi-task Learning (GMTL)– Objective: Covert individual BoW-based single-task learning into joint multiple components of PBoW-based multi-task learning

– Formulation:

To encode the reasonable latent relatedness between part-wise features.


Experimental Results: Image ClassificationExperimental Results: Image ClassificationExperimental Results: Image Sequence TaggingExperimental Results: Image Sequence TaggingExperiment ResultExperiment Result

Evaluation on KTH– KTH Dataset:• 6 kind of actions:

• Each of the 6 actions was performed four times by 25 subjects in 4 different scenarios.

• All videos were with a static camera with 25fps frame rate. The sequences were down sampled to the spatial resolution of 160x120pixels and have a length of four seconds in average.



Evaluation on KTH– Baseline: BoW+SVM Implement the standard framework of BoW+SVM ( kernal) on KTH

The best accuracies is 91.0% with 4000-D codebook. The worse results were obtained with 100-D.

Reference:An-An Liu, Yuting Su, Hong Lin, et al., “Single/Multi-view Human Action Recognition via Regularized Multi-Task Learning”, Neurocomputing, 2013 (under review).

2



Evaluation on MV-TJU Further, we implemented the framework of BoW+SVM ( kernal) with

individual part-wise BoW features.

The performances show that we can achieve the competitive performance(89.0%) with only 100-D partwise BoW against the best one (91.0%) by 4000-D feature in the standard BoW+SVM framework.

Reference:[7]An-An Liu, Yuting Su, Hong Lin, et al., “Single/Multi-view Human Action Recognition via Regularized Multi-Task Learning”, Neurocomputing, 2013 (under review).

2



Evaluation on KTH– Performance by PBoW (100D) +GMTL

Depending on the human body structure, we implemented seven kinds of graph structures to formulated the 3 levels part-wise BoW features into one multi-task learning problem, hope to encode reasonable latent relatedness between part-wise features



Evaluation on MV-TJU– Performance by PBoW (100D) +GMTL

Analysis:1. The graph penalty can further facilitate common knowledge discovery by MTL

2. The overall accuracies by MTL with graph structure R6 is promising.

3. R6 structure is important for effective relatedness transferring


Experimental Results: Image ClassificationExperimental Results: Image ClassificationExperimental Results: Image Sequence TaggingExperimental Results: Image Sequence Tagging

Thank you!

Hong Lin

univ. of texas at san antonio

Documents

human motion recognition

human actions

local svm approach

human body tracking

single view

recognition wang

featurebased methods

background clutter