school of electronic information engineering , tianjin university

School of Electronic Information Engineering , Tianjin University

Human Action Recognition by Learning Bases of Action

Attributes and Parts

Jia pingping

Outline：

3

Experiments: PASCAL & Stanford 40 Actions4

Intuition: Action Attributes and Parts2

5

Algorithm: Learning Bases of Attributes and Parts

Conclusion

1 Action Classification in Still Images

Action Classification in Still Images

Low level featureRiding bike


Riding a bikeSitting on a bike seatWearing a helmetPeddling the pedals…

- Semantic concepts – Attributes

Low level feature High-level representationRiding bike


- Semantic concepts – Attributes- Objects




- Semantic concepts – Attributes- Objects- Human poses

Parts




- Semantic concepts – Attributes- Objects- Human poses- Contexts of attributes & parts

Parts


Riding


Low level feature

- Semantic concepts – Attributes- Objects- Human poses- Contexts of attributes & parts

High-level representation

Parts

riding a bike

wearing a helmet

Peddling the pedal

sitting on bike seat

Incorporate human knowledge; More understanding of image content; More discriminative classifier.


Riding bike

Outline：

3



5


Conclusion


Action Attributes and Parts

Attributes:

… …

semantic descriptions of human actions


Attributes:

… …

semantic descriptions of human actions

Riding bike

Not riding bike

Discriminative classifier, e.g. SVM


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …

A pre-trained detector


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …

Attribute classification

Object detection

Poselet detection

a: Image feature vector


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …

Attribute classification

Object detection

Poselet detection


…

Action bases Φ


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …


…

Action bases Φ


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …

…

Action bases

Bases coefficients w

Φ


SVM

a Φw


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …

…

Action bases

Bases coefficients w

Φ


Riding bike

a Φw

Outline：

3



5


Conclusion


Bases of Atr. & Parts: Training

w

Φa

a Φw

• Input: 1, , Na a

• Output: 1, , MΦ Φ Φ

1, , NW w wsparse

2

2 1,1

1min ,

2

N

i i ii

Φ W

a Φw w

2

1 2s.t. , 1

2j jj

Φ Φ

• Jointly estimate and :Φ W

…

Bases of Atr. & Parts: Testing

…

w

Φa

a Φw

• Input: a

• Output:

1, , MΦ Φ Φ

w sparse

• Estimate w:

2

2 1

1min

2

wa Φw w

Outline：

3



5


Conclusion


1. PASCAL Action Dataset

http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2008/

1. PASCAL Action Dataset

• Contain 9 classes , there are 21,738 images in total;

• Randomly select 50% of each class for training/validation and the remain images for testing;

• 14 attributes, 27 objects, 150 poselets;

• The number of action bases are set to 400 and 600 respectively. The 𝜆and values are set to 0.1 and 0.15.𝛾

Classification Result

1 2 3 4 5 6 7 8 9

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Phoning Playing instrument

Reading Riding bike

Riding horse

Running Taking photo

Using computer

Walking

Ave

rage

pre

cisi

on Our method, use “a”

POSELETS

SURREY_MKUCLEAR_DOSP

…

w

Φa

…

w

Φa

1 2 3 4 5 6 7 8 9

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Reading Riding bike

Riding horse


Walking

Our method, use “a”Our method, use “w”

POSELETS


Ave

rage

pre

cisi

on

Using computer


…

w

Φa

1 2 3 4 5 6 7 8 9

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Reading Riding bike

Riding horse


Walking

Our method, use “a”Our method, use “w”

Poselet, Maji et al, 2011


Ave

rage

pre

cisi

on

Using computer

400 action bases

attributesobjects

poselets


Control Experiment

…

w

Φa

Use “a”

Use “w”

A: attributeO: objectP: poselet

2. Stanford 40 Actions

Applauding Blowing bubbles

Brushing teeth

Calling Cleaning floor

Climbing wall

Cooking Cutting trees

Cutting vegetables

Drinking Feeding horse

Fishing Fixing bike

Gardening Holding umbrella

Jumping

Playing guitar

Playing violin

Pouring liquid

Pushing cart

Reading Repairing car

Riding bike

Riding horse

Rowing Running Shooting arrow

Smoking cigarette

Taking photo

Texting message

Throwing frisbee

Using computer

Using microscope

Using telescope

Walking dog

Washing dishes

Watching television

Waving hands

Writing on board

Writing on paper

http://vision.stanford.edu/Datasets/40actions.html

2. Stanford 40 Actions

• contains 40 diverse daily human actions;• 180∼300 images for each class, 9532 real world images in total;• All the images are obtained from Google, Bing, and Flickr;• large variations in human pose, appearance, and background clutter.

Cutting vegetables

Drinking Feeding horse

Fixing bike

Gardening Holding umbrella

Playing guitar

Playing violin

Pouring liquid

Reading Repairing car

Riding bike

Shooting arrow

Smoking cigarette

Taking photo

Walking dog

Washing dishes

Watching television

Drinking Gardening

Smoking Cigarette

35

Result: • Randomly select 100 images in each class for training, and the remaining images for testing.• 45 attributes, 81 objects, 150 poselets. The number of action bases are set to 400 and 600 respectively. The 𝜆 and 𝜆 values are set to 0.1 and 0.15.•Compare our method with the Locality-constrained Linear Coding (LLC, Wang et al, CVPR 2010) baseline.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Riding

a h

orse

Rowing

a b

oat

Riding

a b

ike

Climbin

g m

ount

ain

Jum

ping

Cleanin

g th

e flo

or

Wal

king

a do

g

Shoot

ing a

n ar

row

Playin

g gu

itar

Fishin

g

Holding

up

an u

mbr

ella

Runni

ng

Throw

ing

a fri

sbee

Writ

ing

on a

boa

rd

Wat

chin

g TV

Cuttin

g tre

es

Feedin

g a

hors

e

Garde

ning

Writ

ing

on a

boo

k

Repai

ring

a ca

r

Look

ing th

ru a

micr

osco

pe

Cuttin

g ve

geta

bles

Blowing

bub

bles

Playin

g vio

lin

Brush

ing te

eth

Repai

ring

a bi

ke

Pushin

g a

cart

Using

a co

mpu

ter

Appla

uding

Cookin

g

Smok

ing c

igare

tte

Look

ing th

ru a

teles

cope

Was

hing

dishe

s

Drinkin

g

Calling

Wav

ing h

ands

Pourin

g liq

uid

Readi

ng a

boo

k

Taking

pho

tos

Textin

g m

essa

ge

LLC

Our Method

Ave

rage

pre

cisi

on

Control Experiment

…

w

Φa

A: attributeO: objectP: poselet

Use “a”

Use “w”

Outline：

3



5


Conclusion


• Partwise Bag-of-Words (PBoW) Representation: Local feature Body part localization PBoW generation

head-wise BoW

limb-wise BoW

leg-wise BoW

foot-wise BoW

• Local Action Attribute Method: 1. Label the action samples according to different parts

static

vertical move

horizontal move

Head

static

swing

…

Limb …

For each part, we define a

new set of low-level semantic to re-class the training action

samplesstatic

…

Leg…

static

…

Foot

…

• Local Action Attribute Method: 2. For each part, train a set of attribute classifiers according to the set of

semantic we define.

for each part

train

……

…

• Local Action Attribute Method: 3. For each action sample, map its low-level representation to a middle-

level representation through the framework as follow:

Head-wise BoW

Limb-wise BoW

Leg-wise BoW

Foot-wise BoW

Combine this four part to built a new histogram

representation of the sample

One action sample

• Local Action Attribute Method: 4. Thus, based on local action attribute, we construct a new descriptor of

action samples. It can be used to classify.

Training set

Testing set

SVMK-NN

Training set

Testing set

School of Electronic Information Engineering , Tianjin University

Thank you

school of electronic information engineering , tianjin university

Documents

imagesaction attributes

learning bases of attributes

imagesaction classification

bike seatwearing

human knowledge

human posespartsriding

understanding of image

pascal stanford