human action recognition - semantic scholar€¦ · several approaches for human action recognition...

Human Action RecognitionA project report

submitted in partial fulfillment of the

requirements for the Degree of

Master of Technology

in

Computational Science

by

Rajendra Kumar

SUPERCOMPUTER EDUCATION AND RESEARCH CENTER

INDIAN INSTITUTE OF SCIENCE

Bangalore - 560012

July 2012

http://www.serc.iisc.ernet.in

http://www.iisc.ernet.in

Dedicated to All teachers, Family and

Friends

Abstract

Human Action Recognition

Human action recognition is an important topic of computer vision research and

applications. The goal of the action recognition is an automated analysis of on-

going events from video data. A reliable system capable of recognizing various

human actions has many important applications. The applications include surveil-

lance systems, health-care systems, and a variety of systems that involve interac-

tions between persons and electronic devices such as human-computer interfaces.

In this project, the problem of human action recognition from video sequences

is addressed. Human Action Recognition(HAR) for both Depth as well as RGB

video sequences were analysed. In Depth based HAR, we propose two methods,

first with the local features and second uses the global features. In both methods,

l1minimization framework was employed for classification. Experiments were per-

formed on Video Analytics Lab(VAL) dataset. For RGB based HAR, Latent

Dirichlet Allocation(LDA) was used with Space Time Interest Points(STIP) feature

descriptors. STIP effectively captures the local structure in spatio temporal dimen-

sions of the video sequence. Each video sequence was represented as a ’bag-of-visual

words’. Experiments were performed on WEIZMANN, KTH and Video Ana-

lytics Lab(VAL) databases in two scenarios, one in which number of topics for

all actions were constant and manually chosen, another where number of topics for

each action changed depending on human action categories.

Keywords: Human Action Recognition, l1minimization, Latent Dirichlet Allo-

cation, Space Time Interest Points, Kinect Depth sensor.

iii

Acknowledgements

I am extremely lucky to have worked with my advisor Dr. R.Venketesh Babu, and

I would like to thank him for his guidance, encouragement and invaluable inputs

throughout the project. I am grateful to him for inspiring to learn and for being

very supportive to me.

I extremely thankful to Sreekanth, Priti, Naresh, Avinash and Sovan for their valu-

able time, encouragement and support.

I thank Bhuvnesh and his team who have collected the human action depth database.

I also want to show my deepest gratitude to all my friends who have helped me in

many ways. I am highly obliged to my Parents, Brothers and Sisters who have been

extremely understanding and supportive to my studies.

Last but not the least, I thank god.

iv

Contents

Abstract iii

Acknowledgements iv

List of Figures vii

List of Tables ix

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 5

3 Methodology 7

3.1 Depth Based HAR: . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Local Approach: . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1.1 Average with Overlap and Slice . . . . . . . . . . . 10

3.1.1.2 Most tinct frames and slice . . . . . . . . . . . . . 13

3.1.2 Global Approach: . . . . . . . . . . . . . . . . . . . . . . 14

3.1.2.1 Average of difference with overlap . . . . . . . . . . 14

3.2 RGB Based HAR: . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Experiments and Results 21

4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2.1 Depth Based HAR: . . . . . . . . . . . . . . . . . . . . . 23

4.2.1.1 Average with overlap and slice: . . . . . . . . 24

4.2.1.2 Most distinct frames and slice: . . . . . . . . . 24

4.2.1.3 Average of difference with overlap: . . . . . . 25

4.2.2 RGB Based HAR: . . . . . . . . . . . . . . . . . . . . . . 26

v

vi Human Action Recognition

4.2.2.1 Weizmann: . . . . . . . . . . . . . . . . . . . . . 27

5 Conclusion and Future Work 33

Appendices

A SPArse Modeling Software 35

B Space Time Interest Points(STIP) 37

C Latent Dirichlet Allocation 39

Bibliography 47

List of Figures

2.1 RGB and Depth Frame . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 Our Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 Average with overlap . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.4 Average with overlap and slice : Training . . . . . . . . . . . . . . . 9

3.5 Recognition of a feature vector into a class action . . . . . . . . . . 12

3.6 Most distinct frames and slice : Training . . . . . . . . . . . . . . . 13

3.7 Average of difference with overlap : Training . . . . . . . . . . . . . 15

3.8 RGB based HAR : Training . . . . . . . . . . . . . . . . . . . . . . 17

3.9 RGB based HAR : Testing . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Confusion Matrix corresponds to 73.9 . . . . . . . . . . . . . . . . 23



4.4 Weizmannafter removing 7th action and 5th subjects (Confusion Ma-trix) k = 10, k = 15 . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5 KTHvariable topics (Confusion Matrix), subjects per action = 15 . 29

4.6 VALfix topics (Confusion Matrix),removed ”Boxing” 5th sub. . . . 31

C.1 The generative LDA Process . . . . . . . . . . . . . . . . . . . . . . 39

C.2 Representation of the LDA model . . . . . . . . . . . . . . . . . . . 39

vii

List of Tables

4.1 Confused Actions . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Recognition in % on different parameters value . . . . . . . . 24

4.3 Average Recognition in % . . . . . . . . . . . . . . . . . . . . . 24








4.11 Recognition in % on different parameter’s value . . . . . . . 27






4.17 Recognition in percentage for KTH . . . . . . . . . . . . . . . 30



4.20 Recognition in percentage For VAL . . . . . . . . . . . . . . . 30

ix

Chapter 1

Introduction

In recent years, Human action recognition (HAR) has evoked considerable interest

in the various research areas due to its potential use in proactive computing. A

reliable system capable of recognizing various human actions has many important

applications such as automated surveillance systems, human computer interaction,

smart-home health care systems and control free gaming systems etc. The problem

of human action recognition from video sequences was addressed in this project.

The aim is to develop an algorithm which can recognize low-level-actions such as

Bending, Bowling, Boxing, Jogging, Jumping, Kicking etc. from the input video

sequences. Human Action Recognition(HAR) for both depth sequence as well as

RGB video sequences was analysed.

In depth based HAR, two methods were proposed, one uses the local fea-

tures and the other uses the global features. In both the methods, l1minimization

framework was employed for classification. For this SPAMS Software [1] was used.

Action recognition algorithms which use grey scale images are sensitive to illu-

mination changes. But depth information (in depth frames) obtained using Kinect

sensors, are independent of these variations. The performance of these methods

were tested on the data collected at Video Analytics Lab (VAL database). ’Leave-

one-out’ strategy was used for performance evaluation. The results indicate that

the features in global approach give better recognition results compared to local

approach. The parameters such as number of frames to be averaged, number of

frames to be overlapped, etc. need to be set during the training and testing phase.

NOTE– The performance of the algorithm is dependent on the values chosen

for the parameters and are documented with the corresponding results.

1

2 Human Action Recognition

For RGB based HAR, Latent Dirichlet Allocation(LDA) was used to ana-

lyze the performance. Space Time Interest Points(STIP)[2][3] feature descriptors

of videos were used in LDA. STIP effectively captures the local structure in spatio

temporal dimensions of the video sequence. Each video sequence is represented

as a ’bag-of-visual words’, where visual words are the key STIP features, obtained

by clustering all STIP features of the training videos of all actions using K-means

clustering. Models for each action were obtained by LDA Latent topic model which

learns the spatio-temporal distribution of visual words.

Most important cue for recognizing human low-level action is this space-time

interest points, which detect local structures in space-time where the image values

have significant local variations in both dimensions.

Experiments were done on WEIZMANN, KTH and VAL databases in two sce-

narios, one in which the number of topics for all actions were constant and manually

chosen. In the second scenario, the number of topics for each action were automat-

ically chosen depending on the human action categories.

Databases:

• Weizmann

• KTH

• VAL Database

Training and Testing framework : leave-one-out and average of results is the final

result.

Challenges: Discrimination between the similar actions such as:- Jumping-

sitting, Jogging-Walking, Boxing-Stretching etc.

Keywords: Depth frame, HAR, l1minimization, SPAMS, LDA, STIP, Kinect,

Depth sensor.

1.1 Motivation

As mentioned earlier, Human action recognition (HAR) has evoked considerable

interest in the various research areas and applications due to its potential use in

Introduction 3

proactive computing. Proactive computing is a technology that pro-actively an-

ticipated people necessity in situation such as health-care or life-care and takes

appropriate actions on their behalf. A system or solution capable of recognizing

various human actions has many important applications such as automated surveil-

lance systems, human computer interaction, smart home health-care systems and

control free gaming systems etc. Thus Human Action Recognition(HAR) is a very

fertile domain with many promising applications and it draws attentions of several

researchers, institutions and commercial companies.

1.2 Aim

Our aim is to analyze this problem and develop a robust HAR algorithm using l1

minimization and Latent Dirichlet Allocation(LDA) framework, which can recognize

low-level-Actions such as: Bending, Bowling, Boxing, Jogging, kicking etc. from the

input video sequences.

1.3 Outline of thesis

The thesis is organized in the following way:

1. Chapter 1 : Introduction introduces the context and motivation of the

research presented in this thesis.

2. Chapter 2 : Related work describes related work and our contribution.

3. Chapter 3 : Methodology explains about different approaches used in

detail. It tells how we can achieve HAR.

4. Chapter 4 : Experiments and Results describe the datasets used for

HAR, briefly present the bag-of-visual word model used for evaluating our

algorithm and in the end we describe the parameter values and scenarios in

which experiments were performed.

5. Chapter 5 : Conclusion and Future Work concludes the project’s thesis

along with future plans.

Chapter 2

Related Work

Several approaches for human action recognition have been proposed. A survey on

HAR can be found at [4]. A variety of approaches use features which describe the

motion and/or shape of the entire human body figure to perform human action

recognition.

Efros et al. [5] recognize the actions of small scale figures using features derived

from blurred optical flow estimates. Blank et al. [6] represent an action by consid-

ering the shape carved by its silhouette in time. Local shape descriptors based on

the Poisson equations are computed, then aggregated into a global descriptor by

computing moments.

Another group of methods uses features derived from small-scale patches, usually

computed at a set of interest points. Schuldt et al. [7] have computed local space-

time features at location selected in a scale-space representation. These features

are used in an SVM classification scheme.

Traditional approaches for motion analysis mainly involve the computation of

optical flow (Barron et al., 1994) [8] or feature tracking[9] [10](Smith and Brady,

1995; Blake and Isard, 1998). Although very effective for many tasks, both of these

techniques have limitations. Optical flow approaches mostly capture first order

motion and may fail when the direction of motion has sudden changes. Feature

trackers often assume a constant appearance of image patches over time and may

hence fail when the appearance changes, for example, in situations when two objects

in the image merge or split. Model-based solutions for this problem have been

presented by (Black and Jepson, 1998)[11].

5


Figure 2.1: RGB and Depth Frame

Image structures in videos are not restricted to constant velocity and/or constant

appearance over time. On the contrary, many interesting events in videos are char-

acterized by strong variations in the data along both the spatial and the temporal

dimensions.

In the spatial domain, points with a significant local variation of image inten-

sities have been extensively investigated in the past[12](Forstner and Gulch, 1987;

Harris and Stephens, 1988;[13] Lindeberg, 1998[14]; Schmid et al., 2000)[15]. Such

image points are frequently referred to as Interest points and are attractive due

to their high information content and relative stability with respect to perspective

transformations of the data.

In this paper Human Action Recognition(HAR) for both DEPTH sequence as

well as RGB video sequence are analysed. In depth based HAR - the depth infor-

mation (Figure 2.1) of video sequence is used whereas in RGB based HAR - the

extended notion of interest points (proposed by IVAN LAPTEV - 2004)[16] into the

spatio-temporal domain is used for a compact representation of video data as well

as for interpretation of spatio-temporal events. Latent Dirichlet Allocation(LDA)

is then used for modeling the human action.

Chapter 3

Methodology

We have proposed the two different approaches (Figure 3.1) for human action

recognition, one based on depth information while other based on STIP features.

3.1 Depth Based HAR:

In this approach two methods were proposed, one uses the local features whereas

the other uses the global features. In both methods, l1 minimization framework

was employed for classification. Performance was evaluated on the Video Analytics

Lab database(VAL). The dataset has M = 9 subjects, each one is performing N

= 11 actions viz. Bending, Bowling, Boxing, Jogging, Jumping, Kicking, Sitting,

Figure 3.1: Our Approaches

7


Figure 3.2: Slicing

Figure 3.3: Average with overlap

Stretching, Swimming, Walking and Waving.

Before going into the detail of each method, there are terms which need to be

explained.

1) Slicing: A process which takes single depth frame and gives p binary frames

just by chopping out the depth range of input frame into p equal subranges. In

Figure 3.2 input frame has depth range from 0 to 60 and output three frames

having range 0-20, 21-40, 41-60. Non zero values are replaced with 1 to get binary

images.

Methodology 9

Figure 3.4: Average with overlap and slice : Training

2) Average with overlap: This can be better explained with an example.

Suppose we have 5 depth frames for a single action numbered from a, b, c, d, e and

we want to average N = 3 frames with n = 2 overlap. Then the resultant frame1

will be the result of average of a, b, and c, similarly resultant frame2, frame3 will

be average of b, c, d, and c, d, e respectively. Refer to Figure 3.3.

3.1.1 Local Approach:

In this approach local features are used for recognition. For training and testing,

the framework, used is leave-one-out(ie. training with M1 subjects and testing with

remaining one subject) and the average of results is the final result. Parameters cor-

responding to the number of frames needed for summing, overlapping and number

of distinct frames are tunable.

There are two schemes - refer to Figure 3.1


3.1.1.1 Average with Overlap and Slice

In this approach features are extracted from local grid blocks of human figure centric

frames.

Training:

For training, x depth frames, with y frames overlap, were averaged. The resultant

frames obtained were sliced.

In our scenario there are (M − 1) =8 training subjects per action. For a partic-

ular action, let there be, total n resultant depth frames and each resultant depth

frame was sliced into p binary frames. So there is total n*p binary slice frames per

action.

For each of the p binary slices of a resultant depth frame:

• Chop the frame into finite number of small blocks ie. convert frame to grid.

• For each grid block – find the summation of 1’s available divided by grid size.

This numerical value is the local feature for that particular grid block.

• Make a column vector (local feature vector) of numerical values which are

obtained in 2nd step.

Thus p column vectors are obtained, concatenation of these p column vectors in

to a single column vector will be the feature vector corresponding to a resultant

depth frame. Similarly the local feature vectors for all (in our case N = 11) the

training actions are obtained. Refer to Figure 3.4.

Dictionary Building: Put all the feature vectors of an action together as

columns and normalized the dictionary. Now the normalized dictionary looks as:

D = [A11....A1n, A21....A2n..., AN1...ANn]; (3.1)

of size (no.of gridblocks*p) x (N*n) . where

Xi = [Ai1Ai2...Ain] (3.2)

is the set of feature vectors for an action Ai .

n = Total resultant depth frames for an action class

Methodology 11

p = Output of slicing

N = Number of Actions in dataset

Testing

For testing each action video of the test subject, user is allowed to provide the

values for parameters corresponding to the number of frames to be averaged and

overlapped. These values could be different from what was chosen during the train-

ing.

Assume, n’ resultant depth frames were obtained, number of binary slices must

be p per resultant depth frames. Suppose after completing the averaging and slicing

process, we got

Xi = [Ai1Ai2...Ain′ ] (3.3)

which is of size (no. of gridblocks*p) x n’, the set of feature vectors for an action

Ai

To classify Xi, we need to solve lasso problem Let

X = [x1, x2, ..., xn] (3.4)

of m x n dimension is a response matrix and D of m x p dimension be a matrix

of predictors, then the lasso problem is commonly written as ..

minimizeα∈Rp

‖ α ‖1 s.t. ‖ x−Dα ‖22≤ λ (3.5)

that is to find alpha corresponding to a column x of X input response ma-

trix which is solution of this minimization problem. For solving this problem, the

function- mexLasso available in open source SPAMS[1] software was used.

Function mexLasso

This is a fast implementation of the LARS algorithm for solving the Lasso. It

is optimized for solving a large number of small or medium-sized decomposition

problem.It first computes the Gram matrix

GM = DTD (3.6)

then perform a Cholesky-based Orthogonal Matching Pursuit (OMP) of the input

signals in parallel.


Figure 3.5: Recognition of a feature vector into a class action

It takes X and D as inputs and depending on input parameters, the algorithm

returns a matrix of coefficients .

A = [α1, α2, ..., αn] (3.7)

in p x n dimension such that for every column x of X, the corresponding column

alpha of A in the solution of above l1minimization(Lasso) problem.

After getting this A of N*n x n’dimension, corresponding to our X (test ma-

trix) of (no. of gridblocks*p) x n’ dimension and D (dictionary) of (no. of

gridblocks*p) x N*n.

Recognition:

Since we know test matrix’s columns are feature vectors for test action. So for a

particular feature vector of a test action there is corresponding coefficient column

vector in the A matrix whose size is (N*n) x 1.

Divide the N*n in to N equal size groups, either find the sum of coefficients of

each group, pick the group which has the maximum sum or choose the group which

has highest peak, as the recognized action label, Refer to Figure 3.5 where it is

classified to A1 class action.

So for a particular test action we got the n recognitions corresponding to n

columns of the test matrix Xi. We showed the result in percentage of accuracy, ie.

out of n how many are recognized correctly.

Methodology 13

Figure 3.6: Most distinct frames and slice : Training

3.1.1.2 Most tinct frames and slice

Training In this method collect the most distinct x frames from total frames of a

video sequence of an action(in our case total frames is - last 48 frames out of 58

depth frames leaving first 10 depth frames).

• Pick 1st frame as first distinct frame and put in set X(initially empty) and

remove the same from the set Y( initially having all 48 depth frames).

• Pick next frame from the Y set which is most distinct from all depth frames

available in the set X(in our case distance measure is the l2−norm of difference

frame ).

Do the step 2 until X contains x, and do it for all videos of a particular action and

collect in X.

Thus X, the set of ((M-1)*x ) = n most distinct depth frames for a particular action

is obtained. After that do slicing for each of the frame in X set as described in

Figure 3.6 and build a normalized dictionary matrix D as.


D = [A11....A1n, A21....A2n..., AN1...ANn]; (3.8)

is of size (no.of gridblocks*p) x (N*n) where p is the no. of slices per distinct

depth frame,

Xi = [Ai1Ai2...Ain] (3.9)

is the set of feature vectors of an action Ai.

n = Total most distinct depth frames for an action class

P = Outputs of slicing

N = No. of Actions

Testing

• For given test action find the say x1(may be differ from the x) most distinct

frames.

• Slice and form grid

• Procedure for classification is same as describe in method 3.1.1.1

3.1.2 Global Approach:

3.1.2.1 Average of difference with overlap

In this approach features are extracted from complete human figure centric frame.

Training:

Find the difference frames from the consecutive frames (in our case it is 48 depth

) of an action (make sure to avoid the subtraction from or to zero depth value

because this leads to high value which comes due to the slight oscillations of the

body, which is actually not a part of human action). On these difference frames,

average the x frames with y overlap to get x resultant frames. Since in our case,

there are 8 training subjects per action, so the total resultant frames for a particular

action is (8*x ) = n. Reshape these n resultant depth frames in to column vectors

as described in Figure 3.7. Build the dictionary D and normalized it.

Methodology 15

Figure 3.7: Average of difference with overlap : Training

D = [A11....A1n, A21....A2n..., A111...A11n]; (3.10)

of size sizeofframe x (N*n). where ,

Xi = [Ai1Ai2...Ain] (3.11)

is the set of feature vector for an action Ai.

n = Total most distinct depth frames for an action class

N = No. of Actions

Testing:

• For a given test action find the resultant frames (say x’, may be different from

x ) by taking average of say, x1 frames with y1 overlap.

• Reshape each of the frames into column vector.


• Procedure for classification is same as describe inmethod 3.1.1.1 or method

3.1.1.2

3.2 RGB Based HAR:

This approach uses the Space Time Interest[2][3] Points , which detect local struc-

ture in space-time where the image or frame value have significant local variations

in both dimensions. This is an important cue for recognizing human low-level ac-

tion. Each video sequence is represented as a bag-of-visual words, where visual

words are the key STIP features, obtained by clustering all STIP features of the

training videos of all actions using clustering algorithm. For action model forma-

tion Latent Dirichlet Allocation (LDA) was used. Performance was evaluated on

Weizmann, KTH, VAL Databases. For training and Testing, the framework used,

is leave-one-out.

Training:

The training for learning a classifier to distinguish among the different action

classes consists of two layers as described in Figure 3.8:

1) clustering to obtain the visual words(key descriptors)

• Obtain all STIP features from the videos of the training subjects actions.

• Arrange STIP features descriptors of a training subject video sequence in an

ascending order according to the temporal information (optional).

• Concatenate these arranged STIP feature descriptors into a thin matrix.

• Use clustering algorithm to cluster into different clusters.

Cluster centers(centroids) are the visual words or key features whose combination

represents an action.

2) Action Modeling by LDA Topic Model

Latent Dirichlet Allocation topic model which learns the spatio-temporal distri-

bution of visual words was used for action modeling.

To get topic models for an action – for each training subject of an action:

Methodology 17

Figure 3.8: RGB based HAR : Training

• Compute all STIP features from video sequence.

• Arrange them in an ascending order according to the temporal information.

Bag-Of-Word representation:

• Replace each stip feature descriptor to one of the key stip feature descriptor

based on minimum l2−norm distance measure.

• Compute the frequency of words in this bag of word representation of video

sequence (in word:frequency format and write it on text file in a single line).

Do above steps for all training subject videos to get a text file having no. of rows

equal to no. of train subjects.

• Use LDA on text file to get action model(in terms of beta matrix).


Figure 3.9: RGB based HAR : Testing

Columns of beta are topic models for that particular action.

Testing:

Testing of each action of a test subject was done on the bases of stip descriptors

in some range.

• Compute all STIP features video sequence.

• Arrange them in an ascending order according to the temporal information.

Bag-Of-Word representation:

• Take all stip descriptors in some range and represent each stip feature de-

scriptor to one of the key stip feature descriptor based on minimum l2−norm

distance measure.

• Compute the frequency of words in this bag of word representation to obtain

a test column vector of size equal to no. of clusters(having only frequencies).

Methodology 19

Recognition:

• Compute the distances of test vector to all models of each actions.

• Find the index of minimum of these. That corresponds to the class action to

which it is classified.

Refer to Figure 3.9

Chapter 4

Experiments and Results

4.1 Experiment Setup

In this section, we first describe the datasets used for action recognition. We, then,

briefly present the bag-of-visual word model used for evaluating our algorithm and

in the end we describe the parameter values and scenarios in which experiments

were performed.

Experiments of depth based HAR and of RGB based HAR were performed on

the VAL database and Weizmann, KTH and VAL datasets respectively.

KTH dataset: The KTH dataset consists of N = 6 human action classes:

walking, jogging, running, boxing, handwaving and handclapping. Each action is

performed several times by M = 25 subjects. The sequences were recorded in

four different scenarios: outdoors, outdoors with scale variation, outdoors with

different clothes and indoors. The Background is homogeneous and static in most

sequences. In total, the data consists of 2391 video samples. The Experiment setup

was ’Leave-One-Out’. Here only one instance out of four instances of a subject was

left for test set. Average over all subject’s test set’s results was considered as the

final performance measure.

Weizmann dataset: The Weizmann human action dataset contains of N = 10

action classes: bend, jack, jump, pjump, run, side, skip, walk, wave1, wave2. Each

action is performed by M = 9 subjects. Dataset contains 93 low-resolution(180

x 144 pixels) video sequences. The Experiment setup was Leave-One-Out. Aver-

age of all test subjects result was the final classification result. Our method has

achieved approximately 98.30% average classification where we have removed 7th

21


action (skip) that was mostly confusing with jack and pjump. Fifth subject from

all actions were also removed to improve performance.

Video Analytic Lab (VAL) dataset: The VAL dataset was collected using

Kinect depth sensor and contains of eleven action classes for depth based HAR:

bending, bowling, boxing, jogging, jumping, kicking, sitting, stretching, swimming,

walking and waving while for RGB based HAR there is only ten action classes:

bending, bowling, boxing, jogging, kicking, sitting, stretching, swimming, walking,

waving. Each action is performed by nine subjects. The dataset was collected in

the indoor scenario in lab, This dataset contains of depth frame sequences along

with RGB frame sequences but for depth based HAR, depth frame sequences were

used whereas RGB frame sequences were used for RGB based HAR. Our approach

in RGB based HAR has achieved approximately 95.56% average classification.

Bag of visual words:

To evaluate the performance of our RGB based HAR where STIP features were

used, we used a standard bag of word approach. We clustered all stip features

of actions into different groups by using clustering algorithm and centroids were

taken as the key features or visual words. For a given video sequence, compute stip

features and by using l2 norm distance measure corresponds each and every stip

feature to one of the key features. So this representation of video sequence in visual

words, so called bag-of-visual word.

Parameters:

In RGB based HAR, there was two scenarios in which performance was evaluated.

One in which the number of topics for all actions are constant and manually chosen

(our case it was 8). In the second scenario, the number of topics for each action

are automatically set depending on the human action categories. Idea behind the

variable topic was, every action has different atomic action labels. Just clustered

the all stip features of an action into k clusters and find particular k where within-

cluster sums of point-to-centroid distances is relatively minimum. For finding the

key descriptors or visual words from all stip features of all actions, we clustered them

into K = 500. To increase the precision, the K-means algorithm was replicated

thrice and the results with lowest within-cluster sum were kept. Distance measure

used for K-means algorithm was square of Euclidean distance.

Experiments and Results 23

Figure 4.1: Confusion Matrix corresponds to 73.9

4.2 Results

4.2.1 Depth Based HAR:

Experiments were performed on VAL dataset, we have used only the depth frames

not RGB frames. Each action folder has 58 depth images and in our experiment

we have used the last 48 depth frames out of 58, leaving the first 10 depth frames

because they were almost similar with high probability. These 58 depth frames of

each action were preprocessed by following steps

1) Normalization

2) Tight bounding box

3) Resizing

4) Reshaping in to vector

Table 4.1: Confused Actions

Actions Confused %

Bowling and Waving 9.5

Jogging and Sitting 18.5

Jogging and Walking 18.5

Waving and Bowling 20.4


4.2.1.1 Average with overlap and slice:

Experiments were performed on the various parameters’value, refer to Table 4.2

and for confusion matrix corresponding to Avg.% classification = 73.9 refer to

Figure 4.1. Most of confusing actions which degrade the average percentage of

classification are shown in the Table 4.1. Average percentage of classification per

action is shown in Table 4.3.

Table 4.2: Recognition in % on different parameters value

frames Overlap frames Overlap Slice Gridsize Avg.% Recog.

16(train) 4(train) 30(test) 25(test) 5 4x4 72.4

30(train) 25(train) 30(test) 25(test) 5 4x4 73.9

Table 4.3: Average Recognition in %

Action % Action %

Bending 94.4 Bowling 81.5

Boxing 55.6 Jogging 25.9

Jumping 55.6 Kicking 81.5

Sitting 100 Stretching 72.2

Swimming 94.4 Walking 72.2

Waving 79.6 - -

4.2.1.2 Most distinct frames and slice:


and for confusion matrix corresponding to Avg. % classification = 77.8 refer to

Figure 4.2. Most of confusing actions which degrade the average percentage of

classification are shown in the Table 4.6. Average percentage of classification per

action is shown in Table 4.5.


Distinct frames Slice Grid size Avg.% Recog.

15 6 4x4 77.8

30 5 4x4 72.9




Action % Action %

Bending 94.8 Bowling 77.7


Jumping 61.5 Kicking 91.5

Sitting 79.3 Stretching 80.0

Swimming 68.1 Walking 82.9

Waving 84.4 - -


Actions Confused %


Jumping and Jogging 20.0


Swimming and Bowling 17.0

4.2.1.3 Average of difference with overlap:


and for confusion matrix corresponding to Avg.% classification = 87.2 refer to Fig-

ure 4.3. Most of confusing actions degrade the average percentage of classification.

Confused actions in case of experiment whose avg. % Recog. = 87.2 are shown in

the Table 4.8. Average percentage of classification per action is shown in Table

4.9. Confused actions in case of experiment whose avg. % Recog. = 72.2 are

shown in the Table 4.10



Removed Testframes Overlap Avg.% Recog.

None 20 10 72.2

None 30 20 80.1

5th(Jumping) 30 20 87.2



Actions Confused %


Jogging and Kicking 11.1

Boxing and Swimming 11.1

Walking and Sitting 11.1


Action % Action %

Bending 88.8 Bowling 100


Kicking 94.4 Sitting 88.9

Stretching 88.9 Swimming 100

Walking 77.8 Waving 94.4

4.2.2 RGB Based HAR:

Experiments were performed on three datasets ie., Weizmann, KTH and Visual

Analytics Lab (VAL dataset).



Actions Confused %

Bowling and Waving 11.1

Jogging and Jumping 25.0


Jumping and Sitting 33.3

By performing various experiments on different values of K, distance measure of

K-means algorithm and no. of topics in training, we found that on K = 500 (no.

of clusters), distance measure = square of Euclidean distance and no. of

topic = 8(in fix topic case scenario) the performances were good.

4.2.2.1 Weizmann:

kparam: It is the k parameter in Harris Function, corresponds to the sensitivity

factor, generally in the range of (0, 0.25). Smaller the value of k, more likely it is

that the algorithm can detect sharp corners.

thresh: It is intensity comparison threshold. This is for omitting weak points.

Larger the value of thresh more weak points.

Experiment1: We took some appropriate action in order to achieve better

recognition rate. Experiments were performed in fix topics per action class sce-

nario, refer Table 4.11

Table 4.11: Recognition in % on different parameter’s value

Removed Cluster Topics Testdescriptors Overlap Avg.classifi.

None 500 8 12 6 88.95

5th subjecbject 500 8 12 6 89.56

”Skip” 500 8 12 6 92.64

None 500 8 16 8 90.46

5th subject 500 8 16 8 91.30

”Skip” 500 8 16 8 93.28

Experiment2: Changed kparam’s value from 0.00050 to 0.00010, just to get the

more no. of STIP features descriptors. Experiments were performed on fix topics

per action class scenarios, refer Table 4.12.



Removed Cluster Topics Testdescrip. Overlap Avg.% classifi.

None 500 5 16 8 88.64

5th subject 500 5 16 8 94.65

None 500 8 16 8 92.47

5th subject 500 8 16 8 96.13

Figure 4.4: Weizmann after removing 7th action and 5th subjects (ConfusionMatrix) k = 10, k = 15

Experiment3: Here we have 1) changed kparam’s value from 0.00050 (default)

to 0.00010, to get more no. of STIP features descriptors, 2) taken variable topics for

each action class based on k = 10 and k = 15 . Refer Table 4.13 for experiments,

and for confusion matrices corresponding to 98.33% and 96.99% respectively refer

to Figure 4.4. Confused actions corresponding to avg % recognition = 96.99 shown

in Table 4.14.


Removed Cluster Testdescr. Overlap k Avg.% classifi.

None 500 16 8 10 90.24

5th subject 500 16 8 10 94.38

5th sub.& ”Skip” 500 16 8 10 98.33

5th subject 500 16 8 15 94.18

5th sub. & ”Skip” 500 16 8 15 96.99


Actions Confused %

Jump and Jack 12.5

Wave1 and Jack 25.0


Run and Walk 6.2


Figure 4.5: KTH variable topics (Confusion Matrix), subjects per action = 15

KTH:

Here we 1) have changed kparam’s value from 0.00050 (default) to 0.00080 and

thresh’s value from 1.000e-009 (default) to 1.000e-008, 2) have used frames between

50 to 200 but for 1,2,3rd instance of 9th subject’s we have used frames between 201

to 351 (since there was no STIP features in 50 to 200 frames range), to capture less

no. of STIP feature descriptors to make computation easier and faster. We took all

actions and 5 or 9 or 15 out of 25 subjects with all four instances.

Experiment: We have evaluated our method in two scenarios, one in fixed or

constant manually selected topics = 8 while in other topics were selected automat-

ically. Refer Table 4.15 for experiments, and for confusion matrix corresponding

to 82.84% refer to Figure 4.5. Confused actions shown in Table 4.16.


Sub.Taken Clusters Topics Testdescr. Overlap k Avg. % classifi.

5 500 8 16 8 10 71.92

5 500 variable 16 8 10 80.40

9 500 8 16 8 10 76.92

9 500 variable 16 8 10 77.10

15 500 8 16 8 10 80.36

15 500 variable 16 8 10 82.84

VAL:



Actions Confused %

Jogging and Running 27.4

Handclap and Handwave 51.5

Handwave and Boxing 42.0

Running and Jogging 24.6

Table 4.17: Recognition in percentage for KTH

Boxing Handclap Handwave Jog Run Walk

90.37 91.73 92.68 62.57 71.69 88.02

Experiment: We have evaluate our method in two scenarios, one in fixed or con-

stant manually selected topics = 8 while in others topics were selected automati-

cally. Refer Table 4.18 for experiments, and for confusion matrix corresponding

to 95.56% refer to Figure 4.6. Confused actions shown in Table 4.19.


Removed Clusters Topics Testdescr. Overlap k Avg. % classifi.

None 500 8 20 10 10 91.55

None 500 variable 20 10 10 89.35

”Boxing” 500 8 20 10 10 93.93

”Boxing” 500 variable 20 10 10 92.59

5th sub.& ”Boxing” 500 8 20 10 10 95.56


Actions Confused %

Bowl, Kick, Sit – Walk 4.3

Sitting and Bending 5.7

Handwave and Boxing 42.0

Waving and Stretching 10.8

Table 4.20: Recognition in percentage For VAL

Bend Bowling Jog Kick Sit

100 95.83 97.22 94.95 91.74Stretch Swim Walk Wave -

98.21 100 95.77 86.29 -


Figure 4.6: VAL fix topics (Confusion Matrix),removed ”Boxing” 5th sub.

Chapter 5

Conclusion and Future Work

In this paper we have analysed the performance of human action recognition based

on depth and RGB video sequences. In Depth based HAR, we took the advantage

of depth information in feature description because of its insensitiveness toward

illumination changes, method based on global features perform better than local

features based. In RGB based HAR, we have used Space-Time Interest Points[2][3]

through Latent Topic Model for classification. STIP effectively captures the local

structure in spatio temporal dimensions of the video sequence. Each video sequence

was represented as a ”bag-of-visual words”. Results show our approaches performing

satisfactorily.

Exploring new features for HAR on sparse(l1) Framework and exploring the pos-

sibility of combining Depth and Gray scale videos for HAR are my future plans.

33

Appendix A

SPArse Modeling Software

SPAMS (SPArse Modeling Software) is an open-source optimization toolbox under

licence GPLv3. It implements algorithms for solving various machine learning and

signal processing problems involving sparse regularizations.

Function mexLasso was used in our project for solving the l1 minimization to

accomplish the classification of human action.

mexLasso: This is a fast implementation of the LARS algorithm [8] (variant for

solving the Lasso) for solving the Lasso or Elastic-Net. Given a matrix of signals

X=[x1,,xn] in m n and a dictionary D in m p, depending on the input parameters,

the algorithm returns a matrix of coefficients A=[1,,n] in p n such that for every

column x of X, the corresponding column of A is the solution of

minimizeα∈Rp

‖ α ‖1 s.t. ‖ x−Dα ‖22≤ λ (A.1)

For efficiency reasons, the method first compute the covariance matrix DTD,

then for each signal, it computes DTx and performs the decomposition with a

Cholesky-based algorithm. The implementation has also an option to add positivity

constraints on the solutions . When the solution is very sparse and the problem size

is reasonable, this approach can be very efficient. Moreover, it gives the solution

with an exact precision, and its performance does not depend on the correlation

of the dictionary elements, except when the solution is not unique (the algorithm

breaks in this case). Note that mexLasso can return the whole regularizations path

of the first signal x1 and can handle implicitly the matrix D if the quantities DTD

and DTx are passed as an argument, see below:

35


Usage: [A [path]]=mexLasso(X,D,param); or [A [path]]=mexLasso(X,Q,q,param);

Name: mexLasso

Description: mexLasso is an efficient implementation of the homotopy-LARS

algorithm for solving the Lasso.

Go to site:– http://www.di.ens.fr/willow/spams/index.html for more informa-

tion.

Appendix B

Space Time Interest Points(STIP)

In spatial donmai, points with a significant local variation of image intensities are

frequently denoted as ”interest points” and are attractive due to their high informa-

tion contents. And these interest points detectors used in the various application

such as image indexing[15], stereo matching[17][18][], optical flow estimation and

tracking[19], and recognition[20][21].

Laptev and Lindeberg[3] extended this idea of interest points in spatio-temporal

domain and illustrated how these resulting space-time features often corresponds

to interesting events in video data. The idea of detecting spatio-temporal interest

points built upon the Harris and Forstner interest point operators[12][13] which

capture the large variations in the both spatial and temporal dimensions.

We have used the code for STIP computation from this site–

— STIP implementation v1.0 — (18-06-2008)

http://www.irisa.fr/vista/Equipe/People/Laptev/download/stip-1.0.zip

This was developed in 2006-2008 jointly at INRIA Rennes (http://www.irisa.fr/vista)

and IDIAP (www.idiap.ch) under supervision of Ivan Laptev and Barbara Caputo.

General: The code detects Space-Time Inter- est Points (STIPs) and computes

corresponding local space-time descriptors. The currently implemented detector re-

sembles the extended space-time Harris detector described in [Laptev IJCV05]. The

code does not implement scale selection but selects scale that roughly correspond to

the size of the detected events in the space and to their duration in time and detects

points for a set of multiple combinations of spatial and temporal scales. This simpli-

fication appears to produce similar (or better) results in applications (e.g. action

37


recognition) while resulting in a considerable speed-up and close-to-video-rate run

time.

The currently implemented types of descriptors are HOG (Histograms of Oriented

Gradients) and HOF (Histograms of Optical Flow) computed on a 3D video patch

in the neighborhood of each detected STIP. The patch is partitioned into a grid with

3x3x2 spatio-temporal blocks; 4-bin HOG descriptors and 5-bin HOF descriptors are

then computed for all blocks and are concatenated into a 72-element and 90-element

descriptors respectively.

Appendix C

Latent Dirichlet Allocation

latent Dirichlet allocation (LDA) is a generative model that allows sets of ob-

servations to be explained by unobserved groups that explain why some parts of

the data are similar. It is a powerful learning algorithm for automatically and

jointly clustering words into ”topics” and documents into mixtures of topics. A

topic model is, roughly, a hierarchical Bayesian model that associates with each

document a probability distribution over ”topics”, which are in turn distributions

over words.

Figure C.1: The generative LDA Process

Figure C.2: Representation of the LDA model

39


We have used the freely available code for Latent Dirichlet Allocation, developed

by Daichi Mochihashi Computational Linguistics Laboratory, NAIST, Japan / ATR

Spoken Language Translation Research Laboratories, Kyoto, Japan [email protected]

Overview

lda is a Latent Dirichlet Allocation (Blei et al., 2001) package written both in

MATLAB and C (command line interface). This package provides only a standard

variational Bayes estimation that is first proposed, but has a simple textual data

format that is almost the same as SVMlight or TinySVM. This package can be

used as an aid to understand LDA, or simply as a regularized alternative to PLSI,

which has a severe overfitting problem due to its maximum likelihood structure. For

advanced users who wish to benefit from the latest result, consider using npbayes

or MPCA: though, they have non-trivial data structures.

Requirements

C version:

* ANSI C compiler.

Systems below are confirmed to compile.

- Linux 2.4.20, Redhat 9, gcc 3.2.2

- Linux 2.6.5, Fedora core release 2, gcc 3.3.3

- FreeBSD 4.8-STABLE, gcc 2.95.4 (GNU make)

- SunOS 5.8, gcc 2.95.3 (GNU make) MATLAB version:

* A MATLAB environment. Statistical Toolbox may be needed for psi() func-

tion (but in case it is not installed, consider using Minkas Lightspeed MATLAB

toolbox).

* Octave is not supported.

Install

C version: 1. Take a glance at Makefile; and type make.

2. C version is not intended to be used by those who are not familiar with C.

Makefile and source files are very simple, so you can modify it as needed if it

does not compile (If severe problems are found, please contact to the author.)

Latent Dirichlet Allocation 41

MATLAB version:

simply add a directory where you have unpacked *.m into MATLAB path. For

example:

1) cd /work

2) tar xvfz lda-0.1-matlab.tar.gz

3) cd lda-0.1-matlab/

4) matlab

addpath /work/lda 0.1 matlab

Download

* C version: lda-0.1.tar.gz * MATLAB version: lda-0.1-matlab.tar.gz

Performance

* C version runs about 8 times or more faster than MATLAB (while MATLAB

codes are fully vectorized).

* However, MATLAB version is closed under MATLAB environment; so it is easy

to investigate and manipulate the parameters (especially graphically using plot or

surf). Moreover, MATLAB codes are so simple and easy to understand.

* To estimate the parameters of 50 class LDA decomposition of the standard

Cranfield collection (1397 documents, 5177 unique terms),

- C version took 1 minute 32 seconds,

- MATLAB version took 38 minutes 55 seconds, on a Xeon 2.8GHz.

It runs in low memory efficiently: in the experiment above, it uses only 6.8MB

(C) and 29MB (MATLAB) of memory.

Getting Started

This package contains a sample data file train which was compiled from the

first 100 documents of the Cranfield collection. Each feature id corresponds to the

respective line of file train.lex; that is, feature 20 means a word accuracy, feature

21 means accurate, and so on. After compilation, you can test it using train data

as follows.


C version:

lda -N 20 train model

MATLAB version:

matlab

[alpha, beta] = ldamain( train , 20);

C version creates two files model.alpha and model.beta; MATLAB version creates

1x20-dimensional vector alpha and 1324x20-dimensional matrix beta. Parameters

of the resulting models are explained in the sections below.

Data Format

Data format is common to both C and MATLAB version and almost the same as

widely-used SVMlight, except that there is no label in Latent Dirichlet Allocation

since LDA is an unsupervised method.

A data file is a ASCII text file, where each line represents a document (NB.

document is simply a synonym for a group of data; So you can interpret it as you

like whenever it means a group of data.) Typical data file is as follows:

1:1 2:4 5:2

1:2 3:3 5:1 6:1 7:1

2:4 5:1 7:1

* Each line can be maximum 65535 bytes (about 820 lines in 80-column text) by

default. For a standard document this value is sufficient, but if you wish to increase

this limit, modify BUFSIZE in feature.c as you like.

* Each line consists of pairs of < featureid >:< count >. Here, featureid is an

integer from 1 (this is the same as SVMlight); count can be an integer or a real

number that must be positive.

* < featureid >:< count > pairs are separated by (possibly multiple) white

spaces. The program is coded to work even if there are any empty lines, but it is

preferable that there are no such unnecessary lines.

* For a complete specification, please refer to SVMlights page.

Command Line Syntax


C version

lda is typically invoked simply as: lda -N 100 train model

train is a data file that has a format described above, and model is a basename

of output files of model parameters. Specifically, lda uses two outputs: model.alpha

and model.beta that represent alpha and beta in the LDA described in (Blei et al.,

2001); that is, alpha is the parameter of prior Dirichlet distribution over the latent

classes, and beta is the set of class unigrams for each latent class. -N 100 is the

number of latent classes to assume in the data. For the standard model of LDA,

this is the only parameter we must provide in advance. In this case, 100 latent

classes are assumed.

Besides, there are several rarely-used options:

lda -h

lda, a Latent Dirichlet Allocation package. Copyright (c) 2004 Daichi Mochihashi,

All rights reserved. usage: lda -N classes [-I emmax -D demmax -E epsilon] train

model

-I emmax

Maximum of iteration of the outer VB-EM algorithm, which is exited when con-

verged. (default 100)

-D demmax

Maximum of iteration of the inner VB-EM algorithm for each document, which is

exited when converged. (default 20)

-E epsilon

A threshold to determine the whole convergence of the estimation. It is a lower

threshold of the relative increase in the total data likelihood. (default 0.0001)

-h

displays help.

MATLAB version

First, you must load a data file into MATLAB data structure:

matlab

d = fmatrix(train);


And run a function lda to estimate the parameters. The second argument is the

number of latent classes that you assume. (in the example below, 20) – help lda

Latent Dirichlet Allocation, standard model. Copyright (c) 2004 Daichi Mochi-

hashi, all rights reserved.

Id : index.html, v1.32004/12/0412 : 47 : 35daiti mExp

[alpha,beta] = lda(d,k,[emmax,demmax])

d : data of documents

k : of classes to assume

emmax : of maximum VB-EM iteration (default 100)

demmax : of maximum VB-EM iteration for a document (default 20)

[alpha,beta] = lda(d,20);

Optional two parameters emmax and demmax can be fed into lda, which has the

same meaning as the C version. If you find that loading text data into MATLAB

structure in advance is troublesome, there is a wrapper function ldamain that works

exactly the same as the C version:

[alpha,beta] = ldamain(train.dat,20);

Output Format

MATLAB version

In the example above, alpha is a N-dimensional row vector of alpha for cor-

responding latent topics, and beta is a [V,N]-dimensional matrix of beta where

beta(v,n) = p(vn) (n = 1 .. N, v = 1 .. V; V is the size of the lexicon). You can

save them to file using standard MATLAB function save, for example, as:

[alpha,beta] = ldamain(train.dat,20);

number of latent classes = 20

number of documents = 100

number of words = 1324

iteration 26/100.. likelihood = 339.167

ETA: 0:01:03 (1 sec/step)

converged.

save(alpha.dat, alpha, -ascii);

save(beta.dat, beta, -ascii);

C version


If you invoke lda as the following,

lda -N 100 train model

two files model.alpha and model.beta are created. These two files are exactly of

the same format as those which are saved from MATLAB: model.alpha is a space-

separated N-dimensional vector of alpha, and model.beta is a space-separated V x

N matrix of beta. These parameters can be loaded into MATLAB using standard

ways:

beta = load(model.beta);

And you can manipulate these parameters within MATLAB.

Bibliography

[1] “SParse Modelling Software (spams),” in http://www.di.ens.fr/willow/spams/index.html.

[2] “Space Time Interest Points (stip),” in http://www.di.ens.fr/ laptev/interest-

points.html”.

[3] I. Laptev and T. Lindeberg, “Space-time interest points,” Proceedings Ninth

IEEE International Conference on Computer Vision, vol. pages, no. Iccv,

pp. 432–439 vol.1, 2003.

[4] T. B. Moeslund, A. Hilton, and V. Kruger, “A survey of advances in vision-

based human motion capture and analysis,” Computer Vision and Image Un-

derstanding, vol. 104, no. 2-3, pp. 90–126, 2006.

[5] A. Efros, A. Berg, G. Mori, and J. Malik, “Recognizing action at a distance,”

in Proc. 9th Int. Conf. Computer Vision, vol. 2, pp. 726–733, 2003.

[6] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as

space-time shapes,” IEEE Conference on Computer Vision, pp. 1395–1402,

2005.

[7] C. S. I. Laptev and B. Caputo, “Recognizing human actions: A local svm

approach,” International Conf. on Pattern Recognition, 2004.

[8] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, “Performance of optical flow

techniques,” International Journal of Computer Vision, vol. 12, no. 1, pp. 43–

77, 1994.

[9] S. M. Smith and J. M. Brady, “Susan - a new approach to low level image

processing,” International Journal of Computer Vision, vol. 23, no. 1, pp. 45–

78, 1997.

[10] M. Isard and A. Blake, “Icondensation: Unifying low-level and high-level track-

ing in a stochastic framework,” in ECCV (1) (H. Burkhardt and B. Neumann,

47


eds.), vol. 1406 of Lecture Notes in Computer Science, pp. 893–908, Springer,

1998.

[11] M. J. Black and A. D. Jepson, “Recognizing temporal trajectories using the

condensation algorithm,” in FG, pp. 16–21, IEEE Computer Society, 1998.

[12] W. Frstner and E. Glch, A fast operator for detection and precise location of

distinct points, corners and centres of circular features, pp. 281–305. ISPRS

Intercomission Workshop, Interlaken, 1987.

[13] C. Harris and M. Stephens, A combined corner and edge detector, vol. 15,

pp. 147–151. Manchester, UK, 1988.

[14] T. Lindeberg, “Feature detection with automatic scale selection,” International

Journal of Computer Vision, vol. 30, no. 2, pp. 79–116, 1998.

[15] C. Schmid and R. Mohr, “Local grayvalue invariants for image retrieval,” IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 5,

pp. 530–535, 1997.

[16] I. Laptev and T. Lindeberg, “Local descriptors for spatio-temporal recogni-

tion,” in SCVMA (W. J. MacLean, ed.), vol. 3667 of Lecture Notes in Computer

Science, pp. 91–103, Springer, 2004.

[17] T. Tuytelaars and L. V. Gool, “Wide baseline stereo matching based on local

, affinely invariant regions,” Baseline, vol. pages, p. 412425, 2000.

[18] K. Mikolajczyk and C. Schmid, “An affine invariant interest point detector,”

Image Rochester NY, vol. 1, no. 1, pp. 128–142, 2002.

[19] D. Tell and S. Carlsson, “Combining appearance and topology for wide baseline

matching,” Proc ECCV, pp. 68–81, 2002.

[20] D. G. Lowe, “Object recognition from local scale-invariant features,” Proceed-

ings of the Seventh IEEE International Conference on Computer Vision, vol. 2,

no. [8, pp. 1150–1157 vol.2, 1999.

[21] D. Hall, J. L. Crowley, and V. C. D. Verdi, “View invariant object recognition

using coloured receptive fields,” Machine Graphics And Vision, pp. 1–12, 2000.

human action recognition - semantic scholar€¦ · several approaches for human action recognition...

Documents