discriminant bag of words based representation for human ... · pdf fileon bag of words (bows)...

Discriminant Bag of Words based Representation forHuman Action Recognition

Alexandros Iosifidis, Anastasios Tefas and Ioannis Pitas

Department of Informatics, Aristotle University of ThessalonikiBox 451, 54124 Thessaloniki, Greece

{aiosif,tefas,pitas}@aiia.csd.auth.gr

Abstract

In this paper we propose a novel framework for human action recognition basedon Bag of Words (BoWs) action representation, that unifies discriminative code-book generation and discriminant subspace learning. The proposed framework isable to, naturally, incorporate several (linear or non-linear) discrimination crite-ria for discriminant BoWs-based action representation. An iterative optimizationscheme is proposed for sequential discriminant BoWs-based action representationand codebook adaptation based on action discrimination in a reduced dimension-ality feature space where action classes are better discriminated. Experiments onfive publicly available data sets aiming at different application scenarios demon-strate that the proposed unified approach increases the codebook discriminativeability providing enhanced action classification performance.

Keywords: Bag of Words, Discriminant Learning, Codebook Learning

1. Introduction

Human action recognition from videos has been intensively studied in the lasttwo decades due to its importance in a wide range of applications, like human-computer interaction (HCI), content-based video retrieval and augmented reality,to name a few. It is, still, an active research field due to its difficulty, which is,mainly, caused because there is not a formal description of actions. Action ex-ecution style variations and changes in human body sizes among individuals, aswell as different camera observation angles are some of the reasons that lead to

Preprint submitted to Pattern Recognition Letters October 30, 2014

high intra-class and, possibly, small inter-class variations of action classes. Re-cently, several action descriptors aiming at action recognition in unconstrainedenvironments have been proposed, including local sparse and dense space-timefeatures ([23, 45, 10, 20, 34, 35]). Such descriptors capture information appear-ing in video frame locations that either correspond to video frame interest pointswhich are tracked during action execution, or that are subject to abrupt intensityvalue variations and, thus, contain information regarding motion speed and/or ac-celeration, which is of interest for the description of actions. These local videoframe descriptors are calculated by using the color (grayscale) video frames and,thus, video frame segmentation is not required.

After describing actions, videos depicting actions, called action videos here-after, are usually represented by fixed size vectors. Several feature coding ap-proaches have been proposed in order to determine compact (vectorial) represen-tations ([14]), including Sparse Coding ([42]), Fisher Vector ([28]), Local Tangentcoding ([47]) and Salience-based coding ([13]). Perhaps the most well studiedand successful approach for action representation is based on the Bag of Words(BoWs) model ([4]). According to this model, each action video is representedby a vector obtained by applying (hard or soft) quantization on the features de-scribing it and using a set of feature prototypes forming the so-called codebook.This codebook is determined by clustering the features describing training actionvideos. The BoWs-based action representation has been combined with severalclassifiers, like Support Vector Machines, Artificial Neural Networks and Dis-criminant Analysis based classification schemes, providing high action classifi-cation performance on publicly available data sets aiming at different applicationscenarios. However, due to the fact that the calculation of the adopted codebookis based on an unsupervised process, the discriminative ability of the BoWs-basedaction representation is limited.

In order to increase the quality of the adopted codebook, codebook adaptationprocesses have been proposed which adopt a generative approach. That is, theinitial codebook generated by clustering the features describing training videosis adapted so as to reduce the reconstruction error of the resulted video repre-sentation ([37]). However, since this generative adaptation process does not takeinto account the class labels that are available for the training action videos, thediscriminative ability of the optimized codebook is not necessarily increased. Inorder to increase the discriminative ability of the adopted codebook, researchershave begun to introduce discriminative codebook learning processes ([40, 29]).However, since the codebook calculation process is, still, disconnected from theadopted classification scheme, the obtained codebook may not be the one that is

2

best suited for the task under consideration, i.e., the classification of actions in ourcase.

A method aiming at simultaneously learning both a discriminative codebookand a classifier is proposed in ([43]) for image classification. This method consistsof two iteratively repeated steps. The first one involves training images represen-tation by a set of class-specific histograms of visual words at the bit level and mul-tiple binary classifiers, one for each image category, training by using the obtainedhistograms. Based on the performance of each classifier, the set of training his-tograms is updated in the second step. While this approach has lead to increasedimage classification performance, its extension in other classification tasks, e.g.,action recognition, is not straightforward. Another approach has been proposedin ([25]), where a two-class linear SVM-based codebook adaptation scheme isformulated. The adoption of a two class formulation generates the drawback thatC(C − 1)/2 two-class codebooks have to be learned (C being the number ofclasses) and used in the test phase along with an appropriate fusion strategy. Inaddition, such an approach is not able to exploit inter-class correlation informationappearing in multi-class problems, which may facilitate class discrimination.

In this paper, we build on the BoWs-based action video representation by in-troducing discriminative criteria on the codebook learning process. The proposedapproach integrates codebook learning and action class discrimination in a multi-class optimization process in order to produce a discriminant BoWs-based actionvideo representation. Two processing steps are iteratively repeated to this end.The first one, involves the calculation of BoWs-based representation of the train-ing action videos using a codebook of representative features and learning of anoptimal mapping of the obtained BoWs-based action video representations to adiscriminant feature space where action classes are better discriminated. In thesecond step, based on an action class discrimination criterion in the obtained fea-ture space, the adopted codebook is adapted in order to increase action classesdiscrimination. In order to classify a new, unknown, action video, it is repre-sented by employing the optimized codebook and the obtained BoWs-based ac-tion video representation is mapped to the discriminant feature space determinedin the training phase. In this discriminant space, classification can be performedby employing several classifiers, like K-Nearest Neighbors (K-NN), Support Vec-tor Machine (SVM) or Artificial Neural Networks (ANN). Here it should be notedthat the proposed approach is not aiming at increasing the representation powerof the adopted codebook. Instead, it aims at increasing its discrimination powerfor the classification task under consideration, i.e., the discrimination of the actionclasses involved in the classification problem at hand.

3

The rest of the paper is structured as follows. The proposed approach for inte-grated discriminant codebook and discriminant BoWs-based action representationlearning is described in Section 2. Experiments conducted on publicly availabledata sets aiming at different application scenarios are illustrated in Section 3. Fi-nally, conclusions are drawn in Section 4.

2. Discriminant Codebook Learning for BoWs-based Action Representation

In this section we describe in detail the proposed integrated optimization schemefor discriminant BoWs-based action representation. Let U be a video databasecontaining NT action videos followed by action class labels li, i = 1, . . . , NT ap-pearing in an action class set A = {α}Cα=1. Let us assume that each action videoi is described by Ni feature vectors pij,∈ RD, i = 1, . . . , NT , j = 1, . . . , Ni,which are normalized in order to have unit l2 norm. We employ the feature vec-tors pij and the action class labels li in order to represent each action video i bytwo discriminant feature vectors si ∈ RD and zi ∈ Rd, d < D, in the feature spacedetermined by the adopted codebook and the discriminant space, respectively.

2.1. Standard BoWs-based action representationLet us denote by V ∈ RD×K a codebook formed by codebook vectors vk ∈

RD, k = 1, . . . , K. This codebook is calculated by clustering the feature vec-tors pij, i = 1, . . . , NT , j = 1, . . . , Ni without exploiting the available labelinginformation for the training action videos. Several clustering techniques can beemployed to this end. K-Means has been widely adopted for its simplicity andfast operation. The codebook vectors vk are, usually, determined to be the clus-ter mean vectors. After determining the codebook V, the standard BoWs-basedaction representation of action video i is obtained by applying hard or soft vec-tor quantization on the feature vectors pij, j = 1, . . . , Ni. In the first case, theBoWs-based representation of action video i is a histogram of features, calculatedby assigning each feature vector pij to the cluster of the closest codebook vectorvk. In the second case, a distance function, usually the Euclidean one, is usedin order to determine Ni distance vectors, each denoting the similarity of featurevector pij to all the codebook vectors vk, and the BoWs-based representation ofaction video i is determined to be the mean normalized distance vector ([15]).

2.2. Discriminant BoWs-based action representationThe proposed discriminant BoWs-based representation exploits a generaliza-

tion of the Euclidean distance, i.e.,:

4

dijk = ∥vk − pij∥−g2 . (1)

The use of a parameter value g = 1.0 leads to a BoWs-based representation basedon soft vector quantization, while a parameter value g ≫ 1.0 leads to a BoWs-based representation based on hard vector quantization. By using the above dis-tance function, each feature vector pij is mapped to the so-called membershipvector uij ∈ RK , encoding the similarity of pij to all the codebook vectors vk.Membership vectors uij ∈ RK are obtained by normalizing the distance vectorsdij = [dij1 . . . dijK ]

T in order to have unit l1 norm, i.e.:

uij =dij

∥dij∥1. (2)

The BoWs-based representation of action video i is obtained by calculating themean membership vector:

qi =1

Ni

Ni∑j=1

uij. (3)

Finally, the mean membership vectors qi are normalized in order to produce theso-called action vectors:

si =qi

∥qi∥2. (4)

The adopted similarity function as well as the mean similarity to codebook vectorsfor action video representation allow for better diffusion of the similarity along thecodebook vectors. This is more effective, especially for small feature sets, wherethe resulting standard BoW-based representations are rather sparse, as is the caseof STIP-based extracted features.

After calculating the action vectors representing all the action videos, they arenormalized in order to have zero mean and unit standard deviation, resulting to thenormalized action vectors xi ∈ RK . In order to map the normalized action vectorsxi to a new feature space in which action classes are better discriminated, an opti-mal linear transformation W∗ is obtained by solving the trace ratio optimizationproblem:

W∗ = argminW

trace{WTAW}trace{WTBW}

, (5)

where A, B are matrices encoding properties of interest for the training normal-ized action vectors xi, i = 1, . . . , NT . That is, they are functions of the training

5

normalized action vectors, i.e., A(xi), B(xi). Finally, the discriminant actionvectors zi, i = 1, . . . , NT are obtained by:

zi = W∗Txi. (6)

The optimization problem in (5) is, usually, approximated by solving the ratiotrace optimization problem Aw = λBw, λ = 0 ([36]), which can be solved byperforming eigenanalysis to the matrix B−1A in the case where B is invertible, orA−1B in the case where A is invertible. In the case where both A, B are singular,i.e., when NT < K, the strictly diagonally dominant criterion for nonsingularmatrices is exploited and a regularized version of A = A+ rI is employed ([26]),where r is a small positive value. However, as has been shown in ([36, 17]), theoriginal trace ratio problem can directly be solved by solving the equivalent tracedifference optimization problem:

W∗ = argmaxWTW=I

Tr[WT (B− λ∗A)W

], (7)

where λ∗ ≥ 0 is the trace ratio λ∗ = Tr(WTBW)Tr(WTAW)

. The trace ratio value λ∗ canbe calculated by using an efficient algorithm based on Newton-Raphson method([17]). After determining λ∗, the trace difference optimization problem is solvedby performing eigenanalysis to the matrix S = B − λ∗A. The optimal transfor-mation matrix W∗ is formed by the eigenvectors corresponding to the non zeroeigenvalues of S.

By following the above described procedure for discriminant action vectorscalculation, several discriminant spaces can be obtained for action representation.As has been shown in ([41]), a wide range of (linear and non-linear) discrimi-nant learning techniques can be obtained by solving the trace ratio optimizationproblem (5) and employing the graph embedding framework, including PrincipalComponents Analysis (PCA) ([6]), Linear Discriminant Analysis (LDA) ([6]),ISOMAP ([33]), Locally Linear Embedding (LLE) ([3]) and Laplacian Eigen-maps (LE) ([1]).

After determining the above described discriminant action video representa-tion, a codebook adaptation process is performed in order to increase the codebookdiscriminative ability based on action class discrimination in the obtained discrim-inant space. The procedure followed to this end is described in the following.

2.3. Codebook AdaptationBy observing that the normalized action vectors xi are functions of the adopted

codebook V, as described in subsection 2.2, it can be seen that the optimization

6

problem (5) is a function of both the transformation matrix W and the codebookV. Based on this observation, we propose to minimize the trace ratio criterionwith respect to both W and V, in order to simultaneously increase the codebookdiscriminative ability and to obtain the optimal transformation matrix for actionclasses discrimination:

J (W,V) =trace{WTA(V)W}trace{WTB(V)W}

(8)

Since the minimization of (8), for given V, can be readily computed by solv-ing the trace ratio problem ([17]), we propose an iterative optimization schemeconsisting of two steps. In the first step, for a given codebook, training normal-ized action vectors xi are employed in order to determine the optimal projectionmatrix W∗

t by solving the trace ratio problem (8). In the second step, codebookvectors vk,t are adapted, in the direction of the gradient of (8), by using the ob-tained W∗

t . Here, we have introduced the index t denoting the iteration of theiterative optimization process. The adaptation of vk,t is performed by followingthe gradient of J with respect to vk,t:

vk,t+1 = vk,t − η∂Jt

∂vk,t

(9)

∂Jt

∂vk,t

=∂Jt

∂xik,t

∂xik,t

∂qik,t

∂qik,t∂dijk,t

∂dijk,t∂vk,t

(10)

where η is an update rate parameter. In order to avoid scaling issues, codebookvectors of both the initial and the updated codebooks, vk,0, vk,t respectively, arenormalized to have unit l2 norm.

The two above described optimization steps are performed until (Jt−Jt+1)/Jt <ϵ, where ϵ is a small positive value (equal to ϵ = 10−6 in our experiments). SinceLDA is the most widely adopted discriminant learning technique, due to its ef-fectiveness in many classification problems, we provide the update rule obtainedby employing the LDA discrimination criterion in the following. The derivationof codebook adaptation processes exploiting different discrimination criteria isstraightforward.

LDA determines an optimal discriminant space for data projection in whichclasses are better discriminated. The adopted criterion is the ratio of the within-class scatter to the between-class scatter in the projection space. LDA solves theoptimization problem (8) by using the matrices:

At =C∑

α=1

NT∑i=1

bαi (xi,t − xαt ) (xi,t − xα

t )T , (11)

7

Bt =C∑

α=1

(xαt − xt) (x

αt − xt)

T , (12)

where At, Bt are the within-class and between-class scatter matrices obtained byusing Vt, respectively. bαi is an index denoting if the normalized action vector xi,t

belongs to action class α, i.e., bαi = 1 if li = α and bαi = 0 otherwise. xαt is the

mean vector of action class α and xt is the mean normalized action vector of theentire training set.

By using (11), (12), the gradient (10) is given by:∂Jt

∂vk,t

=(aWt(i,:)(xi,t − xα

t )− cWt(i,:)xαt ))

·

(1

sk,t− sik,t − sk,t

s3k,t

)(1

∥qi,t∥2−

q2ik,t∥qi,t∥32

)· NT − 1

NTNi

(1

∥dij,t∥1− dijk,t

∥dij,t∥21

)· −g∥vk,t − pij∥−(g+2)

2 (vk,t − pij) , (13)

where a =2bαi

trace(WTt BtWt)

, c =2bαi trace(W

Tt AtWt)

trace(WTt BtWt)2

, Wt(i,:) is the i-th row of the

matrix Wt = WtWTt and sk,t, sk,t are the mean and standard deviation of the

training action vectors in dimension k, respectively.The update rate parameter value η in (9) can either be set to a fixed value,

e.g., η = 0.01, or be dynamically determined. In order to accelerate the code-book adaptation process and (possibly) to avoid convergence on local minima, inour experiments we have employed a (dynamic) line search strategy. That is, ineach iteration of the codebook adaptation process, the trace ratio criterion (8) wasevaluated by using (9) and η0 = 0.1. In the case where Jt+1 < Jt, the trace ratiocriterion was evaluated by using a codebook update parameter value ηn = 2ηn−1.This process is followed until Jt+1 > Jt and the codebook update parameter valueproviding the highest J decrease was employed for codebook adaptation. In thecase where, by using a codebook update parameter value η0 = 0.1, Jt+1 > Jt, thetrace ratio criterion was evaluated by using a codebook update parameter valueηn = ηn−1/2. This process is followed until Jt+1 < Jt and the codebook updateparameter value providing J decrease was employed for codebook adaptation.

2.4. Action Recognition (Test Phase)Let us denote by Vopt, W∗

opt the codebook and the corresponding projectionmatrix obtained by applying the above described optimization process employing

8

the feature vectors pij describing the training action videos and the correspondingaction class labels. Let ptj ∈ RD, j = 1, . . . , Nt be feature vectors describing atest action video. ptj are employed in order to calculate the corresponding nor-malized action vector xt ∈ RK using Vopt. xt can be either classified in thisspace, or be mapped to the discriminant space, determined in the training phase,by applying zt = W∗T

optxt. In either cases, action classification is performed byemploying any, linear or non-linear, classifier, like K-NN, SVM and ANNs.

3. Experimental Evaluation

In this Section we present experiments conducted in order to evaluate the pro-posed discriminant codebook learning technique and the obtained discriminantBoWs-based action representation. In all the experiments we have employed theHarris3D detector ([21]) followed by HOG/HOF descriptors ([35]) calculation foraction video description. We should note here that the proposed optimizationframework for discriminant BoWs-based representation can be combined withany descriptor proposed for BoWs-based representation. The optimal values ofparameters K and g have been determined by applying grid search using values50 < K < 500 and g = [1, 2, 5, 10, 20], respectively. In order to limit the com-plexity, we cluster a subset of 100k randomly selected HOG/HOF descriptors forinitial codebook calculation. To increase precision of the initial codebook, weinitialize K-Means 10 times and keep the codebook providing the smallest error.In the test phase, classification is performed by employing a Single-hidden LayerFeedforward Neural Network trained by applying the recently proposed ExtremeLearning Machine algorithm ([12]).

We have used five publicly available data sets aiming at different applicationscenarios, from single-view everyday actions, actions appearing in movies, com-plex ballet movements, multi-view everyday actions and, even, facial expressions.Comprehensive description of the data sets and information concerning the ex-perimental setup used in each data set are provided in the following subsections.In all the experiments we compare the performance of the standard BoWs-basedaction video representation (BoWs), i.e., the BoWs-based action video represen-tation adopting an unsupervised codebook, and the proposed discriminant BoWs-based action representation (DBoWs), i.e., the discriminant BoWs-based actionrepresentation obtained by applying the proposed supervised codebook learningoptimization scheme. Furthermore, we provide comparison results of the pro-posed discriminant BoWs-based action video representation adopting the abovementioned descriptor-classifier combination with some recently proposed state-

9

of-the-art action recognition methods evaluating their performance on the adoptedaction recognition data sets.

3.1. The KTH Data SetThe KTH data set ([32]) consists of 600 videos depicting twenty-five persons,

each performing six everyday actions: ‘walking’, ‘jogging’, ‘running’, ‘boxing’,‘hand waving’ and ‘hand clapping’. Four different scenarios have been recorded:(s1) outdoors, (s2) outdoors with scale variation, (s3) outdoors with differentclothes and (s4) indoors. The persons are free to change motion speed and di-rection between different action realizations. Example video frames are illus-trated in Figure 1. The most widely adopted experimental protocol in this data setis based on a split (16 training and 9 test persons) that has been used in ([32]).In order to limit the complexity of the proposed codebook adaptation process,each action video has been represented by Nk = 500 vectors obtained by clus-tering the HOG/HOF descriptors calculated on the detected STIP video locations.This choice clearly accelerates the training process of the proposed method, sincecodebook adaptation is performed by using only Nk vectors. However, such anapproach may decrease action recognition performance. In our preliminary exper-iments we have not witnessed major performance drops by adopting this approach,since the vectors obtained by applying clustering on the HOG/HOF descriptors ofa video can be considered to be representative of the video descriptors.

3.2. The Hollywood2 Data SetThe Hollywood2 data set ([27]) consists of 1707 videos collected from 69

Hollywood movies. The actions appearing in the data set are: ‘answering phone’,‘driving car’, ‘eating’, ‘fighting’, ‘getting out of car’, ‘hand shaking’, ‘hugging’,‘kissing’, ‘running’, ‘sitting down’, ‘sitting up’ and ‘standing up’. Example videoframes are illustrated in Figure 2. The most widely adopted experimental protocolin this data set is based on a split (823 training and 884 test videos) that hasbeen used in ([32]). In order to limit the complexity of the proposed codebookadaptation process, each action video has been represented by Nk = 1000 vectorsobtained by clustering the HOG/HOF descriptors calculated on the detected STIPvideo locations.

3.3. The Ballet Data SetThe Ballet data set ([5]) consists of 44 real video sequences depicting three

actors performing eight ballet movements and has been collected from an instruc-tional ballet DVD. The actions appearing in the data set are: ‘left-to-right hand

10

Figure 1: Example video frames of the KTH data set depicting instances of all the six actions fromall the four experimental scenarios.

Figure 2: Example video frames of the Hollywood2 data set depicting instances of all the twelveactions.

11

opening’, ‘right-to-left hand opening’, ‘standing hand opening’, ‘leg swinging’,‘jumping’, ‘turning’, ‘hopping’ and ‘standing still’. Example video frames areillustrated in Figure 3. The Leave-One-Video-Out cross-validation experimentalprotocol is, typically, used for this data set.

Figure 3: Video frames of the Ballet data set depicting instances of all the eight actions.

3.4. The i3DPost Data SetThe i3DPost data set ([9]) consists of 512 videos depicting eight persons from

eight observation angles, each performing eight actions: ‘walk’, ‘run’, ‘jumpin place’, ‘jump forward’, ‘bend’, ‘fall down’, ‘sit on a chair’ and ‘wave onehand’. Example video frames depicting a person walking from all the eight cam-eras used in the data set are illustrated in Figure 4. The Leave-One-Person-Outcross-validation experimental protocol is, typically, used for this data set. In or-der to fuse the information coming from all available cameras, the action videosdepicting the same test action instance from different viewing angles have beenclassified independently and the obtained action class labels were fused by usinga simple majority voting fusion scheme, similar to ([15]).

3.5. The Facial Expression Data SetThe facial expression data set ([5]) consists of 192 videos depicting two per-

sons expressing six different emotions under two lighting conditions. The expres-sions appearing in the data set are: ‘anger’, ‘disgust’, ‘fear’, ‘joy’, ‘sadness’ and‘surprise’. The persons always start with a neutral expression, show the emotionand return to neutral. Example video frames depicting all the expressions appear-ing in the data set are illustrated in Figure 5. Four experimental protocols havebeen used for this data set: ‘Same person & lighting’ (SpSl), ‘Same person, Dif-ferent lighting’ (SpDl), ‘Different person, Same lighting’ (DpSl) and ‘Differentperson & lighting’ (DpDl).

12

Figure 4: Example video frames of the i3DPost data set depicting a person walking from eightviewing angles.

Figure 5: Video frames of the facial expression data set depicting the emotion apex of all the sixemotions.

13

3.6. Experimental ResultsTables 1 - 6 illustrate the performance obtained by using the proposed discrim-

inant BoWs-based action video representation on the KTH, Hollywood2, Ballet,i3DPost and the facial expression data sets, respectively. As can be seen in theseTables, the adoption of a supervised learning process on the codebook adaptationenhances performance, when compared to the standard BoWs-based action videorepresentation. Table 1 illustrates the classification rates obtained by using differ-ent values of Nk in KTH database. As can be seen, a value of Nk = 500 providessatisfactory performance. In Tables 2 - 6 we, also, compare the performance ofthe proposed action video recognition approach with that of some state-of-the-artmethods, recently proposed in the literature.

In the KTH data set, the use of the BoWs-based action video representationled to a classification rate equal to 88.89%. By adopting the proposed DBoWs-based action video representation, an increased action classification rate, equal to92.13%, has been obtained. As can be seen in Table 2, the proposed method out-performs other state-of-the-art methods employing low-level action video repre-sentations. In this Table, we also provide the performance of some state-of-the-artmethods evaluating their performance on the KTH data set that exploit mid-leveland high-level action video representations. It can be seen that the proposed ap-proach exploiting a low-level action video representation provides performancecomparable with that of the the two methods exploiting a mid-level action videorepresentation. In addition, its performance is comparable with that of one methodexploiting high-level action video representation, while it is inferior to that of theremaining three methods exploiting high-level representations. However, the cal-culation of high-level representations is computationally demanding, comparedto the calculation of low-level ones and, thus, a comparison between the two ap-proaches in terms of only the obtained action recognition performance is not fair.

Table 1: Performance on KTH for different values of Nk.

200 300 400 500 600BoWs 85.19% 86.11% 87.5% 88.89% 88.89%DBoWs 87.96% 89.35% 90.74% 92.13% 92.13%

In the Hollywood2 data set, the use of the BoWs-based action video represen-tation led to a performance equal to 41.5%. By adopting the proposed DBoWs-

14

Table 2: Classification rates on the KTH data set.

Representation Performance[46] low-level 82%[19] low-level 84.3%[44] low-level 87.3%[18] low-level 90.57%[30] low-level 91.1%[22] low-level 91.8%[7] mid-level 90.5%[39] mid-level 92.4%[48] high-level 93.25%[24] high-level 93.9%[8] high-level 94.5%[20] high-level 94.5%[31] high-level 98.9%[16] high-level 99.54%BoWs low-level 88.89%DBoWs low-level 92.13%

based action video representation, a performance equal to 45.8% has been ob-tained. As can be seen in Table 3, the methods evaluated on this data set can beroughly divided based on the employed action video description. On the one hand,methods employing densely sampled descriptors for action video representation,i.e., Cuboids, Dense and Regions, have been shown to outperform the ones em-ploying descriptors calculated on STIPs, i.e., Harris3D and Hessian. On the otherhand, it can be seen that the adoption of a higher number of descriptors, eachdescribing a different action property, enhances action classification performance.

As can be seen in Table 3, the best approach in this data set is that of ([34]),which is based on an action video representation exploiting information appear-ing on the trajectories of densely sampled video frame interest points and is de-scribed by calculating multiple (five) descriptors, i.e., HOG, HOF, MBHx, MBHyand Traj, along these trajectories. However, this approach has the following twodisadvantages: 1) It is very computationally demanding and 2) it requires mul-tiple descriptors (as well as a good descriptor combination scheme) in order toachieve better performance, when compared to action video representations ex-ploiting STIP-based information. It can be seen in Table 3 that in the caseswhere the Dense Trajectory-based and the Cuboid-based video descriptions ex-ploit HOG/HOF descriptors, their performance is comparable with that of the

15

proposed method exploiting STIP-based visual information. Taking into accountthat STIP-based video representations are much faster, when compared to onesexploiting densely-sampled visual information, we can see that a comparison be-tween the two approaches, in terms of only the obtained performance, is not fair.

In ([34]), the adopted action video representation exploits the BoW modelfor each of the five descriptor types (i.e., HOG, HOF, MBHx, MBHy and Traj)and combines the obtained five (single-channel) action video representations onthe classification process by using a multi-channel kernel Support Vector Ma-chine approach. Thus, the action video representation of ([34]) employs multiple(five) codebooks (each calculated by applying K-Means clustering on the featurevectors calculated for the training videos corresponding to one descriptor type).From this, it can be seen that an extension of the proposed approach for unifiedcodebook adaptation, in the case where multiple descriptor types are exploited,would probably enhance the performance of such methods. Overall, it can be seenthat the proposed DBoWs action representation provides comparable performancewith that of other state-of-the-art methods employing action video representationsevaluated on STIPs.

Table 3: Classification rates on the Hollywood2 data set.

Representation Performance[27] Harris3D+HOG+HOF 32.4%[27] Harris3D+HOG+HOF+SIFT 32.6%[27] Harris3D+HOG+HOF+SIFT+Scene 35.5%[35] Harris3D+HOG/HOF 45.2%[35] Hessian+HOG/HOF 46%[35] Cuboids+HOG/HOF 46.2%[24] Cuboids+ISA 53.3%[34] Dense+HOG 41.5%[34] Dense+HOF 50.8%[35] Dense+HOG/HOF 47.4%[34] Dense+HOG+HOF+MBH+Traj 58.3%[2] Regions+HOG+HOF+OF 41.34%

BoWs Harris3D+HOG/HOF 41.5%DBoWs Harris3D+HOG/HOF 45.8%

In the Ballet data set, the use of the BoWs-based action video representa-tion led to an action classification rate equal to 86.3%. The use of the proposedDBoWs-based action video representation increased the action classification rate

16

to 91.1%, which is comparable to the performance of the two competing methodspresented in Table 4.

Table 4: Classification rates on the Ballet data set.

[10] [38] BoWs DBoWs91.1% 91.3% 86.3% 91.1%

In the i3DPost data set, the use of the BoWs-based action representation led toan action classification rate equal to 95.31%, while the adoption of the proposedDBoWs-based action video representation led to an action classification rate equalto 98.44%, equal to that of ([11]). We should note here that the method in ([11])employs a computationally expensive 4D optical flow-based action video repre-sentation and, thus, its operation is slower compared with the proposed one in thetest phase.

Table 5: Classification rates on the i3DPost data set.

[15] [11] BoWs DBoWs94.87% 98.44% 95.31% 98.44%

Finally, in the facial expression data set, it can be seen that the proposedDBoWs-based action representation, clearly, outperforms the BoWs-based actionrepresentation providing up to 5% increase on the classification performance. Theproposed method, clearly, outperforms the method in ([5]), while it provides com-parable classification rates with that in ([10]). However, we should note that themethod in [10] involves an optimization process during testing and, thus, its oper-ation is slower compared to the BoWs-based action classification approach.

4. Conclusions

In this paper we proposed a novel framework for human action recognitionunifying discriminative codebook generation and discriminant subspace learn-ing. An iterative optimization scheme has been proposed for sequential discrim-inant BoWs-based action video representation calculation and codebook adapta-tion based on action classes discrimination. The proposed framework is able to,

17

Table 6: Classification rates on the facial expression data set for different experimental protocols.

SpSl SpDl DpSl DpDl[5] 97.9% 89.6% 75% 69.8%[10] 100% 93.7% 91.7% 72.9%

BoWs 98.96% 87.5% 78.65% 73.44%DBoWs 100% 92.71% 83.33% 77.6%

naturally, incorporate several (linear or non-linear) discrimination criteria for dis-criminant BoWs-based action video representation. Experiments conducted byemploying the LDA criterion on five publicly available data sets aiming at differ-ent application scenarios demonstrate that the proposed unified approach increasesthe codebook discriminative ability providing enhanced performance.

Acknowledgment

The research leading to these results has received funding from the EuropeanUnion Seventh Framework Programme (FP7/2007-2013) under grant agreementnumber 316564 (IMPART).

[1] Belkin, M., Niyogi, P., 2001. Laplacian eigenmaps and spectral techniquesfor embedding and clustering. Advances in Neural Information ProcessingSystems 14, 585–591.

[2] Bilen, H., Namboodiri, V., Gool, L., 2011. Action recognition: a regionbased approach. Applications of Computer Vision .

[3] Boweils, S., Saul, L., 2000. Nonlinear dimensionality reduction by locallylinear embedding. Science 290, 2323–2326.

[4] Csurka, G., Bray, C., Dance, C., Fan, L., 2004. Visual categorization withbags of keypoints. European Conference on Computer Vision .

[5] Dollar, P., Rabaud, V., Cottrell, G., Belongie, S., 2005. Behavior recogni-tion via sparse spatio-temporal features. International Workshop on VisualSurveillance and Performance Evaluation of Tracking and Surveillance .

[6] Duda, R., Hart, P., Stork, D., 2000. Pattern classification, 2nd ed. Wiley-Interscience.

18

[7] Fathi, A., Mori, G., 2008. Action recognition by learning mid-level motionfeatures. Computer Vision and Pattern Recognition .

[8] Gilbert, A., Illingworth, J., Bowden, R., 2009. Fast realistic multi-actionrecognition using mined dense spatio-temporal features. Computer Visionand Pattern Recognition .

[9] Gkalelis, N., KIm, H., Hilton, A., Nikolaidis, N., Pitas, I., 2009. The i3dpostmulti-view and 3d human action/interaction database. Conference on VisualMedia Production , 159–168.

[10] Guha, T., Ward, R., 2011. Learning sparse representations for human actionrecognition. IEEE Transactions on Pattern Analysis and Machine Intelli-gence 34, 1576–1588.

[11] Holte, M., Chakraborty, B., Gonzalez, J., Moeslund, T., 2012. A local 3dmotion descriptor for multi-view human action recognition from 4d spatio-temporal interest points. IEEE Journal of Selected Topics in Signal Process-ing 6, 553–565.

[12] Huang, G., Zhou, H., Ding, X., Zhang, R., 2012. Extreme learning machinefor regression and multiclass classification. IEEE Transactions on Systems,Man, and Cybernetics, Part B: Cybernetics 42, 513–529.

[13] Huang, Y., Huang, Y., Tan, T., 2011. Salient coding for image classification.Computer Vision and Pattern Recognition .

[14] Huang, Y., Wu, Z., Wang, L., Tan, T., 2014. Feature coding in image clas-sification: A comprehensive study. IEEE Transactions on Pattern Analysisand Machine Intelligence 36, 493–506.

[15] Iosifidis, A., Tefas, A., Pitas, I., 2012. View-invariant action recognitionbased on artificial neural networks. IEEE Transactions on Neural Networksand Learning Systems 23, 412–425.

[16] Iosifidis, A., Tefas, A., Pitas, I., 2014. Regularized extreme learning machinefor multi-view semi-supervised action recognition. Neurocomputing DOI:10.1016/j.neucom.2014.05.036.

[17] Jia, Y., Nie, F., Zhang, C., 2009. Trace ratio problem revisited. IEEE Trans-actions on Neural Networks 20, 729–735.

19

[18] Kaaniche, M., Bremond, F., 2010. Gesture recognition by learning localmotion signatures. Computer Vision and Pattern Recognition .

[19] Klaser, A., Marszalek, M., Schmid, S., 2009. A spatio-temporal descriptorbased on 3d-gradients. British Machine Vision Conference , 995–1004.

[20] Kovashka, A., Grauman, K., 2010. Learning a hierarchy of discriminativespacetime neighborhood features for human action recognition. ComputerVision and Pattern Recognition .

[21] Laptev, I., Lindeberg, T., 2003. Space-time interest points. InternationalConference on Computer Vision , 432–439.

[22] Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B., 2008. Learning real-istic human actions from movies. Computer Vision and Pattern Recognition.

[23] Laptev, J., 2005. On space-time interest points. International Journal onComputer Vision 64, 107–123.

[24] Le, Q., Zou, W., Yeung, S., Ng, A., 2011. Learning hierarchical invariantspatio-temporal features for action recognition with independent subspaceanalysis. Computer Vision and Pattern Recognition .

[25] Lian, X., Li, Z., Lu, B., Zhang, L., 2010. Max-margin dictionary learning formulticlass image categorization. European Conference on Computer Vision.

[26] Lu, J., Plataniotis, K., Venetsanopoulos, A., 2003. Face recognition usinglda-based algorithms. IEEE Transactions on Neural Networks 14, 195–200.

[27] Marszalek, M., Laptev, I., Schmid, C., 2009. Actions in context. ComputerVision and Pattern Recognition .

[28] Perronnin, F., Dance, C., 2007. Fisher kernels on visual vocabularies forimage categorization. Computer Vision and Pattern Recognition .

[29] Perronnin, F., Dance, C., Csurka, G., Bressan, M., 2005. Adapted vocacbu-laries for generic visual categorization. IEEE European Conference on Com-puter Vision .

20

[30] Ryoo, M., Aggarwal, J., 2009. Spatio-temporal relationship match: Videostructure comparison for recognition of complex human activities. Interna-tional Conference on Computer Vision .

[31] Sadanand, S., Corso, J., 2012. Action bank: A high-level representation ofactivity in video. Computer Vision and Pattern Recognition .

[32] Schuldt, C., Laptev, I., Caputo, B., 2004. Recognizing human actions: Alocal svm approach. International Conference on Pattern Recognition , 32–36.

[33] Tenenbaum, J., Silva, V., Langford, J., 2000. A global geometric frameworkfor nonlinear dimensionality reduction. Science 290, 2319–2323.

[34] Wang, H., Klaser, A., Schmid, C., Liu, C., 2011. Action recognition bydense trajectories. Computer Vision and Pattern Recognition .

[35] Wang, H., Ullah, M., Klaser, A., Laptev, I., Schmid, C., 2009. Evaluation oflocal spatio-temporal features for action recognition. British Machine VisionConference , 1–11.

[36] Wang, H., Yan, S., Xu, D., Tang, X., Huang, T., 2007. Trace ratio vs. ratiotrace for dimensionality reduction. Computer Vision and Pattern Recogni-tion .

[37] Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y., 2010. Locality-constrained linear coding for image classication. Computer Vision and Pat-tern Recognition .

[38] Wang, Y., Mori, G., 2009. Human action recognition by semilatent topicmodels. IEEE Transactions on Pattern Analysis and Machine Intelligence31, 1762–1774.

[39] Weinland, M., Ozuysal, D., Fua, P., 2010. Making action recognition robustto occlusions and viewpoing changes. European Conference on ComputerVision Workshops .

[40] Winn, J., Criminisi, A., Minka, T., 2005. Object categorizations by learneduniversal visual dictionary. International Conference on Computer Vision .

21

[41] Yan, S., Xu, D., Zhang, B., Zhang, H., Yang, Q., Lin, S., 2007. Graph em-bedding and extentions: A general framework for dimensionality reduction.IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 40–51.

[42] Yang, J., Yu, K., Y., G., Huang, T., 2009. Linear spatial pyramid matchingusing sparse coding for image classication. Computer Vision and PatternRecognition .

[43] Yang, L., Jin, R., Sukthankar, R., Jurie, F., 2008. Unifying discriminative vi-sual codebook generation with classifier training for object category recog-nition. Computer Vision and Pattern Recognition .

[44] Yang, W., Wang, Y., G., M., 2010. Recognizing human actions from stillimages with latent poses. Computer Vision and Pattern Recognition .

[45] Yeffet, L., Wolf, L., 2009. Local trinary patterns for human action recogni-tion. International Conference on Computer Vision , 492–497.

[46] Yin, J., Meng, Y., 2010. Human activity recognition in video using a hierar-chical probabilistic latent model. Computer Vision and Pattern Recognition.

[47] Yu, K., Zhang, T., 2010. Improving local coordinate coding using loca tan-gents. International Conference on Machine Learning .

[48] Zhou, G., Wang, Q., 2012. Atomic action features: A new feature for actionrecognition. European Conference on Computer Vision Workshops .

22

discriminant bag of words based representation for human ... · pdf fileon bag of words (bows)...

Documents