recognition of human actions using texture descriptors

Machine Vision and ApplicationsDOI 10.1007/s00138-009-0233-8

SPECIAL ISSUE PAPER

Recognition of human actions using texture descriptors

Vili Kellokumpu · Guoying Zhao · Matti Pietikäinen

Received: 1 August 2008 / Revised: 9 July 2009 / Accepted: 4 November 2009© Springer-Verlag 2009

Abstract Human motion can be seen as a type of texturepattern. In this paper, we adopt the ideas of spatiotemporalanalysis and the use of local features for motion description.Two methods are proposed. The Þrst one uses temporal tem-plates to capture movement dynamics and then uses texturefeatures to characterize the observed movements. We thenextend this idea into a spatiotemporal space and describehuman movements with dynamic texture features. Follow-ing recent trends in computer vision, the method is designedto work with image data rather than silhouettes. The pro-posed methods are computationally simple and suitable forvarious applications. We verify the performance of our meth-ods on the popular Weizmann and KTH datasets, achievinghigh accuracy.

Keywords Action recognition· Local binary pattern·Dynamic textures· Temporal templates· Hidden Markovmodels

1 Introduction

Human action recognition has become an important researchtopic in computer vision in recent years. It has gained a lot ofattention because of its important application domains likevideo indexing, surveillance, human computer interaction,

V. Kellokumpu (B) · G. Zhao· M. PietikŠinenMachine Vision Group, University of Oulu,P.O. Box 4500, Oulu, Finlande-mail: [email protected].ÞURL: http://www.ee.oulu.Þ/mvg

G. Zhaoe-mail: [email protected].Þ

M. PietikŠinene-mail: [email protected].Þ

sport video analysis, intelligent environments etc. All theseapplication domains do have their own demands, but in gen-eral, algorithms must be able to detect and recognize variousactions in real time. Also as people look different and movedifferently, the designed algorithms must be able to handlevariations in performing actions and handle various kinds ofenvironments.

Many approaches for human activity recognition havebeen proposed in the literature [5,17]. Recently, there hasbeen a lot of attention towards analyzing human motions inspatiotemporal space instead of analyzing each frame of thedata separately.

The Þrst steps to spatiotemporal analysis were taken byBobick and Davis [2]. They used motion energy images(MEI) and motion history images (MHI) as temporal tem-plates to recognize aerobics movements. Matching was doneusing seven Hu moments. 3D extension of the temporaltemplates was proposed by Weinland etal. [30]. They usedmultiple cameras to build motion history volumes and per-formed action classiÞcation using Fourier analysis in cylin-drical coordinates. Related 3D approaches have been usedby Blank etal. [1] and Yilmaz and Shah [31] who uti-lized time as the third dimension and built spaceÐtime vol-umes in (x, y, t) space. SpaceÐtime volumes were matchedusing features from Poisson equations and geometric sur-face properties, respectively. Ke etal. [8] built a cascade ofÞlters based on volumetric features to detect and recognizehuman actions. Shechtman and Irani [26] used a correlation-based method in 3d, whereas Kobyashi and Otsu [15] usedCubic Higher-order Local Autocorrelation to describe humanmovements.

Interest point-based methods that have been quite popularin object recognition have also found their way to actionsrecognition. Laptev and Lindeberg [16] extended the Harrisdetector into spaceÐtime interest points and detected local

123

V. Kellokumpu et al.

structures that have significant local variation in both spaceand time. The representation was later applied to humanaction recognition using Support Vector Machine (SVM)[24]. Doll‡r etal. [4] described interest points with cuboids,whereas Niebles and Fei-Fei [18] used a collection of spatialand spatial temporal features extracted in static and dynamicinterest points.

In this paper, we adopt the ideas of spatiotemporal analysisand local features. Local features have been used to charac-terize textures, and we propose two texture-based methodsfor human action recognition. The Þrst method uses tem-poral templates to capture the dynamics of human move-ment and describes the templates with texture features. Thesecond method takes this approach a step further and usesdynamic texture descriptors extracted from the spatiotempo-ral space. These methods inherently describe both appear-ance and motion. Preliminary results of this work werereported in [10,12].

Prior work with the temporal templates, motion historyvolumes, and spaceÐtime volumes are based on modelingthe action as a whole. The choice of the appropriate actionduration parameter is crucial because segmentation errorswill lead to disastrous classiÞcation. Interestingly, Schindlerand van Gool [23] recently pointed out that action recog-nition can be done robustly using short action snippets ofjust a few frames. Instead of modeling the actions with onetemplate we model the actions as a sequence of templates.Furthermore, as the local properties of the templates capturethe essential information about human movements, we pro-pose to use texture features for describing the templates. Thelocal binary pattern (LBP) gives a description of local texturepatterns and it has been successfully used in various applica-tions, ranging from texture classiÞcation and segmentationto face recognition, image retrieval, and surface inspection.LBP features are fast to compute so they have been found tobe suitable for real-time applications.

We further build on this idea and extend the idea ofusing histograms of local features into a spatiotemporalspace. Furthermore, following recent trends in computervision, we propose a method that is designed to work withimage data rather than silhouettes. The method is based onusing a dynamic texture descriptor, Local Binary Patterns

from Three Orthogonal Planes (LBP-TOP), to representhuman movements. The LBP-TOP features have successfullybeen used for facial expression recognition [32] and visualspeech recognition [33]. Niyogi and Adelson [20] proposedthe use ofxt slices for detecting contours of walking people.We also propose a method for background subtraction thatuses the LBP-TOP features (the same features we use forhuman motion description), making the combined approachcomputationally simple.

The rest of the paper is organized as follows. Section2introduces the texture descriptors. Section3describes the sta-tic texture and temporal template-based method for motiondescription. Section4 describes the application of dynamictextures to background subtraction and action recognition.Section5shortly describes Hidden Markov Models (HMMs)that are used for modeling the texture feature histograms. Weshow experimental results in Sect.6 and conclude in Sect.7.

2 Texture descriptors: LBP and LBP-TOP

LBP operator [21] describes local texture pattern with abinary code, which is obtained by thresholding a neighbor-hood of pixels with the gray value of its center pixel. Theoriginal LBP operator represented a 3× 3 pixel neighborhoodwith a binary number and is illustrated in Fig.1a. Differentneighborhoods can also be used. In texture analysis the sam-pling is done on equally spaced sampling pointsP on a circu-lar neighborhood with radiusR. This is illustrated in Fig.1b.An image texture can be described with a histogram of theLBP binary codes. LBP is invariant to monotonic gray scalechanges and it is computationally very simple which makesit attractive for many kinds of applications.

The LBP operator was extended to a dynamic texture oper-ator by Zhao and PietikŠinen [32], who proposed to formtheir dynamic LBP description from three orthogonal planes(LBP-TOP) of a spaceÐtime volume. Figure2shows the spa-tiotemporal volume of a person walking from left to right.It also illustrates the resultingxt andyt planes from a singlerow of and column of the volume as well as the Þrst andlastxy planes that are the frames themselves. The LBP-TOPdescription is formed by calculating the LBP features from

9119

61018

71323

9119

61018

71323

010

01

011

010

01

011(10100011) =163

2

(a) (b)

Fig. 1 aIllustration of basic LBP operator.b Circular neighborhood with eight sampling points and radius of two. If the sampling point is not inthe center of a pixel, the value at that point is bilinearly interpolated from the nearest pixels

123


yt

xt

Fig. 2 Illustration of a person walking and the correspondingxt andytplanes from a single row and column. The frames correspond to thexyplanes

the planes and concatenating the histograms. Intuitively, itcan be thought thatxt andyt planes encode the vertical andhorizontal motion patterns, respectively.

The original LBP operator was based on a circular sam-pling pattern, but different radii and neighborhoods can alsobe used. Zhao and PietikŠinen proposed to use elliptic sam-pling for thext andyt planes:

LBP(xc, yc, tc)=Pplane� 1∑

p= 0

s(gp� gc)2p s(x)={

1, x � 00, x < 0

(1)

wheregc is the gray value of the center pixel (xc, yc, tc)and gp are the gray values at thePplane sampling points.Pplane can be different on each plane. The gray valuesgp

are taken from sampling points: (xc � Rx sin(2πp/Pxt ), yc,tc � Rt cos(2πp/Pxt )) on xt plane and similarly (xc, yc �Ry sin(2πp/Pyt ), tc � Rt cos(2πp/Pyt )) on yt plane.Rd isthe radius of the ellipse to direction of the axisd (x, y or t). As thexy plane encodes only the appearance,i.e., both axes have the same meaning, circular sampling issuitable. The valuesgp for points that do not fall on pixelsare estimated using bilinear interpolation. The length of thefeature histogram for LBP-TOP is 2Pxy + 2Pxt + 2Pyt whenall three planes are considered.

In the following sections we propose two methods forhuman motion description. The Þrst one uses the basic LBPtexture descriptor while the other utilizes the LBP-TOP fea-tures. With the LBP-TOP features we only consider the usageof the temporal planes, namely thext and yt planes. Thereason for this is the variability in the appearance of humansand different environments. Thexy plane contains a lot ofuseful appearance information but it should be noted thatthe temporal planes do also encode some of the low levelappearance information.

3 Static texture-based description of movements

Temporal templates, MEI, and MHI, were introduced todescribe motion information from images [2]. MHI is a

grayscale image that describes how the motion has occurredby showing more recent movements with brighter values.MEI, on the other hand, is a binary image that describeswhere the motion has occurred.

Our static texture-based methods uses the temporal tem-plates as a preprocessing stage to build a representation ofmotion. Motion is then characterized with a texture-baseddescription by extracting LBP histograms from the templates.

Silhouettes are used as input for the system. MHI can becalculated from the silhouette representation

MHI τ (x, y, t)

={

τ if D(x, y, t) = 1max(0, MHI τ (x, y, t) � 1) otherwise

(2)

whereD is the absolute value of silhouette difference betweenframest andt � 1, i.e.,|S(t) � S(t � 1)|. The MEI, on theother hand, can be calculated directly from the silhouettesS:

MEIτ (x, y, t) =τ⋃

i= 0

S(x, y, t � i) (3)

Silhouettes describe the moving object in the scene, but assuch, they do not describe any motion. Therefore, differenceof silhouettes is used as our representation when construct-ing the MHI. Furthermore, this representation allows us touse online weighting on different subareas of the MHI asdescribed later in this section. Silhouettes are used for MEIcalculation to obtain a better overall description of the humanpose. Figure3 illustrates the templates.

When MHI and MEI are used to represent an action asa whole, setting the duration parameterτ is critical. This isnot always easy as the duration of different actions as wellas different instances of the same action can vary a lot. Theproblem with the temporal template representation is thatactions that occupy the same space at different times cannotbe modeled properly as the observed features will overlapand new observations will erase old ones. We propose tosolve this problem by Þxingτ to give a short-term motionrepresentation and modeling the actions as a sequence oftemplates. The sequential modeling is then done with HMMs.

LBP histograms are used to characterize both MHIand MEI. This gives us a new texture-based descriptor of

Fig. 3 Illustration of the MHI (left) and MEI (right) in a case where aperson is raising both hands

123


human movements that describes human movements on twolevels. The LBP codes from MHI encode the informationabout the direction of motion, whereas the MEI-based LBPcodes describe the combination of overall pose and shape ofmotion.

It should be noted that some areas of MHI and MEI con-tain more information than others when texture is considered.MHI represents motion in gray-level changes, which meansthat the outer edges of MHI may be misleading. In these areasthere is no useful motion information and so the non-mov-ing pixels having zero value should not be included in thecalculation of the LBP codes. Therefore, calculation of LBPfeatures is restricted to the non-monotonous area within theMHI template. For MEI the case is just the opposite; the onlymeaningful information is obtained around the boundaries.

LBP histogram of an image only contains informationabout the statistics of local spatial structures and thereforeloses the information about the global structure of motion.To preserve the rough structure of motion, MHI and MEIare divided into subregions and each subregion is describedwith an LBP histogram. In our approach the division is donethrough the centroid of a silhouette into four regions. Thisdivision roughly separates the limbs in most cases and pre-serves the essential information about the movements. Thisdivision may not be optimal but works well in practice. Theresolution of the description could be increased by choosinga more speciÞed division scheme, though this also increasesthe number of subregion histograms.

Furthermore, for many common movements some of theMHI subregions provide more information than the others asthey contain most of the motion. Therefore, we perform spa-tial enhancement to focus on more important areas of MHIby assigning different weights to different subregions. Onlineweighting is used instead of prior weights, and weights areassigned based on the relative amount of motion a subregioncontains: the weights are given as the ratio of the area ofnonzero pixels that the MHI subregion contains to the areaof nonzero pixels in the whole image. An example of how theweights are assigned is illustrated in Fig.4. As MEI describesthe pose of the person and all parts have similar importance,all subregion histograms for MEI are given equal weights.

Finally, to form a description of a frame, all the MHI-and MEI-based LBP subregion histograms are concatenatedinto one global histogram and normalized so that the sumof the histogram equals one. Figure4 illustrates the MHIand MEI, their division into subregions, and the formation ofLBP histograms. The sequential development of the featuresis modeled with HMMs that are introduced in Sect.5.

4 Dynamic texture method for motion description

In this section we introduce a novel approach for humanaction description. Instead of using a silhouette-dependentmethod like MHI to incorporate time into the description,the dynamic texture features capture the dynamics straightfrom image data. In the following subsections we show howthe LBP-TOP features can be used to perform backgroundsubtraction to locate a bounding volume of human inxytspace and how the same features can be used for describinghuman movements.

4.1 Human detection with background subtraction

Background subtraction is used as the Þrst stage of process-ing in many approaches to human action recognition. There-fore, it can have a huge affect on the performance of sucha system if the location and shape of the person need tobe accurately determined. Also, from application point ofview, the high computational cost and memory demands ofmany background subtraction methods may limit their usein systems requiring processing at video rate. We tackle theproblem of computation cost by using the same features forboth human detection and action recognition. Furthermore,our dynamic texture approach is designed so that an accuratesilhouette is not needed and a bounding box is sufÞcient.

Usually, background subtraction is done by modelingthe pixel color and intensities [13,27]. A different kind ofapproach was presented by HeikkilŠ and PietikŠinen [6] whointroduced a region-based method that uses LBP featuresfrom a local neighborhood. Unlike their work, we do not con-sider the image plane itself, but instead the temporal planes

Fig. 4 Illustration of theformation of the featurehistogram. The person is raisingboth hands in this example,which means that the top twosubregions in MHI contain mostof the motion and are givenbigger weights

w

w

w

w

1

2

3

4

123


xt andyt. In this work we want to show the descriptive powerof the dynamic texture features and therefore only focus onthe temporal planes. We do not try to claim that temporalplanes alone are better thanxy for background subtraction;instead, we want to show that they are suitable for the task.The temporal planes are quite interesting from backgroundsubtraction point of view as they can encode both motion andsome low-level structural information.

In our method, a background model consists of a code-bookC for every local neighborhood. When pixel values ofa neighborhood are close to one another, the thresholdingoperation in the LBP feature extraction can be vulnerable tonoise. HeikkilŠ and PietikŠinen proposed to add a bias to thecenter pixel, i.e., the terms(gp � gc) in Eq. (1) is replacedwith the terms(gp � gc + a). We also adopt their idea andpropose to use neighborhood-speciÞc bias values. Thus, thebackground model consists of codebookC and the biasa foreach pixel for both the temporal planes.

Overlapping volumes of duration�t = 2Rt + 1 are usedas an input to the method, i.e., for the current frame thereis a buffer of Rt frames before and after the frame fromwhich the spatiotemporal features are calculated. A pixel inthe current frame is determined to belong to an object if theobserved LBP code of a pixel neighborhood of the inputvolume does not match the codes in the corresponding code-book. The result fromxt andyt planes can be combined usingthe logicaland operator. With this method, we can extractthe bounding volume (3D equivalent to a bounding box in2D) of a human in each spaceÐtime volume and then usethe volumes for action recognition as described in the nextsubsection.

For the current application, the background model is notadaptive and we must learn the model by observing an emptyscene. We present preliminary experimental results in Sect.6.

The proposed method can be extended to be adaptive tochanges in the background and that topic is under investi-gation though out of the scope of this paper.

4.2 Action description

The dynamic LBP-TOP features, previously introduced inSect.2, are used for human action description and the inputfor action recognition are the dynamic LBP-TOP featuresextracted from the detected humanxyt volumes. Notice thatwhen the proposed background subtraction is used, the LBP-TOP features have already been computed and the actiondescription can be obtained very efÞciently. However, a cou-ple of points need addressing to enhance the performance ofthe features.

As we do not use silhouette data but images from a bound-ing volume that contains the human, our input also containssome background information as well as the static humanbody parts. These regions do not contain any useful motioninformation and only describe the structure of the scene andthe clothing of the person.

Considering the images illustrated in Fig.5, static partsof the images produce stripe-like patterns for thext andytplanes (for comparison, see the motion patterns in Fig.2).We observe that certain LBP codes represent these stripes.On xt andyt planes, symmetrical LBP kernels (kernels witheven number of sampling points) have sampling point pairsthat lie on the same spatial location, but at different times(there are three such pairs in the eight-point kernel as can beseen in Fig.5). It can be seen that the common property ofthese codes is that both bits of each pair are the same. If thebits of a pair are different for an observed code, in ideal case(no noise) this must be because of motion in the scene. As we

Fig. 5 Illustration of LBPpatterns that represent nomotion. The two images on thetop illustrate a static scenecontaining no motion and aresultingxt plane. Note theresulting stripe-like pattern onthext plane. The binary patternsillustrated below are the codesthat describe the stripe-likepatterns that do not contain anymotion. Consider nearestneighbor interpolation for thesimplicity of the illustration

x

t

123


Fig. 6 Illustration of theformation of the featurehistogram from a boundingvolume

Feature histogram of a bounding volume

wish to obtain a motion description, the stripe pattern codes(SPC(P)) that contain no motion information are removedfrom the histogram. The stripe patterns are always the samefor a given LBP kernel and only their relative appearancefrequency depends on the scene structure. Cutting off thesebins reduces the histogram length for an eight-point neigh-borhood into 240 bins instead of 256, but more importantly,it also improves the motion description.

Similarly to the static LBP representation, the dynamicLBP histograms calculated over the whole bounding vol-ume encode only the local properties of the movements with-out any information about their local or temporal locations.For this reason we divide the detected bounding volumethrough its center point into four regions and form global fea-ture histogram by concatenating the subvolume histograms.Using the subvolume representation we encode the motionon three different levels: pixel-level (single bins in the subvo-lume histogram), region-level (whole subvolume histogram),and global-level (concatenated subvolume histograms).

The subvolume division and the formation of our fea-ture histogram are illustrated in Fig.6. All the subvolumehistograms are concatenated and the resulting histogram isnormalized by setting its sum equal to one. This is our repre-sentation of a bounding volume. The Þnal histogram length inour representation is nroSubVolumes× (2Pxt � SPC(Pxt )+2Pyt � SPC(Pyt )). For example,Pxt = Pyt = 8 with foursubvolumes gives 4× (240+240)=1920 features. Finally,the sequential development of the features is modeled usingHMMs.

5 Modeling temporal information with hiddenMarkov models

As described earlier, we describe short-term human move-ments using histograms of local features extracted froma spaceÐtime representation. We then model the temporaldevelopment of our histograms with HMM. Our models

are brießy described next (see tutorial [22] for more de-tails on HMMs). In our approach a HMM that hasN statesQ = { q1, q2, . . . , qN } is deÞned with the tripletλ = (A,

π , H), whereA is the N × N state transition matrix,π isthe initial state distribution vector, and theH is the set ofoutput histograms.

The probability of observing an LBP histogramhobs isthe texture similarity between the observation and the modelhistograms. Histogram intersection was chosen as the sim-ilarity measure as it satisÞes the probabilistic constraints.Furthermore, we introduce a penalty factorn that allows asto adjust the tightness of our models. Thus, the probabilityof observinghobs in statei at timet is given as

P(hobs|st = qi ) =[∑

min(hobs, hi )]n

, (4)

wherest is the state at time stept, andh i is the observationhistogram in statei. The summation is done over the bins. Itcan be seen that histograms with a larger deviation from themodel are penalized more with a largern.

Figure 7 illustrates how the features are calculated astime evolves and a simple circular HMM. Different kindsof model topologies can be used: circular models are suit-able for modeling repetitious movements like walking andrunning, whereas left-to-right models are suitable for move-ments like bending, for example.

HMMs can be used for activity classiÞcation by traininga HMM for each action class. A new observed unknownfeature sequenceHobs = { hobs1, hobs2, . . ., hobsT} can beclassiÞed as belonging to the class of the model that maxi-mizesP(Hobs|λ), i.e., the probability of observingHobsfromthe modelλ.

6 Experiments

We demonstrate the performance of our method by experi-menting with the Weizmann [1] and KTH [24] datasets. Thedatasets have become popular benchmark databases [1,3,7,9,

123


Fig. 7 Illustration of thetemporal development of thefeature histograms and a simplethree-state circular HMM

14,18,19,25,28,29], so we can directly compare our resultsto others reported in the literature. We also perform an onlinerecognition experiment using the data from Kellokumpuet al. [11]. In the following subsections, we show experi-mental results on human detection with background subtrac-tion, feature analysis, human action classiÞcation, and onlineaction recognition.

6.1 Human detection with background subtraction

To get our background model we need to learn the codebookand bias for each pixel on two spatiotemporal planes. Wetrain our model with the frames where there is no subject inthe spaceÐtime volumes.

First, Gaussian Þltering is applied for the images as pre-processing. Then, codebookC is calculated for each pixelwith all bias values betweenamin andamax. The codebook(and the correspondinga) with the smallest number of codesis chosen to model the local neighborhood around each pixel.If the codebook size is the same with multiple bias values,we choose the one with the smallest bias. It would be desir-able to have a small bias and a small codebook to make themethod sensitive to changes in the view. However, codebooksize decreases as the bias increases and increases when biasdecreases. Our training method attempts to Þnd the saturationin the codebook size with the smallest bias. In our exper-iments we usedamin = 3, amax= 8, and eight-point neigh-borhoods with radiiRx = 1, Ry = 2, andRt = 1 which means�t = 3. Figure8 gives examples on the performance.

It should be noted that the learning method is preliminaryand not optimal. But as mentioned earlier, the development

of the method part is out of the scope of this paper and undercurrent work. Nevertheless, we can see that the spatiotem-poral features are suitable for the task even with the rathersimple background modeling.

6.2 Feature analysis and classiÞcation

The Weizmann datasetconsists of ten different activitiesperformed by nine different persons. Figure9 illustrates theactivities.

First, we want to make sure that our features capture thecharacteristics of movements without a sophisticated mod-eling method and that histogram intersection we chose forsimilarity measure can be used to compare instances of move-ments. We calculated the static and dynamic texture-basedfeatures for the dataset; for the static texture method we usedthe silhouettes given in the database. The duration of theactions in the database varies between 146 and 28 frames sofor each movement we summed the histograms over the dura-tion of the activity and normalized the histogram sum intoone. This representation of the movements does not containany sequential information and the only temporal informa-tion is the short-term dynamics deÞned byτ andRt .

Result of feature analysis is illustrated in of Fig.10a and bwhere each row and column represents the similarity (histo-gram intersection) of one sample to all other samples. Darkercolors indicate larger similarity in the illustration. As men-tioned, histogram intersection was used as a similarity mea-sure and the penalty factorn was set to one. The parametersused for the illustration areτ = 4, P = 4, andR = 1 for sta-tic texture andRt = 1, Rx = 1, Ry = 2, andPxt = Pyt = 8

Fig. 8 Illustration of the human detection performance. The last imageon the right illustrates the bias foryt plane feature calculation for thescene viewed next to it. The binary result images illustrate the center

pixels of the input volume that have been determined to not belong tobackground and the detected bounding volume

123


Fig. 9 Illustration of themovement classes in theWeizmann database. Startingfrom the upper left corner themovements are: Bending,Jumping jack, Jumping,Jumping in place (ÔPjumpÕ),Gallop sideways (ÔSideÕ),Running, Skipping, Walking,Wave one hand (ÔWave1Õ), andWave two hands (ÔWave2Õ)

for dynamic texture-based method. It can clearly be seenthat even without the sequential information the featuresform clusters and different movement classes are somewhatseparable.

We then performed the action classiÞcation experimentson the Weizmann dataset using HMM modeling. The HMMtopologies were set to circular for all models, and we achievedclassiÞcation accuracy of 98.9% using both the static andthe dynamic texture-based methods. The parameters are thesame as used above. We used HMMs with seven states andsetn =7, though both these values can be changed withouttoo much of an affect on the performance. Figure10c and dshows the confusion matrixes of the HMM classiÞcation.

It should be noted that the methods use different input data:the static texture-based method uses silhouettes, whereasthe dynamic texture method uses image data. Therefore, wealso run the experiment with dynamic texture-based method,but with silhouettes as input. Using the same parametersas above, 96.6% classiÞcation accuracy was achieved. Thisresult is slightly lower than the result achieved with static tex-ture method. It is interesting that the dynamic texture-basedmethod works better when images are used as input. Webelieve this is because texture statistics extracted from imagedata is more reliable than the statistics taken from silhouetteswhere we get interesting features only around the edges.

We also tried SVM (RBF kernel) classiÞcation with thehistograms that were used earlier for feature analysis. TheclassiÞcation accuracies of 95.6 and 97.8% were achievedusing dynamic (both on image and silhouette data) and statictexture, respectively. Confusion matrix is shown in Fig.10eand f. Interestingly there is not much difference in the resultsbetween SVM and HMM modeling even though the input ofthe SVM does not contain the sequential information. Thisshows that our histograms of local features capture much ofthe movement dynamics. We also tested our methods withdifferent values ofτ andRt and found that the methods arerobust with a range of values. In general, we found that set-ting the parameters so that the dynamics are captured fromthe duration of a few frames usually works well in practice.

Results achieved by others on this database are summa-rized in Table1. From the image-based approaches Boiman

and Irani [3] report the best overall recognition result, buttheir test set does not include theSkipping class. It is easy tosee from the confusion matrix in Fig.10 that this extra classcauses the only mistake we make in the test set when usingHMM and the image-based dynamic texture features. We alsorun the test without the skipping class, and in this case wewere able to classify all of the movements correctly. To ourknowledge, our methods give the best results on the databasewhen image data are used as an input and are also very com-petitive against approaches that are based on silhouette data.As can be seen in Table1, in general, silhouette-based meth-ods work better, but new and effective image-based methodsare emerging. As the results show, image-based approacheshave the potential to outperform the silhouette-based meth-ods in the future.

The KTH dataset consists of 25 people performingsix actions in four different conditions. The actions in thedatabase areBoxing, Handwaving, Handclapping, Jogging,Running, andWalking. Each person performs these actionsunder four different conditions: outdoors, outdoors with scalevariations, outdoors with clothing variation, and indoors.Figure11 illustrates the variance in the data.

The videos in the database are recorded with a hand heldcamera and contain zooming effects, so using simple back-ground subtraction methods does not work. Instead, we man-ually marked the human bounding box in a few frames forevery movement and interpolated the bounding box for theother frames. ForJogging, Running, andWalking movementswhere the person is not always in the Þeld of view, the bound-ing boxes were marked so that in the beginning and end ofmovements the person was still partially out of view. It shouldbe noted that the extraction of bounding boxes could possiblyalso be done automatically using a tracker or human detec-tor. Since it is not possible to get the silhouettes we usedframe differencing to build the MHI description for statictexture-based method.

We tried to use HMM modeling, but did not get satisfac-tory results. This is due to the huge variation in the data whichour simple HMMs with the intersection similarity measurecannot capture. Instead, we applied nearest neighbor andSVM classiÞcation.

123


Fig. 10 Results on theexperiments on the Weizmanndataset.a Feature analysis ofstatic texture.b Feature analysisof dynamic texture.c Confusionmatrix for static texture withHMM modeling.d Confusionmatrix for dynamic texture withHMM modeling.eConfusionmatrix for static texture withSVM modeling.f Confusionmatrix for dynamic texture withSVM modeling. Ina andbdarker colors indicate largersimilarity (histogramintersection)

(a) (b)

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

.89 .11

1.00

Bend

Jack

Jump

Pjump

Run

Side

Skip

Walk

Wave1

Wave2

Ben

d

Jack

Jum

p

Pju

mp

Run

Sid

e

Ski

p

Wal

k

Wav

e1

Wav

e2

1.00

1.00

.89 .11

1.00

1.00

1.00

1.00

1.00

1.00

1.00

Bend

Jack

Jump

Pjump

Run

Side

Skip

Walk

Wave1

Wave2

Ben

d

Jack

Jum

p

Pju

mp

Run

Sid

e

Ski

p

Wal

k

Wav

e1

Wav

e2

(c) (d)

Ben

d

Jack

Jum

p

Pju

mp

Run

Sid

e

Ski

p

Wal

k

Wav

e1

Wav

e2

1.00

.11.89

1.00

1.00

.11.89

1.00

1.00

1.00

1.00

1.00

1.00

.11.89

1.00

1.00

.11.89

1.00

1.00

1.00

1.00

1.00Bend

Jack

Jump

Pjump

Run

Side

Skip

Walk

Wave1

Wave2 1.00

.11.89

1.00

.89.11

.11.89

1.00

1.00

1.00

1.00

.11.89

1.00

.11.89

1.00

.89.11

.11.89

1.00

1.00

1.00

1.00

.11.89Bend

Jack

Jump

Pjump

Run

Side

Skip

Walk

Wave1

Wave2

Ben

d

Jack

Jum

p

Pju

mp

Run

Sid

e

Ski

p

Wal

k

Wav

e1

Wav

e2

(e) (f)

Ben

d

Jack

Jum

p

Pju

mp

Run

Sid

e

Ski

p

Wal

k

Wav

e1

Wav

e2

Bend

Jack

Jump

Pjump

Run

Side

Skip

Walk

Wave1

Wave2

Bend

Jack

Jump

Pjump

Run

Side

Skip

Walk

Wave1

Wave2

Ben

d

Jack

Jum

p

Pju

mp

Run

Sid

e

Ski

p

Wal

k

Wav

e1

Wav

e2

We performed the experiment using leave-one-out methodfor the database. We used the center point of the bound-ing box for the spatial division into subregions (subvo-lumes for the dynamic texture). We then extracted thestatic texture (R = 3, P = 8, τ = 4), and dynamic LBP-TOP(Rt = 3, Rx = 3, Ry = 3, Pxt = Pyt = 8 and constant biasa=5) features from the whole duration of the action (var-ies from 13 to 362 frames), concatenated the histograms,

and normalized the histogram over the duration of the ac-tion. A larger radius was chosen in both space and time tobetter capture the movement dynamics as the classiÞcationmethods do not consider sequential information.

Using nearest neighbor classiÞcation, we achieved accu-racies of 85.9 and 89.8% for static and dynamic texturemethods, respectively. Similarly, using an SVM with rbf ker-nel we achieved accuracies of 90.8 and 93.8%. Only the

123


Table 1 Results reported in theliterature for the Weizmanndatabase

The columns represent thereference, input data type,number of activity classes,number of sequences, andÞnally, the classiÞcation result

Reference Input Act Seq Result (%)

Dynamic texture method (HMM) Image data 10 (9) 90 (81) 98.9 (100)

Dynamic texture method (SVM) Image data 10 90 95.6

Scovanner etal. 2007 [25] Image data 10 92 82.6

Boiman and Irani 2006 [3] Image data 9 81 97.5

Niebles etal. 2007 [18] Image data 9 83 72.8

Static texture method (HMM) Silhouettes 10 (9) (81) 98.9 (98.7)

Static texture method (SVM) Silhouettes 10 90 97.8

Dynamic texture method (HMM) Silhouettes 10 90 96.6

Dynamic texture method (SVM) Silhouettes 10 90 95.6

Wang and Suter 2007 [28] Silhouettes 10 90 97.8

Ikizler and Duygulu 2007 [7] Silhouettes 9 81 100.0

Fig. 11 Illustration of the sixaction classes in the KTHdataset and the differentcapturing conditions s1Ðs4

Fig. 12 Confusion matrices forthe KTH dataset (a) dynamictexture based method (b) statictexture-based method

.980.020

.855.145

.032,108.860

.977.020.003

.01.987.003

.033.967

.980.020

.855.145

.032,108.860

.977.020.003

.01.987.003

.033.967

Box Clap Wave Jog Run Walk

Clap

Wave

Jog

Run

Walk

Box .900 .063 .028 .002 .007

.020 .910 .070

.003 .090 .907

.007 .867 ,108 .018

.003 .110 .882 .005

.015 .005 .980

Box Clap Wave Jog Run Walk

Clap

Wave

Jog

Run

Walk

Box

(a) (b)

MHI-based features were used for the static texture-basedmethod. In the absence of silhouettes, the MEI-based featureslowered the results considerably. We notice that althoughboth of the proposed methods performed equally well on theeasier Weizmann dataset, the more advanced dynamic

texture-based method outperforms the more na•ve statictexture-based method in this more challenging experiment.Figure12shows the classiÞcation results. It can be seen thatmost of the misclassiÞcations are between jogging and run-ning. This confusion is quite natural as the boundary between

123


Table 2 Comparison of results achieved on the KTH dataset

Reference Result (%)

Kim etal. (2007) [14] 95.3

Dynamic texture method 93.8

Wong etal. (2007) [29] 91.6

Static texture method (MHI) 90.8

Niebles etal. (2008) [19] 81.5

Ke etal. (2007) [9] 80.9

the two movements depends on the subject performing theaction and some of the examples in the database would surelybe misclassiÞed even by a human observer.

The recent results reported in the literature [9,14,19,29]for this database vary between 80 and 95%. Table2 summa-rizes the results achieved by others with a similar leave oneout test setup. It can be seen that our result is among the best.

6.3 Online action recognition

The action classiÞcation experiments shown above consist ofa task of labeling an unknown temporally segmented (begin-ning and end known) sequence into the most probable class.In an online action recognition experiment one must also con-sider the possibility of not observing any action. Therefore,a method must be able to perform reliable segmentation andrecognition simultaneously.

To further illustrate the discriminativity of our featuresand the usability of the HMM modeling, we performed theonline action recognition experiment used by Kellokumpu

etal. [11]. The goal in this test is to recognize humanmovements from continuous video data, and the decisionmust be made online in each frame.

The data consist of video sequences of Þve persons per-forming 15 different actions. The actions in the database areas follows:Raising one hand, Waving one hand, Loweringone hand, Raising both hands, Waving both hands, Loweringboth hands, Bending down, Getting up, Raising foot, Low-ering foot, Sitting down, Standing up, Squatting, Up fromsquat, Jumping jack. Each person performed the activities indifferent order without intentional pauses in between actions.The actions are illustrated using their MHIs in Fig.13.

The data are in silhouette format so we used the MHI- andMEI-based texture features in this experiment. We adoptedthe same windowing approach for temporal segmentationthat was used by Kellokumpu etal. This resembles theexhaustive search used by Bobick and Davis [2]. We considera detection to be correct if an action is correctly recognizedand the beginning and end of an action are determined to bewithin a couple of frames from the ground truth data (evenhuman observers would segment the movements slightly dif-ferently). Furthermore, we can also get more than one detec-tion of one movement in adjacent frames, but in these casesthe action can be determined to be the same by comparingthe beginning and end segmentations. These cases are thustreated as one detection.

Kellokumpu etal. used SVM to classify human pose ineach frame and modeled the actions as sequences of postures.Their system was invariant to handedness of performingactions. As the database contains movements such asRaiseone hand performed with both hands, we mirrored the

Fig. 13 Illustration of theaction classes in the onlinerecognition database. Startingfrom the top left the activitiesare Raising one hand, Wavingone hand, Lowering one hand,Raising both hands, Wavingboth hands, Lowering bothhands, Bending down, Gettingup, Raising foot, Lowering foot,Sitting down, Standing up,Squatting, Up from squat,Jumping jack. Note that theMHI from the whole duration ofthe activity is shown to clarifythe movements

123


Tabl

e3

Con

fusi

onm

atrix

ofth

eon

line

reco

gniti

onex

perim

ent

Rai

sing

one

hand

Wav

ing

one

hand

Low

erin

gon

eha

ndR

aisi

ngbo

thha

nds

Wav

ing

both

hand

s

Low

erin

gbo

thha

nds

Ben

ding

dow

nG

ettin

gup

Rai

sing

foot

Low

erin

gfo

otS

ittin

gD

own

Sta

ndin

gup

Squ

attin

gU

pfr

omsq

uat

Jum

ping

jack

No

mov

emen

t

Rai

sing

one

h6

2

Wav

ing

one

h5

1

Low

erin

gon

eh

15

Rai

sing

both

h6

1

Wav

ing

both

h3

Low

erin

gbo

thh

16

2

Ben

ding

dow

n5

2

Get

ting

up5

Rai

sing

foot

9

Low

erin

gfo

ot8

Sitt

ing

dow

n5

Sta

ndin

gup

5

Squ

atin

g6

Up

form

squa

t6

Jum

ping

jack

16

No

dete

ctio

n1

14

1

The

row

sre

pres

entt

hede

tect

ions

and

the

colu

mns

repr

esen

tthe

grou

ndtr

uth

for

the

dete

ctio

n

123


training data for such movements and trained two models.Thus, instead of trying to recognize 15 actions, we try torecognize 24.

The original training data are corrupt and not available,so we used the leave-one-out method on the test videos. Theresults (withτ = 4, n = 1, P = 8, andR = 2) for the testsare shown in Table3. We do not include theno action andno detection labels in the confusion matrix as it would bedominated by the 4,000 frames when there is no action. Thenumber of actions in the database was 101 and our methodrecognized 96 correctly with 106 detections. This gives the

Recognition rate=Correct detections

Actions in db=

96101

= 95% (6)

and

Accuracy=Correct detections

All detections=

96101

= 91% (7)

A recognition rate of 90% and detection accuracy of 83% wasreported by Kellokumpu etal. Our result is better, though thecomparison is only indicative as the training data is different.

It can be noticed from the confusion matrix in Table3that many of the false alarms come when the ground truth iswaving one hand or both. Most of these false alarms actuallyare from one subject whose range of motion was much vasterthan the others. This results in false detections of raising andlowering hand(s). Based on the training samples one couldargue that the detections could be interpreted to be correct aswell as the hand movement during waving hand(s) was quitesimilar to repeating raising and lowering hand(s) motions.

7 Conclusions

In this paper we adopt the ideas of spatiotemporal analysisand the use of local features for human movement descrip-tion and we have proposed two novel texture-based methodsfor human action recognition that naturally combines motionand appearance cues. The Þrst uses temporal templates tocapture movement dynamics and then uses texture featuresto characterize the observed movements. We then extendedthis idea into a spatiotemporal space and describe humanmovements with dynamic texture features. Following recenttrends in computer vision, the latter method is designed towork with image data rather than silhouettes.

By using local properties of human motion, our meth-ods are robust against variations in performing actions as thehigh accuracy in the experiments on the Weizmann and KTHdatasets shows. The proposed methods are computationallysimple and suitable for various applications. We also showedthat our method can be used for online recognition with veryfew errors.

We also notice in our experiments that the dynamictexture-based method works better when images are used

as input instead of silhouettes. Furthermore, although theperformance on the easy Weizmann dataset is similar usingthe dynamic and static texture-based methods, the image-based dynamic texture method outperforms the simpler statictexture-based method on the more challenging KTH dataset.We conclude that image-based approaches have the potentialto outperform the silhouette-based methods that have domi-nated the action recognition Þeld in the past.

As future work, we plan to develop the background sub-traction part. Also, as thexy plane contains much useful infor-mation, we intend to examine how the data from thexy planecould be efÞciently fused into the description. We also planto extend and apply the method for other motion recognitionproblems.

Acknowledgments The research was sponsored by the GraduateSchool in Electronics, Telecommunication and Automation (GETA)and the Academy of Finland.

References

1. Blank, M., Gorelick, L., Shechtman, E., Irani, M. Basri, R.: Actionsas space-time shapes. In: Proceedings of the ICCV, pp. 1395Ð1402(2005)

2. Bobick, A., Davis, J.: The recognition of human movement usingtemporal templates. PAMI23(3), 257Ð267 (2001)

3. Boiman, O., Irani, M.: Similarity by composition. In: Proceedingsof the Neural Information Processing Systems (NIPS) (2006)

4. Doll‡r, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recog-nition via sparse spatio-temporal features. In: VS-PETS Workshop(2005)

5. Gavrila, D.M.: The visual analysis of human movement: a survey.CVIU 73(3), 82Ð98 (1999)

6. HeikkilŠ, M., PietikŠinen, M.: A texture-based method formodeling the background and detecting moving objects. PAMI28(4), 657Ð662 (2006)

7. Ikizler, N., Duygulu, P.: Human action recognition using distribu-tion of oriented rectangular patches. In: ICCV Workshop on HumanMotion Understanding, Modeling, Capture and Animation (2007)

8. Ke, Y., Sukthankar, R. Hebert, M.: EfÞcient visual event detec-tion using volumetric features. In: Proceedings of the ICCV,pp. 165Ð173 (2005)

9. Ke, Y., Sukthankar, R., Hebert, M.: Spatio-temporal shape and ßowcorrelation for action recognition. In: Proceedings of the CVPR,8pp (2007)

10. Kellokumpu, V., Zhao, G., PietikŠinen, M.: Human activity recog-nition using a dynamic texture based method. In: Proceedings ofthe BMVC, 10pp (2008)

11. Kellokumpu, V., PietikŠinen, M., HeikkilŠ, J.: Human activity rec-ognition using sequences of postures. In: Proceedings of the MVA,pp. 570Ð573 (2005)

12. Kellokumpu, V., Zhao, G., PietikŠinen, M.: texture based descrip-tion of movements for activity analysis. In: Proceedings of theVISAPP (2008)

13. Kim, K., Chalidabhongse, T.H., Harwood, D., Davis, L.: Back-ground modeling and subtraction by codebook construction. Proc.ICIP 5, 3061Ð3064 (2004)

14. Kim, T., Wong, S. Cipolla, R.: Tensor canonical correlation anal-ysis for action classiÞcation. In: Proceedings of the CVPR, 8pp(2007)

123


15. Kobyashi, T., Otsu, N.: Action and simultaneous multiple-per-son identiÞcation using cubic higher-order auto-correlation. Proc.ICPR4, 741Ð744 (2004)

16. Laptev, I., Lindeberg, T.: Space-time interest points. Proc. ICCV1, 432Ð439 (2003)

17. Moeslund, T.B., Hilton, A., KrŸger, V.: A survey of advances invision-based human motion capture and analysis. CVIU104(2Ð3),90Ð126 (2006)

18. Niebles, J.C., Fei-Fei, L.: A hierarchical model of shape and appear-ance for human action classiÞcation. In: Proceedings of the CVPR,8pp (2007)

19. Niebles, J., Wang, H., Fei-Fei, L.: Unsupervised learning of humanaction categories using spatial-temporal words. IJCV79(3), 299Ð318 (2008)

20. Niyogi, S.A., Adelson, E.H.: Analysing and recognizing walkingÞgs in XYT. In: Proceedings of the CVPR, pp. 469Ð474 (1994)

21. Ojala, T., PietikŠinen, M., MŠenpŠŠ, T.: Multiresolution gray-scaleand rotation invariant texture classiÞcation with local binary pat-terns. PAMI24(7), 971Ð987 (2002)

22. Rabiner, L.R.: A tutorial on hidden Markov models and selectedapplications in speech recognition. Proc. IEEE77(2), 257Ð285(1989)

23. Schindler, K., van Gool, L.: Action snippets: how many frames doeshuman action recognition require? In: Proceedings of the CVPR,8pp (2008)

24. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions:a local SVM approach. In: Proceedings of the ICPR, pp. 32Ð36(2004)

25. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional SIFT descrip-tor and its application to action recognition. In: Proceedings of theACM Multimedia, pp. 357Ð360 (2007)

26. Shechtman, E., Irani, M.: Space-time behavior based correlation.Proc. CVPR1, 405Ð412 (2005)

27. Stauffer, C., Grimson, W.E.L.: Adaptive background mixture mod-els for real-time tracking. Proc. CVPR2, 246Ð252 (1999)

28. Wang, L., Suter, D.: Recognizing human activities from silhouettes:motion subspace and factorial discriminative graphical model. In:Proceedings of the CVPR, 8pp (2007)

29. Wong, S., Kim, T., Cipolla, R.: Learning motion categories usingboth semantic and structural information. In: Proceedings of theCVPR, 6pp (2007)

30. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action rec-ognition using motion history volumes. CVIU104(2-3), 249Ð257(2006)

31. Yilmaz, A., Shah, M.: Action sketch: a novel action representa-tion. Proc. CVPR1, 984Ð989 (2005)

32. Zhao, G., PietikŠinen, M.: Dynamic texture recognition usinglocal binary patterns with an application to facial expressions.PAMI 29(6), 915Ð928 (2007)

33. Zhao, G., Barnard, M., PietikŠinen, M.: Lipreading with localspatiotemporal descriptors. IEEE Trans. Multimed.11(7), 1254Ð1265 (2009)

123

recognition of human actions using texture descriptors

Documents