research article a human activity recognition system using skeleton...

15
Research Article A Human Activity Recognition System Using Skeleton Data from RGBD Sensors Enea Cippitelli, Samuele Gasparrini, Ennio Gambi, and Susanna Spinsante Dipartimento di Ingegneria dell’Informazione, Universit` a Politecnica delle Marche, 60131 Ancona, Italy Correspondence should be addressed to Enea Cippitelli; [email protected] Received 3 December 2015; Revised 3 February 2016; Accepted 18 February 2016 Academic Editor: Robertas Damaˇ seviˇ cius Copyright © 2016 Enea Cippitelli et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. e aim of Active and Assisted Living is to develop tools to promote the ageing in place of elderly people, and human activity recognition algorithms can help to monitor aged people in home environments. Different types of sensors can be used to address this task and the RGBD sensors, especially the ones used for gaming, are cost-effective and provide much information about the environment. is work aims to propose an activity recognition algorithm exploiting skeleton data extracted by RGBD sensors. e system is based on the extraction of key poses to compose a feature vector, and a multiclass Support Vector Machine to perform classification. Computation and association of key poses are carried out using a clustering algorithm, without the need of a learning algorithm. e proposed approach is evaluated on five publicly available datasets for activity recognition, showing promising results especially when applied for the recognition of AAL related actions. Finally, the current applicability of this solution in AAL scenarios and the future improvements needed are discussed. 1. Introduction People ageing is one of the main problems in modern and developed society and Active and Assisted Living (AAL) tools may allow to reduce social costs by helping older people to age at home. In the last years, several tools have been proposed to improve quality of life of elderly people, from the remote control of health conditions to the improvement of safety [1]. Human activity recognition (HAR) is a hot research topic since it may enable different applications, from the most commercial (gaming or Human Computer Interaction) to the most assistive ones. In this area, HAR can be applied, for example, to detect dangerous events or to monitor people living alone. is task can be accomplished using different sensors, mainly represented by wearable sensors or vision- based devices [2], even if there is a growing number of researchers working on radio-based solutions [3], or others who fuse data captured from wearable and ambient sensors [4, 5]. e availability in the market of RGBD sensors fostered the development of promising approaches to build reliable and cost-effective solutions [6]. Using depth sensors, like Microsoſt Kinect or other similar devices, it is possible to design activity recognition systems exploiting depth maps, which are a good source of information because they are not affected by environment light variations, can provide body shape, and simplify the problem of human detection and segmentation [7]. Furthermore, the availability of skeleton joints extracted from the depth frames allows having a com- pact representation of the human body that can be used in many applications [8]. HAR may have strong privacy-related implications, and privacy is a key factor in AAL. RGBD sensors are much more privacy preserving than traditional video cameras: thanks to the easy computation of the human silhouette, it is possible to achieve an even higher level of privacy by using only the skeleton to represent a person [9]. In this work, a human action recognition algorithm exploiting the skeleton provided by Microsoſt Kinect is proposed, and its application in AAL scenarios is discussed. e algorithm starts from a skeleton model and computes posture features. en, a clustering algorithm selects the key poses, which are the most informative postures, and a vector containing the activity features is composed. Finally, a multiclass Support Vector Machine (SVM) is exploited to obtain different activities. e proposed algorithm has been tested on five publicly available datasets and it out- performs the state-of-the-art results obtained from two of Hindawi Publishing Corporation Computational Intelligence and Neuroscience Volume 2016, Article ID 4351435, 14 pages http://dx.doi.org/10.1155/2016/4351435

Upload: others

Post on 31-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

Research ArticleA Human Activity Recognition System Using SkeletonData from RGBD Sensors

Enea Cippitelli Samuele Gasparrini Ennio Gambi and Susanna Spinsante

Dipartimento di Ingegneria dellrsquoInformazione Universita Politecnica delle Marche 60131 Ancona Italy

Correspondence should be addressed to Enea Cippitelli ecippitelliunivpmit

Received 3 December 2015 Revised 3 February 2016 Accepted 18 February 2016

Academic Editor Robertas Damasevicius

Copyright copy 2016 Enea Cippitelli et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

The aim of Active and Assisted Living is to develop tools to promote the ageing in place of elderly people and human activityrecognition algorithms can help to monitor aged people in home environments Different types of sensors can be used to addressthis task and the RGBD sensors especially the ones used for gaming are cost-effective and provide much information about theenvironmentThis work aims to propose an activity recognition algorithm exploiting skeleton data extracted by RGBD sensorsThesystem is based on the extraction of key poses to compose a feature vector and a multiclass Support Vector Machine to performclassification Computation and association of key poses are carried out using a clustering algorithm without the need of a learningalgorithmThe proposed approach is evaluated on five publicly available datasets for activity recognition showing promising resultsespecially when applied for the recognition ofAAL related actions Finally the current applicability of this solution inAAL scenariosand the future improvements needed are discussed

1 Introduction

People ageing is one of the main problems in modern anddeveloped society andActive andAssisted Living (AAL) toolsmay allow to reduce social costs by helping older people to ageat home In the last years several tools have been proposedto improve quality of life of elderly people from the remotecontrol of health conditions to the improvement of safety [1]

Human activity recognition (HAR) is a hot research topicsince it may enable different applications from the mostcommercial (gaming orHumanComputer Interaction) to themost assistive ones In this area HAR can be applied forexample to detect dangerous events or to monitor peopleliving alone This task can be accomplished using differentsensors mainly represented by wearable sensors or vision-based devices [2] even if there is a growing number ofresearchers working on radio-based solutions [3] or otherswho fuse data captured from wearable and ambient sensors[4 5]The availability in themarket of RGBD sensors fosteredthe development of promising approaches to build reliableand cost-effective solutions [6] Using depth sensors likeMicrosoft Kinect or other similar devices it is possible todesign activity recognition systems exploiting depth maps

which are a good source of information because they are notaffected by environment light variations can provide bodyshape and simplify the problem of human detection andsegmentation [7] Furthermore the availability of skeletonjoints extracted from the depth frames allows having a com-pact representation of the human body that can be used inmany applications [8] HAR may have strong privacy-relatedimplications and privacy is a key factor in AAL RGBDsensors are much more privacy preserving than traditionalvideo cameras thanks to the easy computation of the humansilhouette it is possible to achieve an even higher level ofprivacy by using only the skeleton to represent a person [9]

In this work a human action recognition algorithmexploiting the skeleton provided by Microsoft Kinect isproposed and its application in AAL scenarios is discussedThe algorithm starts from a skeleton model and computesposture features Then a clustering algorithm selects thekey poses which are the most informative postures and avector containing the activity features is composed Finallya multiclass Support Vector Machine (SVM) is exploitedto obtain different activities The proposed algorithm hasbeen tested on five publicly available datasets and it out-performs the state-of-the-art results obtained from two of

Hindawi Publishing CorporationComputational Intelligence and NeuroscienceVolume 2016 Article ID 4351435 14 pageshttpdxdoiorg10115520164351435

2 Computational Intelligence and Neuroscience

them showing interesting performances in the evaluation ofactivities specifically related to AAL

The paper is organized as follows Section 2 reviewsrelated works in human activity recognition using RGBDsensors Section 3 describes the proposed algorithm whereasthe experimental results are shown in Section 4 Section 5discusses the applicability of this system to the AAL scenarioand finally Section 6 provides concluding remarks

2 Related Works

In the last years many solutions for human activity recog-nition have been proposed some of them aimed to extractfeatures from depth data such as [10] where the main ideais to evaluate spatiotemporal depth subvolume descriptorsA group of hypersurface normals (polynormal) containinggeometry and local motion information is extracted fromdepth sequences The polynormals are then aggregated toconstitute the final representation of the depth map calledSuper Normal Vector (SNV)This representation can includealso skeleton joint trajectories improving the recognitionresults when peoplemove a lot in a sequence of depth framesDepth images can be seen as sequence features modeledtemporally as subspaces lying on the Grassmann manifold[11] This representation starting from the orientation of thenormal vector at every surface point describes the geometricappearance and the dynamic of human body without usingjoint position Other works proposed holistic descriptors theHON4D descriptor [12] which is based on the orientationsof normal surfaces in 4D and HOPC descriptor [13] whichis able to represent the geometric characteristics of a sequenceof 3D points

Other works exploit both depth and skeleton data forexample the 35D representation combines the skeleton jointinformation with features extracted from depth images inthe region surrounding each node of interest [14] The fea-tures are extracted using an extended Independent SubspaceAnalysis (ISA) algorithm by applying it only to local regionof joints instead of the entire video thus improving thetraining efficiency The depth information makes it easy toextract the human silhouette which can be concatenatedwithnormalized skeleton features to improve the recognition rate[15] Depth and skeleton features can be combined at differentlevels of the activity recognition algorithm Althloothi et al[16] proposed amethod where the data are fused at the kernellevel instead of the feature level using the Multiple KernelLearning (MKL) technique On the other hand fusion at thefeature level of spatiotemporal features and skeleton jointsis performed in [17] In such a work several spatiotemporalinterest point detectors such as Harris 3D ESURF [18] andHOG3D [19] have been fused using regression forests withthe skeleton joint features consisting of posture movementand offset information

Skeleton joints extracted from depth frames can be com-bined also with RGB data Luo et al [20] proposed a humanaction recognition framework where the pairwise relativepositions of joints and Center-Symmetric Motion LocalTernary Pattern (CS-Mltp) features from RGB are fused bothat feature level and at classifier level Spatiotemporal Interest

Points (STIP) are typically used in activity recognition wheredata are represented by RGB frames This approach canbe also extended to depth and skeleton data combiningthe features with random forests [21] The results are verygood but depth estimation noise and background may havean impact on interest point detection so the depth STIPfeatures have to be constrained using skeleton joint positionsor RGB videos Instead of using spatiotemporal featuresanother approach for human activity recognition relies ongraph-based methods for sequential modeling of RGB dataThis concept can be extended to depth information and anapproach based on coupled Hidden Conditional RandomFields (cHCRF) model where visual feature sequences areextracted from RGB and depth data has been proposed[22] The main advantage of this approach is the capabilityto preserve the dynamics of individual sequences even ifthe complementary information from RGB and depth areshared

Some previous works simply rely on Kinect skeleton datasuch as the proposed algorithm Maybe they are simplerapproaches but in many cases they can achieve performancevery close to the algorithms exploiting multimodal data andsometimes they also perform better than those solutionsDevanne et al [23] proposed representing human actionsby spatiotemporal motion trajectories in a 60-dimensionalspace since they considered 20 joints each of them with3 coordinates Then an elastic metric which means ametric invariant to speed and time of the action withina Riemannian shape space is employed to represent thedistance between two curves Finally the action recognitionproblem can be seen as a classification in the Riemannianspace using a 119896-Nearest-Neighbor (119896-NN) classifier Otherskeleton representations have been proposed The APJ3Drepresentation [24] is constituted starting by a subset of 15skeleton joints from which the relative positions and localspherical angles are computed After a selection of key-postures the action is partitioned using a reviewed FourierTemporal Pyramid [25] and the classification is made byrandom forests Another joint representation is calledHOJ3D[26] where the 3D space is partitioned into 119899 bins and thejoints are associated with each bin using a Gaussian weightfunction Then a discrete Hidden Markov Model (HMM) isemployed to model the temporal evolution of the posturesattained using a clustering algorithm A human action canbe characterized also by a combination of static posturefeatures representing the actual frame consecutive motionfeatures computed using the actual and the previous framesand overall dynamics features which consider the actual andthe initial frames [27] Taha et al [28] also exploit jointsspherical coordinates to represent the skeleton and a frame-work composed of a multiclass SVM and a discrete HMMto recognize activities constituted by many actions Otherapproaches exploit a double machine learning algorithm toclassify actions for example Gaglio et al [29] consider amulticlass SVM to estimate the postures and a discrete HMMto model an activity as a sequence of postures Also in[30] human actions are considered as a sequence of bodyposes over time and skeletal data are processed to obtaininvariant pose representations given by 8 pairs of angles

Computational Intelligence and Neuroscience 3

Then the recognition is realized using the representation inthe dissimilarity space where different feature trajectoriesmaintain discriminant information and have a fixed-lengthrepresentation Ding et al [31] proposed a Spatiotempo-ral Feature Chain (STFC) to represent the human actionsby trajectories of joint positions Before using the STFCmodel a graph is used to erase periodic sequences makingthe solution more robust to noise and periodic sequencemisalignment Slama et al [32] exploited the geometricstructure of the Grassmann manifold for action analysisIn fact considering the problem as a sequence matchingtask this manifold allows considering an action sequence asa point on its space and provides tools to make statisticalanalysis Considering that the relative geometry betweenbody parts is more meaningful than their absolute locationsrotations and translations required to perform rigid bodytransformations can be represented as points in a SpecialEuclidean SE(3) group Each skeleton can be represented asa point in the Lie group SE(3) times SE(3) times sdot sdot sdot times SE(3) and ahuman action can be modeled as a curve in this Lie group[33] The same skeleton feature is also used in [34] whereManifold Functional PCA (mfPCA) is employed to reducefeature dimensionality Some works developed techniques toautomatically select the most informative joints aiming atincreasing the recognition accuracy and reducing the noiseeffect on the skeleton estimation [35 36]

The work presented in this paper is based on the conceptthat an action can be seen as a sequence of informativepostures which are known as ldquokey posesrdquoThis idea has beenintroduced in [37] and used in other subsequent proposals[15 36 38] While in previous works a fixed set of key poses(119870) is extracted for each action of the dataset considering allthe training sequences in this work the clustering algorithmwhich selects the most informative postures is executed foreach sequence thus selecting a different set of 119870 poseswhich constitutes the feature vector This procedure avoidsthe application of a learning algorithm which has to find thenearest neighbor key pose for each frame constituting thesequence

3 Activity Recognition Algorithm

The proposed algorithm for activity recognition starts fromskeleton joints and computes a vector of features for eachactivity Then a multiclass machine learning algorithmwhere each class represents a different activity is exploited forclassification purpose Four main steps constitute the wholealgorithm they are represented in Figure 1 and are discussedin the following

(1) Posture Features Extraction The coordinates of theskeleton joints are used to evaluate the feature vectorswhich represent human postures

(2) Postures Selection The most important postures foreach activity are selected

(3) Activity Features Computation A feature vector rep-resenting the whole activity is created and used forclassification

(4) Classification The classification is realized using amulticlass SVM implemented with the ldquoone-versus-onerdquo approach

First there is the need to extract the features fromskeleton data representing the input to the algorithm Thejoint extraction algorithm proposed in [8] is included in theKinect libraries and is ready to use The choice to considerKinect skeleton is motivated by the fact that it is easy to havea compact representation of the human body Starting fromskeleton data many features which are able to represent ahuman pose and consequently a human action have beenproposed in the literature In [6] the authors found thatmanyfeatures can be computed from skeleton joints The simplestfeatures can be extracted by joint locations or from theirdistances considering spatial information Other featuresmay involve joints orientation or motion considering spatialand temporal data More complex features may be initiallybased on the estimation of a plane considering some jointsThen by measuring the distances between this plane andother joints the features can be extracted

The proposed algorithm exploits spatial features com-puted from 3D skeleton coordinates without including thetime information in the computation in order to make thesystem independent of the speed of movement The featureextraction method has been introduced in [36] and it is hereadopted with small differences For each skeleton frame aposture feature vector is computed Each joint is representedby J119894 a three-dimensional vector in the coordinate space of

Kinect The person can be found at any place within thecoverage area of Kinect and the coordinates of the same jointmay assume different values It is necessary to compensatethis effect by using a proper features computation algorithmA straightforward solution is to compensate the position ofthe skeleton by centering the coordinate space in one skeletonjoint Considering a skeleton composed of 119875 joints J

0being

the coordinates of the torso joint and J2being the coordinates

of the neck joint the 119894th joint feature d119894is the distance vector

between J119894and J0 normalized to the distance between J

2and

J0

d119894=

J119894minus J0

1003817100381710038171003817J2 minus J01003817100381710038171003817

119894 = 1 2 119875 minus 1 (1)

This feature is invariant to the position of the skeleton withinthe coverage area ofKinect furthermore the invariance of thefeature to the build of the person is obtained by normalizationwith respect to the distance between the neck and torso jointsThese features may be seen as a set of distance vectors whichconnect each joint to the joint of the torso A posture featurevector f is created for each skeleton frame

f = [d1 d2 d3 d

119875minus1] (2)

A set of 119873 feature vectors is computed having an activityconstituted by119873 frames

The second phase concerns the human postures selectionwith the aim of reducing the complexity and increasinggenerality by representing the activity by means of only asubset of poses without using all the frames A clustering

4 Computational Intelligence and Neuroscience

1

23

4

Test Test

Train

LA

(A L)

SVM1

SVM2

SVM3C1

C2

C3

CK

SVMM(Mminus1)2

Classification

Activity features computation Postures selection

Posture features extraction

A = [C1C2C3 CK]

s1 = [J0 J1 J2 J3 JPminus1]s2 = [J0 J1 J2 J3 JPminus1]s3 = [J0 J1 J2 J3 JPminus1]s4 = [J0 J1 J2 J3 JPminus1]

sN = [J0 J1 J2 J3 JPminus1]

f1 = [d1 d2 d3 dPminus1]f2 = [d1 d2 d3 dPminus1]f3 = [d1 d2 d3 dPminus1]f4 = [d1 d2 d3 dPminus1]

fN = [d1 d2 d3 dPminus1]f = [d1 d2 d3 dPminus1]

Ji = ith joint

di =Ji minus J0J2 minus J0

i = 1 2 P minus 1

Figure 1 The general scheme of the activity recognition algorithm is composed of 4 steps In the first step the posture feature vectors arecomputed for each skeleton frame then the postures are selected and an activity features vector is created Finally amulticlass SVM is exploitedfor classification

algorithm is used to process 119873 feature vectors constitutingthe activity by grouping them into 119870 clusters The well-known 119896-means clustering algorithm based on the squaredEuclidean distance as a metric can be used to group togetherthe frames representing similar postures Considering anactivity composed of119873 feature vectors [f

1 f2 f3 f

119873] the

119896-means algorithm gives as outputs 119873 clusters ID (one foreach feature vector) and 119870 vectors [C

1C2C3 C

119870] that

represent the centroids of each cluster The feature vectorsare partitioned into clusters 119878

1 1198782 119878

119870so as to satisfy the

condition expressed by

argminS

119870

sum

119895=1

sum

f119894isin119878119895

10038171003817100381710038171003817f119894minus C119895

10038171003817100381710038171003817

2

(3)

The 119870 centroids can be seen as the main postures whichare the most important feature vectors Unlike classicalapproaches based on key poses where the most informativepostures are evaluated by considering all the sequencesof each activity in the proposed solution the clusteringalgorithm is executed for each sequence This avoids theapplication of a learning algorithm required to associate eachframe to the closest key pose and allows to have a morecompact representation of the activity

The third phase is related to the computation of a featurevector which models the whole activity starting from the119870 centroid vectors computed by the clustering algorithmIn more detail [C

1C2C3 C

119870] vectors are sorted con-

sidering the order in which the clusterrsquos elements occurduring the activity The activity features vector is composedof concatenating the sorted centroid vectors For example

considering an activity featured by 119873 = 10 and 119870 = 4after running the 119896-means algorithm one of the possibleoutputs could be the following sequence of cluster IDs[2 2 2 3 3 1 1 4 4 4] meaning that the first three posturevectors belong to cluster 2 the fourth and the fifth areassociated with cluster 3 and so on In this case the activityfeatures vector is A = [C

2C3C1C4] A feature activity

vector has a dimension of 3119870(119875 minus 1) that can be handledwithout using dimensionality reduction algorithms such asPCA if 119870 is small

The classification step aims to associate each featureactivity vector to the correct activity Many machine learningalgorithms may be applied to fulfil this task among them aSVM Considering a number of 119897 training vectors x

119894isin 119877119899 and

a vector of labels 119910 isin 119877119897 where 119910119894isin minus1 1 a binary SVM

can be formulated as follows [39]

minw119887120585

1

2w119879w + 119862

119897

sum

119894=1

120585119894

subject to 119910119894(w119879120601 (x

119894) + 119887) ge 1 minus 120585

119894

120585119894ge 0 119894 = 1 119897

(4)

where

w119879120601 (x) + 119887 = 0 (5)

is the optimal hyperplane that allows separation betweenclasses in the feature space119862 is a constant and 120585

119894are nonneg-

ative variables which consider training errorsThe function 120601allows transforming between the features space and an higher

Computational Intelligence and Neuroscience 5

12

30456

78

910

1112

131415

16 17

18 19

(a)

12

30456

78

910

1112

131415

16 17

18 19

(b)

12

30456

78

910

1112

131415

1617

18 19

(c)

12

30456

78

910

1112

131415

16 17

18 19(d)

Figure 2 Subsets of joints considered in the evaluation of the proposed algorithmThewhole skeleton is represented by 20 joints the selectedones are depicted as green circles while the discarded joints are represented by red squares Subsets of 7 (a) 11 (b) and 15 (c) joints and thewhole set of 20 joints (d) are selected

dimensional space where the data are separable Consideringtwo training vectors x

119894and x

119895 the kernel function can be

defined as

119870(x119894 x119895) = 120601 (x

119894)119879120601 (x119895) (6)

In this work the Radial Basis Function (RBF) kernel has beenused where

119870(x119894 x119895) = 119890minus120574x119894minusx1198952 120574 gt 0 (7)

It follows that 119862 and 120574 are the parameters that have to beestimated prior using the SVM

The idea herein exploited is to use a multiclass SVMwhere each class represents an activity of the dataset In orderto extend the role from a binary to amulticlass classifier someapproaches have been proposed in the literature In [40] theauthors compared many methods and found that the ldquoone-against-onerdquo is one of the most suitable for practical use Itis implemented in LIBSVM [41] and it is the approach usedin this workThe ldquoone-against-onerdquo approach is based on theconstruction of several binary SVM classifiers inmore detaila number of119872(119872minus1)2 binary SVMs are necessary in a119872-classes dataset This happens because each SVM is trained todistinguish between 2 classes and the final decision is takenexploiting a voting strategy among all the binary classifiersDuring the training phase the activity feature vectors aregiven as inputs to the multiclass SVM together with the label119871 of the action In the test phase the activity label is obtainedfrom the classifier

4 Experimental Results

Thealgorithmperformance is evaluated on five publicly avail-able datasets In order to perform an objective comparisonto previous works the reference test procedures have beenconsidered for each dataset The performance indicators areevaluated using four different subsets of joints shown inFigure 2 going from a minimum of 7 up to a maximum of20 joints Finally in order to evaluate the performance inAAL scenarios some suitable activities related to AAL areselected from the datasets and the recognition accuracies areevaluated

41 Datasets Five different 3D datasets have been consideredin this work each of them including a different set of activitiesand gestures

KARD dataset [29] is composed of 18 activities that canbe divided into 10 gestures (horizontal arm wave high armwave two-hand wave high throw draw X draw tick forwardkick side kick bend and hand clap) and eight actions (catchcap toss paper take umbrellawalk phone call drink sit downand stand up) This dataset has been captured in a controlledenvironment that is an office with a static background anda Kinect device placed at a distance of 2-3m from the subjectSome objects were present in the area useful to perform someof the actions a desk with a phone a coat rack and a wastebin The activities have been performed by 10 young people(nine males and one female) aged from 20 to 30 years andfrom 150 to 185 cm tall Each person repeated each activity3 times creating a number of 540 sequences The dataset iscomposed of RGB and depth frames captured at a rate of30 fps with a 640 times 480 resolution In addition 15 joints ofthe skeleton in world and screen coordinates are providedThe skeleton has been captured using OpenNI libraries [42]

The Cornell Activity Dataset (CAD-60) [43] is madeby 12 different activities typical of indoor environmentsThe activities are rinsing mouth brushing teeth wearingcontact lens talking on the phone drinking water openingpill container cooking-chopping cooking-stirring talking oncouch relaxing on couch writing on whiteboard and workingon computer and are performed in 5 different environmentsbathroom bedroom kitchen living room and office All theactivities are performed by 4 different people two males andtwo females one of which is left-handed No instructionswere given to actors about how to perform the activitiesthe authors simply ensured that the skeleton was correctlydetected by Kinect The dataset is composed of RGB depthand skeleton data with 15 joints available

The UTKinect dataset [26] is composed of 10 differentsubjects (9males and 1 female) performing 10 activities twiceThe following activities are part of the dataset walk sit downstand up pick up carry throw push pull wave and claphands A number of 199 sequences are available because onesequence is not labeled and the length of sample actions

6 Computational Intelligence and Neuroscience

ranges from 5 to 120 frames The dataset provides 640 times 480RGB frames and 320 times 240 depth frames together with 20skeleton joints captured using Kinect forWindows SDKBetaVersion with a final frame rate of about 15 fps

The Florence3D dataset [44] includes 9 different activi-ties wave drink from a bottle answer phone clap tight lacesit down stand up read watch and bow These activities wereperformed by 10 different subjects for 2 or 3 times resultingin a total number of 215 sequences This is a challengingdataset since the same action is performed with both handsand because of the presence of very similar actions such asdrink from a bottle and answer phone The activities wererecorded in different environments and only RGB videos and15 skeleton joints are available

Finally the MSR Action3D [45] represents one of themost used datasets for HAR It includes 20 activities per-formed by 10 subjects 2 or 3 times In total 567 sequencesof depth (320 times 240) and skeleton frames are providedbut 10 of them have to be discarded because the skeletonsare either missing or affected by too many errors Thefollowing activities are included in the dataset high armwavehorizontal arm wave hammer hand catch forward punchhigh throw draw X draw tick draw circle hand clap two-hand wave side boxing bend forward kick side kick joggingtennis swing tennis serve golf swing and pickup and throwThe dataset has been collected using a structured-light depthcamera at 15 fps RGB data are not available

42 Tests and Results The proposed algorithm has beentested over the datasets detailed above following the recom-mendations provided in each reference paper in order toensure a fair comparison to previous works Following thiscomparison a more AAL oriented evaluation has been con-ducted which consists of considering only suitable actionsfor each dataset excluding the gestures Another type ofevaluation regards the subset of skeleton joints that has to beincluded in the feature computation

421 KARD Dataset Gaglio et al [29] collected the KARDdataset and proposed some evaluation experiments on itThey considered three different experiments and two modal-ities of dataset splitting The experiments are as follows

(i) Experiment A one-third of the data is considered fortraining and the rest for testing

(ii) Experiment B two-thirds of the data is considered fortraining and the rest for testing

(iii) Experiment C half of the data is considered fortraining and the rest for testing

The activities constituting the dataset are split in the followinggroups

(i) Gestures and Actions(ii) Activity Set 1 Activity Set 2 and Activity Set 3 as

listed in Table 1 Activity Set 1 is the simplest onesince it is composed of quite different activities whilethe other two sets include more similar actions andgestures

Table 1 Activity sets grouping different and similar activities fromKARD dataset

Activity Set 1 Activity Set 2 Activity Set 3Horizontal arm wave High arm wave Draw tickTwo-hand wave Side kick DrinkBend Catch cap Sit downPhone call Draw tick Phone callStand up Hand clap Take umbrellaForward kick Forward kick Toss paperDraw X Bend High throwWalk Sit down Horizontal arm wave

Each experiment has been repeated 10 times randomlysplitting training and testing data Finally the ldquonew-personrdquoscenario is also performed that is a leave-one-actor-outsetting consisting of training the system on nine of the tenpeople of the dataset and testing on the tenth In the ldquonew-personrdquo test no recommendation is provided about how tosplit the dataset so we assumed that the whole dataset of 18activities is considered The only parameter that can be setin the proposed algorithm is the number of clusters 119870 anddifferent subsets of skeleton joints are considered Since only15 skeleton joints are available the fourth group of 20 joints(Figure 2(d)) cannot be considered

For each test conducted on KARD dataset the sequenceof clusters 119870 = [3 5 10 15 20 25 30 35] has been consid-ered The results concerning Activity Set tests are reported inTable 2 it is shown that the proposed algorithm outperformsthe originally proposed one in almost all the tests The 119870parameter which gives the maximum accuracy is quite highsince it is 30 or 35 in most of the cases It means that forthe KARD dataset it is better to have a significant numberof clusters representing each activity This is possible alsobecause the number of frames that constitutes each activitygoes from aminimumof 42 in a sequence of hand clap gestureto a maximum of 310 for a walk sequence Experiment Ain Activity Set 1 shows the highest difference between theminimum and the maximum accuracy when varying thenumber of clusters For example considering 119875 = 7 themaximum accuracy (982) is shown in Table 2 and it isobtained with 119870 = 35 The minimum accuracy is 914 andit is obtained with 119870 = 5 The difference is quite high (68)but this gap is reduced by considering more training data Infact Experiment B shows a difference of 05 in Activity Set1 11 in Activity Set 2 and 26 in Activity Set 3

Considering the number of selected joints the observa-tion of Tables 2 and 3 lets us conclude that not all the jointsare necessary to achieve good recognition results In moredetail from Table 2 it can be noticed that the Activity Set 1and Activity Set 2 which are the simplest ones have goodrecognition results using a subset composed of 119875 = 7 jointsThe Activity Set 3 composed of more similar activities isbetter recognized with 119875 = 11 joints In any case it is notnecessary to consider all the skeleton joints provided fromKARD dataset

The results obtained with the ldquonew-personrdquo scenario areshown in Table 4 The best result is obtained with 119870 = 30

Computational Intelligence and Neuroscience 7

Table 2 Accuracy () of the proposed algorithm compared to the other using KARD dataset with different Activity Sets and for differentexperiments

Activity Set 1 Activity Set 2 Activity Set 3A B C A B C A B C

Gaglio et al [29] 951 991 930 899 949 901 842 895 817

Proposed (119875 = 7) 982 984 981 997 100 997 902 950 913

Proposed (119875 = 11) 980 990 977 998 100 996 916 958 933Proposed (119875 = 15) 975 988 976 995 100 996 91 951 932

090 003 007097 003

100100

003 093 003007 007 083 003007 007 077 010

003 093 003100

100100

100100

100090 010

003 003 007 087100

100

Hor

arm

wav

e

Hig

h ar

m w

ave

Two-

hand

wav

e

Catch

cap

Hig

h th

row

Dra

w X

Dra

w tic

k

Toss

pape

r

Forw

kick

Side

kick

Take

um

br

Bend

Han

d cla

p

Wal

k

Phon

e cal

l

Drin

k

Sit d

own

Stan

d up

Hor arm waveHigh arm wave

Two-hand waveCatch cap

High throwDraw X

Draw tickToss paperForw kick

Side kickTake umbr

BendHand clap

WalkPhone call

DrinkSit downStand up

Figure 3 Confusion matrix of the ldquonew-personrdquo test on the whole KARD dataset

Table 3 Accuracy () of the proposed algorithm compared tothe other using KARD dataset with dataset split in Gestures andActions for different experiments

Gestures ActionsA B C A B C

Gaglio et al [29] 865 930 867 925 950 901

Proposed (119875 = 7) 899 935 925 991 996 994Proposed (119875 = 11) 899 959 937 990 999 991Proposed (119875 = 15) 874 936 928 987 995 993

Table 4 Precision () and recall () of the proposed algorithmcompared to the other using the whole KARD dataset and ldquonew-personrdquo setting

Algorithm Precision RecallGaglio et al [29] 848 845

Proposed (119875 = 7) 940 937

Proposed (119875 = 11) 951 950Proposed (119875 = 15) 950 948

using a number of 119875 = 11 joints The overall precision andrecall are about 10higher than the previous approachwhichuses the KARD dataset In this condition the confusion

matrix obtained is shown in Figure 3 It can be noticed thatthe actions are distinguished very well only phone call anddrink show a recognition accuracy equal or lower than 90and sometimes they are mixed with each other The mostcritical activities are the draw X and draw tick gestures whichare quite similar

From the AAL point of view only some activities arerelevant Table 3 shows that the eight actions constituting theKARD dataset are recognized with an accuracy greater than98 even if there are some similar actions such as sit downand stand up or phone call and drink Considering only theActions subset the lower recognition accuracy is 949and itis obtained with119870 = 5 and119875 = 7 It means that the algorithmis able to reach a high recognition rate even if the featurevector is limited to 90 elements

422 CAD-60 Dataset TheCAD-60 dataset is a challengingdataset consisting of 12 activities performed by 4 people in5 different environments The dataset is usually evaluatedby splitting the activities according to the environmentthe global performance of the algorithm is given by theaverage precision and recall among all the environments Twodifferent settings were experimented for CAD-60 in [43]Theformer is defined ldquonew-personrdquo and the latter is the so-calledldquohave-seenrdquo ldquoNew-personrdquo setting has been considered in all

8 Computational Intelligence and Neuroscience

the works using CAD-60 so it is the one selected also in thiswork

The most challenging element of the dataset is thepresence of a left-handed actor In order to increase the per-formance which are particularly affected by this unbalancingin the ldquonew-personrdquo test mirrored copies of each action arecreated as suggested in [43] For each actor a left-handed anda right-handed version of each action are made availableThedummyversion of the activity has been obtained bymirroringthe skeleton with respect to the virtual sagittal plane that cutsthe person in a half The proposed algorithm is evaluatedusing three different sets of joints from 119875 = 7 to 119875 = 15and the sequence of clusters 119870 = [3 5 10 15 20 25 30 35]as in the KARD dataset

The best results are obtained with the configurations 119875 =11 and 119870 = 25 and the performance in terms of precisionand recall for each activity is shown in Table 5 Very goodresults are given in office environment where the averageprecision and recall are 964 and 958 respectively Infact the activities of this environment are quite different onlytalking on phone and drinking water are similar On the otherhand the living room environment includes talking on couchand relaxing on couch in addition to talking on phone anddrinking water and it is the most challenging case since theaverage precision and recall are 91 and 906

The proposed algorithm is compared to other worksusing the same ldquonew-personrdquo setting and the results areshown in Table 6 which tells that the 119875 = 11 configurationoutperforms the state-of-the-art results in terms of precisionand it is only 1 lower in terms of recall Shan and Akella[46] achieve very good results using a multiclass SVMscheme with a linear kernel However they train and testmirrored actions separately and then merge the results whencomputing average precision and recall Our approach simplyconsiders two copies of the same action given as input to themulticlass SVM and retrieves the classification results

The reduced number of joints does not affect too muchthe average performance of the algorithm that reaches aprecision of 927 and a recall of 915 with 119875 = 7Using all the available jonts on the other hand brings toa more substantial reduction of the performance showing879 and 867 for precision and recall respectively with119875 = 15 The best results for the proposed algorithm werealways obtained with a high number of clusters (25 or30) The reduction of this number affects the performancefor example considering the 119875 = 11 subset the worstperformance is obtained with 119870 = 5 and with a precision of866 and a recall of 860

From theAALpoint of view this dataset is composed onlyof actions not gestures so the dataset does not have to beseparated to evaluate the performance in a scenario which isclose to AAL

423 UTKinect Dataset The UTKinect dataset is composedof 10 activities performed twice by 10 subjects and theevaluation setting proposed in [26] is the leave-one-out-cross-validation (LOOCV) which means that the systemis trained on all the sequences except one and that one isused for testing Each trainingtesting procedure is repeated

Table 5 Precision () and recall () of the proposed algorithm inthe different environments of CAD-60 with 119875 = 11 and 119870 = 25

Location Activity ldquoNew-personrdquoPrecision Recall

Bathroom

Brushing teeth 889 100

Rinsing mouth 923 100

Wearing contact lens 100 792

Average 937 931

Bedroom

Talking on phone 917 917

Drinking water 913 875

Opening pill container 960 100

Average 930 931

Kitchen

Cooking-chopping 857 100

Cooking-stirring 100 791

Drinking water 960 100

Opening pill container 100 100

Average 954 948

Living room

Talking on phone 875 875

Drinking water 875 875

Talking on couch 889 100

Relaxing on couch 100 875

Average 910 906

Office

Talking on phone 100 875

Writing on whiteboard 100 958

Drinking water 857 100

Working on computer 100 100

Average 964 958Global average 939 935

Table 6 Global precision () and recall () of the proposedalgorithm for CAD-60 dataset and ldquonew-personrdquo setting withdifferent subsets of joints compared to other works

Algorithm Precision RecallSung et al [43] 679 555

Gaglio et al [29] 773 767

Proposed (119875 = 15) 879 867

Faria et al [47] 911 919

Parisi et al [48] 919 902

Proposed (119875 = 7) 927 915

Shan and Akella [46] 938 945Proposed (119875 = 11) 939 935

20 times to reduce the random effect of 119896-means For thisdataset all the different subsets of joints shown in Figure 2are considered since the skeleton is captured usingMicrosoftSDK which provides 20 joints The considered sequence ofclusters is only 119870 = [3 4 5] because the minimum numberof frames constituting an action sequence is 5

The results compared with previous works are shownin Table 7 The best results for the proposed algorithm areobtained with 119875 = 7 which is the smallest set of jointsconsidered The result corresponds to a number of 119870 = 4clusters but the difference with the other clusters is very low

Computational Intelligence and Neuroscience 9

093 002 005

094 006

099 001

100

002 098W

alk

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(a)

088 004 008

095 005

100

003 097

010 090

Wal

k

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(b)

Figure 4 Confusion matrices of the UTKinect dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 7 (b) Worst accuracy confusion matrix obtained with 119875 = 15

Table 7 Global accuracy () of the proposed algorithm forUTKinect dataset and LOOCV setting with different subsets ofjoints compared to other works

Algorithm AccuracyXia et al [26] 909

Theodorakopoulos et al [30] 9095

Ding et al [31] 915

Zhu et al [17] 919

Jiang et al [35] 919

Gan and Chen [24] 920

Liu et al [22] 920

Proposed (119875 = 15) 931

Proposed (119875 = 11) 942

Proposed (119875 = 19) 943

Anirudh et al [34] 949

Proposed (119875 = 7) 951

Vemulapalli et al [33] 971

only 06 with 119870 = 5 that provided the worst result Theselection of different sets of joints from 119875 = 7 to 119875 = 15changes the accuracy only by a 2 from 931 to 951In this dataset the main limitation to the performance isgiven by the reduced number of frames that constitute somesequences and limits the number of clusters representing theactions Vemulapalli et al [33] reach the highest accuracybut the approach is much more complex after modelingskeleton joints in a Lie group the processing scheme includesDynamic TimeWarping to perform temporal alignments anda specific representation called Fourier Temporal Pyramidbefore classification with one-versus-all multiclass SVM

Contextualizing the UTKinect dataset to AAL involvesthe consideration of a subset of activities which includes onlyactions and discards gestures In more detail the following 5activities have been selectedwalk sit down stand up pick upand carry In this condition the highest accuracy (967) isstill given by119875 = 7 with 3 clusters and the lower one (941)is represented by 119875 = 15 again with 3 clustersThe confusion

Table 8 Global accuracy () of the proposed algorithm forFlorence3D dataset and ldquonew-personrdquo setting with different subsetsof joints compared to other works

Algorithm AccuracySeidenari et al [44] 820

Proposed (119875 = 7) 821

Proposed (119875 = 15) 847

Proposed (119875 = 11) 861

Anirudh et al [34] 897

Vemulapalli et al [33] 909

Taha et al [28] 962

matrix for these two configurations are shown in Figure 4where the main difference is the reduced misclassificationbetween the activities walk and carry that are very similarto each other

424 Florence3D Dataset The Florence3D dataset is com-posed of 9 activities and 10 people performing themmultipletimes resulting in a total number of 215 activities Theproposed setting to evaluate this dataset is the leave-one-actor-out which is equivalent to the ldquonew-personrdquo settingpreviously described The minimum number of frames is 8so the following sequence of clusters has been considered119870 = [3 4 5 6 7 8] A number of 15 joints are available forthe skeleton so only the first three schemes are included inthe tests

Table 8 shows the results obtained with different subsetsof joints where the best result for the proposed algorithmis given by 119875 = 11 with 6 clusters The choice of clustersdoes not significantly affect the performance because themaximum reduction is 2 In this dataset the proposedapproach does not achieve state-of-the-art accuracy (962)given by Taha et al [28] However all the algorithmsthat overcome the proposed one exhibit greater complexitybecause they consider Lie group representations [33 34] orseveral machine learning algorithms such as SVM combined

10 Computational Intelligence and Neuroscience

(a)

(b)

(c)

Figure 5 Sequences of frames representing the drink from a bottle (a) answer phone (b) and read watch (c) activities from Florence3Ddataset

081 019

014 086

100

005 005 090

005 095

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(a)

076 019 005

032 064 005

004 092 004

005 010

005010005

085

080

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(b)

Figure 6 Confusion matrices of the Florence3D dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 11 (b) Worst accuracy confusion matrix obtained with 119875 = 7

to HMM to recognize activities composed of atomic actions[28]

If different subsets of joints are considered the perfor-mance decreases In particular considering only 7 joints and3 clusters it is possible to reach a maximum accuracy of821 which is very similar to the one obtained by Seidenariet al [44] who collected the dataset By including all theavailable joints (15) in the algorithm processing it is possibleto achieve an accuracy which varies between 840 and847

The Florence3D dataset is challenging because of tworeasons

(i) High interclass similarity some actions are verysimilar to each other for example drink from abottle answer phone and read watch As can beseen in Figure 5 all of them consist of an uprising

arm movement to the mouth hear or head Thisfact affects the performance because it is difficult toclassify actions consisting of very similar skeletonmovements

(ii) High intraclass variability the same action is per-formed in different ways by the same subject forexample using left right or both hands indifferently

Limiting the analysis to AAL related activities only thefollowing ones can be selected drink answer phone tight lacesit down and stand up The algorithm reaches the highestaccuracy (908) using the ldquonew-personrdquo setting with 119875 =11 joints and 119870 = 3 The confusion matrix obtained in thiscondition is shown in Figure 6(a) On the other hand theworst accuracy is 798 and it is obtained with 119875 = 7 and119870 = 4 In Figure 6(b) the confusion matrix for this scenario

Computational Intelligence and Neuroscience 11

081 008

004 056 022 011007

012 077 012

012

004 004 081 012

100

081 004 015

013 087

004 007019 070

a02

a03

a05

a06

a10

a13

a18

a20

a02

a03

a05

a06

a10

a13

a18

a20

(a)

056 015 007 015 007

007

007 017 013

016 044 008 004 004 024

011 056 004

003

022

010 070 017

063

100

010 087 003

007 093

a01

a04

a07

a08

a09

a11

a12

a14

a01

a04

a07

a08

a09

a11

a12

a14

(b)

088 012

100

005 095

097 003

003 097

093 007

003 003 083 010

004 011 085

a06

a14

a15

a16

a17

a18

a19

a20

a06

a14

a15

a16

a17

a18

a19

a20

(c)

Figure 7 Confusion matrices of the MSR Action3D dataset obtained with 119875 = 7 and the ldquonew-personrdquo test (a) AS1 (b) AS2 and (c) AS3

is shown In both situations the main problem is given by thesimilarity of drink and answer phone activities

425 MSR Action3D Dataset The MSR Action3D datasetis composed of 20 activities which are mainly gestures Itsevaluation is included for the sake of completeness in thecomparison with other activity recognition algorithms butthe dataset does not contain AAL related activities Thereis a big confusion in the literature about the validationtests to be used for MSR Action3D Padilla-Lopez et al[49] summarized all the validation methods for this datasetand recommend using all the possible combinations of 5-5 subjects splitting or using LOAO The former procedureconsists of considering 252 combinations of 5 subjects fortraining and the remaining 5 for testing while the latter is theleave-one-actor-out equivalent to the ldquonew-personrdquo schemepreviously introduced This dataset is quite challenging dueto the presence of similar and complex gestures hence theevaluation is performed by considering three subsets of 8gestures each (AS1 AS2 and AS3 in Table 9) as suggestedin [45] Since the minimum number of frames constitutingone sequence is 13 the following set of clusters has beenconsidered 119870 = [3 5 8 10 13] All the combinations of 711 15 and 20 joints shown in Figure 2 are included in theexperimental tests

Considering the ldquonew-personrdquo scheme the proposedalgorithm tested separately on the three subsets of MSRAction3D reaches an average accuracy of 812 with 119875 = 7and 119870 = 10 The confusion matrices for the three subsetsare shown in Figure 7 where it is possible to notice thatthe algorithm struggles in the recognition of the AS2 subset(Figure 7(b)) mainly represented by drawing gestures Betterresults are obtained processing the subset AS3 (Figure 7(c))where lower recognition rates are shown for complex ges-tures golf swing and pickup and throw Table 10 shows theresults obtained including also other joints which are slightlyworse A comparison with previous works validated usingthe ldquonew-personrdquo test is shown in Table 11 The proposedalgorithm achieves results comparable with [50] whichexploits an approach based on skeleton data Chaaraoui etal [15 36] exploits more complex algorithms considering

Table 9 Three subsets of gestures fromMSR Action3D dataset

AS1 AS2 AS3[11988602] Horizontalarm wave [11988601] High arm wave [11988606] High throw

[11988603] Hammer [11988604] Hand catch [11988614] Forward kick[11988605] Forwardpunch [11988607] Draw X [11988615] Side kick

[11988606] High throw [11988608] Draw tick [11988616] Jogging[11988610] Hand clap [11988609] Draw circle [11988617] Tennis swing[11988613] Bend [11988611] Two-hand wave [11988618] Tennis serve[11988618] Tennis serve [11988612] Side boxing [11988619] Golf swing[11988620] Pickup andthrow [11988614] Forward kick [11988620] Pickup and throw

Table 10 Accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting with different subsetsof joints and different subsets of activities

Algorithm AS1 AS2 AS3 AvgProposed (119875 = 15) 785 688 928 800

Proposed (119875 = 20) 790 702 919 804

Proposed (119875 = 11) 776 737 914 809

Proposed (119875 = 7) 795 719 923 812

the fusion of skeleton and depth data or the evolutionaryselection of the best set of joints

5 Discussion

The proposed algorithm despite its simplicity is able toachieve and sometimes to overcome state-of-the-art per-formance when applied to publicly available datasets Inparticular it is able to outperform some complex algorithmsexploiting more than one classifier in KARD and CAD-60datasets

Limiting the analysis to AAL related activities only thealgorithm achieves interesting results in all the datasetsThe group of 8 activities labeled as Actions in the KARD

12 Computational Intelligence and Neuroscience

Table 11 Average accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting compared to otherworks

Algorithm AccuracyCeliktutan et al (2015) [51] 729

Azary and Savakis (2013) [52] 785

Proposed (119875 = 7) 812

Azary and Savakis (2012) [50] 839

Chaaraoui et al (2013) [15] 906

Chaaraoui et al (2014) [36] 935

dataset is recognized with an accuracy greater than 987in the considered experiments and the group includes twosimilar activities such as phone call and drink The CAD-60dataset contains only actions so all the activities consideredwithin the proper location have been included in the eval-uation resulting in global precision and recall of 939 and935 respectively In the UTKinect dataset the performanceimproves considering only the AAL activities and the samehappens also in the Florence3D having the highest accuracyclose to 91 even if some actions are very similar

The MSR Action3D is a challenging dataset mainlycomprising gestures for human-computer interaction andnot actions Many gestures especially the ones included inAS2 subset are very similar to each other In order to improvethe recognition accuracy more complex features should beconsidered including not only jointsrsquo relative positions butalso their velocity Many sequences contain noisy skeletondata another approach to improving recognition accuracycould be the development of a method to discard noisyskeleton or to include depth-based features

However the AAL scenario raises some problems whichhave to be addressed First of all the algorithm exploits amul-ticlass SVM which is a good classifier but it does not makeeasy to understand if an activity belongs or not to any of thetraining classes In fact in a real scenario it is possible to havea sequence of frames that does not represent any activity ofthe training set in this case the SVM outputs the most likelyclass anyway even if it does not make sense Other machinelearning algorithms such as HMMs distinguish amongmultiple classes using the maximum posterior probabilityGaglio et al [29] proposed using a threshold on the outputprobability to detect unknown actions Usually SVMs do notprovide output probabilities but there exist some methodsto extend SVM implementations and make them able toprovide also this information [53 54] These techniquescan be investigated to understand their applicability in theproposed scenario

Another problem is segmentation of actions In manydatasets the actions are represented as segmented sequencesof frames but in real applications the algorithm has to handlea continuous stream of frames and to segment actions byitself Some solutions for segmentation have been proposedbut most of them are based on thresholds on movementswhich can be highly data-dependent [46] Also this aspect hasto be further investigated to have a systemwhich is effectivelyapplicable in a real AAL scenario

6 Conclusions

In this work a simple yet effective activity recognitionalgorithm has been proposed It is based on skeleton dataextracted from an RGBD sensor and it creates a featurevector representing the whole activity It is able to overcomestate-of-the-art results in two publicly available datasets theKARD and CAD-60 outperforming more complex algo-rithms in many conditions The algorithm has been testedalso over other more challenging datasets the UTKinect andFlorence3D where it is outperformed only by algorithmsexploiting temporal alignment techniques or a combinationof several machine learning methods The MSR Action3Dis a complex dataset and the proposed algorithm strugglesin the recognition of subsets constituted by similar gesturesHowever this dataset does not contain any activity of interestfor AAL scenarios

Future works will concern the application of the activityrecognition algorithm to AAL scenarios by consideringaction segmentation and the detection of unknown activities

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the contributionof the COST Action IC1303 AAPELE (Architectures Algo-rithms and Platforms for Enhanced Living Environments)

References

[1] P Rashidi and A Mihailidis ldquoA survey on ambient-assistedliving tools for older adultsrdquo IEEE Journal of Biomedical andHealth Informatics vol 17 no 3 pp 579ndash590 2013

[2] O C Ann and L B Theng ldquoHuman activity recognition areviewrdquo in Proceedings of the IEEE International Conference onControl System Computing and Engineering (ICCSCE rsquo14) pp389ndash393 IEEE Batu Ferringhi Malaysia November 2014

[3] S Wang and G Zhou ldquoA review on radio based activityrecognitionrdquo Digital Communications and Networks vol 1 no1 pp 20ndash29 2015

[4] L Atallah B Lo R Ali R King and G-Z Yang ldquoReal-timeactivity classificationusing ambient andwearable sensorsrdquo IEEETransactions on Information Technology in Biomedicine vol 13no 6 pp 1031ndash1039 2009

[5] D De P Bharti S K Das and S Chellappan ldquoMultimodalwearable sensing for fine-grained activity recognition in health-carerdquo IEEE Internet Computing vol 19 no 5 pp 26ndash35 2015

[6] J K Aggarwal and L Xia ldquoHuman activity recognition from 3Ddata a reviewrdquo Pattern Recognition Letters vol 48 pp 70ndash802014

[7] S Gasparrini E Cippitelli E Gambi S Spinsante and FFlorez-Revuelta ldquoPerformance analysis of self-organising neu-ral networks tracking algorithms for intake monitoring usingkinectrdquo in Proceedings of the 1st IET International Conferenceon Technologies for Active and Assisted Living (TechAAL rsquo15)Kingston UK November 2015

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

2 Computational Intelligence and Neuroscience

them showing interesting performances in the evaluation ofactivities specifically related to AAL

The paper is organized as follows Section 2 reviewsrelated works in human activity recognition using RGBDsensors Section 3 describes the proposed algorithm whereasthe experimental results are shown in Section 4 Section 5discusses the applicability of this system to the AAL scenarioand finally Section 6 provides concluding remarks

2 Related Works

In the last years many solutions for human activity recog-nition have been proposed some of them aimed to extractfeatures from depth data such as [10] where the main ideais to evaluate spatiotemporal depth subvolume descriptorsA group of hypersurface normals (polynormal) containinggeometry and local motion information is extracted fromdepth sequences The polynormals are then aggregated toconstitute the final representation of the depth map calledSuper Normal Vector (SNV)This representation can includealso skeleton joint trajectories improving the recognitionresults when peoplemove a lot in a sequence of depth framesDepth images can be seen as sequence features modeledtemporally as subspaces lying on the Grassmann manifold[11] This representation starting from the orientation of thenormal vector at every surface point describes the geometricappearance and the dynamic of human body without usingjoint position Other works proposed holistic descriptors theHON4D descriptor [12] which is based on the orientationsof normal surfaces in 4D and HOPC descriptor [13] whichis able to represent the geometric characteristics of a sequenceof 3D points

Other works exploit both depth and skeleton data forexample the 35D representation combines the skeleton jointinformation with features extracted from depth images inthe region surrounding each node of interest [14] The fea-tures are extracted using an extended Independent SubspaceAnalysis (ISA) algorithm by applying it only to local regionof joints instead of the entire video thus improving thetraining efficiency The depth information makes it easy toextract the human silhouette which can be concatenatedwithnormalized skeleton features to improve the recognition rate[15] Depth and skeleton features can be combined at differentlevels of the activity recognition algorithm Althloothi et al[16] proposed amethod where the data are fused at the kernellevel instead of the feature level using the Multiple KernelLearning (MKL) technique On the other hand fusion at thefeature level of spatiotemporal features and skeleton jointsis performed in [17] In such a work several spatiotemporalinterest point detectors such as Harris 3D ESURF [18] andHOG3D [19] have been fused using regression forests withthe skeleton joint features consisting of posture movementand offset information

Skeleton joints extracted from depth frames can be com-bined also with RGB data Luo et al [20] proposed a humanaction recognition framework where the pairwise relativepositions of joints and Center-Symmetric Motion LocalTernary Pattern (CS-Mltp) features from RGB are fused bothat feature level and at classifier level Spatiotemporal Interest

Points (STIP) are typically used in activity recognition wheredata are represented by RGB frames This approach canbe also extended to depth and skeleton data combiningthe features with random forests [21] The results are verygood but depth estimation noise and background may havean impact on interest point detection so the depth STIPfeatures have to be constrained using skeleton joint positionsor RGB videos Instead of using spatiotemporal featuresanother approach for human activity recognition relies ongraph-based methods for sequential modeling of RGB dataThis concept can be extended to depth information and anapproach based on coupled Hidden Conditional RandomFields (cHCRF) model where visual feature sequences areextracted from RGB and depth data has been proposed[22] The main advantage of this approach is the capabilityto preserve the dynamics of individual sequences even ifthe complementary information from RGB and depth areshared

Some previous works simply rely on Kinect skeleton datasuch as the proposed algorithm Maybe they are simplerapproaches but in many cases they can achieve performancevery close to the algorithms exploiting multimodal data andsometimes they also perform better than those solutionsDevanne et al [23] proposed representing human actionsby spatiotemporal motion trajectories in a 60-dimensionalspace since they considered 20 joints each of them with3 coordinates Then an elastic metric which means ametric invariant to speed and time of the action withina Riemannian shape space is employed to represent thedistance between two curves Finally the action recognitionproblem can be seen as a classification in the Riemannianspace using a 119896-Nearest-Neighbor (119896-NN) classifier Otherskeleton representations have been proposed The APJ3Drepresentation [24] is constituted starting by a subset of 15skeleton joints from which the relative positions and localspherical angles are computed After a selection of key-postures the action is partitioned using a reviewed FourierTemporal Pyramid [25] and the classification is made byrandom forests Another joint representation is calledHOJ3D[26] where the 3D space is partitioned into 119899 bins and thejoints are associated with each bin using a Gaussian weightfunction Then a discrete Hidden Markov Model (HMM) isemployed to model the temporal evolution of the posturesattained using a clustering algorithm A human action canbe characterized also by a combination of static posturefeatures representing the actual frame consecutive motionfeatures computed using the actual and the previous framesand overall dynamics features which consider the actual andthe initial frames [27] Taha et al [28] also exploit jointsspherical coordinates to represent the skeleton and a frame-work composed of a multiclass SVM and a discrete HMMto recognize activities constituted by many actions Otherapproaches exploit a double machine learning algorithm toclassify actions for example Gaglio et al [29] consider amulticlass SVM to estimate the postures and a discrete HMMto model an activity as a sequence of postures Also in[30] human actions are considered as a sequence of bodyposes over time and skeletal data are processed to obtaininvariant pose representations given by 8 pairs of angles

Computational Intelligence and Neuroscience 3

Then the recognition is realized using the representation inthe dissimilarity space where different feature trajectoriesmaintain discriminant information and have a fixed-lengthrepresentation Ding et al [31] proposed a Spatiotempo-ral Feature Chain (STFC) to represent the human actionsby trajectories of joint positions Before using the STFCmodel a graph is used to erase periodic sequences makingthe solution more robust to noise and periodic sequencemisalignment Slama et al [32] exploited the geometricstructure of the Grassmann manifold for action analysisIn fact considering the problem as a sequence matchingtask this manifold allows considering an action sequence asa point on its space and provides tools to make statisticalanalysis Considering that the relative geometry betweenbody parts is more meaningful than their absolute locationsrotations and translations required to perform rigid bodytransformations can be represented as points in a SpecialEuclidean SE(3) group Each skeleton can be represented asa point in the Lie group SE(3) times SE(3) times sdot sdot sdot times SE(3) and ahuman action can be modeled as a curve in this Lie group[33] The same skeleton feature is also used in [34] whereManifold Functional PCA (mfPCA) is employed to reducefeature dimensionality Some works developed techniques toautomatically select the most informative joints aiming atincreasing the recognition accuracy and reducing the noiseeffect on the skeleton estimation [35 36]

The work presented in this paper is based on the conceptthat an action can be seen as a sequence of informativepostures which are known as ldquokey posesrdquoThis idea has beenintroduced in [37] and used in other subsequent proposals[15 36 38] While in previous works a fixed set of key poses(119870) is extracted for each action of the dataset considering allthe training sequences in this work the clustering algorithmwhich selects the most informative postures is executed foreach sequence thus selecting a different set of 119870 poseswhich constitutes the feature vector This procedure avoidsthe application of a learning algorithm which has to find thenearest neighbor key pose for each frame constituting thesequence

3 Activity Recognition Algorithm

The proposed algorithm for activity recognition starts fromskeleton joints and computes a vector of features for eachactivity Then a multiclass machine learning algorithmwhere each class represents a different activity is exploited forclassification purpose Four main steps constitute the wholealgorithm they are represented in Figure 1 and are discussedin the following

(1) Posture Features Extraction The coordinates of theskeleton joints are used to evaluate the feature vectorswhich represent human postures

(2) Postures Selection The most important postures foreach activity are selected

(3) Activity Features Computation A feature vector rep-resenting the whole activity is created and used forclassification

(4) Classification The classification is realized using amulticlass SVM implemented with the ldquoone-versus-onerdquo approach

First there is the need to extract the features fromskeleton data representing the input to the algorithm Thejoint extraction algorithm proposed in [8] is included in theKinect libraries and is ready to use The choice to considerKinect skeleton is motivated by the fact that it is easy to havea compact representation of the human body Starting fromskeleton data many features which are able to represent ahuman pose and consequently a human action have beenproposed in the literature In [6] the authors found thatmanyfeatures can be computed from skeleton joints The simplestfeatures can be extracted by joint locations or from theirdistances considering spatial information Other featuresmay involve joints orientation or motion considering spatialand temporal data More complex features may be initiallybased on the estimation of a plane considering some jointsThen by measuring the distances between this plane andother joints the features can be extracted

The proposed algorithm exploits spatial features com-puted from 3D skeleton coordinates without including thetime information in the computation in order to make thesystem independent of the speed of movement The featureextraction method has been introduced in [36] and it is hereadopted with small differences For each skeleton frame aposture feature vector is computed Each joint is representedby J119894 a three-dimensional vector in the coordinate space of

Kinect The person can be found at any place within thecoverage area of Kinect and the coordinates of the same jointmay assume different values It is necessary to compensatethis effect by using a proper features computation algorithmA straightforward solution is to compensate the position ofthe skeleton by centering the coordinate space in one skeletonjoint Considering a skeleton composed of 119875 joints J

0being

the coordinates of the torso joint and J2being the coordinates

of the neck joint the 119894th joint feature d119894is the distance vector

between J119894and J0 normalized to the distance between J

2and

J0

d119894=

J119894minus J0

1003817100381710038171003817J2 minus J01003817100381710038171003817

119894 = 1 2 119875 minus 1 (1)

This feature is invariant to the position of the skeleton withinthe coverage area ofKinect furthermore the invariance of thefeature to the build of the person is obtained by normalizationwith respect to the distance between the neck and torso jointsThese features may be seen as a set of distance vectors whichconnect each joint to the joint of the torso A posture featurevector f is created for each skeleton frame

f = [d1 d2 d3 d

119875minus1] (2)

A set of 119873 feature vectors is computed having an activityconstituted by119873 frames

The second phase concerns the human postures selectionwith the aim of reducing the complexity and increasinggenerality by representing the activity by means of only asubset of poses without using all the frames A clustering

4 Computational Intelligence and Neuroscience

1

23

4

Test Test

Train

LA

(A L)

SVM1

SVM2

SVM3C1

C2

C3

CK

SVMM(Mminus1)2

Classification

Activity features computation Postures selection

Posture features extraction

A = [C1C2C3 CK]

s1 = [J0 J1 J2 J3 JPminus1]s2 = [J0 J1 J2 J3 JPminus1]s3 = [J0 J1 J2 J3 JPminus1]s4 = [J0 J1 J2 J3 JPminus1]

sN = [J0 J1 J2 J3 JPminus1]

f1 = [d1 d2 d3 dPminus1]f2 = [d1 d2 d3 dPminus1]f3 = [d1 d2 d3 dPminus1]f4 = [d1 d2 d3 dPminus1]

fN = [d1 d2 d3 dPminus1]f = [d1 d2 d3 dPminus1]

Ji = ith joint

di =Ji minus J0J2 minus J0

i = 1 2 P minus 1

Figure 1 The general scheme of the activity recognition algorithm is composed of 4 steps In the first step the posture feature vectors arecomputed for each skeleton frame then the postures are selected and an activity features vector is created Finally amulticlass SVM is exploitedfor classification

algorithm is used to process 119873 feature vectors constitutingthe activity by grouping them into 119870 clusters The well-known 119896-means clustering algorithm based on the squaredEuclidean distance as a metric can be used to group togetherthe frames representing similar postures Considering anactivity composed of119873 feature vectors [f

1 f2 f3 f

119873] the

119896-means algorithm gives as outputs 119873 clusters ID (one foreach feature vector) and 119870 vectors [C

1C2C3 C

119870] that

represent the centroids of each cluster The feature vectorsare partitioned into clusters 119878

1 1198782 119878

119870so as to satisfy the

condition expressed by

argminS

119870

sum

119895=1

sum

f119894isin119878119895

10038171003817100381710038171003817f119894minus C119895

10038171003817100381710038171003817

2

(3)

The 119870 centroids can be seen as the main postures whichare the most important feature vectors Unlike classicalapproaches based on key poses where the most informativepostures are evaluated by considering all the sequencesof each activity in the proposed solution the clusteringalgorithm is executed for each sequence This avoids theapplication of a learning algorithm required to associate eachframe to the closest key pose and allows to have a morecompact representation of the activity

The third phase is related to the computation of a featurevector which models the whole activity starting from the119870 centroid vectors computed by the clustering algorithmIn more detail [C

1C2C3 C

119870] vectors are sorted con-

sidering the order in which the clusterrsquos elements occurduring the activity The activity features vector is composedof concatenating the sorted centroid vectors For example

considering an activity featured by 119873 = 10 and 119870 = 4after running the 119896-means algorithm one of the possibleoutputs could be the following sequence of cluster IDs[2 2 2 3 3 1 1 4 4 4] meaning that the first three posturevectors belong to cluster 2 the fourth and the fifth areassociated with cluster 3 and so on In this case the activityfeatures vector is A = [C

2C3C1C4] A feature activity

vector has a dimension of 3119870(119875 minus 1) that can be handledwithout using dimensionality reduction algorithms such asPCA if 119870 is small

The classification step aims to associate each featureactivity vector to the correct activity Many machine learningalgorithms may be applied to fulfil this task among them aSVM Considering a number of 119897 training vectors x

119894isin 119877119899 and

a vector of labels 119910 isin 119877119897 where 119910119894isin minus1 1 a binary SVM

can be formulated as follows [39]

minw119887120585

1

2w119879w + 119862

119897

sum

119894=1

120585119894

subject to 119910119894(w119879120601 (x

119894) + 119887) ge 1 minus 120585

119894

120585119894ge 0 119894 = 1 119897

(4)

where

w119879120601 (x) + 119887 = 0 (5)

is the optimal hyperplane that allows separation betweenclasses in the feature space119862 is a constant and 120585

119894are nonneg-

ative variables which consider training errorsThe function 120601allows transforming between the features space and an higher

Computational Intelligence and Neuroscience 5

12

30456

78

910

1112

131415

16 17

18 19

(a)

12

30456

78

910

1112

131415

16 17

18 19

(b)

12

30456

78

910

1112

131415

1617

18 19

(c)

12

30456

78

910

1112

131415

16 17

18 19(d)

Figure 2 Subsets of joints considered in the evaluation of the proposed algorithmThewhole skeleton is represented by 20 joints the selectedones are depicted as green circles while the discarded joints are represented by red squares Subsets of 7 (a) 11 (b) and 15 (c) joints and thewhole set of 20 joints (d) are selected

dimensional space where the data are separable Consideringtwo training vectors x

119894and x

119895 the kernel function can be

defined as

119870(x119894 x119895) = 120601 (x

119894)119879120601 (x119895) (6)

In this work the Radial Basis Function (RBF) kernel has beenused where

119870(x119894 x119895) = 119890minus120574x119894minusx1198952 120574 gt 0 (7)

It follows that 119862 and 120574 are the parameters that have to beestimated prior using the SVM

The idea herein exploited is to use a multiclass SVMwhere each class represents an activity of the dataset In orderto extend the role from a binary to amulticlass classifier someapproaches have been proposed in the literature In [40] theauthors compared many methods and found that the ldquoone-against-onerdquo is one of the most suitable for practical use Itis implemented in LIBSVM [41] and it is the approach usedin this workThe ldquoone-against-onerdquo approach is based on theconstruction of several binary SVM classifiers inmore detaila number of119872(119872minus1)2 binary SVMs are necessary in a119872-classes dataset This happens because each SVM is trained todistinguish between 2 classes and the final decision is takenexploiting a voting strategy among all the binary classifiersDuring the training phase the activity feature vectors aregiven as inputs to the multiclass SVM together with the label119871 of the action In the test phase the activity label is obtainedfrom the classifier

4 Experimental Results

Thealgorithmperformance is evaluated on five publicly avail-able datasets In order to perform an objective comparisonto previous works the reference test procedures have beenconsidered for each dataset The performance indicators areevaluated using four different subsets of joints shown inFigure 2 going from a minimum of 7 up to a maximum of20 joints Finally in order to evaluate the performance inAAL scenarios some suitable activities related to AAL areselected from the datasets and the recognition accuracies areevaluated

41 Datasets Five different 3D datasets have been consideredin this work each of them including a different set of activitiesand gestures

KARD dataset [29] is composed of 18 activities that canbe divided into 10 gestures (horizontal arm wave high armwave two-hand wave high throw draw X draw tick forwardkick side kick bend and hand clap) and eight actions (catchcap toss paper take umbrellawalk phone call drink sit downand stand up) This dataset has been captured in a controlledenvironment that is an office with a static background anda Kinect device placed at a distance of 2-3m from the subjectSome objects were present in the area useful to perform someof the actions a desk with a phone a coat rack and a wastebin The activities have been performed by 10 young people(nine males and one female) aged from 20 to 30 years andfrom 150 to 185 cm tall Each person repeated each activity3 times creating a number of 540 sequences The dataset iscomposed of RGB and depth frames captured at a rate of30 fps with a 640 times 480 resolution In addition 15 joints ofthe skeleton in world and screen coordinates are providedThe skeleton has been captured using OpenNI libraries [42]

The Cornell Activity Dataset (CAD-60) [43] is madeby 12 different activities typical of indoor environmentsThe activities are rinsing mouth brushing teeth wearingcontact lens talking on the phone drinking water openingpill container cooking-chopping cooking-stirring talking oncouch relaxing on couch writing on whiteboard and workingon computer and are performed in 5 different environmentsbathroom bedroom kitchen living room and office All theactivities are performed by 4 different people two males andtwo females one of which is left-handed No instructionswere given to actors about how to perform the activitiesthe authors simply ensured that the skeleton was correctlydetected by Kinect The dataset is composed of RGB depthand skeleton data with 15 joints available

The UTKinect dataset [26] is composed of 10 differentsubjects (9males and 1 female) performing 10 activities twiceThe following activities are part of the dataset walk sit downstand up pick up carry throw push pull wave and claphands A number of 199 sequences are available because onesequence is not labeled and the length of sample actions

6 Computational Intelligence and Neuroscience

ranges from 5 to 120 frames The dataset provides 640 times 480RGB frames and 320 times 240 depth frames together with 20skeleton joints captured using Kinect forWindows SDKBetaVersion with a final frame rate of about 15 fps

The Florence3D dataset [44] includes 9 different activi-ties wave drink from a bottle answer phone clap tight lacesit down stand up read watch and bow These activities wereperformed by 10 different subjects for 2 or 3 times resultingin a total number of 215 sequences This is a challengingdataset since the same action is performed with both handsand because of the presence of very similar actions such asdrink from a bottle and answer phone The activities wererecorded in different environments and only RGB videos and15 skeleton joints are available

Finally the MSR Action3D [45] represents one of themost used datasets for HAR It includes 20 activities per-formed by 10 subjects 2 or 3 times In total 567 sequencesof depth (320 times 240) and skeleton frames are providedbut 10 of them have to be discarded because the skeletonsare either missing or affected by too many errors Thefollowing activities are included in the dataset high armwavehorizontal arm wave hammer hand catch forward punchhigh throw draw X draw tick draw circle hand clap two-hand wave side boxing bend forward kick side kick joggingtennis swing tennis serve golf swing and pickup and throwThe dataset has been collected using a structured-light depthcamera at 15 fps RGB data are not available

42 Tests and Results The proposed algorithm has beentested over the datasets detailed above following the recom-mendations provided in each reference paper in order toensure a fair comparison to previous works Following thiscomparison a more AAL oriented evaluation has been con-ducted which consists of considering only suitable actionsfor each dataset excluding the gestures Another type ofevaluation regards the subset of skeleton joints that has to beincluded in the feature computation

421 KARD Dataset Gaglio et al [29] collected the KARDdataset and proposed some evaluation experiments on itThey considered three different experiments and two modal-ities of dataset splitting The experiments are as follows

(i) Experiment A one-third of the data is considered fortraining and the rest for testing

(ii) Experiment B two-thirds of the data is considered fortraining and the rest for testing

(iii) Experiment C half of the data is considered fortraining and the rest for testing

The activities constituting the dataset are split in the followinggroups

(i) Gestures and Actions(ii) Activity Set 1 Activity Set 2 and Activity Set 3 as

listed in Table 1 Activity Set 1 is the simplest onesince it is composed of quite different activities whilethe other two sets include more similar actions andgestures

Table 1 Activity sets grouping different and similar activities fromKARD dataset

Activity Set 1 Activity Set 2 Activity Set 3Horizontal arm wave High arm wave Draw tickTwo-hand wave Side kick DrinkBend Catch cap Sit downPhone call Draw tick Phone callStand up Hand clap Take umbrellaForward kick Forward kick Toss paperDraw X Bend High throwWalk Sit down Horizontal arm wave

Each experiment has been repeated 10 times randomlysplitting training and testing data Finally the ldquonew-personrdquoscenario is also performed that is a leave-one-actor-outsetting consisting of training the system on nine of the tenpeople of the dataset and testing on the tenth In the ldquonew-personrdquo test no recommendation is provided about how tosplit the dataset so we assumed that the whole dataset of 18activities is considered The only parameter that can be setin the proposed algorithm is the number of clusters 119870 anddifferent subsets of skeleton joints are considered Since only15 skeleton joints are available the fourth group of 20 joints(Figure 2(d)) cannot be considered

For each test conducted on KARD dataset the sequenceof clusters 119870 = [3 5 10 15 20 25 30 35] has been consid-ered The results concerning Activity Set tests are reported inTable 2 it is shown that the proposed algorithm outperformsthe originally proposed one in almost all the tests The 119870parameter which gives the maximum accuracy is quite highsince it is 30 or 35 in most of the cases It means that forthe KARD dataset it is better to have a significant numberof clusters representing each activity This is possible alsobecause the number of frames that constitutes each activitygoes from aminimumof 42 in a sequence of hand clap gestureto a maximum of 310 for a walk sequence Experiment Ain Activity Set 1 shows the highest difference between theminimum and the maximum accuracy when varying thenumber of clusters For example considering 119875 = 7 themaximum accuracy (982) is shown in Table 2 and it isobtained with 119870 = 35 The minimum accuracy is 914 andit is obtained with 119870 = 5 The difference is quite high (68)but this gap is reduced by considering more training data Infact Experiment B shows a difference of 05 in Activity Set1 11 in Activity Set 2 and 26 in Activity Set 3

Considering the number of selected joints the observa-tion of Tables 2 and 3 lets us conclude that not all the jointsare necessary to achieve good recognition results In moredetail from Table 2 it can be noticed that the Activity Set 1and Activity Set 2 which are the simplest ones have goodrecognition results using a subset composed of 119875 = 7 jointsThe Activity Set 3 composed of more similar activities isbetter recognized with 119875 = 11 joints In any case it is notnecessary to consider all the skeleton joints provided fromKARD dataset

The results obtained with the ldquonew-personrdquo scenario areshown in Table 4 The best result is obtained with 119870 = 30

Computational Intelligence and Neuroscience 7

Table 2 Accuracy () of the proposed algorithm compared to the other using KARD dataset with different Activity Sets and for differentexperiments

Activity Set 1 Activity Set 2 Activity Set 3A B C A B C A B C

Gaglio et al [29] 951 991 930 899 949 901 842 895 817

Proposed (119875 = 7) 982 984 981 997 100 997 902 950 913

Proposed (119875 = 11) 980 990 977 998 100 996 916 958 933Proposed (119875 = 15) 975 988 976 995 100 996 91 951 932

090 003 007097 003

100100

003 093 003007 007 083 003007 007 077 010

003 093 003100

100100

100100

100090 010

003 003 007 087100

100

Hor

arm

wav

e

Hig

h ar

m w

ave

Two-

hand

wav

e

Catch

cap

Hig

h th

row

Dra

w X

Dra

w tic

k

Toss

pape

r

Forw

kick

Side

kick

Take

um

br

Bend

Han

d cla

p

Wal

k

Phon

e cal

l

Drin

k

Sit d

own

Stan

d up

Hor arm waveHigh arm wave

Two-hand waveCatch cap

High throwDraw X

Draw tickToss paperForw kick

Side kickTake umbr

BendHand clap

WalkPhone call

DrinkSit downStand up

Figure 3 Confusion matrix of the ldquonew-personrdquo test on the whole KARD dataset

Table 3 Accuracy () of the proposed algorithm compared tothe other using KARD dataset with dataset split in Gestures andActions for different experiments

Gestures ActionsA B C A B C

Gaglio et al [29] 865 930 867 925 950 901

Proposed (119875 = 7) 899 935 925 991 996 994Proposed (119875 = 11) 899 959 937 990 999 991Proposed (119875 = 15) 874 936 928 987 995 993

Table 4 Precision () and recall () of the proposed algorithmcompared to the other using the whole KARD dataset and ldquonew-personrdquo setting

Algorithm Precision RecallGaglio et al [29] 848 845

Proposed (119875 = 7) 940 937

Proposed (119875 = 11) 951 950Proposed (119875 = 15) 950 948

using a number of 119875 = 11 joints The overall precision andrecall are about 10higher than the previous approachwhichuses the KARD dataset In this condition the confusion

matrix obtained is shown in Figure 3 It can be noticed thatthe actions are distinguished very well only phone call anddrink show a recognition accuracy equal or lower than 90and sometimes they are mixed with each other The mostcritical activities are the draw X and draw tick gestures whichare quite similar

From the AAL point of view only some activities arerelevant Table 3 shows that the eight actions constituting theKARD dataset are recognized with an accuracy greater than98 even if there are some similar actions such as sit downand stand up or phone call and drink Considering only theActions subset the lower recognition accuracy is 949and itis obtained with119870 = 5 and119875 = 7 It means that the algorithmis able to reach a high recognition rate even if the featurevector is limited to 90 elements

422 CAD-60 Dataset TheCAD-60 dataset is a challengingdataset consisting of 12 activities performed by 4 people in5 different environments The dataset is usually evaluatedby splitting the activities according to the environmentthe global performance of the algorithm is given by theaverage precision and recall among all the environments Twodifferent settings were experimented for CAD-60 in [43]Theformer is defined ldquonew-personrdquo and the latter is the so-calledldquohave-seenrdquo ldquoNew-personrdquo setting has been considered in all

8 Computational Intelligence and Neuroscience

the works using CAD-60 so it is the one selected also in thiswork

The most challenging element of the dataset is thepresence of a left-handed actor In order to increase the per-formance which are particularly affected by this unbalancingin the ldquonew-personrdquo test mirrored copies of each action arecreated as suggested in [43] For each actor a left-handed anda right-handed version of each action are made availableThedummyversion of the activity has been obtained bymirroringthe skeleton with respect to the virtual sagittal plane that cutsthe person in a half The proposed algorithm is evaluatedusing three different sets of joints from 119875 = 7 to 119875 = 15and the sequence of clusters 119870 = [3 5 10 15 20 25 30 35]as in the KARD dataset

The best results are obtained with the configurations 119875 =11 and 119870 = 25 and the performance in terms of precisionand recall for each activity is shown in Table 5 Very goodresults are given in office environment where the averageprecision and recall are 964 and 958 respectively Infact the activities of this environment are quite different onlytalking on phone and drinking water are similar On the otherhand the living room environment includes talking on couchand relaxing on couch in addition to talking on phone anddrinking water and it is the most challenging case since theaverage precision and recall are 91 and 906

The proposed algorithm is compared to other worksusing the same ldquonew-personrdquo setting and the results areshown in Table 6 which tells that the 119875 = 11 configurationoutperforms the state-of-the-art results in terms of precisionand it is only 1 lower in terms of recall Shan and Akella[46] achieve very good results using a multiclass SVMscheme with a linear kernel However they train and testmirrored actions separately and then merge the results whencomputing average precision and recall Our approach simplyconsiders two copies of the same action given as input to themulticlass SVM and retrieves the classification results

The reduced number of joints does not affect too muchthe average performance of the algorithm that reaches aprecision of 927 and a recall of 915 with 119875 = 7Using all the available jonts on the other hand brings toa more substantial reduction of the performance showing879 and 867 for precision and recall respectively with119875 = 15 The best results for the proposed algorithm werealways obtained with a high number of clusters (25 or30) The reduction of this number affects the performancefor example considering the 119875 = 11 subset the worstperformance is obtained with 119870 = 5 and with a precision of866 and a recall of 860

From theAALpoint of view this dataset is composed onlyof actions not gestures so the dataset does not have to beseparated to evaluate the performance in a scenario which isclose to AAL

423 UTKinect Dataset The UTKinect dataset is composedof 10 activities performed twice by 10 subjects and theevaluation setting proposed in [26] is the leave-one-out-cross-validation (LOOCV) which means that the systemis trained on all the sequences except one and that one isused for testing Each trainingtesting procedure is repeated

Table 5 Precision () and recall () of the proposed algorithm inthe different environments of CAD-60 with 119875 = 11 and 119870 = 25

Location Activity ldquoNew-personrdquoPrecision Recall

Bathroom

Brushing teeth 889 100

Rinsing mouth 923 100

Wearing contact lens 100 792

Average 937 931

Bedroom

Talking on phone 917 917

Drinking water 913 875

Opening pill container 960 100

Average 930 931

Kitchen

Cooking-chopping 857 100

Cooking-stirring 100 791

Drinking water 960 100

Opening pill container 100 100

Average 954 948

Living room

Talking on phone 875 875

Drinking water 875 875

Talking on couch 889 100

Relaxing on couch 100 875

Average 910 906

Office

Talking on phone 100 875

Writing on whiteboard 100 958

Drinking water 857 100

Working on computer 100 100

Average 964 958Global average 939 935

Table 6 Global precision () and recall () of the proposedalgorithm for CAD-60 dataset and ldquonew-personrdquo setting withdifferent subsets of joints compared to other works

Algorithm Precision RecallSung et al [43] 679 555

Gaglio et al [29] 773 767

Proposed (119875 = 15) 879 867

Faria et al [47] 911 919

Parisi et al [48] 919 902

Proposed (119875 = 7) 927 915

Shan and Akella [46] 938 945Proposed (119875 = 11) 939 935

20 times to reduce the random effect of 119896-means For thisdataset all the different subsets of joints shown in Figure 2are considered since the skeleton is captured usingMicrosoftSDK which provides 20 joints The considered sequence ofclusters is only 119870 = [3 4 5] because the minimum numberof frames constituting an action sequence is 5

The results compared with previous works are shownin Table 7 The best results for the proposed algorithm areobtained with 119875 = 7 which is the smallest set of jointsconsidered The result corresponds to a number of 119870 = 4clusters but the difference with the other clusters is very low

Computational Intelligence and Neuroscience 9

093 002 005

094 006

099 001

100

002 098W

alk

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(a)

088 004 008

095 005

100

003 097

010 090

Wal

k

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(b)

Figure 4 Confusion matrices of the UTKinect dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 7 (b) Worst accuracy confusion matrix obtained with 119875 = 15

Table 7 Global accuracy () of the proposed algorithm forUTKinect dataset and LOOCV setting with different subsets ofjoints compared to other works

Algorithm AccuracyXia et al [26] 909

Theodorakopoulos et al [30] 9095

Ding et al [31] 915

Zhu et al [17] 919

Jiang et al [35] 919

Gan and Chen [24] 920

Liu et al [22] 920

Proposed (119875 = 15) 931

Proposed (119875 = 11) 942

Proposed (119875 = 19) 943

Anirudh et al [34] 949

Proposed (119875 = 7) 951

Vemulapalli et al [33] 971

only 06 with 119870 = 5 that provided the worst result Theselection of different sets of joints from 119875 = 7 to 119875 = 15changes the accuracy only by a 2 from 931 to 951In this dataset the main limitation to the performance isgiven by the reduced number of frames that constitute somesequences and limits the number of clusters representing theactions Vemulapalli et al [33] reach the highest accuracybut the approach is much more complex after modelingskeleton joints in a Lie group the processing scheme includesDynamic TimeWarping to perform temporal alignments anda specific representation called Fourier Temporal Pyramidbefore classification with one-versus-all multiclass SVM

Contextualizing the UTKinect dataset to AAL involvesthe consideration of a subset of activities which includes onlyactions and discards gestures In more detail the following 5activities have been selectedwalk sit down stand up pick upand carry In this condition the highest accuracy (967) isstill given by119875 = 7 with 3 clusters and the lower one (941)is represented by 119875 = 15 again with 3 clustersThe confusion

Table 8 Global accuracy () of the proposed algorithm forFlorence3D dataset and ldquonew-personrdquo setting with different subsetsof joints compared to other works

Algorithm AccuracySeidenari et al [44] 820

Proposed (119875 = 7) 821

Proposed (119875 = 15) 847

Proposed (119875 = 11) 861

Anirudh et al [34] 897

Vemulapalli et al [33] 909

Taha et al [28] 962

matrix for these two configurations are shown in Figure 4where the main difference is the reduced misclassificationbetween the activities walk and carry that are very similarto each other

424 Florence3D Dataset The Florence3D dataset is com-posed of 9 activities and 10 people performing themmultipletimes resulting in a total number of 215 activities Theproposed setting to evaluate this dataset is the leave-one-actor-out which is equivalent to the ldquonew-personrdquo settingpreviously described The minimum number of frames is 8so the following sequence of clusters has been considered119870 = [3 4 5 6 7 8] A number of 15 joints are available forthe skeleton so only the first three schemes are included inthe tests

Table 8 shows the results obtained with different subsetsof joints where the best result for the proposed algorithmis given by 119875 = 11 with 6 clusters The choice of clustersdoes not significantly affect the performance because themaximum reduction is 2 In this dataset the proposedapproach does not achieve state-of-the-art accuracy (962)given by Taha et al [28] However all the algorithmsthat overcome the proposed one exhibit greater complexitybecause they consider Lie group representations [33 34] orseveral machine learning algorithms such as SVM combined

10 Computational Intelligence and Neuroscience

(a)

(b)

(c)

Figure 5 Sequences of frames representing the drink from a bottle (a) answer phone (b) and read watch (c) activities from Florence3Ddataset

081 019

014 086

100

005 005 090

005 095

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(a)

076 019 005

032 064 005

004 092 004

005 010

005010005

085

080

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(b)

Figure 6 Confusion matrices of the Florence3D dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 11 (b) Worst accuracy confusion matrix obtained with 119875 = 7

to HMM to recognize activities composed of atomic actions[28]

If different subsets of joints are considered the perfor-mance decreases In particular considering only 7 joints and3 clusters it is possible to reach a maximum accuracy of821 which is very similar to the one obtained by Seidenariet al [44] who collected the dataset By including all theavailable joints (15) in the algorithm processing it is possibleto achieve an accuracy which varies between 840 and847

The Florence3D dataset is challenging because of tworeasons

(i) High interclass similarity some actions are verysimilar to each other for example drink from abottle answer phone and read watch As can beseen in Figure 5 all of them consist of an uprising

arm movement to the mouth hear or head Thisfact affects the performance because it is difficult toclassify actions consisting of very similar skeletonmovements

(ii) High intraclass variability the same action is per-formed in different ways by the same subject forexample using left right or both hands indifferently

Limiting the analysis to AAL related activities only thefollowing ones can be selected drink answer phone tight lacesit down and stand up The algorithm reaches the highestaccuracy (908) using the ldquonew-personrdquo setting with 119875 =11 joints and 119870 = 3 The confusion matrix obtained in thiscondition is shown in Figure 6(a) On the other hand theworst accuracy is 798 and it is obtained with 119875 = 7 and119870 = 4 In Figure 6(b) the confusion matrix for this scenario

Computational Intelligence and Neuroscience 11

081 008

004 056 022 011007

012 077 012

012

004 004 081 012

100

081 004 015

013 087

004 007019 070

a02

a03

a05

a06

a10

a13

a18

a20

a02

a03

a05

a06

a10

a13

a18

a20

(a)

056 015 007 015 007

007

007 017 013

016 044 008 004 004 024

011 056 004

003

022

010 070 017

063

100

010 087 003

007 093

a01

a04

a07

a08

a09

a11

a12

a14

a01

a04

a07

a08

a09

a11

a12

a14

(b)

088 012

100

005 095

097 003

003 097

093 007

003 003 083 010

004 011 085

a06

a14

a15

a16

a17

a18

a19

a20

a06

a14

a15

a16

a17

a18

a19

a20

(c)

Figure 7 Confusion matrices of the MSR Action3D dataset obtained with 119875 = 7 and the ldquonew-personrdquo test (a) AS1 (b) AS2 and (c) AS3

is shown In both situations the main problem is given by thesimilarity of drink and answer phone activities

425 MSR Action3D Dataset The MSR Action3D datasetis composed of 20 activities which are mainly gestures Itsevaluation is included for the sake of completeness in thecomparison with other activity recognition algorithms butthe dataset does not contain AAL related activities Thereis a big confusion in the literature about the validationtests to be used for MSR Action3D Padilla-Lopez et al[49] summarized all the validation methods for this datasetand recommend using all the possible combinations of 5-5 subjects splitting or using LOAO The former procedureconsists of considering 252 combinations of 5 subjects fortraining and the remaining 5 for testing while the latter is theleave-one-actor-out equivalent to the ldquonew-personrdquo schemepreviously introduced This dataset is quite challenging dueto the presence of similar and complex gestures hence theevaluation is performed by considering three subsets of 8gestures each (AS1 AS2 and AS3 in Table 9) as suggestedin [45] Since the minimum number of frames constitutingone sequence is 13 the following set of clusters has beenconsidered 119870 = [3 5 8 10 13] All the combinations of 711 15 and 20 joints shown in Figure 2 are included in theexperimental tests

Considering the ldquonew-personrdquo scheme the proposedalgorithm tested separately on the three subsets of MSRAction3D reaches an average accuracy of 812 with 119875 = 7and 119870 = 10 The confusion matrices for the three subsetsare shown in Figure 7 where it is possible to notice thatthe algorithm struggles in the recognition of the AS2 subset(Figure 7(b)) mainly represented by drawing gestures Betterresults are obtained processing the subset AS3 (Figure 7(c))where lower recognition rates are shown for complex ges-tures golf swing and pickup and throw Table 10 shows theresults obtained including also other joints which are slightlyworse A comparison with previous works validated usingthe ldquonew-personrdquo test is shown in Table 11 The proposedalgorithm achieves results comparable with [50] whichexploits an approach based on skeleton data Chaaraoui etal [15 36] exploits more complex algorithms considering

Table 9 Three subsets of gestures fromMSR Action3D dataset

AS1 AS2 AS3[11988602] Horizontalarm wave [11988601] High arm wave [11988606] High throw

[11988603] Hammer [11988604] Hand catch [11988614] Forward kick[11988605] Forwardpunch [11988607] Draw X [11988615] Side kick

[11988606] High throw [11988608] Draw tick [11988616] Jogging[11988610] Hand clap [11988609] Draw circle [11988617] Tennis swing[11988613] Bend [11988611] Two-hand wave [11988618] Tennis serve[11988618] Tennis serve [11988612] Side boxing [11988619] Golf swing[11988620] Pickup andthrow [11988614] Forward kick [11988620] Pickup and throw

Table 10 Accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting with different subsetsof joints and different subsets of activities

Algorithm AS1 AS2 AS3 AvgProposed (119875 = 15) 785 688 928 800

Proposed (119875 = 20) 790 702 919 804

Proposed (119875 = 11) 776 737 914 809

Proposed (119875 = 7) 795 719 923 812

the fusion of skeleton and depth data or the evolutionaryselection of the best set of joints

5 Discussion

The proposed algorithm despite its simplicity is able toachieve and sometimes to overcome state-of-the-art per-formance when applied to publicly available datasets Inparticular it is able to outperform some complex algorithmsexploiting more than one classifier in KARD and CAD-60datasets

Limiting the analysis to AAL related activities only thealgorithm achieves interesting results in all the datasetsThe group of 8 activities labeled as Actions in the KARD

12 Computational Intelligence and Neuroscience

Table 11 Average accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting compared to otherworks

Algorithm AccuracyCeliktutan et al (2015) [51] 729

Azary and Savakis (2013) [52] 785

Proposed (119875 = 7) 812

Azary and Savakis (2012) [50] 839

Chaaraoui et al (2013) [15] 906

Chaaraoui et al (2014) [36] 935

dataset is recognized with an accuracy greater than 987in the considered experiments and the group includes twosimilar activities such as phone call and drink The CAD-60dataset contains only actions so all the activities consideredwithin the proper location have been included in the eval-uation resulting in global precision and recall of 939 and935 respectively In the UTKinect dataset the performanceimproves considering only the AAL activities and the samehappens also in the Florence3D having the highest accuracyclose to 91 even if some actions are very similar

The MSR Action3D is a challenging dataset mainlycomprising gestures for human-computer interaction andnot actions Many gestures especially the ones included inAS2 subset are very similar to each other In order to improvethe recognition accuracy more complex features should beconsidered including not only jointsrsquo relative positions butalso their velocity Many sequences contain noisy skeletondata another approach to improving recognition accuracycould be the development of a method to discard noisyskeleton or to include depth-based features

However the AAL scenario raises some problems whichhave to be addressed First of all the algorithm exploits amul-ticlass SVM which is a good classifier but it does not makeeasy to understand if an activity belongs or not to any of thetraining classes In fact in a real scenario it is possible to havea sequence of frames that does not represent any activity ofthe training set in this case the SVM outputs the most likelyclass anyway even if it does not make sense Other machinelearning algorithms such as HMMs distinguish amongmultiple classes using the maximum posterior probabilityGaglio et al [29] proposed using a threshold on the outputprobability to detect unknown actions Usually SVMs do notprovide output probabilities but there exist some methodsto extend SVM implementations and make them able toprovide also this information [53 54] These techniquescan be investigated to understand their applicability in theproposed scenario

Another problem is segmentation of actions In manydatasets the actions are represented as segmented sequencesof frames but in real applications the algorithm has to handlea continuous stream of frames and to segment actions byitself Some solutions for segmentation have been proposedbut most of them are based on thresholds on movementswhich can be highly data-dependent [46] Also this aspect hasto be further investigated to have a systemwhich is effectivelyapplicable in a real AAL scenario

6 Conclusions

In this work a simple yet effective activity recognitionalgorithm has been proposed It is based on skeleton dataextracted from an RGBD sensor and it creates a featurevector representing the whole activity It is able to overcomestate-of-the-art results in two publicly available datasets theKARD and CAD-60 outperforming more complex algo-rithms in many conditions The algorithm has been testedalso over other more challenging datasets the UTKinect andFlorence3D where it is outperformed only by algorithmsexploiting temporal alignment techniques or a combinationof several machine learning methods The MSR Action3Dis a complex dataset and the proposed algorithm strugglesin the recognition of subsets constituted by similar gesturesHowever this dataset does not contain any activity of interestfor AAL scenarios

Future works will concern the application of the activityrecognition algorithm to AAL scenarios by consideringaction segmentation and the detection of unknown activities

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the contributionof the COST Action IC1303 AAPELE (Architectures Algo-rithms and Platforms for Enhanced Living Environments)

References

[1] P Rashidi and A Mihailidis ldquoA survey on ambient-assistedliving tools for older adultsrdquo IEEE Journal of Biomedical andHealth Informatics vol 17 no 3 pp 579ndash590 2013

[2] O C Ann and L B Theng ldquoHuman activity recognition areviewrdquo in Proceedings of the IEEE International Conference onControl System Computing and Engineering (ICCSCE rsquo14) pp389ndash393 IEEE Batu Ferringhi Malaysia November 2014

[3] S Wang and G Zhou ldquoA review on radio based activityrecognitionrdquo Digital Communications and Networks vol 1 no1 pp 20ndash29 2015

[4] L Atallah B Lo R Ali R King and G-Z Yang ldquoReal-timeactivity classificationusing ambient andwearable sensorsrdquo IEEETransactions on Information Technology in Biomedicine vol 13no 6 pp 1031ndash1039 2009

[5] D De P Bharti S K Das and S Chellappan ldquoMultimodalwearable sensing for fine-grained activity recognition in health-carerdquo IEEE Internet Computing vol 19 no 5 pp 26ndash35 2015

[6] J K Aggarwal and L Xia ldquoHuman activity recognition from 3Ddata a reviewrdquo Pattern Recognition Letters vol 48 pp 70ndash802014

[7] S Gasparrini E Cippitelli E Gambi S Spinsante and FFlorez-Revuelta ldquoPerformance analysis of self-organising neu-ral networks tracking algorithms for intake monitoring usingkinectrdquo in Proceedings of the 1st IET International Conferenceon Technologies for Active and Assisted Living (TechAAL rsquo15)Kingston UK November 2015

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

Computational Intelligence and Neuroscience 3

Then the recognition is realized using the representation inthe dissimilarity space where different feature trajectoriesmaintain discriminant information and have a fixed-lengthrepresentation Ding et al [31] proposed a Spatiotempo-ral Feature Chain (STFC) to represent the human actionsby trajectories of joint positions Before using the STFCmodel a graph is used to erase periodic sequences makingthe solution more robust to noise and periodic sequencemisalignment Slama et al [32] exploited the geometricstructure of the Grassmann manifold for action analysisIn fact considering the problem as a sequence matchingtask this manifold allows considering an action sequence asa point on its space and provides tools to make statisticalanalysis Considering that the relative geometry betweenbody parts is more meaningful than their absolute locationsrotations and translations required to perform rigid bodytransformations can be represented as points in a SpecialEuclidean SE(3) group Each skeleton can be represented asa point in the Lie group SE(3) times SE(3) times sdot sdot sdot times SE(3) and ahuman action can be modeled as a curve in this Lie group[33] The same skeleton feature is also used in [34] whereManifold Functional PCA (mfPCA) is employed to reducefeature dimensionality Some works developed techniques toautomatically select the most informative joints aiming atincreasing the recognition accuracy and reducing the noiseeffect on the skeleton estimation [35 36]

The work presented in this paper is based on the conceptthat an action can be seen as a sequence of informativepostures which are known as ldquokey posesrdquoThis idea has beenintroduced in [37] and used in other subsequent proposals[15 36 38] While in previous works a fixed set of key poses(119870) is extracted for each action of the dataset considering allthe training sequences in this work the clustering algorithmwhich selects the most informative postures is executed foreach sequence thus selecting a different set of 119870 poseswhich constitutes the feature vector This procedure avoidsthe application of a learning algorithm which has to find thenearest neighbor key pose for each frame constituting thesequence

3 Activity Recognition Algorithm

The proposed algorithm for activity recognition starts fromskeleton joints and computes a vector of features for eachactivity Then a multiclass machine learning algorithmwhere each class represents a different activity is exploited forclassification purpose Four main steps constitute the wholealgorithm they are represented in Figure 1 and are discussedin the following

(1) Posture Features Extraction The coordinates of theskeleton joints are used to evaluate the feature vectorswhich represent human postures

(2) Postures Selection The most important postures foreach activity are selected

(3) Activity Features Computation A feature vector rep-resenting the whole activity is created and used forclassification

(4) Classification The classification is realized using amulticlass SVM implemented with the ldquoone-versus-onerdquo approach

First there is the need to extract the features fromskeleton data representing the input to the algorithm Thejoint extraction algorithm proposed in [8] is included in theKinect libraries and is ready to use The choice to considerKinect skeleton is motivated by the fact that it is easy to havea compact representation of the human body Starting fromskeleton data many features which are able to represent ahuman pose and consequently a human action have beenproposed in the literature In [6] the authors found thatmanyfeatures can be computed from skeleton joints The simplestfeatures can be extracted by joint locations or from theirdistances considering spatial information Other featuresmay involve joints orientation or motion considering spatialand temporal data More complex features may be initiallybased on the estimation of a plane considering some jointsThen by measuring the distances between this plane andother joints the features can be extracted

The proposed algorithm exploits spatial features com-puted from 3D skeleton coordinates without including thetime information in the computation in order to make thesystem independent of the speed of movement The featureextraction method has been introduced in [36] and it is hereadopted with small differences For each skeleton frame aposture feature vector is computed Each joint is representedby J119894 a three-dimensional vector in the coordinate space of

Kinect The person can be found at any place within thecoverage area of Kinect and the coordinates of the same jointmay assume different values It is necessary to compensatethis effect by using a proper features computation algorithmA straightforward solution is to compensate the position ofthe skeleton by centering the coordinate space in one skeletonjoint Considering a skeleton composed of 119875 joints J

0being

the coordinates of the torso joint and J2being the coordinates

of the neck joint the 119894th joint feature d119894is the distance vector

between J119894and J0 normalized to the distance between J

2and

J0

d119894=

J119894minus J0

1003817100381710038171003817J2 minus J01003817100381710038171003817

119894 = 1 2 119875 minus 1 (1)

This feature is invariant to the position of the skeleton withinthe coverage area ofKinect furthermore the invariance of thefeature to the build of the person is obtained by normalizationwith respect to the distance between the neck and torso jointsThese features may be seen as a set of distance vectors whichconnect each joint to the joint of the torso A posture featurevector f is created for each skeleton frame

f = [d1 d2 d3 d

119875minus1] (2)

A set of 119873 feature vectors is computed having an activityconstituted by119873 frames

The second phase concerns the human postures selectionwith the aim of reducing the complexity and increasinggenerality by representing the activity by means of only asubset of poses without using all the frames A clustering

4 Computational Intelligence and Neuroscience

1

23

4

Test Test

Train

LA

(A L)

SVM1

SVM2

SVM3C1

C2

C3

CK

SVMM(Mminus1)2

Classification

Activity features computation Postures selection

Posture features extraction

A = [C1C2C3 CK]

s1 = [J0 J1 J2 J3 JPminus1]s2 = [J0 J1 J2 J3 JPminus1]s3 = [J0 J1 J2 J3 JPminus1]s4 = [J0 J1 J2 J3 JPminus1]

sN = [J0 J1 J2 J3 JPminus1]

f1 = [d1 d2 d3 dPminus1]f2 = [d1 d2 d3 dPminus1]f3 = [d1 d2 d3 dPminus1]f4 = [d1 d2 d3 dPminus1]

fN = [d1 d2 d3 dPminus1]f = [d1 d2 d3 dPminus1]

Ji = ith joint

di =Ji minus J0J2 minus J0

i = 1 2 P minus 1

Figure 1 The general scheme of the activity recognition algorithm is composed of 4 steps In the first step the posture feature vectors arecomputed for each skeleton frame then the postures are selected and an activity features vector is created Finally amulticlass SVM is exploitedfor classification

algorithm is used to process 119873 feature vectors constitutingthe activity by grouping them into 119870 clusters The well-known 119896-means clustering algorithm based on the squaredEuclidean distance as a metric can be used to group togetherthe frames representing similar postures Considering anactivity composed of119873 feature vectors [f

1 f2 f3 f

119873] the

119896-means algorithm gives as outputs 119873 clusters ID (one foreach feature vector) and 119870 vectors [C

1C2C3 C

119870] that

represent the centroids of each cluster The feature vectorsare partitioned into clusters 119878

1 1198782 119878

119870so as to satisfy the

condition expressed by

argminS

119870

sum

119895=1

sum

f119894isin119878119895

10038171003817100381710038171003817f119894minus C119895

10038171003817100381710038171003817

2

(3)

The 119870 centroids can be seen as the main postures whichare the most important feature vectors Unlike classicalapproaches based on key poses where the most informativepostures are evaluated by considering all the sequencesof each activity in the proposed solution the clusteringalgorithm is executed for each sequence This avoids theapplication of a learning algorithm required to associate eachframe to the closest key pose and allows to have a morecompact representation of the activity

The third phase is related to the computation of a featurevector which models the whole activity starting from the119870 centroid vectors computed by the clustering algorithmIn more detail [C

1C2C3 C

119870] vectors are sorted con-

sidering the order in which the clusterrsquos elements occurduring the activity The activity features vector is composedof concatenating the sorted centroid vectors For example

considering an activity featured by 119873 = 10 and 119870 = 4after running the 119896-means algorithm one of the possibleoutputs could be the following sequence of cluster IDs[2 2 2 3 3 1 1 4 4 4] meaning that the first three posturevectors belong to cluster 2 the fourth and the fifth areassociated with cluster 3 and so on In this case the activityfeatures vector is A = [C

2C3C1C4] A feature activity

vector has a dimension of 3119870(119875 minus 1) that can be handledwithout using dimensionality reduction algorithms such asPCA if 119870 is small

The classification step aims to associate each featureactivity vector to the correct activity Many machine learningalgorithms may be applied to fulfil this task among them aSVM Considering a number of 119897 training vectors x

119894isin 119877119899 and

a vector of labels 119910 isin 119877119897 where 119910119894isin minus1 1 a binary SVM

can be formulated as follows [39]

minw119887120585

1

2w119879w + 119862

119897

sum

119894=1

120585119894

subject to 119910119894(w119879120601 (x

119894) + 119887) ge 1 minus 120585

119894

120585119894ge 0 119894 = 1 119897

(4)

where

w119879120601 (x) + 119887 = 0 (5)

is the optimal hyperplane that allows separation betweenclasses in the feature space119862 is a constant and 120585

119894are nonneg-

ative variables which consider training errorsThe function 120601allows transforming between the features space and an higher

Computational Intelligence and Neuroscience 5

12

30456

78

910

1112

131415

16 17

18 19

(a)

12

30456

78

910

1112

131415

16 17

18 19

(b)

12

30456

78

910

1112

131415

1617

18 19

(c)

12

30456

78

910

1112

131415

16 17

18 19(d)

Figure 2 Subsets of joints considered in the evaluation of the proposed algorithmThewhole skeleton is represented by 20 joints the selectedones are depicted as green circles while the discarded joints are represented by red squares Subsets of 7 (a) 11 (b) and 15 (c) joints and thewhole set of 20 joints (d) are selected

dimensional space where the data are separable Consideringtwo training vectors x

119894and x

119895 the kernel function can be

defined as

119870(x119894 x119895) = 120601 (x

119894)119879120601 (x119895) (6)

In this work the Radial Basis Function (RBF) kernel has beenused where

119870(x119894 x119895) = 119890minus120574x119894minusx1198952 120574 gt 0 (7)

It follows that 119862 and 120574 are the parameters that have to beestimated prior using the SVM

The idea herein exploited is to use a multiclass SVMwhere each class represents an activity of the dataset In orderto extend the role from a binary to amulticlass classifier someapproaches have been proposed in the literature In [40] theauthors compared many methods and found that the ldquoone-against-onerdquo is one of the most suitable for practical use Itis implemented in LIBSVM [41] and it is the approach usedin this workThe ldquoone-against-onerdquo approach is based on theconstruction of several binary SVM classifiers inmore detaila number of119872(119872minus1)2 binary SVMs are necessary in a119872-classes dataset This happens because each SVM is trained todistinguish between 2 classes and the final decision is takenexploiting a voting strategy among all the binary classifiersDuring the training phase the activity feature vectors aregiven as inputs to the multiclass SVM together with the label119871 of the action In the test phase the activity label is obtainedfrom the classifier

4 Experimental Results

Thealgorithmperformance is evaluated on five publicly avail-able datasets In order to perform an objective comparisonto previous works the reference test procedures have beenconsidered for each dataset The performance indicators areevaluated using four different subsets of joints shown inFigure 2 going from a minimum of 7 up to a maximum of20 joints Finally in order to evaluate the performance inAAL scenarios some suitable activities related to AAL areselected from the datasets and the recognition accuracies areevaluated

41 Datasets Five different 3D datasets have been consideredin this work each of them including a different set of activitiesand gestures

KARD dataset [29] is composed of 18 activities that canbe divided into 10 gestures (horizontal arm wave high armwave two-hand wave high throw draw X draw tick forwardkick side kick bend and hand clap) and eight actions (catchcap toss paper take umbrellawalk phone call drink sit downand stand up) This dataset has been captured in a controlledenvironment that is an office with a static background anda Kinect device placed at a distance of 2-3m from the subjectSome objects were present in the area useful to perform someof the actions a desk with a phone a coat rack and a wastebin The activities have been performed by 10 young people(nine males and one female) aged from 20 to 30 years andfrom 150 to 185 cm tall Each person repeated each activity3 times creating a number of 540 sequences The dataset iscomposed of RGB and depth frames captured at a rate of30 fps with a 640 times 480 resolution In addition 15 joints ofthe skeleton in world and screen coordinates are providedThe skeleton has been captured using OpenNI libraries [42]

The Cornell Activity Dataset (CAD-60) [43] is madeby 12 different activities typical of indoor environmentsThe activities are rinsing mouth brushing teeth wearingcontact lens talking on the phone drinking water openingpill container cooking-chopping cooking-stirring talking oncouch relaxing on couch writing on whiteboard and workingon computer and are performed in 5 different environmentsbathroom bedroom kitchen living room and office All theactivities are performed by 4 different people two males andtwo females one of which is left-handed No instructionswere given to actors about how to perform the activitiesthe authors simply ensured that the skeleton was correctlydetected by Kinect The dataset is composed of RGB depthand skeleton data with 15 joints available

The UTKinect dataset [26] is composed of 10 differentsubjects (9males and 1 female) performing 10 activities twiceThe following activities are part of the dataset walk sit downstand up pick up carry throw push pull wave and claphands A number of 199 sequences are available because onesequence is not labeled and the length of sample actions

6 Computational Intelligence and Neuroscience

ranges from 5 to 120 frames The dataset provides 640 times 480RGB frames and 320 times 240 depth frames together with 20skeleton joints captured using Kinect forWindows SDKBetaVersion with a final frame rate of about 15 fps

The Florence3D dataset [44] includes 9 different activi-ties wave drink from a bottle answer phone clap tight lacesit down stand up read watch and bow These activities wereperformed by 10 different subjects for 2 or 3 times resultingin a total number of 215 sequences This is a challengingdataset since the same action is performed with both handsand because of the presence of very similar actions such asdrink from a bottle and answer phone The activities wererecorded in different environments and only RGB videos and15 skeleton joints are available

Finally the MSR Action3D [45] represents one of themost used datasets for HAR It includes 20 activities per-formed by 10 subjects 2 or 3 times In total 567 sequencesof depth (320 times 240) and skeleton frames are providedbut 10 of them have to be discarded because the skeletonsare either missing or affected by too many errors Thefollowing activities are included in the dataset high armwavehorizontal arm wave hammer hand catch forward punchhigh throw draw X draw tick draw circle hand clap two-hand wave side boxing bend forward kick side kick joggingtennis swing tennis serve golf swing and pickup and throwThe dataset has been collected using a structured-light depthcamera at 15 fps RGB data are not available

42 Tests and Results The proposed algorithm has beentested over the datasets detailed above following the recom-mendations provided in each reference paper in order toensure a fair comparison to previous works Following thiscomparison a more AAL oriented evaluation has been con-ducted which consists of considering only suitable actionsfor each dataset excluding the gestures Another type ofevaluation regards the subset of skeleton joints that has to beincluded in the feature computation

421 KARD Dataset Gaglio et al [29] collected the KARDdataset and proposed some evaluation experiments on itThey considered three different experiments and two modal-ities of dataset splitting The experiments are as follows

(i) Experiment A one-third of the data is considered fortraining and the rest for testing

(ii) Experiment B two-thirds of the data is considered fortraining and the rest for testing

(iii) Experiment C half of the data is considered fortraining and the rest for testing

The activities constituting the dataset are split in the followinggroups

(i) Gestures and Actions(ii) Activity Set 1 Activity Set 2 and Activity Set 3 as

listed in Table 1 Activity Set 1 is the simplest onesince it is composed of quite different activities whilethe other two sets include more similar actions andgestures

Table 1 Activity sets grouping different and similar activities fromKARD dataset

Activity Set 1 Activity Set 2 Activity Set 3Horizontal arm wave High arm wave Draw tickTwo-hand wave Side kick DrinkBend Catch cap Sit downPhone call Draw tick Phone callStand up Hand clap Take umbrellaForward kick Forward kick Toss paperDraw X Bend High throwWalk Sit down Horizontal arm wave

Each experiment has been repeated 10 times randomlysplitting training and testing data Finally the ldquonew-personrdquoscenario is also performed that is a leave-one-actor-outsetting consisting of training the system on nine of the tenpeople of the dataset and testing on the tenth In the ldquonew-personrdquo test no recommendation is provided about how tosplit the dataset so we assumed that the whole dataset of 18activities is considered The only parameter that can be setin the proposed algorithm is the number of clusters 119870 anddifferent subsets of skeleton joints are considered Since only15 skeleton joints are available the fourth group of 20 joints(Figure 2(d)) cannot be considered

For each test conducted on KARD dataset the sequenceof clusters 119870 = [3 5 10 15 20 25 30 35] has been consid-ered The results concerning Activity Set tests are reported inTable 2 it is shown that the proposed algorithm outperformsthe originally proposed one in almost all the tests The 119870parameter which gives the maximum accuracy is quite highsince it is 30 or 35 in most of the cases It means that forthe KARD dataset it is better to have a significant numberof clusters representing each activity This is possible alsobecause the number of frames that constitutes each activitygoes from aminimumof 42 in a sequence of hand clap gestureto a maximum of 310 for a walk sequence Experiment Ain Activity Set 1 shows the highest difference between theminimum and the maximum accuracy when varying thenumber of clusters For example considering 119875 = 7 themaximum accuracy (982) is shown in Table 2 and it isobtained with 119870 = 35 The minimum accuracy is 914 andit is obtained with 119870 = 5 The difference is quite high (68)but this gap is reduced by considering more training data Infact Experiment B shows a difference of 05 in Activity Set1 11 in Activity Set 2 and 26 in Activity Set 3

Considering the number of selected joints the observa-tion of Tables 2 and 3 lets us conclude that not all the jointsare necessary to achieve good recognition results In moredetail from Table 2 it can be noticed that the Activity Set 1and Activity Set 2 which are the simplest ones have goodrecognition results using a subset composed of 119875 = 7 jointsThe Activity Set 3 composed of more similar activities isbetter recognized with 119875 = 11 joints In any case it is notnecessary to consider all the skeleton joints provided fromKARD dataset

The results obtained with the ldquonew-personrdquo scenario areshown in Table 4 The best result is obtained with 119870 = 30

Computational Intelligence and Neuroscience 7

Table 2 Accuracy () of the proposed algorithm compared to the other using KARD dataset with different Activity Sets and for differentexperiments

Activity Set 1 Activity Set 2 Activity Set 3A B C A B C A B C

Gaglio et al [29] 951 991 930 899 949 901 842 895 817

Proposed (119875 = 7) 982 984 981 997 100 997 902 950 913

Proposed (119875 = 11) 980 990 977 998 100 996 916 958 933Proposed (119875 = 15) 975 988 976 995 100 996 91 951 932

090 003 007097 003

100100

003 093 003007 007 083 003007 007 077 010

003 093 003100

100100

100100

100090 010

003 003 007 087100

100

Hor

arm

wav

e

Hig

h ar

m w

ave

Two-

hand

wav

e

Catch

cap

Hig

h th

row

Dra

w X

Dra

w tic

k

Toss

pape

r

Forw

kick

Side

kick

Take

um

br

Bend

Han

d cla

p

Wal

k

Phon

e cal

l

Drin

k

Sit d

own

Stan

d up

Hor arm waveHigh arm wave

Two-hand waveCatch cap

High throwDraw X

Draw tickToss paperForw kick

Side kickTake umbr

BendHand clap

WalkPhone call

DrinkSit downStand up

Figure 3 Confusion matrix of the ldquonew-personrdquo test on the whole KARD dataset

Table 3 Accuracy () of the proposed algorithm compared tothe other using KARD dataset with dataset split in Gestures andActions for different experiments

Gestures ActionsA B C A B C

Gaglio et al [29] 865 930 867 925 950 901

Proposed (119875 = 7) 899 935 925 991 996 994Proposed (119875 = 11) 899 959 937 990 999 991Proposed (119875 = 15) 874 936 928 987 995 993

Table 4 Precision () and recall () of the proposed algorithmcompared to the other using the whole KARD dataset and ldquonew-personrdquo setting

Algorithm Precision RecallGaglio et al [29] 848 845

Proposed (119875 = 7) 940 937

Proposed (119875 = 11) 951 950Proposed (119875 = 15) 950 948

using a number of 119875 = 11 joints The overall precision andrecall are about 10higher than the previous approachwhichuses the KARD dataset In this condition the confusion

matrix obtained is shown in Figure 3 It can be noticed thatthe actions are distinguished very well only phone call anddrink show a recognition accuracy equal or lower than 90and sometimes they are mixed with each other The mostcritical activities are the draw X and draw tick gestures whichare quite similar

From the AAL point of view only some activities arerelevant Table 3 shows that the eight actions constituting theKARD dataset are recognized with an accuracy greater than98 even if there are some similar actions such as sit downand stand up or phone call and drink Considering only theActions subset the lower recognition accuracy is 949and itis obtained with119870 = 5 and119875 = 7 It means that the algorithmis able to reach a high recognition rate even if the featurevector is limited to 90 elements

422 CAD-60 Dataset TheCAD-60 dataset is a challengingdataset consisting of 12 activities performed by 4 people in5 different environments The dataset is usually evaluatedby splitting the activities according to the environmentthe global performance of the algorithm is given by theaverage precision and recall among all the environments Twodifferent settings were experimented for CAD-60 in [43]Theformer is defined ldquonew-personrdquo and the latter is the so-calledldquohave-seenrdquo ldquoNew-personrdquo setting has been considered in all

8 Computational Intelligence and Neuroscience

the works using CAD-60 so it is the one selected also in thiswork

The most challenging element of the dataset is thepresence of a left-handed actor In order to increase the per-formance which are particularly affected by this unbalancingin the ldquonew-personrdquo test mirrored copies of each action arecreated as suggested in [43] For each actor a left-handed anda right-handed version of each action are made availableThedummyversion of the activity has been obtained bymirroringthe skeleton with respect to the virtual sagittal plane that cutsthe person in a half The proposed algorithm is evaluatedusing three different sets of joints from 119875 = 7 to 119875 = 15and the sequence of clusters 119870 = [3 5 10 15 20 25 30 35]as in the KARD dataset

The best results are obtained with the configurations 119875 =11 and 119870 = 25 and the performance in terms of precisionand recall for each activity is shown in Table 5 Very goodresults are given in office environment where the averageprecision and recall are 964 and 958 respectively Infact the activities of this environment are quite different onlytalking on phone and drinking water are similar On the otherhand the living room environment includes talking on couchand relaxing on couch in addition to talking on phone anddrinking water and it is the most challenging case since theaverage precision and recall are 91 and 906

The proposed algorithm is compared to other worksusing the same ldquonew-personrdquo setting and the results areshown in Table 6 which tells that the 119875 = 11 configurationoutperforms the state-of-the-art results in terms of precisionand it is only 1 lower in terms of recall Shan and Akella[46] achieve very good results using a multiclass SVMscheme with a linear kernel However they train and testmirrored actions separately and then merge the results whencomputing average precision and recall Our approach simplyconsiders two copies of the same action given as input to themulticlass SVM and retrieves the classification results

The reduced number of joints does not affect too muchthe average performance of the algorithm that reaches aprecision of 927 and a recall of 915 with 119875 = 7Using all the available jonts on the other hand brings toa more substantial reduction of the performance showing879 and 867 for precision and recall respectively with119875 = 15 The best results for the proposed algorithm werealways obtained with a high number of clusters (25 or30) The reduction of this number affects the performancefor example considering the 119875 = 11 subset the worstperformance is obtained with 119870 = 5 and with a precision of866 and a recall of 860

From theAALpoint of view this dataset is composed onlyof actions not gestures so the dataset does not have to beseparated to evaluate the performance in a scenario which isclose to AAL

423 UTKinect Dataset The UTKinect dataset is composedof 10 activities performed twice by 10 subjects and theevaluation setting proposed in [26] is the leave-one-out-cross-validation (LOOCV) which means that the systemis trained on all the sequences except one and that one isused for testing Each trainingtesting procedure is repeated

Table 5 Precision () and recall () of the proposed algorithm inthe different environments of CAD-60 with 119875 = 11 and 119870 = 25

Location Activity ldquoNew-personrdquoPrecision Recall

Bathroom

Brushing teeth 889 100

Rinsing mouth 923 100

Wearing contact lens 100 792

Average 937 931

Bedroom

Talking on phone 917 917

Drinking water 913 875

Opening pill container 960 100

Average 930 931

Kitchen

Cooking-chopping 857 100

Cooking-stirring 100 791

Drinking water 960 100

Opening pill container 100 100

Average 954 948

Living room

Talking on phone 875 875

Drinking water 875 875

Talking on couch 889 100

Relaxing on couch 100 875

Average 910 906

Office

Talking on phone 100 875

Writing on whiteboard 100 958

Drinking water 857 100

Working on computer 100 100

Average 964 958Global average 939 935

Table 6 Global precision () and recall () of the proposedalgorithm for CAD-60 dataset and ldquonew-personrdquo setting withdifferent subsets of joints compared to other works

Algorithm Precision RecallSung et al [43] 679 555

Gaglio et al [29] 773 767

Proposed (119875 = 15) 879 867

Faria et al [47] 911 919

Parisi et al [48] 919 902

Proposed (119875 = 7) 927 915

Shan and Akella [46] 938 945Proposed (119875 = 11) 939 935

20 times to reduce the random effect of 119896-means For thisdataset all the different subsets of joints shown in Figure 2are considered since the skeleton is captured usingMicrosoftSDK which provides 20 joints The considered sequence ofclusters is only 119870 = [3 4 5] because the minimum numberof frames constituting an action sequence is 5

The results compared with previous works are shownin Table 7 The best results for the proposed algorithm areobtained with 119875 = 7 which is the smallest set of jointsconsidered The result corresponds to a number of 119870 = 4clusters but the difference with the other clusters is very low

Computational Intelligence and Neuroscience 9

093 002 005

094 006

099 001

100

002 098W

alk

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(a)

088 004 008

095 005

100

003 097

010 090

Wal

k

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(b)

Figure 4 Confusion matrices of the UTKinect dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 7 (b) Worst accuracy confusion matrix obtained with 119875 = 15

Table 7 Global accuracy () of the proposed algorithm forUTKinect dataset and LOOCV setting with different subsets ofjoints compared to other works

Algorithm AccuracyXia et al [26] 909

Theodorakopoulos et al [30] 9095

Ding et al [31] 915

Zhu et al [17] 919

Jiang et al [35] 919

Gan and Chen [24] 920

Liu et al [22] 920

Proposed (119875 = 15) 931

Proposed (119875 = 11) 942

Proposed (119875 = 19) 943

Anirudh et al [34] 949

Proposed (119875 = 7) 951

Vemulapalli et al [33] 971

only 06 with 119870 = 5 that provided the worst result Theselection of different sets of joints from 119875 = 7 to 119875 = 15changes the accuracy only by a 2 from 931 to 951In this dataset the main limitation to the performance isgiven by the reduced number of frames that constitute somesequences and limits the number of clusters representing theactions Vemulapalli et al [33] reach the highest accuracybut the approach is much more complex after modelingskeleton joints in a Lie group the processing scheme includesDynamic TimeWarping to perform temporal alignments anda specific representation called Fourier Temporal Pyramidbefore classification with one-versus-all multiclass SVM

Contextualizing the UTKinect dataset to AAL involvesthe consideration of a subset of activities which includes onlyactions and discards gestures In more detail the following 5activities have been selectedwalk sit down stand up pick upand carry In this condition the highest accuracy (967) isstill given by119875 = 7 with 3 clusters and the lower one (941)is represented by 119875 = 15 again with 3 clustersThe confusion

Table 8 Global accuracy () of the proposed algorithm forFlorence3D dataset and ldquonew-personrdquo setting with different subsetsof joints compared to other works

Algorithm AccuracySeidenari et al [44] 820

Proposed (119875 = 7) 821

Proposed (119875 = 15) 847

Proposed (119875 = 11) 861

Anirudh et al [34] 897

Vemulapalli et al [33] 909

Taha et al [28] 962

matrix for these two configurations are shown in Figure 4where the main difference is the reduced misclassificationbetween the activities walk and carry that are very similarto each other

424 Florence3D Dataset The Florence3D dataset is com-posed of 9 activities and 10 people performing themmultipletimes resulting in a total number of 215 activities Theproposed setting to evaluate this dataset is the leave-one-actor-out which is equivalent to the ldquonew-personrdquo settingpreviously described The minimum number of frames is 8so the following sequence of clusters has been considered119870 = [3 4 5 6 7 8] A number of 15 joints are available forthe skeleton so only the first three schemes are included inthe tests

Table 8 shows the results obtained with different subsetsof joints where the best result for the proposed algorithmis given by 119875 = 11 with 6 clusters The choice of clustersdoes not significantly affect the performance because themaximum reduction is 2 In this dataset the proposedapproach does not achieve state-of-the-art accuracy (962)given by Taha et al [28] However all the algorithmsthat overcome the proposed one exhibit greater complexitybecause they consider Lie group representations [33 34] orseveral machine learning algorithms such as SVM combined

10 Computational Intelligence and Neuroscience

(a)

(b)

(c)

Figure 5 Sequences of frames representing the drink from a bottle (a) answer phone (b) and read watch (c) activities from Florence3Ddataset

081 019

014 086

100

005 005 090

005 095

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(a)

076 019 005

032 064 005

004 092 004

005 010

005010005

085

080

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(b)

Figure 6 Confusion matrices of the Florence3D dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 11 (b) Worst accuracy confusion matrix obtained with 119875 = 7

to HMM to recognize activities composed of atomic actions[28]

If different subsets of joints are considered the perfor-mance decreases In particular considering only 7 joints and3 clusters it is possible to reach a maximum accuracy of821 which is very similar to the one obtained by Seidenariet al [44] who collected the dataset By including all theavailable joints (15) in the algorithm processing it is possibleto achieve an accuracy which varies between 840 and847

The Florence3D dataset is challenging because of tworeasons

(i) High interclass similarity some actions are verysimilar to each other for example drink from abottle answer phone and read watch As can beseen in Figure 5 all of them consist of an uprising

arm movement to the mouth hear or head Thisfact affects the performance because it is difficult toclassify actions consisting of very similar skeletonmovements

(ii) High intraclass variability the same action is per-formed in different ways by the same subject forexample using left right or both hands indifferently

Limiting the analysis to AAL related activities only thefollowing ones can be selected drink answer phone tight lacesit down and stand up The algorithm reaches the highestaccuracy (908) using the ldquonew-personrdquo setting with 119875 =11 joints and 119870 = 3 The confusion matrix obtained in thiscondition is shown in Figure 6(a) On the other hand theworst accuracy is 798 and it is obtained with 119875 = 7 and119870 = 4 In Figure 6(b) the confusion matrix for this scenario

Computational Intelligence and Neuroscience 11

081 008

004 056 022 011007

012 077 012

012

004 004 081 012

100

081 004 015

013 087

004 007019 070

a02

a03

a05

a06

a10

a13

a18

a20

a02

a03

a05

a06

a10

a13

a18

a20

(a)

056 015 007 015 007

007

007 017 013

016 044 008 004 004 024

011 056 004

003

022

010 070 017

063

100

010 087 003

007 093

a01

a04

a07

a08

a09

a11

a12

a14

a01

a04

a07

a08

a09

a11

a12

a14

(b)

088 012

100

005 095

097 003

003 097

093 007

003 003 083 010

004 011 085

a06

a14

a15

a16

a17

a18

a19

a20

a06

a14

a15

a16

a17

a18

a19

a20

(c)

Figure 7 Confusion matrices of the MSR Action3D dataset obtained with 119875 = 7 and the ldquonew-personrdquo test (a) AS1 (b) AS2 and (c) AS3

is shown In both situations the main problem is given by thesimilarity of drink and answer phone activities

425 MSR Action3D Dataset The MSR Action3D datasetis composed of 20 activities which are mainly gestures Itsevaluation is included for the sake of completeness in thecomparison with other activity recognition algorithms butthe dataset does not contain AAL related activities Thereis a big confusion in the literature about the validationtests to be used for MSR Action3D Padilla-Lopez et al[49] summarized all the validation methods for this datasetand recommend using all the possible combinations of 5-5 subjects splitting or using LOAO The former procedureconsists of considering 252 combinations of 5 subjects fortraining and the remaining 5 for testing while the latter is theleave-one-actor-out equivalent to the ldquonew-personrdquo schemepreviously introduced This dataset is quite challenging dueto the presence of similar and complex gestures hence theevaluation is performed by considering three subsets of 8gestures each (AS1 AS2 and AS3 in Table 9) as suggestedin [45] Since the minimum number of frames constitutingone sequence is 13 the following set of clusters has beenconsidered 119870 = [3 5 8 10 13] All the combinations of 711 15 and 20 joints shown in Figure 2 are included in theexperimental tests

Considering the ldquonew-personrdquo scheme the proposedalgorithm tested separately on the three subsets of MSRAction3D reaches an average accuracy of 812 with 119875 = 7and 119870 = 10 The confusion matrices for the three subsetsare shown in Figure 7 where it is possible to notice thatthe algorithm struggles in the recognition of the AS2 subset(Figure 7(b)) mainly represented by drawing gestures Betterresults are obtained processing the subset AS3 (Figure 7(c))where lower recognition rates are shown for complex ges-tures golf swing and pickup and throw Table 10 shows theresults obtained including also other joints which are slightlyworse A comparison with previous works validated usingthe ldquonew-personrdquo test is shown in Table 11 The proposedalgorithm achieves results comparable with [50] whichexploits an approach based on skeleton data Chaaraoui etal [15 36] exploits more complex algorithms considering

Table 9 Three subsets of gestures fromMSR Action3D dataset

AS1 AS2 AS3[11988602] Horizontalarm wave [11988601] High arm wave [11988606] High throw

[11988603] Hammer [11988604] Hand catch [11988614] Forward kick[11988605] Forwardpunch [11988607] Draw X [11988615] Side kick

[11988606] High throw [11988608] Draw tick [11988616] Jogging[11988610] Hand clap [11988609] Draw circle [11988617] Tennis swing[11988613] Bend [11988611] Two-hand wave [11988618] Tennis serve[11988618] Tennis serve [11988612] Side boxing [11988619] Golf swing[11988620] Pickup andthrow [11988614] Forward kick [11988620] Pickup and throw

Table 10 Accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting with different subsetsof joints and different subsets of activities

Algorithm AS1 AS2 AS3 AvgProposed (119875 = 15) 785 688 928 800

Proposed (119875 = 20) 790 702 919 804

Proposed (119875 = 11) 776 737 914 809

Proposed (119875 = 7) 795 719 923 812

the fusion of skeleton and depth data or the evolutionaryselection of the best set of joints

5 Discussion

The proposed algorithm despite its simplicity is able toachieve and sometimes to overcome state-of-the-art per-formance when applied to publicly available datasets Inparticular it is able to outperform some complex algorithmsexploiting more than one classifier in KARD and CAD-60datasets

Limiting the analysis to AAL related activities only thealgorithm achieves interesting results in all the datasetsThe group of 8 activities labeled as Actions in the KARD

12 Computational Intelligence and Neuroscience

Table 11 Average accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting compared to otherworks

Algorithm AccuracyCeliktutan et al (2015) [51] 729

Azary and Savakis (2013) [52] 785

Proposed (119875 = 7) 812

Azary and Savakis (2012) [50] 839

Chaaraoui et al (2013) [15] 906

Chaaraoui et al (2014) [36] 935

dataset is recognized with an accuracy greater than 987in the considered experiments and the group includes twosimilar activities such as phone call and drink The CAD-60dataset contains only actions so all the activities consideredwithin the proper location have been included in the eval-uation resulting in global precision and recall of 939 and935 respectively In the UTKinect dataset the performanceimproves considering only the AAL activities and the samehappens also in the Florence3D having the highest accuracyclose to 91 even if some actions are very similar

The MSR Action3D is a challenging dataset mainlycomprising gestures for human-computer interaction andnot actions Many gestures especially the ones included inAS2 subset are very similar to each other In order to improvethe recognition accuracy more complex features should beconsidered including not only jointsrsquo relative positions butalso their velocity Many sequences contain noisy skeletondata another approach to improving recognition accuracycould be the development of a method to discard noisyskeleton or to include depth-based features

However the AAL scenario raises some problems whichhave to be addressed First of all the algorithm exploits amul-ticlass SVM which is a good classifier but it does not makeeasy to understand if an activity belongs or not to any of thetraining classes In fact in a real scenario it is possible to havea sequence of frames that does not represent any activity ofthe training set in this case the SVM outputs the most likelyclass anyway even if it does not make sense Other machinelearning algorithms such as HMMs distinguish amongmultiple classes using the maximum posterior probabilityGaglio et al [29] proposed using a threshold on the outputprobability to detect unknown actions Usually SVMs do notprovide output probabilities but there exist some methodsto extend SVM implementations and make them able toprovide also this information [53 54] These techniquescan be investigated to understand their applicability in theproposed scenario

Another problem is segmentation of actions In manydatasets the actions are represented as segmented sequencesof frames but in real applications the algorithm has to handlea continuous stream of frames and to segment actions byitself Some solutions for segmentation have been proposedbut most of them are based on thresholds on movementswhich can be highly data-dependent [46] Also this aspect hasto be further investigated to have a systemwhich is effectivelyapplicable in a real AAL scenario

6 Conclusions

In this work a simple yet effective activity recognitionalgorithm has been proposed It is based on skeleton dataextracted from an RGBD sensor and it creates a featurevector representing the whole activity It is able to overcomestate-of-the-art results in two publicly available datasets theKARD and CAD-60 outperforming more complex algo-rithms in many conditions The algorithm has been testedalso over other more challenging datasets the UTKinect andFlorence3D where it is outperformed only by algorithmsexploiting temporal alignment techniques or a combinationof several machine learning methods The MSR Action3Dis a complex dataset and the proposed algorithm strugglesin the recognition of subsets constituted by similar gesturesHowever this dataset does not contain any activity of interestfor AAL scenarios

Future works will concern the application of the activityrecognition algorithm to AAL scenarios by consideringaction segmentation and the detection of unknown activities

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the contributionof the COST Action IC1303 AAPELE (Architectures Algo-rithms and Platforms for Enhanced Living Environments)

References

[1] P Rashidi and A Mihailidis ldquoA survey on ambient-assistedliving tools for older adultsrdquo IEEE Journal of Biomedical andHealth Informatics vol 17 no 3 pp 579ndash590 2013

[2] O C Ann and L B Theng ldquoHuman activity recognition areviewrdquo in Proceedings of the IEEE International Conference onControl System Computing and Engineering (ICCSCE rsquo14) pp389ndash393 IEEE Batu Ferringhi Malaysia November 2014

[3] S Wang and G Zhou ldquoA review on radio based activityrecognitionrdquo Digital Communications and Networks vol 1 no1 pp 20ndash29 2015

[4] L Atallah B Lo R Ali R King and G-Z Yang ldquoReal-timeactivity classificationusing ambient andwearable sensorsrdquo IEEETransactions on Information Technology in Biomedicine vol 13no 6 pp 1031ndash1039 2009

[5] D De P Bharti S K Das and S Chellappan ldquoMultimodalwearable sensing for fine-grained activity recognition in health-carerdquo IEEE Internet Computing vol 19 no 5 pp 26ndash35 2015

[6] J K Aggarwal and L Xia ldquoHuman activity recognition from 3Ddata a reviewrdquo Pattern Recognition Letters vol 48 pp 70ndash802014

[7] S Gasparrini E Cippitelli E Gambi S Spinsante and FFlorez-Revuelta ldquoPerformance analysis of self-organising neu-ral networks tracking algorithms for intake monitoring usingkinectrdquo in Proceedings of the 1st IET International Conferenceon Technologies for Active and Assisted Living (TechAAL rsquo15)Kingston UK November 2015

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

4 Computational Intelligence and Neuroscience

1

23

4

Test Test

Train

LA

(A L)

SVM1

SVM2

SVM3C1

C2

C3

CK

SVMM(Mminus1)2

Classification

Activity features computation Postures selection

Posture features extraction

A = [C1C2C3 CK]

s1 = [J0 J1 J2 J3 JPminus1]s2 = [J0 J1 J2 J3 JPminus1]s3 = [J0 J1 J2 J3 JPminus1]s4 = [J0 J1 J2 J3 JPminus1]

sN = [J0 J1 J2 J3 JPminus1]

f1 = [d1 d2 d3 dPminus1]f2 = [d1 d2 d3 dPminus1]f3 = [d1 d2 d3 dPminus1]f4 = [d1 d2 d3 dPminus1]

fN = [d1 d2 d3 dPminus1]f = [d1 d2 d3 dPminus1]

Ji = ith joint

di =Ji minus J0J2 minus J0

i = 1 2 P minus 1

Figure 1 The general scheme of the activity recognition algorithm is composed of 4 steps In the first step the posture feature vectors arecomputed for each skeleton frame then the postures are selected and an activity features vector is created Finally amulticlass SVM is exploitedfor classification

algorithm is used to process 119873 feature vectors constitutingthe activity by grouping them into 119870 clusters The well-known 119896-means clustering algorithm based on the squaredEuclidean distance as a metric can be used to group togetherthe frames representing similar postures Considering anactivity composed of119873 feature vectors [f

1 f2 f3 f

119873] the

119896-means algorithm gives as outputs 119873 clusters ID (one foreach feature vector) and 119870 vectors [C

1C2C3 C

119870] that

represent the centroids of each cluster The feature vectorsare partitioned into clusters 119878

1 1198782 119878

119870so as to satisfy the

condition expressed by

argminS

119870

sum

119895=1

sum

f119894isin119878119895

10038171003817100381710038171003817f119894minus C119895

10038171003817100381710038171003817

2

(3)

The 119870 centroids can be seen as the main postures whichare the most important feature vectors Unlike classicalapproaches based on key poses where the most informativepostures are evaluated by considering all the sequencesof each activity in the proposed solution the clusteringalgorithm is executed for each sequence This avoids theapplication of a learning algorithm required to associate eachframe to the closest key pose and allows to have a morecompact representation of the activity

The third phase is related to the computation of a featurevector which models the whole activity starting from the119870 centroid vectors computed by the clustering algorithmIn more detail [C

1C2C3 C

119870] vectors are sorted con-

sidering the order in which the clusterrsquos elements occurduring the activity The activity features vector is composedof concatenating the sorted centroid vectors For example

considering an activity featured by 119873 = 10 and 119870 = 4after running the 119896-means algorithm one of the possibleoutputs could be the following sequence of cluster IDs[2 2 2 3 3 1 1 4 4 4] meaning that the first three posturevectors belong to cluster 2 the fourth and the fifth areassociated with cluster 3 and so on In this case the activityfeatures vector is A = [C

2C3C1C4] A feature activity

vector has a dimension of 3119870(119875 minus 1) that can be handledwithout using dimensionality reduction algorithms such asPCA if 119870 is small

The classification step aims to associate each featureactivity vector to the correct activity Many machine learningalgorithms may be applied to fulfil this task among them aSVM Considering a number of 119897 training vectors x

119894isin 119877119899 and

a vector of labels 119910 isin 119877119897 where 119910119894isin minus1 1 a binary SVM

can be formulated as follows [39]

minw119887120585

1

2w119879w + 119862

119897

sum

119894=1

120585119894

subject to 119910119894(w119879120601 (x

119894) + 119887) ge 1 minus 120585

119894

120585119894ge 0 119894 = 1 119897

(4)

where

w119879120601 (x) + 119887 = 0 (5)

is the optimal hyperplane that allows separation betweenclasses in the feature space119862 is a constant and 120585

119894are nonneg-

ative variables which consider training errorsThe function 120601allows transforming between the features space and an higher

Computational Intelligence and Neuroscience 5

12

30456

78

910

1112

131415

16 17

18 19

(a)

12

30456

78

910

1112

131415

16 17

18 19

(b)

12

30456

78

910

1112

131415

1617

18 19

(c)

12

30456

78

910

1112

131415

16 17

18 19(d)

Figure 2 Subsets of joints considered in the evaluation of the proposed algorithmThewhole skeleton is represented by 20 joints the selectedones are depicted as green circles while the discarded joints are represented by red squares Subsets of 7 (a) 11 (b) and 15 (c) joints and thewhole set of 20 joints (d) are selected

dimensional space where the data are separable Consideringtwo training vectors x

119894and x

119895 the kernel function can be

defined as

119870(x119894 x119895) = 120601 (x

119894)119879120601 (x119895) (6)

In this work the Radial Basis Function (RBF) kernel has beenused where

119870(x119894 x119895) = 119890minus120574x119894minusx1198952 120574 gt 0 (7)

It follows that 119862 and 120574 are the parameters that have to beestimated prior using the SVM

The idea herein exploited is to use a multiclass SVMwhere each class represents an activity of the dataset In orderto extend the role from a binary to amulticlass classifier someapproaches have been proposed in the literature In [40] theauthors compared many methods and found that the ldquoone-against-onerdquo is one of the most suitable for practical use Itis implemented in LIBSVM [41] and it is the approach usedin this workThe ldquoone-against-onerdquo approach is based on theconstruction of several binary SVM classifiers inmore detaila number of119872(119872minus1)2 binary SVMs are necessary in a119872-classes dataset This happens because each SVM is trained todistinguish between 2 classes and the final decision is takenexploiting a voting strategy among all the binary classifiersDuring the training phase the activity feature vectors aregiven as inputs to the multiclass SVM together with the label119871 of the action In the test phase the activity label is obtainedfrom the classifier

4 Experimental Results

Thealgorithmperformance is evaluated on five publicly avail-able datasets In order to perform an objective comparisonto previous works the reference test procedures have beenconsidered for each dataset The performance indicators areevaluated using four different subsets of joints shown inFigure 2 going from a minimum of 7 up to a maximum of20 joints Finally in order to evaluate the performance inAAL scenarios some suitable activities related to AAL areselected from the datasets and the recognition accuracies areevaluated

41 Datasets Five different 3D datasets have been consideredin this work each of them including a different set of activitiesand gestures

KARD dataset [29] is composed of 18 activities that canbe divided into 10 gestures (horizontal arm wave high armwave two-hand wave high throw draw X draw tick forwardkick side kick bend and hand clap) and eight actions (catchcap toss paper take umbrellawalk phone call drink sit downand stand up) This dataset has been captured in a controlledenvironment that is an office with a static background anda Kinect device placed at a distance of 2-3m from the subjectSome objects were present in the area useful to perform someof the actions a desk with a phone a coat rack and a wastebin The activities have been performed by 10 young people(nine males and one female) aged from 20 to 30 years andfrom 150 to 185 cm tall Each person repeated each activity3 times creating a number of 540 sequences The dataset iscomposed of RGB and depth frames captured at a rate of30 fps with a 640 times 480 resolution In addition 15 joints ofthe skeleton in world and screen coordinates are providedThe skeleton has been captured using OpenNI libraries [42]

The Cornell Activity Dataset (CAD-60) [43] is madeby 12 different activities typical of indoor environmentsThe activities are rinsing mouth brushing teeth wearingcontact lens talking on the phone drinking water openingpill container cooking-chopping cooking-stirring talking oncouch relaxing on couch writing on whiteboard and workingon computer and are performed in 5 different environmentsbathroom bedroom kitchen living room and office All theactivities are performed by 4 different people two males andtwo females one of which is left-handed No instructionswere given to actors about how to perform the activitiesthe authors simply ensured that the skeleton was correctlydetected by Kinect The dataset is composed of RGB depthand skeleton data with 15 joints available

The UTKinect dataset [26] is composed of 10 differentsubjects (9males and 1 female) performing 10 activities twiceThe following activities are part of the dataset walk sit downstand up pick up carry throw push pull wave and claphands A number of 199 sequences are available because onesequence is not labeled and the length of sample actions

6 Computational Intelligence and Neuroscience

ranges from 5 to 120 frames The dataset provides 640 times 480RGB frames and 320 times 240 depth frames together with 20skeleton joints captured using Kinect forWindows SDKBetaVersion with a final frame rate of about 15 fps

The Florence3D dataset [44] includes 9 different activi-ties wave drink from a bottle answer phone clap tight lacesit down stand up read watch and bow These activities wereperformed by 10 different subjects for 2 or 3 times resultingin a total number of 215 sequences This is a challengingdataset since the same action is performed with both handsand because of the presence of very similar actions such asdrink from a bottle and answer phone The activities wererecorded in different environments and only RGB videos and15 skeleton joints are available

Finally the MSR Action3D [45] represents one of themost used datasets for HAR It includes 20 activities per-formed by 10 subjects 2 or 3 times In total 567 sequencesof depth (320 times 240) and skeleton frames are providedbut 10 of them have to be discarded because the skeletonsare either missing or affected by too many errors Thefollowing activities are included in the dataset high armwavehorizontal arm wave hammer hand catch forward punchhigh throw draw X draw tick draw circle hand clap two-hand wave side boxing bend forward kick side kick joggingtennis swing tennis serve golf swing and pickup and throwThe dataset has been collected using a structured-light depthcamera at 15 fps RGB data are not available

42 Tests and Results The proposed algorithm has beentested over the datasets detailed above following the recom-mendations provided in each reference paper in order toensure a fair comparison to previous works Following thiscomparison a more AAL oriented evaluation has been con-ducted which consists of considering only suitable actionsfor each dataset excluding the gestures Another type ofevaluation regards the subset of skeleton joints that has to beincluded in the feature computation

421 KARD Dataset Gaglio et al [29] collected the KARDdataset and proposed some evaluation experiments on itThey considered three different experiments and two modal-ities of dataset splitting The experiments are as follows

(i) Experiment A one-third of the data is considered fortraining and the rest for testing

(ii) Experiment B two-thirds of the data is considered fortraining and the rest for testing

(iii) Experiment C half of the data is considered fortraining and the rest for testing

The activities constituting the dataset are split in the followinggroups

(i) Gestures and Actions(ii) Activity Set 1 Activity Set 2 and Activity Set 3 as

listed in Table 1 Activity Set 1 is the simplest onesince it is composed of quite different activities whilethe other two sets include more similar actions andgestures

Table 1 Activity sets grouping different and similar activities fromKARD dataset

Activity Set 1 Activity Set 2 Activity Set 3Horizontal arm wave High arm wave Draw tickTwo-hand wave Side kick DrinkBend Catch cap Sit downPhone call Draw tick Phone callStand up Hand clap Take umbrellaForward kick Forward kick Toss paperDraw X Bend High throwWalk Sit down Horizontal arm wave

Each experiment has been repeated 10 times randomlysplitting training and testing data Finally the ldquonew-personrdquoscenario is also performed that is a leave-one-actor-outsetting consisting of training the system on nine of the tenpeople of the dataset and testing on the tenth In the ldquonew-personrdquo test no recommendation is provided about how tosplit the dataset so we assumed that the whole dataset of 18activities is considered The only parameter that can be setin the proposed algorithm is the number of clusters 119870 anddifferent subsets of skeleton joints are considered Since only15 skeleton joints are available the fourth group of 20 joints(Figure 2(d)) cannot be considered

For each test conducted on KARD dataset the sequenceof clusters 119870 = [3 5 10 15 20 25 30 35] has been consid-ered The results concerning Activity Set tests are reported inTable 2 it is shown that the proposed algorithm outperformsthe originally proposed one in almost all the tests The 119870parameter which gives the maximum accuracy is quite highsince it is 30 or 35 in most of the cases It means that forthe KARD dataset it is better to have a significant numberof clusters representing each activity This is possible alsobecause the number of frames that constitutes each activitygoes from aminimumof 42 in a sequence of hand clap gestureto a maximum of 310 for a walk sequence Experiment Ain Activity Set 1 shows the highest difference between theminimum and the maximum accuracy when varying thenumber of clusters For example considering 119875 = 7 themaximum accuracy (982) is shown in Table 2 and it isobtained with 119870 = 35 The minimum accuracy is 914 andit is obtained with 119870 = 5 The difference is quite high (68)but this gap is reduced by considering more training data Infact Experiment B shows a difference of 05 in Activity Set1 11 in Activity Set 2 and 26 in Activity Set 3

Considering the number of selected joints the observa-tion of Tables 2 and 3 lets us conclude that not all the jointsare necessary to achieve good recognition results In moredetail from Table 2 it can be noticed that the Activity Set 1and Activity Set 2 which are the simplest ones have goodrecognition results using a subset composed of 119875 = 7 jointsThe Activity Set 3 composed of more similar activities isbetter recognized with 119875 = 11 joints In any case it is notnecessary to consider all the skeleton joints provided fromKARD dataset

The results obtained with the ldquonew-personrdquo scenario areshown in Table 4 The best result is obtained with 119870 = 30

Computational Intelligence and Neuroscience 7

Table 2 Accuracy () of the proposed algorithm compared to the other using KARD dataset with different Activity Sets and for differentexperiments

Activity Set 1 Activity Set 2 Activity Set 3A B C A B C A B C

Gaglio et al [29] 951 991 930 899 949 901 842 895 817

Proposed (119875 = 7) 982 984 981 997 100 997 902 950 913

Proposed (119875 = 11) 980 990 977 998 100 996 916 958 933Proposed (119875 = 15) 975 988 976 995 100 996 91 951 932

090 003 007097 003

100100

003 093 003007 007 083 003007 007 077 010

003 093 003100

100100

100100

100090 010

003 003 007 087100

100

Hor

arm

wav

e

Hig

h ar

m w

ave

Two-

hand

wav

e

Catch

cap

Hig

h th

row

Dra

w X

Dra

w tic

k

Toss

pape

r

Forw

kick

Side

kick

Take

um

br

Bend

Han

d cla

p

Wal

k

Phon

e cal

l

Drin

k

Sit d

own

Stan

d up

Hor arm waveHigh arm wave

Two-hand waveCatch cap

High throwDraw X

Draw tickToss paperForw kick

Side kickTake umbr

BendHand clap

WalkPhone call

DrinkSit downStand up

Figure 3 Confusion matrix of the ldquonew-personrdquo test on the whole KARD dataset

Table 3 Accuracy () of the proposed algorithm compared tothe other using KARD dataset with dataset split in Gestures andActions for different experiments

Gestures ActionsA B C A B C

Gaglio et al [29] 865 930 867 925 950 901

Proposed (119875 = 7) 899 935 925 991 996 994Proposed (119875 = 11) 899 959 937 990 999 991Proposed (119875 = 15) 874 936 928 987 995 993

Table 4 Precision () and recall () of the proposed algorithmcompared to the other using the whole KARD dataset and ldquonew-personrdquo setting

Algorithm Precision RecallGaglio et al [29] 848 845

Proposed (119875 = 7) 940 937

Proposed (119875 = 11) 951 950Proposed (119875 = 15) 950 948

using a number of 119875 = 11 joints The overall precision andrecall are about 10higher than the previous approachwhichuses the KARD dataset In this condition the confusion

matrix obtained is shown in Figure 3 It can be noticed thatthe actions are distinguished very well only phone call anddrink show a recognition accuracy equal or lower than 90and sometimes they are mixed with each other The mostcritical activities are the draw X and draw tick gestures whichare quite similar

From the AAL point of view only some activities arerelevant Table 3 shows that the eight actions constituting theKARD dataset are recognized with an accuracy greater than98 even if there are some similar actions such as sit downand stand up or phone call and drink Considering only theActions subset the lower recognition accuracy is 949and itis obtained with119870 = 5 and119875 = 7 It means that the algorithmis able to reach a high recognition rate even if the featurevector is limited to 90 elements

422 CAD-60 Dataset TheCAD-60 dataset is a challengingdataset consisting of 12 activities performed by 4 people in5 different environments The dataset is usually evaluatedby splitting the activities according to the environmentthe global performance of the algorithm is given by theaverage precision and recall among all the environments Twodifferent settings were experimented for CAD-60 in [43]Theformer is defined ldquonew-personrdquo and the latter is the so-calledldquohave-seenrdquo ldquoNew-personrdquo setting has been considered in all

8 Computational Intelligence and Neuroscience

the works using CAD-60 so it is the one selected also in thiswork

The most challenging element of the dataset is thepresence of a left-handed actor In order to increase the per-formance which are particularly affected by this unbalancingin the ldquonew-personrdquo test mirrored copies of each action arecreated as suggested in [43] For each actor a left-handed anda right-handed version of each action are made availableThedummyversion of the activity has been obtained bymirroringthe skeleton with respect to the virtual sagittal plane that cutsthe person in a half The proposed algorithm is evaluatedusing three different sets of joints from 119875 = 7 to 119875 = 15and the sequence of clusters 119870 = [3 5 10 15 20 25 30 35]as in the KARD dataset

The best results are obtained with the configurations 119875 =11 and 119870 = 25 and the performance in terms of precisionand recall for each activity is shown in Table 5 Very goodresults are given in office environment where the averageprecision and recall are 964 and 958 respectively Infact the activities of this environment are quite different onlytalking on phone and drinking water are similar On the otherhand the living room environment includes talking on couchand relaxing on couch in addition to talking on phone anddrinking water and it is the most challenging case since theaverage precision and recall are 91 and 906

The proposed algorithm is compared to other worksusing the same ldquonew-personrdquo setting and the results areshown in Table 6 which tells that the 119875 = 11 configurationoutperforms the state-of-the-art results in terms of precisionand it is only 1 lower in terms of recall Shan and Akella[46] achieve very good results using a multiclass SVMscheme with a linear kernel However they train and testmirrored actions separately and then merge the results whencomputing average precision and recall Our approach simplyconsiders two copies of the same action given as input to themulticlass SVM and retrieves the classification results

The reduced number of joints does not affect too muchthe average performance of the algorithm that reaches aprecision of 927 and a recall of 915 with 119875 = 7Using all the available jonts on the other hand brings toa more substantial reduction of the performance showing879 and 867 for precision and recall respectively with119875 = 15 The best results for the proposed algorithm werealways obtained with a high number of clusters (25 or30) The reduction of this number affects the performancefor example considering the 119875 = 11 subset the worstperformance is obtained with 119870 = 5 and with a precision of866 and a recall of 860

From theAALpoint of view this dataset is composed onlyof actions not gestures so the dataset does not have to beseparated to evaluate the performance in a scenario which isclose to AAL

423 UTKinect Dataset The UTKinect dataset is composedof 10 activities performed twice by 10 subjects and theevaluation setting proposed in [26] is the leave-one-out-cross-validation (LOOCV) which means that the systemis trained on all the sequences except one and that one isused for testing Each trainingtesting procedure is repeated

Table 5 Precision () and recall () of the proposed algorithm inthe different environments of CAD-60 with 119875 = 11 and 119870 = 25

Location Activity ldquoNew-personrdquoPrecision Recall

Bathroom

Brushing teeth 889 100

Rinsing mouth 923 100

Wearing contact lens 100 792

Average 937 931

Bedroom

Talking on phone 917 917

Drinking water 913 875

Opening pill container 960 100

Average 930 931

Kitchen

Cooking-chopping 857 100

Cooking-stirring 100 791

Drinking water 960 100

Opening pill container 100 100

Average 954 948

Living room

Talking on phone 875 875

Drinking water 875 875

Talking on couch 889 100

Relaxing on couch 100 875

Average 910 906

Office

Talking on phone 100 875

Writing on whiteboard 100 958

Drinking water 857 100

Working on computer 100 100

Average 964 958Global average 939 935

Table 6 Global precision () and recall () of the proposedalgorithm for CAD-60 dataset and ldquonew-personrdquo setting withdifferent subsets of joints compared to other works

Algorithm Precision RecallSung et al [43] 679 555

Gaglio et al [29] 773 767

Proposed (119875 = 15) 879 867

Faria et al [47] 911 919

Parisi et al [48] 919 902

Proposed (119875 = 7) 927 915

Shan and Akella [46] 938 945Proposed (119875 = 11) 939 935

20 times to reduce the random effect of 119896-means For thisdataset all the different subsets of joints shown in Figure 2are considered since the skeleton is captured usingMicrosoftSDK which provides 20 joints The considered sequence ofclusters is only 119870 = [3 4 5] because the minimum numberof frames constituting an action sequence is 5

The results compared with previous works are shownin Table 7 The best results for the proposed algorithm areobtained with 119875 = 7 which is the smallest set of jointsconsidered The result corresponds to a number of 119870 = 4clusters but the difference with the other clusters is very low

Computational Intelligence and Neuroscience 9

093 002 005

094 006

099 001

100

002 098W

alk

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(a)

088 004 008

095 005

100

003 097

010 090

Wal

k

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(b)

Figure 4 Confusion matrices of the UTKinect dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 7 (b) Worst accuracy confusion matrix obtained with 119875 = 15

Table 7 Global accuracy () of the proposed algorithm forUTKinect dataset and LOOCV setting with different subsets ofjoints compared to other works

Algorithm AccuracyXia et al [26] 909

Theodorakopoulos et al [30] 9095

Ding et al [31] 915

Zhu et al [17] 919

Jiang et al [35] 919

Gan and Chen [24] 920

Liu et al [22] 920

Proposed (119875 = 15) 931

Proposed (119875 = 11) 942

Proposed (119875 = 19) 943

Anirudh et al [34] 949

Proposed (119875 = 7) 951

Vemulapalli et al [33] 971

only 06 with 119870 = 5 that provided the worst result Theselection of different sets of joints from 119875 = 7 to 119875 = 15changes the accuracy only by a 2 from 931 to 951In this dataset the main limitation to the performance isgiven by the reduced number of frames that constitute somesequences and limits the number of clusters representing theactions Vemulapalli et al [33] reach the highest accuracybut the approach is much more complex after modelingskeleton joints in a Lie group the processing scheme includesDynamic TimeWarping to perform temporal alignments anda specific representation called Fourier Temporal Pyramidbefore classification with one-versus-all multiclass SVM

Contextualizing the UTKinect dataset to AAL involvesthe consideration of a subset of activities which includes onlyactions and discards gestures In more detail the following 5activities have been selectedwalk sit down stand up pick upand carry In this condition the highest accuracy (967) isstill given by119875 = 7 with 3 clusters and the lower one (941)is represented by 119875 = 15 again with 3 clustersThe confusion

Table 8 Global accuracy () of the proposed algorithm forFlorence3D dataset and ldquonew-personrdquo setting with different subsetsof joints compared to other works

Algorithm AccuracySeidenari et al [44] 820

Proposed (119875 = 7) 821

Proposed (119875 = 15) 847

Proposed (119875 = 11) 861

Anirudh et al [34] 897

Vemulapalli et al [33] 909

Taha et al [28] 962

matrix for these two configurations are shown in Figure 4where the main difference is the reduced misclassificationbetween the activities walk and carry that are very similarto each other

424 Florence3D Dataset The Florence3D dataset is com-posed of 9 activities and 10 people performing themmultipletimes resulting in a total number of 215 activities Theproposed setting to evaluate this dataset is the leave-one-actor-out which is equivalent to the ldquonew-personrdquo settingpreviously described The minimum number of frames is 8so the following sequence of clusters has been considered119870 = [3 4 5 6 7 8] A number of 15 joints are available forthe skeleton so only the first three schemes are included inthe tests

Table 8 shows the results obtained with different subsetsof joints where the best result for the proposed algorithmis given by 119875 = 11 with 6 clusters The choice of clustersdoes not significantly affect the performance because themaximum reduction is 2 In this dataset the proposedapproach does not achieve state-of-the-art accuracy (962)given by Taha et al [28] However all the algorithmsthat overcome the proposed one exhibit greater complexitybecause they consider Lie group representations [33 34] orseveral machine learning algorithms such as SVM combined

10 Computational Intelligence and Neuroscience

(a)

(b)

(c)

Figure 5 Sequences of frames representing the drink from a bottle (a) answer phone (b) and read watch (c) activities from Florence3Ddataset

081 019

014 086

100

005 005 090

005 095

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(a)

076 019 005

032 064 005

004 092 004

005 010

005010005

085

080

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(b)

Figure 6 Confusion matrices of the Florence3D dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 11 (b) Worst accuracy confusion matrix obtained with 119875 = 7

to HMM to recognize activities composed of atomic actions[28]

If different subsets of joints are considered the perfor-mance decreases In particular considering only 7 joints and3 clusters it is possible to reach a maximum accuracy of821 which is very similar to the one obtained by Seidenariet al [44] who collected the dataset By including all theavailable joints (15) in the algorithm processing it is possibleto achieve an accuracy which varies between 840 and847

The Florence3D dataset is challenging because of tworeasons

(i) High interclass similarity some actions are verysimilar to each other for example drink from abottle answer phone and read watch As can beseen in Figure 5 all of them consist of an uprising

arm movement to the mouth hear or head Thisfact affects the performance because it is difficult toclassify actions consisting of very similar skeletonmovements

(ii) High intraclass variability the same action is per-formed in different ways by the same subject forexample using left right or both hands indifferently

Limiting the analysis to AAL related activities only thefollowing ones can be selected drink answer phone tight lacesit down and stand up The algorithm reaches the highestaccuracy (908) using the ldquonew-personrdquo setting with 119875 =11 joints and 119870 = 3 The confusion matrix obtained in thiscondition is shown in Figure 6(a) On the other hand theworst accuracy is 798 and it is obtained with 119875 = 7 and119870 = 4 In Figure 6(b) the confusion matrix for this scenario

Computational Intelligence and Neuroscience 11

081 008

004 056 022 011007

012 077 012

012

004 004 081 012

100

081 004 015

013 087

004 007019 070

a02

a03

a05

a06

a10

a13

a18

a20

a02

a03

a05

a06

a10

a13

a18

a20

(a)

056 015 007 015 007

007

007 017 013

016 044 008 004 004 024

011 056 004

003

022

010 070 017

063

100

010 087 003

007 093

a01

a04

a07

a08

a09

a11

a12

a14

a01

a04

a07

a08

a09

a11

a12

a14

(b)

088 012

100

005 095

097 003

003 097

093 007

003 003 083 010

004 011 085

a06

a14

a15

a16

a17

a18

a19

a20

a06

a14

a15

a16

a17

a18

a19

a20

(c)

Figure 7 Confusion matrices of the MSR Action3D dataset obtained with 119875 = 7 and the ldquonew-personrdquo test (a) AS1 (b) AS2 and (c) AS3

is shown In both situations the main problem is given by thesimilarity of drink and answer phone activities

425 MSR Action3D Dataset The MSR Action3D datasetis composed of 20 activities which are mainly gestures Itsevaluation is included for the sake of completeness in thecomparison with other activity recognition algorithms butthe dataset does not contain AAL related activities Thereis a big confusion in the literature about the validationtests to be used for MSR Action3D Padilla-Lopez et al[49] summarized all the validation methods for this datasetand recommend using all the possible combinations of 5-5 subjects splitting or using LOAO The former procedureconsists of considering 252 combinations of 5 subjects fortraining and the remaining 5 for testing while the latter is theleave-one-actor-out equivalent to the ldquonew-personrdquo schemepreviously introduced This dataset is quite challenging dueto the presence of similar and complex gestures hence theevaluation is performed by considering three subsets of 8gestures each (AS1 AS2 and AS3 in Table 9) as suggestedin [45] Since the minimum number of frames constitutingone sequence is 13 the following set of clusters has beenconsidered 119870 = [3 5 8 10 13] All the combinations of 711 15 and 20 joints shown in Figure 2 are included in theexperimental tests

Considering the ldquonew-personrdquo scheme the proposedalgorithm tested separately on the three subsets of MSRAction3D reaches an average accuracy of 812 with 119875 = 7and 119870 = 10 The confusion matrices for the three subsetsare shown in Figure 7 where it is possible to notice thatthe algorithm struggles in the recognition of the AS2 subset(Figure 7(b)) mainly represented by drawing gestures Betterresults are obtained processing the subset AS3 (Figure 7(c))where lower recognition rates are shown for complex ges-tures golf swing and pickup and throw Table 10 shows theresults obtained including also other joints which are slightlyworse A comparison with previous works validated usingthe ldquonew-personrdquo test is shown in Table 11 The proposedalgorithm achieves results comparable with [50] whichexploits an approach based on skeleton data Chaaraoui etal [15 36] exploits more complex algorithms considering

Table 9 Three subsets of gestures fromMSR Action3D dataset

AS1 AS2 AS3[11988602] Horizontalarm wave [11988601] High arm wave [11988606] High throw

[11988603] Hammer [11988604] Hand catch [11988614] Forward kick[11988605] Forwardpunch [11988607] Draw X [11988615] Side kick

[11988606] High throw [11988608] Draw tick [11988616] Jogging[11988610] Hand clap [11988609] Draw circle [11988617] Tennis swing[11988613] Bend [11988611] Two-hand wave [11988618] Tennis serve[11988618] Tennis serve [11988612] Side boxing [11988619] Golf swing[11988620] Pickup andthrow [11988614] Forward kick [11988620] Pickup and throw

Table 10 Accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting with different subsetsof joints and different subsets of activities

Algorithm AS1 AS2 AS3 AvgProposed (119875 = 15) 785 688 928 800

Proposed (119875 = 20) 790 702 919 804

Proposed (119875 = 11) 776 737 914 809

Proposed (119875 = 7) 795 719 923 812

the fusion of skeleton and depth data or the evolutionaryselection of the best set of joints

5 Discussion

The proposed algorithm despite its simplicity is able toachieve and sometimes to overcome state-of-the-art per-formance when applied to publicly available datasets Inparticular it is able to outperform some complex algorithmsexploiting more than one classifier in KARD and CAD-60datasets

Limiting the analysis to AAL related activities only thealgorithm achieves interesting results in all the datasetsThe group of 8 activities labeled as Actions in the KARD

12 Computational Intelligence and Neuroscience

Table 11 Average accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting compared to otherworks

Algorithm AccuracyCeliktutan et al (2015) [51] 729

Azary and Savakis (2013) [52] 785

Proposed (119875 = 7) 812

Azary and Savakis (2012) [50] 839

Chaaraoui et al (2013) [15] 906

Chaaraoui et al (2014) [36] 935

dataset is recognized with an accuracy greater than 987in the considered experiments and the group includes twosimilar activities such as phone call and drink The CAD-60dataset contains only actions so all the activities consideredwithin the proper location have been included in the eval-uation resulting in global precision and recall of 939 and935 respectively In the UTKinect dataset the performanceimproves considering only the AAL activities and the samehappens also in the Florence3D having the highest accuracyclose to 91 even if some actions are very similar

The MSR Action3D is a challenging dataset mainlycomprising gestures for human-computer interaction andnot actions Many gestures especially the ones included inAS2 subset are very similar to each other In order to improvethe recognition accuracy more complex features should beconsidered including not only jointsrsquo relative positions butalso their velocity Many sequences contain noisy skeletondata another approach to improving recognition accuracycould be the development of a method to discard noisyskeleton or to include depth-based features

However the AAL scenario raises some problems whichhave to be addressed First of all the algorithm exploits amul-ticlass SVM which is a good classifier but it does not makeeasy to understand if an activity belongs or not to any of thetraining classes In fact in a real scenario it is possible to havea sequence of frames that does not represent any activity ofthe training set in this case the SVM outputs the most likelyclass anyway even if it does not make sense Other machinelearning algorithms such as HMMs distinguish amongmultiple classes using the maximum posterior probabilityGaglio et al [29] proposed using a threshold on the outputprobability to detect unknown actions Usually SVMs do notprovide output probabilities but there exist some methodsto extend SVM implementations and make them able toprovide also this information [53 54] These techniquescan be investigated to understand their applicability in theproposed scenario

Another problem is segmentation of actions In manydatasets the actions are represented as segmented sequencesof frames but in real applications the algorithm has to handlea continuous stream of frames and to segment actions byitself Some solutions for segmentation have been proposedbut most of them are based on thresholds on movementswhich can be highly data-dependent [46] Also this aspect hasto be further investigated to have a systemwhich is effectivelyapplicable in a real AAL scenario

6 Conclusions

In this work a simple yet effective activity recognitionalgorithm has been proposed It is based on skeleton dataextracted from an RGBD sensor and it creates a featurevector representing the whole activity It is able to overcomestate-of-the-art results in two publicly available datasets theKARD and CAD-60 outperforming more complex algo-rithms in many conditions The algorithm has been testedalso over other more challenging datasets the UTKinect andFlorence3D where it is outperformed only by algorithmsexploiting temporal alignment techniques or a combinationof several machine learning methods The MSR Action3Dis a complex dataset and the proposed algorithm strugglesin the recognition of subsets constituted by similar gesturesHowever this dataset does not contain any activity of interestfor AAL scenarios

Future works will concern the application of the activityrecognition algorithm to AAL scenarios by consideringaction segmentation and the detection of unknown activities

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the contributionof the COST Action IC1303 AAPELE (Architectures Algo-rithms and Platforms for Enhanced Living Environments)

References

[1] P Rashidi and A Mihailidis ldquoA survey on ambient-assistedliving tools for older adultsrdquo IEEE Journal of Biomedical andHealth Informatics vol 17 no 3 pp 579ndash590 2013

[2] O C Ann and L B Theng ldquoHuman activity recognition areviewrdquo in Proceedings of the IEEE International Conference onControl System Computing and Engineering (ICCSCE rsquo14) pp389ndash393 IEEE Batu Ferringhi Malaysia November 2014

[3] S Wang and G Zhou ldquoA review on radio based activityrecognitionrdquo Digital Communications and Networks vol 1 no1 pp 20ndash29 2015

[4] L Atallah B Lo R Ali R King and G-Z Yang ldquoReal-timeactivity classificationusing ambient andwearable sensorsrdquo IEEETransactions on Information Technology in Biomedicine vol 13no 6 pp 1031ndash1039 2009

[5] D De P Bharti S K Das and S Chellappan ldquoMultimodalwearable sensing for fine-grained activity recognition in health-carerdquo IEEE Internet Computing vol 19 no 5 pp 26ndash35 2015

[6] J K Aggarwal and L Xia ldquoHuman activity recognition from 3Ddata a reviewrdquo Pattern Recognition Letters vol 48 pp 70ndash802014

[7] S Gasparrini E Cippitelli E Gambi S Spinsante and FFlorez-Revuelta ldquoPerformance analysis of self-organising neu-ral networks tracking algorithms for intake monitoring usingkinectrdquo in Proceedings of the 1st IET International Conferenceon Technologies for Active and Assisted Living (TechAAL rsquo15)Kingston UK November 2015

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

Computational Intelligence and Neuroscience 5

12

30456

78

910

1112

131415

16 17

18 19

(a)

12

30456

78

910

1112

131415

16 17

18 19

(b)

12

30456

78

910

1112

131415

1617

18 19

(c)

12

30456

78

910

1112

131415

16 17

18 19(d)

Figure 2 Subsets of joints considered in the evaluation of the proposed algorithmThewhole skeleton is represented by 20 joints the selectedones are depicted as green circles while the discarded joints are represented by red squares Subsets of 7 (a) 11 (b) and 15 (c) joints and thewhole set of 20 joints (d) are selected

dimensional space where the data are separable Consideringtwo training vectors x

119894and x

119895 the kernel function can be

defined as

119870(x119894 x119895) = 120601 (x

119894)119879120601 (x119895) (6)

In this work the Radial Basis Function (RBF) kernel has beenused where

119870(x119894 x119895) = 119890minus120574x119894minusx1198952 120574 gt 0 (7)

It follows that 119862 and 120574 are the parameters that have to beestimated prior using the SVM

The idea herein exploited is to use a multiclass SVMwhere each class represents an activity of the dataset In orderto extend the role from a binary to amulticlass classifier someapproaches have been proposed in the literature In [40] theauthors compared many methods and found that the ldquoone-against-onerdquo is one of the most suitable for practical use Itis implemented in LIBSVM [41] and it is the approach usedin this workThe ldquoone-against-onerdquo approach is based on theconstruction of several binary SVM classifiers inmore detaila number of119872(119872minus1)2 binary SVMs are necessary in a119872-classes dataset This happens because each SVM is trained todistinguish between 2 classes and the final decision is takenexploiting a voting strategy among all the binary classifiersDuring the training phase the activity feature vectors aregiven as inputs to the multiclass SVM together with the label119871 of the action In the test phase the activity label is obtainedfrom the classifier

4 Experimental Results

Thealgorithmperformance is evaluated on five publicly avail-able datasets In order to perform an objective comparisonto previous works the reference test procedures have beenconsidered for each dataset The performance indicators areevaluated using four different subsets of joints shown inFigure 2 going from a minimum of 7 up to a maximum of20 joints Finally in order to evaluate the performance inAAL scenarios some suitable activities related to AAL areselected from the datasets and the recognition accuracies areevaluated

41 Datasets Five different 3D datasets have been consideredin this work each of them including a different set of activitiesand gestures

KARD dataset [29] is composed of 18 activities that canbe divided into 10 gestures (horizontal arm wave high armwave two-hand wave high throw draw X draw tick forwardkick side kick bend and hand clap) and eight actions (catchcap toss paper take umbrellawalk phone call drink sit downand stand up) This dataset has been captured in a controlledenvironment that is an office with a static background anda Kinect device placed at a distance of 2-3m from the subjectSome objects were present in the area useful to perform someof the actions a desk with a phone a coat rack and a wastebin The activities have been performed by 10 young people(nine males and one female) aged from 20 to 30 years andfrom 150 to 185 cm tall Each person repeated each activity3 times creating a number of 540 sequences The dataset iscomposed of RGB and depth frames captured at a rate of30 fps with a 640 times 480 resolution In addition 15 joints ofthe skeleton in world and screen coordinates are providedThe skeleton has been captured using OpenNI libraries [42]

The Cornell Activity Dataset (CAD-60) [43] is madeby 12 different activities typical of indoor environmentsThe activities are rinsing mouth brushing teeth wearingcontact lens talking on the phone drinking water openingpill container cooking-chopping cooking-stirring talking oncouch relaxing on couch writing on whiteboard and workingon computer and are performed in 5 different environmentsbathroom bedroom kitchen living room and office All theactivities are performed by 4 different people two males andtwo females one of which is left-handed No instructionswere given to actors about how to perform the activitiesthe authors simply ensured that the skeleton was correctlydetected by Kinect The dataset is composed of RGB depthand skeleton data with 15 joints available

The UTKinect dataset [26] is composed of 10 differentsubjects (9males and 1 female) performing 10 activities twiceThe following activities are part of the dataset walk sit downstand up pick up carry throw push pull wave and claphands A number of 199 sequences are available because onesequence is not labeled and the length of sample actions

6 Computational Intelligence and Neuroscience

ranges from 5 to 120 frames The dataset provides 640 times 480RGB frames and 320 times 240 depth frames together with 20skeleton joints captured using Kinect forWindows SDKBetaVersion with a final frame rate of about 15 fps

The Florence3D dataset [44] includes 9 different activi-ties wave drink from a bottle answer phone clap tight lacesit down stand up read watch and bow These activities wereperformed by 10 different subjects for 2 or 3 times resultingin a total number of 215 sequences This is a challengingdataset since the same action is performed with both handsand because of the presence of very similar actions such asdrink from a bottle and answer phone The activities wererecorded in different environments and only RGB videos and15 skeleton joints are available

Finally the MSR Action3D [45] represents one of themost used datasets for HAR It includes 20 activities per-formed by 10 subjects 2 or 3 times In total 567 sequencesof depth (320 times 240) and skeleton frames are providedbut 10 of them have to be discarded because the skeletonsare either missing or affected by too many errors Thefollowing activities are included in the dataset high armwavehorizontal arm wave hammer hand catch forward punchhigh throw draw X draw tick draw circle hand clap two-hand wave side boxing bend forward kick side kick joggingtennis swing tennis serve golf swing and pickup and throwThe dataset has been collected using a structured-light depthcamera at 15 fps RGB data are not available

42 Tests and Results The proposed algorithm has beentested over the datasets detailed above following the recom-mendations provided in each reference paper in order toensure a fair comparison to previous works Following thiscomparison a more AAL oriented evaluation has been con-ducted which consists of considering only suitable actionsfor each dataset excluding the gestures Another type ofevaluation regards the subset of skeleton joints that has to beincluded in the feature computation

421 KARD Dataset Gaglio et al [29] collected the KARDdataset and proposed some evaluation experiments on itThey considered three different experiments and two modal-ities of dataset splitting The experiments are as follows

(i) Experiment A one-third of the data is considered fortraining and the rest for testing

(ii) Experiment B two-thirds of the data is considered fortraining and the rest for testing

(iii) Experiment C half of the data is considered fortraining and the rest for testing

The activities constituting the dataset are split in the followinggroups

(i) Gestures and Actions(ii) Activity Set 1 Activity Set 2 and Activity Set 3 as

listed in Table 1 Activity Set 1 is the simplest onesince it is composed of quite different activities whilethe other two sets include more similar actions andgestures

Table 1 Activity sets grouping different and similar activities fromKARD dataset

Activity Set 1 Activity Set 2 Activity Set 3Horizontal arm wave High arm wave Draw tickTwo-hand wave Side kick DrinkBend Catch cap Sit downPhone call Draw tick Phone callStand up Hand clap Take umbrellaForward kick Forward kick Toss paperDraw X Bend High throwWalk Sit down Horizontal arm wave

Each experiment has been repeated 10 times randomlysplitting training and testing data Finally the ldquonew-personrdquoscenario is also performed that is a leave-one-actor-outsetting consisting of training the system on nine of the tenpeople of the dataset and testing on the tenth In the ldquonew-personrdquo test no recommendation is provided about how tosplit the dataset so we assumed that the whole dataset of 18activities is considered The only parameter that can be setin the proposed algorithm is the number of clusters 119870 anddifferent subsets of skeleton joints are considered Since only15 skeleton joints are available the fourth group of 20 joints(Figure 2(d)) cannot be considered

For each test conducted on KARD dataset the sequenceof clusters 119870 = [3 5 10 15 20 25 30 35] has been consid-ered The results concerning Activity Set tests are reported inTable 2 it is shown that the proposed algorithm outperformsthe originally proposed one in almost all the tests The 119870parameter which gives the maximum accuracy is quite highsince it is 30 or 35 in most of the cases It means that forthe KARD dataset it is better to have a significant numberof clusters representing each activity This is possible alsobecause the number of frames that constitutes each activitygoes from aminimumof 42 in a sequence of hand clap gestureto a maximum of 310 for a walk sequence Experiment Ain Activity Set 1 shows the highest difference between theminimum and the maximum accuracy when varying thenumber of clusters For example considering 119875 = 7 themaximum accuracy (982) is shown in Table 2 and it isobtained with 119870 = 35 The minimum accuracy is 914 andit is obtained with 119870 = 5 The difference is quite high (68)but this gap is reduced by considering more training data Infact Experiment B shows a difference of 05 in Activity Set1 11 in Activity Set 2 and 26 in Activity Set 3

Considering the number of selected joints the observa-tion of Tables 2 and 3 lets us conclude that not all the jointsare necessary to achieve good recognition results In moredetail from Table 2 it can be noticed that the Activity Set 1and Activity Set 2 which are the simplest ones have goodrecognition results using a subset composed of 119875 = 7 jointsThe Activity Set 3 composed of more similar activities isbetter recognized with 119875 = 11 joints In any case it is notnecessary to consider all the skeleton joints provided fromKARD dataset

The results obtained with the ldquonew-personrdquo scenario areshown in Table 4 The best result is obtained with 119870 = 30

Computational Intelligence and Neuroscience 7

Table 2 Accuracy () of the proposed algorithm compared to the other using KARD dataset with different Activity Sets and for differentexperiments

Activity Set 1 Activity Set 2 Activity Set 3A B C A B C A B C

Gaglio et al [29] 951 991 930 899 949 901 842 895 817

Proposed (119875 = 7) 982 984 981 997 100 997 902 950 913

Proposed (119875 = 11) 980 990 977 998 100 996 916 958 933Proposed (119875 = 15) 975 988 976 995 100 996 91 951 932

090 003 007097 003

100100

003 093 003007 007 083 003007 007 077 010

003 093 003100

100100

100100

100090 010

003 003 007 087100

100

Hor

arm

wav

e

Hig

h ar

m w

ave

Two-

hand

wav

e

Catch

cap

Hig

h th

row

Dra

w X

Dra

w tic

k

Toss

pape

r

Forw

kick

Side

kick

Take

um

br

Bend

Han

d cla

p

Wal

k

Phon

e cal

l

Drin

k

Sit d

own

Stan

d up

Hor arm waveHigh arm wave

Two-hand waveCatch cap

High throwDraw X

Draw tickToss paperForw kick

Side kickTake umbr

BendHand clap

WalkPhone call

DrinkSit downStand up

Figure 3 Confusion matrix of the ldquonew-personrdquo test on the whole KARD dataset

Table 3 Accuracy () of the proposed algorithm compared tothe other using KARD dataset with dataset split in Gestures andActions for different experiments

Gestures ActionsA B C A B C

Gaglio et al [29] 865 930 867 925 950 901

Proposed (119875 = 7) 899 935 925 991 996 994Proposed (119875 = 11) 899 959 937 990 999 991Proposed (119875 = 15) 874 936 928 987 995 993

Table 4 Precision () and recall () of the proposed algorithmcompared to the other using the whole KARD dataset and ldquonew-personrdquo setting

Algorithm Precision RecallGaglio et al [29] 848 845

Proposed (119875 = 7) 940 937

Proposed (119875 = 11) 951 950Proposed (119875 = 15) 950 948

using a number of 119875 = 11 joints The overall precision andrecall are about 10higher than the previous approachwhichuses the KARD dataset In this condition the confusion

matrix obtained is shown in Figure 3 It can be noticed thatthe actions are distinguished very well only phone call anddrink show a recognition accuracy equal or lower than 90and sometimes they are mixed with each other The mostcritical activities are the draw X and draw tick gestures whichare quite similar

From the AAL point of view only some activities arerelevant Table 3 shows that the eight actions constituting theKARD dataset are recognized with an accuracy greater than98 even if there are some similar actions such as sit downand stand up or phone call and drink Considering only theActions subset the lower recognition accuracy is 949and itis obtained with119870 = 5 and119875 = 7 It means that the algorithmis able to reach a high recognition rate even if the featurevector is limited to 90 elements

422 CAD-60 Dataset TheCAD-60 dataset is a challengingdataset consisting of 12 activities performed by 4 people in5 different environments The dataset is usually evaluatedby splitting the activities according to the environmentthe global performance of the algorithm is given by theaverage precision and recall among all the environments Twodifferent settings were experimented for CAD-60 in [43]Theformer is defined ldquonew-personrdquo and the latter is the so-calledldquohave-seenrdquo ldquoNew-personrdquo setting has been considered in all

8 Computational Intelligence and Neuroscience

the works using CAD-60 so it is the one selected also in thiswork

The most challenging element of the dataset is thepresence of a left-handed actor In order to increase the per-formance which are particularly affected by this unbalancingin the ldquonew-personrdquo test mirrored copies of each action arecreated as suggested in [43] For each actor a left-handed anda right-handed version of each action are made availableThedummyversion of the activity has been obtained bymirroringthe skeleton with respect to the virtual sagittal plane that cutsthe person in a half The proposed algorithm is evaluatedusing three different sets of joints from 119875 = 7 to 119875 = 15and the sequence of clusters 119870 = [3 5 10 15 20 25 30 35]as in the KARD dataset

The best results are obtained with the configurations 119875 =11 and 119870 = 25 and the performance in terms of precisionand recall for each activity is shown in Table 5 Very goodresults are given in office environment where the averageprecision and recall are 964 and 958 respectively Infact the activities of this environment are quite different onlytalking on phone and drinking water are similar On the otherhand the living room environment includes talking on couchand relaxing on couch in addition to talking on phone anddrinking water and it is the most challenging case since theaverage precision and recall are 91 and 906

The proposed algorithm is compared to other worksusing the same ldquonew-personrdquo setting and the results areshown in Table 6 which tells that the 119875 = 11 configurationoutperforms the state-of-the-art results in terms of precisionand it is only 1 lower in terms of recall Shan and Akella[46] achieve very good results using a multiclass SVMscheme with a linear kernel However they train and testmirrored actions separately and then merge the results whencomputing average precision and recall Our approach simplyconsiders two copies of the same action given as input to themulticlass SVM and retrieves the classification results

The reduced number of joints does not affect too muchthe average performance of the algorithm that reaches aprecision of 927 and a recall of 915 with 119875 = 7Using all the available jonts on the other hand brings toa more substantial reduction of the performance showing879 and 867 for precision and recall respectively with119875 = 15 The best results for the proposed algorithm werealways obtained with a high number of clusters (25 or30) The reduction of this number affects the performancefor example considering the 119875 = 11 subset the worstperformance is obtained with 119870 = 5 and with a precision of866 and a recall of 860

From theAALpoint of view this dataset is composed onlyof actions not gestures so the dataset does not have to beseparated to evaluate the performance in a scenario which isclose to AAL

423 UTKinect Dataset The UTKinect dataset is composedof 10 activities performed twice by 10 subjects and theevaluation setting proposed in [26] is the leave-one-out-cross-validation (LOOCV) which means that the systemis trained on all the sequences except one and that one isused for testing Each trainingtesting procedure is repeated

Table 5 Precision () and recall () of the proposed algorithm inthe different environments of CAD-60 with 119875 = 11 and 119870 = 25

Location Activity ldquoNew-personrdquoPrecision Recall

Bathroom

Brushing teeth 889 100

Rinsing mouth 923 100

Wearing contact lens 100 792

Average 937 931

Bedroom

Talking on phone 917 917

Drinking water 913 875

Opening pill container 960 100

Average 930 931

Kitchen

Cooking-chopping 857 100

Cooking-stirring 100 791

Drinking water 960 100

Opening pill container 100 100

Average 954 948

Living room

Talking on phone 875 875

Drinking water 875 875

Talking on couch 889 100

Relaxing on couch 100 875

Average 910 906

Office

Talking on phone 100 875

Writing on whiteboard 100 958

Drinking water 857 100

Working on computer 100 100

Average 964 958Global average 939 935

Table 6 Global precision () and recall () of the proposedalgorithm for CAD-60 dataset and ldquonew-personrdquo setting withdifferent subsets of joints compared to other works

Algorithm Precision RecallSung et al [43] 679 555

Gaglio et al [29] 773 767

Proposed (119875 = 15) 879 867

Faria et al [47] 911 919

Parisi et al [48] 919 902

Proposed (119875 = 7) 927 915

Shan and Akella [46] 938 945Proposed (119875 = 11) 939 935

20 times to reduce the random effect of 119896-means For thisdataset all the different subsets of joints shown in Figure 2are considered since the skeleton is captured usingMicrosoftSDK which provides 20 joints The considered sequence ofclusters is only 119870 = [3 4 5] because the minimum numberof frames constituting an action sequence is 5

The results compared with previous works are shownin Table 7 The best results for the proposed algorithm areobtained with 119875 = 7 which is the smallest set of jointsconsidered The result corresponds to a number of 119870 = 4clusters but the difference with the other clusters is very low

Computational Intelligence and Neuroscience 9

093 002 005

094 006

099 001

100

002 098W

alk

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(a)

088 004 008

095 005

100

003 097

010 090

Wal

k

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(b)

Figure 4 Confusion matrices of the UTKinect dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 7 (b) Worst accuracy confusion matrix obtained with 119875 = 15

Table 7 Global accuracy () of the proposed algorithm forUTKinect dataset and LOOCV setting with different subsets ofjoints compared to other works

Algorithm AccuracyXia et al [26] 909

Theodorakopoulos et al [30] 9095

Ding et al [31] 915

Zhu et al [17] 919

Jiang et al [35] 919

Gan and Chen [24] 920

Liu et al [22] 920

Proposed (119875 = 15) 931

Proposed (119875 = 11) 942

Proposed (119875 = 19) 943

Anirudh et al [34] 949

Proposed (119875 = 7) 951

Vemulapalli et al [33] 971

only 06 with 119870 = 5 that provided the worst result Theselection of different sets of joints from 119875 = 7 to 119875 = 15changes the accuracy only by a 2 from 931 to 951In this dataset the main limitation to the performance isgiven by the reduced number of frames that constitute somesequences and limits the number of clusters representing theactions Vemulapalli et al [33] reach the highest accuracybut the approach is much more complex after modelingskeleton joints in a Lie group the processing scheme includesDynamic TimeWarping to perform temporal alignments anda specific representation called Fourier Temporal Pyramidbefore classification with one-versus-all multiclass SVM

Contextualizing the UTKinect dataset to AAL involvesthe consideration of a subset of activities which includes onlyactions and discards gestures In more detail the following 5activities have been selectedwalk sit down stand up pick upand carry In this condition the highest accuracy (967) isstill given by119875 = 7 with 3 clusters and the lower one (941)is represented by 119875 = 15 again with 3 clustersThe confusion

Table 8 Global accuracy () of the proposed algorithm forFlorence3D dataset and ldquonew-personrdquo setting with different subsetsof joints compared to other works

Algorithm AccuracySeidenari et al [44] 820

Proposed (119875 = 7) 821

Proposed (119875 = 15) 847

Proposed (119875 = 11) 861

Anirudh et al [34] 897

Vemulapalli et al [33] 909

Taha et al [28] 962

matrix for these two configurations are shown in Figure 4where the main difference is the reduced misclassificationbetween the activities walk and carry that are very similarto each other

424 Florence3D Dataset The Florence3D dataset is com-posed of 9 activities and 10 people performing themmultipletimes resulting in a total number of 215 activities Theproposed setting to evaluate this dataset is the leave-one-actor-out which is equivalent to the ldquonew-personrdquo settingpreviously described The minimum number of frames is 8so the following sequence of clusters has been considered119870 = [3 4 5 6 7 8] A number of 15 joints are available forthe skeleton so only the first three schemes are included inthe tests

Table 8 shows the results obtained with different subsetsof joints where the best result for the proposed algorithmis given by 119875 = 11 with 6 clusters The choice of clustersdoes not significantly affect the performance because themaximum reduction is 2 In this dataset the proposedapproach does not achieve state-of-the-art accuracy (962)given by Taha et al [28] However all the algorithmsthat overcome the proposed one exhibit greater complexitybecause they consider Lie group representations [33 34] orseveral machine learning algorithms such as SVM combined

10 Computational Intelligence and Neuroscience

(a)

(b)

(c)

Figure 5 Sequences of frames representing the drink from a bottle (a) answer phone (b) and read watch (c) activities from Florence3Ddataset

081 019

014 086

100

005 005 090

005 095

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(a)

076 019 005

032 064 005

004 092 004

005 010

005010005

085

080

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(b)

Figure 6 Confusion matrices of the Florence3D dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 11 (b) Worst accuracy confusion matrix obtained with 119875 = 7

to HMM to recognize activities composed of atomic actions[28]

If different subsets of joints are considered the perfor-mance decreases In particular considering only 7 joints and3 clusters it is possible to reach a maximum accuracy of821 which is very similar to the one obtained by Seidenariet al [44] who collected the dataset By including all theavailable joints (15) in the algorithm processing it is possibleto achieve an accuracy which varies between 840 and847

The Florence3D dataset is challenging because of tworeasons

(i) High interclass similarity some actions are verysimilar to each other for example drink from abottle answer phone and read watch As can beseen in Figure 5 all of them consist of an uprising

arm movement to the mouth hear or head Thisfact affects the performance because it is difficult toclassify actions consisting of very similar skeletonmovements

(ii) High intraclass variability the same action is per-formed in different ways by the same subject forexample using left right or both hands indifferently

Limiting the analysis to AAL related activities only thefollowing ones can be selected drink answer phone tight lacesit down and stand up The algorithm reaches the highestaccuracy (908) using the ldquonew-personrdquo setting with 119875 =11 joints and 119870 = 3 The confusion matrix obtained in thiscondition is shown in Figure 6(a) On the other hand theworst accuracy is 798 and it is obtained with 119875 = 7 and119870 = 4 In Figure 6(b) the confusion matrix for this scenario

Computational Intelligence and Neuroscience 11

081 008

004 056 022 011007

012 077 012

012

004 004 081 012

100

081 004 015

013 087

004 007019 070

a02

a03

a05

a06

a10

a13

a18

a20

a02

a03

a05

a06

a10

a13

a18

a20

(a)

056 015 007 015 007

007

007 017 013

016 044 008 004 004 024

011 056 004

003

022

010 070 017

063

100

010 087 003

007 093

a01

a04

a07

a08

a09

a11

a12

a14

a01

a04

a07

a08

a09

a11

a12

a14

(b)

088 012

100

005 095

097 003

003 097

093 007

003 003 083 010

004 011 085

a06

a14

a15

a16

a17

a18

a19

a20

a06

a14

a15

a16

a17

a18

a19

a20

(c)

Figure 7 Confusion matrices of the MSR Action3D dataset obtained with 119875 = 7 and the ldquonew-personrdquo test (a) AS1 (b) AS2 and (c) AS3

is shown In both situations the main problem is given by thesimilarity of drink and answer phone activities

425 MSR Action3D Dataset The MSR Action3D datasetis composed of 20 activities which are mainly gestures Itsevaluation is included for the sake of completeness in thecomparison with other activity recognition algorithms butthe dataset does not contain AAL related activities Thereis a big confusion in the literature about the validationtests to be used for MSR Action3D Padilla-Lopez et al[49] summarized all the validation methods for this datasetand recommend using all the possible combinations of 5-5 subjects splitting or using LOAO The former procedureconsists of considering 252 combinations of 5 subjects fortraining and the remaining 5 for testing while the latter is theleave-one-actor-out equivalent to the ldquonew-personrdquo schemepreviously introduced This dataset is quite challenging dueto the presence of similar and complex gestures hence theevaluation is performed by considering three subsets of 8gestures each (AS1 AS2 and AS3 in Table 9) as suggestedin [45] Since the minimum number of frames constitutingone sequence is 13 the following set of clusters has beenconsidered 119870 = [3 5 8 10 13] All the combinations of 711 15 and 20 joints shown in Figure 2 are included in theexperimental tests

Considering the ldquonew-personrdquo scheme the proposedalgorithm tested separately on the three subsets of MSRAction3D reaches an average accuracy of 812 with 119875 = 7and 119870 = 10 The confusion matrices for the three subsetsare shown in Figure 7 where it is possible to notice thatthe algorithm struggles in the recognition of the AS2 subset(Figure 7(b)) mainly represented by drawing gestures Betterresults are obtained processing the subset AS3 (Figure 7(c))where lower recognition rates are shown for complex ges-tures golf swing and pickup and throw Table 10 shows theresults obtained including also other joints which are slightlyworse A comparison with previous works validated usingthe ldquonew-personrdquo test is shown in Table 11 The proposedalgorithm achieves results comparable with [50] whichexploits an approach based on skeleton data Chaaraoui etal [15 36] exploits more complex algorithms considering

Table 9 Three subsets of gestures fromMSR Action3D dataset

AS1 AS2 AS3[11988602] Horizontalarm wave [11988601] High arm wave [11988606] High throw

[11988603] Hammer [11988604] Hand catch [11988614] Forward kick[11988605] Forwardpunch [11988607] Draw X [11988615] Side kick

[11988606] High throw [11988608] Draw tick [11988616] Jogging[11988610] Hand clap [11988609] Draw circle [11988617] Tennis swing[11988613] Bend [11988611] Two-hand wave [11988618] Tennis serve[11988618] Tennis serve [11988612] Side boxing [11988619] Golf swing[11988620] Pickup andthrow [11988614] Forward kick [11988620] Pickup and throw

Table 10 Accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting with different subsetsof joints and different subsets of activities

Algorithm AS1 AS2 AS3 AvgProposed (119875 = 15) 785 688 928 800

Proposed (119875 = 20) 790 702 919 804

Proposed (119875 = 11) 776 737 914 809

Proposed (119875 = 7) 795 719 923 812

the fusion of skeleton and depth data or the evolutionaryselection of the best set of joints

5 Discussion

The proposed algorithm despite its simplicity is able toachieve and sometimes to overcome state-of-the-art per-formance when applied to publicly available datasets Inparticular it is able to outperform some complex algorithmsexploiting more than one classifier in KARD and CAD-60datasets

Limiting the analysis to AAL related activities only thealgorithm achieves interesting results in all the datasetsThe group of 8 activities labeled as Actions in the KARD

12 Computational Intelligence and Neuroscience

Table 11 Average accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting compared to otherworks

Algorithm AccuracyCeliktutan et al (2015) [51] 729

Azary and Savakis (2013) [52] 785

Proposed (119875 = 7) 812

Azary and Savakis (2012) [50] 839

Chaaraoui et al (2013) [15] 906

Chaaraoui et al (2014) [36] 935

dataset is recognized with an accuracy greater than 987in the considered experiments and the group includes twosimilar activities such as phone call and drink The CAD-60dataset contains only actions so all the activities consideredwithin the proper location have been included in the eval-uation resulting in global precision and recall of 939 and935 respectively In the UTKinect dataset the performanceimproves considering only the AAL activities and the samehappens also in the Florence3D having the highest accuracyclose to 91 even if some actions are very similar

The MSR Action3D is a challenging dataset mainlycomprising gestures for human-computer interaction andnot actions Many gestures especially the ones included inAS2 subset are very similar to each other In order to improvethe recognition accuracy more complex features should beconsidered including not only jointsrsquo relative positions butalso their velocity Many sequences contain noisy skeletondata another approach to improving recognition accuracycould be the development of a method to discard noisyskeleton or to include depth-based features

However the AAL scenario raises some problems whichhave to be addressed First of all the algorithm exploits amul-ticlass SVM which is a good classifier but it does not makeeasy to understand if an activity belongs or not to any of thetraining classes In fact in a real scenario it is possible to havea sequence of frames that does not represent any activity ofthe training set in this case the SVM outputs the most likelyclass anyway even if it does not make sense Other machinelearning algorithms such as HMMs distinguish amongmultiple classes using the maximum posterior probabilityGaglio et al [29] proposed using a threshold on the outputprobability to detect unknown actions Usually SVMs do notprovide output probabilities but there exist some methodsto extend SVM implementations and make them able toprovide also this information [53 54] These techniquescan be investigated to understand their applicability in theproposed scenario

Another problem is segmentation of actions In manydatasets the actions are represented as segmented sequencesof frames but in real applications the algorithm has to handlea continuous stream of frames and to segment actions byitself Some solutions for segmentation have been proposedbut most of them are based on thresholds on movementswhich can be highly data-dependent [46] Also this aspect hasto be further investigated to have a systemwhich is effectivelyapplicable in a real AAL scenario

6 Conclusions

In this work a simple yet effective activity recognitionalgorithm has been proposed It is based on skeleton dataextracted from an RGBD sensor and it creates a featurevector representing the whole activity It is able to overcomestate-of-the-art results in two publicly available datasets theKARD and CAD-60 outperforming more complex algo-rithms in many conditions The algorithm has been testedalso over other more challenging datasets the UTKinect andFlorence3D where it is outperformed only by algorithmsexploiting temporal alignment techniques or a combinationof several machine learning methods The MSR Action3Dis a complex dataset and the proposed algorithm strugglesin the recognition of subsets constituted by similar gesturesHowever this dataset does not contain any activity of interestfor AAL scenarios

Future works will concern the application of the activityrecognition algorithm to AAL scenarios by consideringaction segmentation and the detection of unknown activities

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the contributionof the COST Action IC1303 AAPELE (Architectures Algo-rithms and Platforms for Enhanced Living Environments)

References

[1] P Rashidi and A Mihailidis ldquoA survey on ambient-assistedliving tools for older adultsrdquo IEEE Journal of Biomedical andHealth Informatics vol 17 no 3 pp 579ndash590 2013

[2] O C Ann and L B Theng ldquoHuman activity recognition areviewrdquo in Proceedings of the IEEE International Conference onControl System Computing and Engineering (ICCSCE rsquo14) pp389ndash393 IEEE Batu Ferringhi Malaysia November 2014

[3] S Wang and G Zhou ldquoA review on radio based activityrecognitionrdquo Digital Communications and Networks vol 1 no1 pp 20ndash29 2015

[4] L Atallah B Lo R Ali R King and G-Z Yang ldquoReal-timeactivity classificationusing ambient andwearable sensorsrdquo IEEETransactions on Information Technology in Biomedicine vol 13no 6 pp 1031ndash1039 2009

[5] D De P Bharti S K Das and S Chellappan ldquoMultimodalwearable sensing for fine-grained activity recognition in health-carerdquo IEEE Internet Computing vol 19 no 5 pp 26ndash35 2015

[6] J K Aggarwal and L Xia ldquoHuman activity recognition from 3Ddata a reviewrdquo Pattern Recognition Letters vol 48 pp 70ndash802014

[7] S Gasparrini E Cippitelli E Gambi S Spinsante and FFlorez-Revuelta ldquoPerformance analysis of self-organising neu-ral networks tracking algorithms for intake monitoring usingkinectrdquo in Proceedings of the 1st IET International Conferenceon Technologies for Active and Assisted Living (TechAAL rsquo15)Kingston UK November 2015

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

6 Computational Intelligence and Neuroscience

ranges from 5 to 120 frames The dataset provides 640 times 480RGB frames and 320 times 240 depth frames together with 20skeleton joints captured using Kinect forWindows SDKBetaVersion with a final frame rate of about 15 fps

The Florence3D dataset [44] includes 9 different activi-ties wave drink from a bottle answer phone clap tight lacesit down stand up read watch and bow These activities wereperformed by 10 different subjects for 2 or 3 times resultingin a total number of 215 sequences This is a challengingdataset since the same action is performed with both handsand because of the presence of very similar actions such asdrink from a bottle and answer phone The activities wererecorded in different environments and only RGB videos and15 skeleton joints are available

Finally the MSR Action3D [45] represents one of themost used datasets for HAR It includes 20 activities per-formed by 10 subjects 2 or 3 times In total 567 sequencesof depth (320 times 240) and skeleton frames are providedbut 10 of them have to be discarded because the skeletonsare either missing or affected by too many errors Thefollowing activities are included in the dataset high armwavehorizontal arm wave hammer hand catch forward punchhigh throw draw X draw tick draw circle hand clap two-hand wave side boxing bend forward kick side kick joggingtennis swing tennis serve golf swing and pickup and throwThe dataset has been collected using a structured-light depthcamera at 15 fps RGB data are not available

42 Tests and Results The proposed algorithm has beentested over the datasets detailed above following the recom-mendations provided in each reference paper in order toensure a fair comparison to previous works Following thiscomparison a more AAL oriented evaluation has been con-ducted which consists of considering only suitable actionsfor each dataset excluding the gestures Another type ofevaluation regards the subset of skeleton joints that has to beincluded in the feature computation

421 KARD Dataset Gaglio et al [29] collected the KARDdataset and proposed some evaluation experiments on itThey considered three different experiments and two modal-ities of dataset splitting The experiments are as follows

(i) Experiment A one-third of the data is considered fortraining and the rest for testing

(ii) Experiment B two-thirds of the data is considered fortraining and the rest for testing

(iii) Experiment C half of the data is considered fortraining and the rest for testing

The activities constituting the dataset are split in the followinggroups

(i) Gestures and Actions(ii) Activity Set 1 Activity Set 2 and Activity Set 3 as

listed in Table 1 Activity Set 1 is the simplest onesince it is composed of quite different activities whilethe other two sets include more similar actions andgestures

Table 1 Activity sets grouping different and similar activities fromKARD dataset

Activity Set 1 Activity Set 2 Activity Set 3Horizontal arm wave High arm wave Draw tickTwo-hand wave Side kick DrinkBend Catch cap Sit downPhone call Draw tick Phone callStand up Hand clap Take umbrellaForward kick Forward kick Toss paperDraw X Bend High throwWalk Sit down Horizontal arm wave

Each experiment has been repeated 10 times randomlysplitting training and testing data Finally the ldquonew-personrdquoscenario is also performed that is a leave-one-actor-outsetting consisting of training the system on nine of the tenpeople of the dataset and testing on the tenth In the ldquonew-personrdquo test no recommendation is provided about how tosplit the dataset so we assumed that the whole dataset of 18activities is considered The only parameter that can be setin the proposed algorithm is the number of clusters 119870 anddifferent subsets of skeleton joints are considered Since only15 skeleton joints are available the fourth group of 20 joints(Figure 2(d)) cannot be considered

For each test conducted on KARD dataset the sequenceof clusters 119870 = [3 5 10 15 20 25 30 35] has been consid-ered The results concerning Activity Set tests are reported inTable 2 it is shown that the proposed algorithm outperformsthe originally proposed one in almost all the tests The 119870parameter which gives the maximum accuracy is quite highsince it is 30 or 35 in most of the cases It means that forthe KARD dataset it is better to have a significant numberof clusters representing each activity This is possible alsobecause the number of frames that constitutes each activitygoes from aminimumof 42 in a sequence of hand clap gestureto a maximum of 310 for a walk sequence Experiment Ain Activity Set 1 shows the highest difference between theminimum and the maximum accuracy when varying thenumber of clusters For example considering 119875 = 7 themaximum accuracy (982) is shown in Table 2 and it isobtained with 119870 = 35 The minimum accuracy is 914 andit is obtained with 119870 = 5 The difference is quite high (68)but this gap is reduced by considering more training data Infact Experiment B shows a difference of 05 in Activity Set1 11 in Activity Set 2 and 26 in Activity Set 3

Considering the number of selected joints the observa-tion of Tables 2 and 3 lets us conclude that not all the jointsare necessary to achieve good recognition results In moredetail from Table 2 it can be noticed that the Activity Set 1and Activity Set 2 which are the simplest ones have goodrecognition results using a subset composed of 119875 = 7 jointsThe Activity Set 3 composed of more similar activities isbetter recognized with 119875 = 11 joints In any case it is notnecessary to consider all the skeleton joints provided fromKARD dataset

The results obtained with the ldquonew-personrdquo scenario areshown in Table 4 The best result is obtained with 119870 = 30

Computational Intelligence and Neuroscience 7

Table 2 Accuracy () of the proposed algorithm compared to the other using KARD dataset with different Activity Sets and for differentexperiments

Activity Set 1 Activity Set 2 Activity Set 3A B C A B C A B C

Gaglio et al [29] 951 991 930 899 949 901 842 895 817

Proposed (119875 = 7) 982 984 981 997 100 997 902 950 913

Proposed (119875 = 11) 980 990 977 998 100 996 916 958 933Proposed (119875 = 15) 975 988 976 995 100 996 91 951 932

090 003 007097 003

100100

003 093 003007 007 083 003007 007 077 010

003 093 003100

100100

100100

100090 010

003 003 007 087100

100

Hor

arm

wav

e

Hig

h ar

m w

ave

Two-

hand

wav

e

Catch

cap

Hig

h th

row

Dra

w X

Dra

w tic

k

Toss

pape

r

Forw

kick

Side

kick

Take

um

br

Bend

Han

d cla

p

Wal

k

Phon

e cal

l

Drin

k

Sit d

own

Stan

d up

Hor arm waveHigh arm wave

Two-hand waveCatch cap

High throwDraw X

Draw tickToss paperForw kick

Side kickTake umbr

BendHand clap

WalkPhone call

DrinkSit downStand up

Figure 3 Confusion matrix of the ldquonew-personrdquo test on the whole KARD dataset

Table 3 Accuracy () of the proposed algorithm compared tothe other using KARD dataset with dataset split in Gestures andActions for different experiments

Gestures ActionsA B C A B C

Gaglio et al [29] 865 930 867 925 950 901

Proposed (119875 = 7) 899 935 925 991 996 994Proposed (119875 = 11) 899 959 937 990 999 991Proposed (119875 = 15) 874 936 928 987 995 993

Table 4 Precision () and recall () of the proposed algorithmcompared to the other using the whole KARD dataset and ldquonew-personrdquo setting

Algorithm Precision RecallGaglio et al [29] 848 845

Proposed (119875 = 7) 940 937

Proposed (119875 = 11) 951 950Proposed (119875 = 15) 950 948

using a number of 119875 = 11 joints The overall precision andrecall are about 10higher than the previous approachwhichuses the KARD dataset In this condition the confusion

matrix obtained is shown in Figure 3 It can be noticed thatthe actions are distinguished very well only phone call anddrink show a recognition accuracy equal or lower than 90and sometimes they are mixed with each other The mostcritical activities are the draw X and draw tick gestures whichare quite similar

From the AAL point of view only some activities arerelevant Table 3 shows that the eight actions constituting theKARD dataset are recognized with an accuracy greater than98 even if there are some similar actions such as sit downand stand up or phone call and drink Considering only theActions subset the lower recognition accuracy is 949and itis obtained with119870 = 5 and119875 = 7 It means that the algorithmis able to reach a high recognition rate even if the featurevector is limited to 90 elements

422 CAD-60 Dataset TheCAD-60 dataset is a challengingdataset consisting of 12 activities performed by 4 people in5 different environments The dataset is usually evaluatedby splitting the activities according to the environmentthe global performance of the algorithm is given by theaverage precision and recall among all the environments Twodifferent settings were experimented for CAD-60 in [43]Theformer is defined ldquonew-personrdquo and the latter is the so-calledldquohave-seenrdquo ldquoNew-personrdquo setting has been considered in all

8 Computational Intelligence and Neuroscience

the works using CAD-60 so it is the one selected also in thiswork

The most challenging element of the dataset is thepresence of a left-handed actor In order to increase the per-formance which are particularly affected by this unbalancingin the ldquonew-personrdquo test mirrored copies of each action arecreated as suggested in [43] For each actor a left-handed anda right-handed version of each action are made availableThedummyversion of the activity has been obtained bymirroringthe skeleton with respect to the virtual sagittal plane that cutsthe person in a half The proposed algorithm is evaluatedusing three different sets of joints from 119875 = 7 to 119875 = 15and the sequence of clusters 119870 = [3 5 10 15 20 25 30 35]as in the KARD dataset

The best results are obtained with the configurations 119875 =11 and 119870 = 25 and the performance in terms of precisionand recall for each activity is shown in Table 5 Very goodresults are given in office environment where the averageprecision and recall are 964 and 958 respectively Infact the activities of this environment are quite different onlytalking on phone and drinking water are similar On the otherhand the living room environment includes talking on couchand relaxing on couch in addition to talking on phone anddrinking water and it is the most challenging case since theaverage precision and recall are 91 and 906

The proposed algorithm is compared to other worksusing the same ldquonew-personrdquo setting and the results areshown in Table 6 which tells that the 119875 = 11 configurationoutperforms the state-of-the-art results in terms of precisionand it is only 1 lower in terms of recall Shan and Akella[46] achieve very good results using a multiclass SVMscheme with a linear kernel However they train and testmirrored actions separately and then merge the results whencomputing average precision and recall Our approach simplyconsiders two copies of the same action given as input to themulticlass SVM and retrieves the classification results

The reduced number of joints does not affect too muchthe average performance of the algorithm that reaches aprecision of 927 and a recall of 915 with 119875 = 7Using all the available jonts on the other hand brings toa more substantial reduction of the performance showing879 and 867 for precision and recall respectively with119875 = 15 The best results for the proposed algorithm werealways obtained with a high number of clusters (25 or30) The reduction of this number affects the performancefor example considering the 119875 = 11 subset the worstperformance is obtained with 119870 = 5 and with a precision of866 and a recall of 860

From theAALpoint of view this dataset is composed onlyof actions not gestures so the dataset does not have to beseparated to evaluate the performance in a scenario which isclose to AAL

423 UTKinect Dataset The UTKinect dataset is composedof 10 activities performed twice by 10 subjects and theevaluation setting proposed in [26] is the leave-one-out-cross-validation (LOOCV) which means that the systemis trained on all the sequences except one and that one isused for testing Each trainingtesting procedure is repeated

Table 5 Precision () and recall () of the proposed algorithm inthe different environments of CAD-60 with 119875 = 11 and 119870 = 25

Location Activity ldquoNew-personrdquoPrecision Recall

Bathroom

Brushing teeth 889 100

Rinsing mouth 923 100

Wearing contact lens 100 792

Average 937 931

Bedroom

Talking on phone 917 917

Drinking water 913 875

Opening pill container 960 100

Average 930 931

Kitchen

Cooking-chopping 857 100

Cooking-stirring 100 791

Drinking water 960 100

Opening pill container 100 100

Average 954 948

Living room

Talking on phone 875 875

Drinking water 875 875

Talking on couch 889 100

Relaxing on couch 100 875

Average 910 906

Office

Talking on phone 100 875

Writing on whiteboard 100 958

Drinking water 857 100

Working on computer 100 100

Average 964 958Global average 939 935

Table 6 Global precision () and recall () of the proposedalgorithm for CAD-60 dataset and ldquonew-personrdquo setting withdifferent subsets of joints compared to other works

Algorithm Precision RecallSung et al [43] 679 555

Gaglio et al [29] 773 767

Proposed (119875 = 15) 879 867

Faria et al [47] 911 919

Parisi et al [48] 919 902

Proposed (119875 = 7) 927 915

Shan and Akella [46] 938 945Proposed (119875 = 11) 939 935

20 times to reduce the random effect of 119896-means For thisdataset all the different subsets of joints shown in Figure 2are considered since the skeleton is captured usingMicrosoftSDK which provides 20 joints The considered sequence ofclusters is only 119870 = [3 4 5] because the minimum numberof frames constituting an action sequence is 5

The results compared with previous works are shownin Table 7 The best results for the proposed algorithm areobtained with 119875 = 7 which is the smallest set of jointsconsidered The result corresponds to a number of 119870 = 4clusters but the difference with the other clusters is very low

Computational Intelligence and Neuroscience 9

093 002 005

094 006

099 001

100

002 098W

alk

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(a)

088 004 008

095 005

100

003 097

010 090

Wal

k

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(b)

Figure 4 Confusion matrices of the UTKinect dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 7 (b) Worst accuracy confusion matrix obtained with 119875 = 15

Table 7 Global accuracy () of the proposed algorithm forUTKinect dataset and LOOCV setting with different subsets ofjoints compared to other works

Algorithm AccuracyXia et al [26] 909

Theodorakopoulos et al [30] 9095

Ding et al [31] 915

Zhu et al [17] 919

Jiang et al [35] 919

Gan and Chen [24] 920

Liu et al [22] 920

Proposed (119875 = 15) 931

Proposed (119875 = 11) 942

Proposed (119875 = 19) 943

Anirudh et al [34] 949

Proposed (119875 = 7) 951

Vemulapalli et al [33] 971

only 06 with 119870 = 5 that provided the worst result Theselection of different sets of joints from 119875 = 7 to 119875 = 15changes the accuracy only by a 2 from 931 to 951In this dataset the main limitation to the performance isgiven by the reduced number of frames that constitute somesequences and limits the number of clusters representing theactions Vemulapalli et al [33] reach the highest accuracybut the approach is much more complex after modelingskeleton joints in a Lie group the processing scheme includesDynamic TimeWarping to perform temporal alignments anda specific representation called Fourier Temporal Pyramidbefore classification with one-versus-all multiclass SVM

Contextualizing the UTKinect dataset to AAL involvesthe consideration of a subset of activities which includes onlyactions and discards gestures In more detail the following 5activities have been selectedwalk sit down stand up pick upand carry In this condition the highest accuracy (967) isstill given by119875 = 7 with 3 clusters and the lower one (941)is represented by 119875 = 15 again with 3 clustersThe confusion

Table 8 Global accuracy () of the proposed algorithm forFlorence3D dataset and ldquonew-personrdquo setting with different subsetsof joints compared to other works

Algorithm AccuracySeidenari et al [44] 820

Proposed (119875 = 7) 821

Proposed (119875 = 15) 847

Proposed (119875 = 11) 861

Anirudh et al [34] 897

Vemulapalli et al [33] 909

Taha et al [28] 962

matrix for these two configurations are shown in Figure 4where the main difference is the reduced misclassificationbetween the activities walk and carry that are very similarto each other

424 Florence3D Dataset The Florence3D dataset is com-posed of 9 activities and 10 people performing themmultipletimes resulting in a total number of 215 activities Theproposed setting to evaluate this dataset is the leave-one-actor-out which is equivalent to the ldquonew-personrdquo settingpreviously described The minimum number of frames is 8so the following sequence of clusters has been considered119870 = [3 4 5 6 7 8] A number of 15 joints are available forthe skeleton so only the first three schemes are included inthe tests

Table 8 shows the results obtained with different subsetsof joints where the best result for the proposed algorithmis given by 119875 = 11 with 6 clusters The choice of clustersdoes not significantly affect the performance because themaximum reduction is 2 In this dataset the proposedapproach does not achieve state-of-the-art accuracy (962)given by Taha et al [28] However all the algorithmsthat overcome the proposed one exhibit greater complexitybecause they consider Lie group representations [33 34] orseveral machine learning algorithms such as SVM combined

10 Computational Intelligence and Neuroscience

(a)

(b)

(c)

Figure 5 Sequences of frames representing the drink from a bottle (a) answer phone (b) and read watch (c) activities from Florence3Ddataset

081 019

014 086

100

005 005 090

005 095

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(a)

076 019 005

032 064 005

004 092 004

005 010

005010005

085

080

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(b)

Figure 6 Confusion matrices of the Florence3D dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 11 (b) Worst accuracy confusion matrix obtained with 119875 = 7

to HMM to recognize activities composed of atomic actions[28]

If different subsets of joints are considered the perfor-mance decreases In particular considering only 7 joints and3 clusters it is possible to reach a maximum accuracy of821 which is very similar to the one obtained by Seidenariet al [44] who collected the dataset By including all theavailable joints (15) in the algorithm processing it is possibleto achieve an accuracy which varies between 840 and847

The Florence3D dataset is challenging because of tworeasons

(i) High interclass similarity some actions are verysimilar to each other for example drink from abottle answer phone and read watch As can beseen in Figure 5 all of them consist of an uprising

arm movement to the mouth hear or head Thisfact affects the performance because it is difficult toclassify actions consisting of very similar skeletonmovements

(ii) High intraclass variability the same action is per-formed in different ways by the same subject forexample using left right or both hands indifferently

Limiting the analysis to AAL related activities only thefollowing ones can be selected drink answer phone tight lacesit down and stand up The algorithm reaches the highestaccuracy (908) using the ldquonew-personrdquo setting with 119875 =11 joints and 119870 = 3 The confusion matrix obtained in thiscondition is shown in Figure 6(a) On the other hand theworst accuracy is 798 and it is obtained with 119875 = 7 and119870 = 4 In Figure 6(b) the confusion matrix for this scenario

Computational Intelligence and Neuroscience 11

081 008

004 056 022 011007

012 077 012

012

004 004 081 012

100

081 004 015

013 087

004 007019 070

a02

a03

a05

a06

a10

a13

a18

a20

a02

a03

a05

a06

a10

a13

a18

a20

(a)

056 015 007 015 007

007

007 017 013

016 044 008 004 004 024

011 056 004

003

022

010 070 017

063

100

010 087 003

007 093

a01

a04

a07

a08

a09

a11

a12

a14

a01

a04

a07

a08

a09

a11

a12

a14

(b)

088 012

100

005 095

097 003

003 097

093 007

003 003 083 010

004 011 085

a06

a14

a15

a16

a17

a18

a19

a20

a06

a14

a15

a16

a17

a18

a19

a20

(c)

Figure 7 Confusion matrices of the MSR Action3D dataset obtained with 119875 = 7 and the ldquonew-personrdquo test (a) AS1 (b) AS2 and (c) AS3

is shown In both situations the main problem is given by thesimilarity of drink and answer phone activities

425 MSR Action3D Dataset The MSR Action3D datasetis composed of 20 activities which are mainly gestures Itsevaluation is included for the sake of completeness in thecomparison with other activity recognition algorithms butthe dataset does not contain AAL related activities Thereis a big confusion in the literature about the validationtests to be used for MSR Action3D Padilla-Lopez et al[49] summarized all the validation methods for this datasetand recommend using all the possible combinations of 5-5 subjects splitting or using LOAO The former procedureconsists of considering 252 combinations of 5 subjects fortraining and the remaining 5 for testing while the latter is theleave-one-actor-out equivalent to the ldquonew-personrdquo schemepreviously introduced This dataset is quite challenging dueto the presence of similar and complex gestures hence theevaluation is performed by considering three subsets of 8gestures each (AS1 AS2 and AS3 in Table 9) as suggestedin [45] Since the minimum number of frames constitutingone sequence is 13 the following set of clusters has beenconsidered 119870 = [3 5 8 10 13] All the combinations of 711 15 and 20 joints shown in Figure 2 are included in theexperimental tests

Considering the ldquonew-personrdquo scheme the proposedalgorithm tested separately on the three subsets of MSRAction3D reaches an average accuracy of 812 with 119875 = 7and 119870 = 10 The confusion matrices for the three subsetsare shown in Figure 7 where it is possible to notice thatthe algorithm struggles in the recognition of the AS2 subset(Figure 7(b)) mainly represented by drawing gestures Betterresults are obtained processing the subset AS3 (Figure 7(c))where lower recognition rates are shown for complex ges-tures golf swing and pickup and throw Table 10 shows theresults obtained including also other joints which are slightlyworse A comparison with previous works validated usingthe ldquonew-personrdquo test is shown in Table 11 The proposedalgorithm achieves results comparable with [50] whichexploits an approach based on skeleton data Chaaraoui etal [15 36] exploits more complex algorithms considering

Table 9 Three subsets of gestures fromMSR Action3D dataset

AS1 AS2 AS3[11988602] Horizontalarm wave [11988601] High arm wave [11988606] High throw

[11988603] Hammer [11988604] Hand catch [11988614] Forward kick[11988605] Forwardpunch [11988607] Draw X [11988615] Side kick

[11988606] High throw [11988608] Draw tick [11988616] Jogging[11988610] Hand clap [11988609] Draw circle [11988617] Tennis swing[11988613] Bend [11988611] Two-hand wave [11988618] Tennis serve[11988618] Tennis serve [11988612] Side boxing [11988619] Golf swing[11988620] Pickup andthrow [11988614] Forward kick [11988620] Pickup and throw

Table 10 Accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting with different subsetsof joints and different subsets of activities

Algorithm AS1 AS2 AS3 AvgProposed (119875 = 15) 785 688 928 800

Proposed (119875 = 20) 790 702 919 804

Proposed (119875 = 11) 776 737 914 809

Proposed (119875 = 7) 795 719 923 812

the fusion of skeleton and depth data or the evolutionaryselection of the best set of joints

5 Discussion

The proposed algorithm despite its simplicity is able toachieve and sometimes to overcome state-of-the-art per-formance when applied to publicly available datasets Inparticular it is able to outperform some complex algorithmsexploiting more than one classifier in KARD and CAD-60datasets

Limiting the analysis to AAL related activities only thealgorithm achieves interesting results in all the datasetsThe group of 8 activities labeled as Actions in the KARD

12 Computational Intelligence and Neuroscience

Table 11 Average accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting compared to otherworks

Algorithm AccuracyCeliktutan et al (2015) [51] 729

Azary and Savakis (2013) [52] 785

Proposed (119875 = 7) 812

Azary and Savakis (2012) [50] 839

Chaaraoui et al (2013) [15] 906

Chaaraoui et al (2014) [36] 935

dataset is recognized with an accuracy greater than 987in the considered experiments and the group includes twosimilar activities such as phone call and drink The CAD-60dataset contains only actions so all the activities consideredwithin the proper location have been included in the eval-uation resulting in global precision and recall of 939 and935 respectively In the UTKinect dataset the performanceimproves considering only the AAL activities and the samehappens also in the Florence3D having the highest accuracyclose to 91 even if some actions are very similar

The MSR Action3D is a challenging dataset mainlycomprising gestures for human-computer interaction andnot actions Many gestures especially the ones included inAS2 subset are very similar to each other In order to improvethe recognition accuracy more complex features should beconsidered including not only jointsrsquo relative positions butalso their velocity Many sequences contain noisy skeletondata another approach to improving recognition accuracycould be the development of a method to discard noisyskeleton or to include depth-based features

However the AAL scenario raises some problems whichhave to be addressed First of all the algorithm exploits amul-ticlass SVM which is a good classifier but it does not makeeasy to understand if an activity belongs or not to any of thetraining classes In fact in a real scenario it is possible to havea sequence of frames that does not represent any activity ofthe training set in this case the SVM outputs the most likelyclass anyway even if it does not make sense Other machinelearning algorithms such as HMMs distinguish amongmultiple classes using the maximum posterior probabilityGaglio et al [29] proposed using a threshold on the outputprobability to detect unknown actions Usually SVMs do notprovide output probabilities but there exist some methodsto extend SVM implementations and make them able toprovide also this information [53 54] These techniquescan be investigated to understand their applicability in theproposed scenario

Another problem is segmentation of actions In manydatasets the actions are represented as segmented sequencesof frames but in real applications the algorithm has to handlea continuous stream of frames and to segment actions byitself Some solutions for segmentation have been proposedbut most of them are based on thresholds on movementswhich can be highly data-dependent [46] Also this aspect hasto be further investigated to have a systemwhich is effectivelyapplicable in a real AAL scenario

6 Conclusions

In this work a simple yet effective activity recognitionalgorithm has been proposed It is based on skeleton dataextracted from an RGBD sensor and it creates a featurevector representing the whole activity It is able to overcomestate-of-the-art results in two publicly available datasets theKARD and CAD-60 outperforming more complex algo-rithms in many conditions The algorithm has been testedalso over other more challenging datasets the UTKinect andFlorence3D where it is outperformed only by algorithmsexploiting temporal alignment techniques or a combinationof several machine learning methods The MSR Action3Dis a complex dataset and the proposed algorithm strugglesin the recognition of subsets constituted by similar gesturesHowever this dataset does not contain any activity of interestfor AAL scenarios

Future works will concern the application of the activityrecognition algorithm to AAL scenarios by consideringaction segmentation and the detection of unknown activities

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the contributionof the COST Action IC1303 AAPELE (Architectures Algo-rithms and Platforms for Enhanced Living Environments)

References

[1] P Rashidi and A Mihailidis ldquoA survey on ambient-assistedliving tools for older adultsrdquo IEEE Journal of Biomedical andHealth Informatics vol 17 no 3 pp 579ndash590 2013

[2] O C Ann and L B Theng ldquoHuman activity recognition areviewrdquo in Proceedings of the IEEE International Conference onControl System Computing and Engineering (ICCSCE rsquo14) pp389ndash393 IEEE Batu Ferringhi Malaysia November 2014

[3] S Wang and G Zhou ldquoA review on radio based activityrecognitionrdquo Digital Communications and Networks vol 1 no1 pp 20ndash29 2015

[4] L Atallah B Lo R Ali R King and G-Z Yang ldquoReal-timeactivity classificationusing ambient andwearable sensorsrdquo IEEETransactions on Information Technology in Biomedicine vol 13no 6 pp 1031ndash1039 2009

[5] D De P Bharti S K Das and S Chellappan ldquoMultimodalwearable sensing for fine-grained activity recognition in health-carerdquo IEEE Internet Computing vol 19 no 5 pp 26ndash35 2015

[6] J K Aggarwal and L Xia ldquoHuman activity recognition from 3Ddata a reviewrdquo Pattern Recognition Letters vol 48 pp 70ndash802014

[7] S Gasparrini E Cippitelli E Gambi S Spinsante and FFlorez-Revuelta ldquoPerformance analysis of self-organising neu-ral networks tracking algorithms for intake monitoring usingkinectrdquo in Proceedings of the 1st IET International Conferenceon Technologies for Active and Assisted Living (TechAAL rsquo15)Kingston UK November 2015

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

Computational Intelligence and Neuroscience 7

Table 2 Accuracy () of the proposed algorithm compared to the other using KARD dataset with different Activity Sets and for differentexperiments

Activity Set 1 Activity Set 2 Activity Set 3A B C A B C A B C

Gaglio et al [29] 951 991 930 899 949 901 842 895 817

Proposed (119875 = 7) 982 984 981 997 100 997 902 950 913

Proposed (119875 = 11) 980 990 977 998 100 996 916 958 933Proposed (119875 = 15) 975 988 976 995 100 996 91 951 932

090 003 007097 003

100100

003 093 003007 007 083 003007 007 077 010

003 093 003100

100100

100100

100090 010

003 003 007 087100

100

Hor

arm

wav

e

Hig

h ar

m w

ave

Two-

hand

wav

e

Catch

cap

Hig

h th

row

Dra

w X

Dra

w tic

k

Toss

pape

r

Forw

kick

Side

kick

Take

um

br

Bend

Han

d cla

p

Wal

k

Phon

e cal

l

Drin

k

Sit d

own

Stan

d up

Hor arm waveHigh arm wave

Two-hand waveCatch cap

High throwDraw X

Draw tickToss paperForw kick

Side kickTake umbr

BendHand clap

WalkPhone call

DrinkSit downStand up

Figure 3 Confusion matrix of the ldquonew-personrdquo test on the whole KARD dataset

Table 3 Accuracy () of the proposed algorithm compared tothe other using KARD dataset with dataset split in Gestures andActions for different experiments

Gestures ActionsA B C A B C

Gaglio et al [29] 865 930 867 925 950 901

Proposed (119875 = 7) 899 935 925 991 996 994Proposed (119875 = 11) 899 959 937 990 999 991Proposed (119875 = 15) 874 936 928 987 995 993

Table 4 Precision () and recall () of the proposed algorithmcompared to the other using the whole KARD dataset and ldquonew-personrdquo setting

Algorithm Precision RecallGaglio et al [29] 848 845

Proposed (119875 = 7) 940 937

Proposed (119875 = 11) 951 950Proposed (119875 = 15) 950 948

using a number of 119875 = 11 joints The overall precision andrecall are about 10higher than the previous approachwhichuses the KARD dataset In this condition the confusion

matrix obtained is shown in Figure 3 It can be noticed thatthe actions are distinguished very well only phone call anddrink show a recognition accuracy equal or lower than 90and sometimes they are mixed with each other The mostcritical activities are the draw X and draw tick gestures whichare quite similar

From the AAL point of view only some activities arerelevant Table 3 shows that the eight actions constituting theKARD dataset are recognized with an accuracy greater than98 even if there are some similar actions such as sit downand stand up or phone call and drink Considering only theActions subset the lower recognition accuracy is 949and itis obtained with119870 = 5 and119875 = 7 It means that the algorithmis able to reach a high recognition rate even if the featurevector is limited to 90 elements

422 CAD-60 Dataset TheCAD-60 dataset is a challengingdataset consisting of 12 activities performed by 4 people in5 different environments The dataset is usually evaluatedby splitting the activities according to the environmentthe global performance of the algorithm is given by theaverage precision and recall among all the environments Twodifferent settings were experimented for CAD-60 in [43]Theformer is defined ldquonew-personrdquo and the latter is the so-calledldquohave-seenrdquo ldquoNew-personrdquo setting has been considered in all

8 Computational Intelligence and Neuroscience

the works using CAD-60 so it is the one selected also in thiswork

The most challenging element of the dataset is thepresence of a left-handed actor In order to increase the per-formance which are particularly affected by this unbalancingin the ldquonew-personrdquo test mirrored copies of each action arecreated as suggested in [43] For each actor a left-handed anda right-handed version of each action are made availableThedummyversion of the activity has been obtained bymirroringthe skeleton with respect to the virtual sagittal plane that cutsthe person in a half The proposed algorithm is evaluatedusing three different sets of joints from 119875 = 7 to 119875 = 15and the sequence of clusters 119870 = [3 5 10 15 20 25 30 35]as in the KARD dataset

The best results are obtained with the configurations 119875 =11 and 119870 = 25 and the performance in terms of precisionand recall for each activity is shown in Table 5 Very goodresults are given in office environment where the averageprecision and recall are 964 and 958 respectively Infact the activities of this environment are quite different onlytalking on phone and drinking water are similar On the otherhand the living room environment includes talking on couchand relaxing on couch in addition to talking on phone anddrinking water and it is the most challenging case since theaverage precision and recall are 91 and 906

The proposed algorithm is compared to other worksusing the same ldquonew-personrdquo setting and the results areshown in Table 6 which tells that the 119875 = 11 configurationoutperforms the state-of-the-art results in terms of precisionand it is only 1 lower in terms of recall Shan and Akella[46] achieve very good results using a multiclass SVMscheme with a linear kernel However they train and testmirrored actions separately and then merge the results whencomputing average precision and recall Our approach simplyconsiders two copies of the same action given as input to themulticlass SVM and retrieves the classification results

The reduced number of joints does not affect too muchthe average performance of the algorithm that reaches aprecision of 927 and a recall of 915 with 119875 = 7Using all the available jonts on the other hand brings toa more substantial reduction of the performance showing879 and 867 for precision and recall respectively with119875 = 15 The best results for the proposed algorithm werealways obtained with a high number of clusters (25 or30) The reduction of this number affects the performancefor example considering the 119875 = 11 subset the worstperformance is obtained with 119870 = 5 and with a precision of866 and a recall of 860

From theAALpoint of view this dataset is composed onlyof actions not gestures so the dataset does not have to beseparated to evaluate the performance in a scenario which isclose to AAL

423 UTKinect Dataset The UTKinect dataset is composedof 10 activities performed twice by 10 subjects and theevaluation setting proposed in [26] is the leave-one-out-cross-validation (LOOCV) which means that the systemis trained on all the sequences except one and that one isused for testing Each trainingtesting procedure is repeated

Table 5 Precision () and recall () of the proposed algorithm inthe different environments of CAD-60 with 119875 = 11 and 119870 = 25

Location Activity ldquoNew-personrdquoPrecision Recall

Bathroom

Brushing teeth 889 100

Rinsing mouth 923 100

Wearing contact lens 100 792

Average 937 931

Bedroom

Talking on phone 917 917

Drinking water 913 875

Opening pill container 960 100

Average 930 931

Kitchen

Cooking-chopping 857 100

Cooking-stirring 100 791

Drinking water 960 100

Opening pill container 100 100

Average 954 948

Living room

Talking on phone 875 875

Drinking water 875 875

Talking on couch 889 100

Relaxing on couch 100 875

Average 910 906

Office

Talking on phone 100 875

Writing on whiteboard 100 958

Drinking water 857 100

Working on computer 100 100

Average 964 958Global average 939 935

Table 6 Global precision () and recall () of the proposedalgorithm for CAD-60 dataset and ldquonew-personrdquo setting withdifferent subsets of joints compared to other works

Algorithm Precision RecallSung et al [43] 679 555

Gaglio et al [29] 773 767

Proposed (119875 = 15) 879 867

Faria et al [47] 911 919

Parisi et al [48] 919 902

Proposed (119875 = 7) 927 915

Shan and Akella [46] 938 945Proposed (119875 = 11) 939 935

20 times to reduce the random effect of 119896-means For thisdataset all the different subsets of joints shown in Figure 2are considered since the skeleton is captured usingMicrosoftSDK which provides 20 joints The considered sequence ofclusters is only 119870 = [3 4 5] because the minimum numberof frames constituting an action sequence is 5

The results compared with previous works are shownin Table 7 The best results for the proposed algorithm areobtained with 119875 = 7 which is the smallest set of jointsconsidered The result corresponds to a number of 119870 = 4clusters but the difference with the other clusters is very low

Computational Intelligence and Neuroscience 9

093 002 005

094 006

099 001

100

002 098W

alk

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(a)

088 004 008

095 005

100

003 097

010 090

Wal

k

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(b)

Figure 4 Confusion matrices of the UTKinect dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 7 (b) Worst accuracy confusion matrix obtained with 119875 = 15

Table 7 Global accuracy () of the proposed algorithm forUTKinect dataset and LOOCV setting with different subsets ofjoints compared to other works

Algorithm AccuracyXia et al [26] 909

Theodorakopoulos et al [30] 9095

Ding et al [31] 915

Zhu et al [17] 919

Jiang et al [35] 919

Gan and Chen [24] 920

Liu et al [22] 920

Proposed (119875 = 15) 931

Proposed (119875 = 11) 942

Proposed (119875 = 19) 943

Anirudh et al [34] 949

Proposed (119875 = 7) 951

Vemulapalli et al [33] 971

only 06 with 119870 = 5 that provided the worst result Theselection of different sets of joints from 119875 = 7 to 119875 = 15changes the accuracy only by a 2 from 931 to 951In this dataset the main limitation to the performance isgiven by the reduced number of frames that constitute somesequences and limits the number of clusters representing theactions Vemulapalli et al [33] reach the highest accuracybut the approach is much more complex after modelingskeleton joints in a Lie group the processing scheme includesDynamic TimeWarping to perform temporal alignments anda specific representation called Fourier Temporal Pyramidbefore classification with one-versus-all multiclass SVM

Contextualizing the UTKinect dataset to AAL involvesthe consideration of a subset of activities which includes onlyactions and discards gestures In more detail the following 5activities have been selectedwalk sit down stand up pick upand carry In this condition the highest accuracy (967) isstill given by119875 = 7 with 3 clusters and the lower one (941)is represented by 119875 = 15 again with 3 clustersThe confusion

Table 8 Global accuracy () of the proposed algorithm forFlorence3D dataset and ldquonew-personrdquo setting with different subsetsof joints compared to other works

Algorithm AccuracySeidenari et al [44] 820

Proposed (119875 = 7) 821

Proposed (119875 = 15) 847

Proposed (119875 = 11) 861

Anirudh et al [34] 897

Vemulapalli et al [33] 909

Taha et al [28] 962

matrix for these two configurations are shown in Figure 4where the main difference is the reduced misclassificationbetween the activities walk and carry that are very similarto each other

424 Florence3D Dataset The Florence3D dataset is com-posed of 9 activities and 10 people performing themmultipletimes resulting in a total number of 215 activities Theproposed setting to evaluate this dataset is the leave-one-actor-out which is equivalent to the ldquonew-personrdquo settingpreviously described The minimum number of frames is 8so the following sequence of clusters has been considered119870 = [3 4 5 6 7 8] A number of 15 joints are available forthe skeleton so only the first three schemes are included inthe tests

Table 8 shows the results obtained with different subsetsof joints where the best result for the proposed algorithmis given by 119875 = 11 with 6 clusters The choice of clustersdoes not significantly affect the performance because themaximum reduction is 2 In this dataset the proposedapproach does not achieve state-of-the-art accuracy (962)given by Taha et al [28] However all the algorithmsthat overcome the proposed one exhibit greater complexitybecause they consider Lie group representations [33 34] orseveral machine learning algorithms such as SVM combined

10 Computational Intelligence and Neuroscience

(a)

(b)

(c)

Figure 5 Sequences of frames representing the drink from a bottle (a) answer phone (b) and read watch (c) activities from Florence3Ddataset

081 019

014 086

100

005 005 090

005 095

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(a)

076 019 005

032 064 005

004 092 004

005 010

005010005

085

080

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(b)

Figure 6 Confusion matrices of the Florence3D dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 11 (b) Worst accuracy confusion matrix obtained with 119875 = 7

to HMM to recognize activities composed of atomic actions[28]

If different subsets of joints are considered the perfor-mance decreases In particular considering only 7 joints and3 clusters it is possible to reach a maximum accuracy of821 which is very similar to the one obtained by Seidenariet al [44] who collected the dataset By including all theavailable joints (15) in the algorithm processing it is possibleto achieve an accuracy which varies between 840 and847

The Florence3D dataset is challenging because of tworeasons

(i) High interclass similarity some actions are verysimilar to each other for example drink from abottle answer phone and read watch As can beseen in Figure 5 all of them consist of an uprising

arm movement to the mouth hear or head Thisfact affects the performance because it is difficult toclassify actions consisting of very similar skeletonmovements

(ii) High intraclass variability the same action is per-formed in different ways by the same subject forexample using left right or both hands indifferently

Limiting the analysis to AAL related activities only thefollowing ones can be selected drink answer phone tight lacesit down and stand up The algorithm reaches the highestaccuracy (908) using the ldquonew-personrdquo setting with 119875 =11 joints and 119870 = 3 The confusion matrix obtained in thiscondition is shown in Figure 6(a) On the other hand theworst accuracy is 798 and it is obtained with 119875 = 7 and119870 = 4 In Figure 6(b) the confusion matrix for this scenario

Computational Intelligence and Neuroscience 11

081 008

004 056 022 011007

012 077 012

012

004 004 081 012

100

081 004 015

013 087

004 007019 070

a02

a03

a05

a06

a10

a13

a18

a20

a02

a03

a05

a06

a10

a13

a18

a20

(a)

056 015 007 015 007

007

007 017 013

016 044 008 004 004 024

011 056 004

003

022

010 070 017

063

100

010 087 003

007 093

a01

a04

a07

a08

a09

a11

a12

a14

a01

a04

a07

a08

a09

a11

a12

a14

(b)

088 012

100

005 095

097 003

003 097

093 007

003 003 083 010

004 011 085

a06

a14

a15

a16

a17

a18

a19

a20

a06

a14

a15

a16

a17

a18

a19

a20

(c)

Figure 7 Confusion matrices of the MSR Action3D dataset obtained with 119875 = 7 and the ldquonew-personrdquo test (a) AS1 (b) AS2 and (c) AS3

is shown In both situations the main problem is given by thesimilarity of drink and answer phone activities

425 MSR Action3D Dataset The MSR Action3D datasetis composed of 20 activities which are mainly gestures Itsevaluation is included for the sake of completeness in thecomparison with other activity recognition algorithms butthe dataset does not contain AAL related activities Thereis a big confusion in the literature about the validationtests to be used for MSR Action3D Padilla-Lopez et al[49] summarized all the validation methods for this datasetand recommend using all the possible combinations of 5-5 subjects splitting or using LOAO The former procedureconsists of considering 252 combinations of 5 subjects fortraining and the remaining 5 for testing while the latter is theleave-one-actor-out equivalent to the ldquonew-personrdquo schemepreviously introduced This dataset is quite challenging dueto the presence of similar and complex gestures hence theevaluation is performed by considering three subsets of 8gestures each (AS1 AS2 and AS3 in Table 9) as suggestedin [45] Since the minimum number of frames constitutingone sequence is 13 the following set of clusters has beenconsidered 119870 = [3 5 8 10 13] All the combinations of 711 15 and 20 joints shown in Figure 2 are included in theexperimental tests

Considering the ldquonew-personrdquo scheme the proposedalgorithm tested separately on the three subsets of MSRAction3D reaches an average accuracy of 812 with 119875 = 7and 119870 = 10 The confusion matrices for the three subsetsare shown in Figure 7 where it is possible to notice thatthe algorithm struggles in the recognition of the AS2 subset(Figure 7(b)) mainly represented by drawing gestures Betterresults are obtained processing the subset AS3 (Figure 7(c))where lower recognition rates are shown for complex ges-tures golf swing and pickup and throw Table 10 shows theresults obtained including also other joints which are slightlyworse A comparison with previous works validated usingthe ldquonew-personrdquo test is shown in Table 11 The proposedalgorithm achieves results comparable with [50] whichexploits an approach based on skeleton data Chaaraoui etal [15 36] exploits more complex algorithms considering

Table 9 Three subsets of gestures fromMSR Action3D dataset

AS1 AS2 AS3[11988602] Horizontalarm wave [11988601] High arm wave [11988606] High throw

[11988603] Hammer [11988604] Hand catch [11988614] Forward kick[11988605] Forwardpunch [11988607] Draw X [11988615] Side kick

[11988606] High throw [11988608] Draw tick [11988616] Jogging[11988610] Hand clap [11988609] Draw circle [11988617] Tennis swing[11988613] Bend [11988611] Two-hand wave [11988618] Tennis serve[11988618] Tennis serve [11988612] Side boxing [11988619] Golf swing[11988620] Pickup andthrow [11988614] Forward kick [11988620] Pickup and throw

Table 10 Accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting with different subsetsof joints and different subsets of activities

Algorithm AS1 AS2 AS3 AvgProposed (119875 = 15) 785 688 928 800

Proposed (119875 = 20) 790 702 919 804

Proposed (119875 = 11) 776 737 914 809

Proposed (119875 = 7) 795 719 923 812

the fusion of skeleton and depth data or the evolutionaryselection of the best set of joints

5 Discussion

The proposed algorithm despite its simplicity is able toachieve and sometimes to overcome state-of-the-art per-formance when applied to publicly available datasets Inparticular it is able to outperform some complex algorithmsexploiting more than one classifier in KARD and CAD-60datasets

Limiting the analysis to AAL related activities only thealgorithm achieves interesting results in all the datasetsThe group of 8 activities labeled as Actions in the KARD

12 Computational Intelligence and Neuroscience

Table 11 Average accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting compared to otherworks

Algorithm AccuracyCeliktutan et al (2015) [51] 729

Azary and Savakis (2013) [52] 785

Proposed (119875 = 7) 812

Azary and Savakis (2012) [50] 839

Chaaraoui et al (2013) [15] 906

Chaaraoui et al (2014) [36] 935

dataset is recognized with an accuracy greater than 987in the considered experiments and the group includes twosimilar activities such as phone call and drink The CAD-60dataset contains only actions so all the activities consideredwithin the proper location have been included in the eval-uation resulting in global precision and recall of 939 and935 respectively In the UTKinect dataset the performanceimproves considering only the AAL activities and the samehappens also in the Florence3D having the highest accuracyclose to 91 even if some actions are very similar

The MSR Action3D is a challenging dataset mainlycomprising gestures for human-computer interaction andnot actions Many gestures especially the ones included inAS2 subset are very similar to each other In order to improvethe recognition accuracy more complex features should beconsidered including not only jointsrsquo relative positions butalso their velocity Many sequences contain noisy skeletondata another approach to improving recognition accuracycould be the development of a method to discard noisyskeleton or to include depth-based features

However the AAL scenario raises some problems whichhave to be addressed First of all the algorithm exploits amul-ticlass SVM which is a good classifier but it does not makeeasy to understand if an activity belongs or not to any of thetraining classes In fact in a real scenario it is possible to havea sequence of frames that does not represent any activity ofthe training set in this case the SVM outputs the most likelyclass anyway even if it does not make sense Other machinelearning algorithms such as HMMs distinguish amongmultiple classes using the maximum posterior probabilityGaglio et al [29] proposed using a threshold on the outputprobability to detect unknown actions Usually SVMs do notprovide output probabilities but there exist some methodsto extend SVM implementations and make them able toprovide also this information [53 54] These techniquescan be investigated to understand their applicability in theproposed scenario

Another problem is segmentation of actions In manydatasets the actions are represented as segmented sequencesof frames but in real applications the algorithm has to handlea continuous stream of frames and to segment actions byitself Some solutions for segmentation have been proposedbut most of them are based on thresholds on movementswhich can be highly data-dependent [46] Also this aspect hasto be further investigated to have a systemwhich is effectivelyapplicable in a real AAL scenario

6 Conclusions

In this work a simple yet effective activity recognitionalgorithm has been proposed It is based on skeleton dataextracted from an RGBD sensor and it creates a featurevector representing the whole activity It is able to overcomestate-of-the-art results in two publicly available datasets theKARD and CAD-60 outperforming more complex algo-rithms in many conditions The algorithm has been testedalso over other more challenging datasets the UTKinect andFlorence3D where it is outperformed only by algorithmsexploiting temporal alignment techniques or a combinationof several machine learning methods The MSR Action3Dis a complex dataset and the proposed algorithm strugglesin the recognition of subsets constituted by similar gesturesHowever this dataset does not contain any activity of interestfor AAL scenarios

Future works will concern the application of the activityrecognition algorithm to AAL scenarios by consideringaction segmentation and the detection of unknown activities

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the contributionof the COST Action IC1303 AAPELE (Architectures Algo-rithms and Platforms for Enhanced Living Environments)

References

[1] P Rashidi and A Mihailidis ldquoA survey on ambient-assistedliving tools for older adultsrdquo IEEE Journal of Biomedical andHealth Informatics vol 17 no 3 pp 579ndash590 2013

[2] O C Ann and L B Theng ldquoHuman activity recognition areviewrdquo in Proceedings of the IEEE International Conference onControl System Computing and Engineering (ICCSCE rsquo14) pp389ndash393 IEEE Batu Ferringhi Malaysia November 2014

[3] S Wang and G Zhou ldquoA review on radio based activityrecognitionrdquo Digital Communications and Networks vol 1 no1 pp 20ndash29 2015

[4] L Atallah B Lo R Ali R King and G-Z Yang ldquoReal-timeactivity classificationusing ambient andwearable sensorsrdquo IEEETransactions on Information Technology in Biomedicine vol 13no 6 pp 1031ndash1039 2009

[5] D De P Bharti S K Das and S Chellappan ldquoMultimodalwearable sensing for fine-grained activity recognition in health-carerdquo IEEE Internet Computing vol 19 no 5 pp 26ndash35 2015

[6] J K Aggarwal and L Xia ldquoHuman activity recognition from 3Ddata a reviewrdquo Pattern Recognition Letters vol 48 pp 70ndash802014

[7] S Gasparrini E Cippitelli E Gambi S Spinsante and FFlorez-Revuelta ldquoPerformance analysis of self-organising neu-ral networks tracking algorithms for intake monitoring usingkinectrdquo in Proceedings of the 1st IET International Conferenceon Technologies for Active and Assisted Living (TechAAL rsquo15)Kingston UK November 2015

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

8 Computational Intelligence and Neuroscience

the works using CAD-60 so it is the one selected also in thiswork

The most challenging element of the dataset is thepresence of a left-handed actor In order to increase the per-formance which are particularly affected by this unbalancingin the ldquonew-personrdquo test mirrored copies of each action arecreated as suggested in [43] For each actor a left-handed anda right-handed version of each action are made availableThedummyversion of the activity has been obtained bymirroringthe skeleton with respect to the virtual sagittal plane that cutsthe person in a half The proposed algorithm is evaluatedusing three different sets of joints from 119875 = 7 to 119875 = 15and the sequence of clusters 119870 = [3 5 10 15 20 25 30 35]as in the KARD dataset

The best results are obtained with the configurations 119875 =11 and 119870 = 25 and the performance in terms of precisionand recall for each activity is shown in Table 5 Very goodresults are given in office environment where the averageprecision and recall are 964 and 958 respectively Infact the activities of this environment are quite different onlytalking on phone and drinking water are similar On the otherhand the living room environment includes talking on couchand relaxing on couch in addition to talking on phone anddrinking water and it is the most challenging case since theaverage precision and recall are 91 and 906

The proposed algorithm is compared to other worksusing the same ldquonew-personrdquo setting and the results areshown in Table 6 which tells that the 119875 = 11 configurationoutperforms the state-of-the-art results in terms of precisionand it is only 1 lower in terms of recall Shan and Akella[46] achieve very good results using a multiclass SVMscheme with a linear kernel However they train and testmirrored actions separately and then merge the results whencomputing average precision and recall Our approach simplyconsiders two copies of the same action given as input to themulticlass SVM and retrieves the classification results

The reduced number of joints does not affect too muchthe average performance of the algorithm that reaches aprecision of 927 and a recall of 915 with 119875 = 7Using all the available jonts on the other hand brings toa more substantial reduction of the performance showing879 and 867 for precision and recall respectively with119875 = 15 The best results for the proposed algorithm werealways obtained with a high number of clusters (25 or30) The reduction of this number affects the performancefor example considering the 119875 = 11 subset the worstperformance is obtained with 119870 = 5 and with a precision of866 and a recall of 860

From theAALpoint of view this dataset is composed onlyof actions not gestures so the dataset does not have to beseparated to evaluate the performance in a scenario which isclose to AAL

423 UTKinect Dataset The UTKinect dataset is composedof 10 activities performed twice by 10 subjects and theevaluation setting proposed in [26] is the leave-one-out-cross-validation (LOOCV) which means that the systemis trained on all the sequences except one and that one isused for testing Each trainingtesting procedure is repeated

Table 5 Precision () and recall () of the proposed algorithm inthe different environments of CAD-60 with 119875 = 11 and 119870 = 25

Location Activity ldquoNew-personrdquoPrecision Recall

Bathroom

Brushing teeth 889 100

Rinsing mouth 923 100

Wearing contact lens 100 792

Average 937 931

Bedroom

Talking on phone 917 917

Drinking water 913 875

Opening pill container 960 100

Average 930 931

Kitchen

Cooking-chopping 857 100

Cooking-stirring 100 791

Drinking water 960 100

Opening pill container 100 100

Average 954 948

Living room

Talking on phone 875 875

Drinking water 875 875

Talking on couch 889 100

Relaxing on couch 100 875

Average 910 906

Office

Talking on phone 100 875

Writing on whiteboard 100 958

Drinking water 857 100

Working on computer 100 100

Average 964 958Global average 939 935

Table 6 Global precision () and recall () of the proposedalgorithm for CAD-60 dataset and ldquonew-personrdquo setting withdifferent subsets of joints compared to other works

Algorithm Precision RecallSung et al [43] 679 555

Gaglio et al [29] 773 767

Proposed (119875 = 15) 879 867

Faria et al [47] 911 919

Parisi et al [48] 919 902

Proposed (119875 = 7) 927 915

Shan and Akella [46] 938 945Proposed (119875 = 11) 939 935

20 times to reduce the random effect of 119896-means For thisdataset all the different subsets of joints shown in Figure 2are considered since the skeleton is captured usingMicrosoftSDK which provides 20 joints The considered sequence ofclusters is only 119870 = [3 4 5] because the minimum numberof frames constituting an action sequence is 5

The results compared with previous works are shownin Table 7 The best results for the proposed algorithm areobtained with 119875 = 7 which is the smallest set of jointsconsidered The result corresponds to a number of 119870 = 4clusters but the difference with the other clusters is very low

Computational Intelligence and Neuroscience 9

093 002 005

094 006

099 001

100

002 098W

alk

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(a)

088 004 008

095 005

100

003 097

010 090

Wal

k

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(b)

Figure 4 Confusion matrices of the UTKinect dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 7 (b) Worst accuracy confusion matrix obtained with 119875 = 15

Table 7 Global accuracy () of the proposed algorithm forUTKinect dataset and LOOCV setting with different subsets ofjoints compared to other works

Algorithm AccuracyXia et al [26] 909

Theodorakopoulos et al [30] 9095

Ding et al [31] 915

Zhu et al [17] 919

Jiang et al [35] 919

Gan and Chen [24] 920

Liu et al [22] 920

Proposed (119875 = 15) 931

Proposed (119875 = 11) 942

Proposed (119875 = 19) 943

Anirudh et al [34] 949

Proposed (119875 = 7) 951

Vemulapalli et al [33] 971

only 06 with 119870 = 5 that provided the worst result Theselection of different sets of joints from 119875 = 7 to 119875 = 15changes the accuracy only by a 2 from 931 to 951In this dataset the main limitation to the performance isgiven by the reduced number of frames that constitute somesequences and limits the number of clusters representing theactions Vemulapalli et al [33] reach the highest accuracybut the approach is much more complex after modelingskeleton joints in a Lie group the processing scheme includesDynamic TimeWarping to perform temporal alignments anda specific representation called Fourier Temporal Pyramidbefore classification with one-versus-all multiclass SVM

Contextualizing the UTKinect dataset to AAL involvesthe consideration of a subset of activities which includes onlyactions and discards gestures In more detail the following 5activities have been selectedwalk sit down stand up pick upand carry In this condition the highest accuracy (967) isstill given by119875 = 7 with 3 clusters and the lower one (941)is represented by 119875 = 15 again with 3 clustersThe confusion

Table 8 Global accuracy () of the proposed algorithm forFlorence3D dataset and ldquonew-personrdquo setting with different subsetsof joints compared to other works

Algorithm AccuracySeidenari et al [44] 820

Proposed (119875 = 7) 821

Proposed (119875 = 15) 847

Proposed (119875 = 11) 861

Anirudh et al [34] 897

Vemulapalli et al [33] 909

Taha et al [28] 962

matrix for these two configurations are shown in Figure 4where the main difference is the reduced misclassificationbetween the activities walk and carry that are very similarto each other

424 Florence3D Dataset The Florence3D dataset is com-posed of 9 activities and 10 people performing themmultipletimes resulting in a total number of 215 activities Theproposed setting to evaluate this dataset is the leave-one-actor-out which is equivalent to the ldquonew-personrdquo settingpreviously described The minimum number of frames is 8so the following sequence of clusters has been considered119870 = [3 4 5 6 7 8] A number of 15 joints are available forthe skeleton so only the first three schemes are included inthe tests

Table 8 shows the results obtained with different subsetsof joints where the best result for the proposed algorithmis given by 119875 = 11 with 6 clusters The choice of clustersdoes not significantly affect the performance because themaximum reduction is 2 In this dataset the proposedapproach does not achieve state-of-the-art accuracy (962)given by Taha et al [28] However all the algorithmsthat overcome the proposed one exhibit greater complexitybecause they consider Lie group representations [33 34] orseveral machine learning algorithms such as SVM combined

10 Computational Intelligence and Neuroscience

(a)

(b)

(c)

Figure 5 Sequences of frames representing the drink from a bottle (a) answer phone (b) and read watch (c) activities from Florence3Ddataset

081 019

014 086

100

005 005 090

005 095

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(a)

076 019 005

032 064 005

004 092 004

005 010

005010005

085

080

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(b)

Figure 6 Confusion matrices of the Florence3D dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 11 (b) Worst accuracy confusion matrix obtained with 119875 = 7

to HMM to recognize activities composed of atomic actions[28]

If different subsets of joints are considered the perfor-mance decreases In particular considering only 7 joints and3 clusters it is possible to reach a maximum accuracy of821 which is very similar to the one obtained by Seidenariet al [44] who collected the dataset By including all theavailable joints (15) in the algorithm processing it is possibleto achieve an accuracy which varies between 840 and847

The Florence3D dataset is challenging because of tworeasons

(i) High interclass similarity some actions are verysimilar to each other for example drink from abottle answer phone and read watch As can beseen in Figure 5 all of them consist of an uprising

arm movement to the mouth hear or head Thisfact affects the performance because it is difficult toclassify actions consisting of very similar skeletonmovements

(ii) High intraclass variability the same action is per-formed in different ways by the same subject forexample using left right or both hands indifferently

Limiting the analysis to AAL related activities only thefollowing ones can be selected drink answer phone tight lacesit down and stand up The algorithm reaches the highestaccuracy (908) using the ldquonew-personrdquo setting with 119875 =11 joints and 119870 = 3 The confusion matrix obtained in thiscondition is shown in Figure 6(a) On the other hand theworst accuracy is 798 and it is obtained with 119875 = 7 and119870 = 4 In Figure 6(b) the confusion matrix for this scenario

Computational Intelligence and Neuroscience 11

081 008

004 056 022 011007

012 077 012

012

004 004 081 012

100

081 004 015

013 087

004 007019 070

a02

a03

a05

a06

a10

a13

a18

a20

a02

a03

a05

a06

a10

a13

a18

a20

(a)

056 015 007 015 007

007

007 017 013

016 044 008 004 004 024

011 056 004

003

022

010 070 017

063

100

010 087 003

007 093

a01

a04

a07

a08

a09

a11

a12

a14

a01

a04

a07

a08

a09

a11

a12

a14

(b)

088 012

100

005 095

097 003

003 097

093 007

003 003 083 010

004 011 085

a06

a14

a15

a16

a17

a18

a19

a20

a06

a14

a15

a16

a17

a18

a19

a20

(c)

Figure 7 Confusion matrices of the MSR Action3D dataset obtained with 119875 = 7 and the ldquonew-personrdquo test (a) AS1 (b) AS2 and (c) AS3

is shown In both situations the main problem is given by thesimilarity of drink and answer phone activities

425 MSR Action3D Dataset The MSR Action3D datasetis composed of 20 activities which are mainly gestures Itsevaluation is included for the sake of completeness in thecomparison with other activity recognition algorithms butthe dataset does not contain AAL related activities Thereis a big confusion in the literature about the validationtests to be used for MSR Action3D Padilla-Lopez et al[49] summarized all the validation methods for this datasetand recommend using all the possible combinations of 5-5 subjects splitting or using LOAO The former procedureconsists of considering 252 combinations of 5 subjects fortraining and the remaining 5 for testing while the latter is theleave-one-actor-out equivalent to the ldquonew-personrdquo schemepreviously introduced This dataset is quite challenging dueto the presence of similar and complex gestures hence theevaluation is performed by considering three subsets of 8gestures each (AS1 AS2 and AS3 in Table 9) as suggestedin [45] Since the minimum number of frames constitutingone sequence is 13 the following set of clusters has beenconsidered 119870 = [3 5 8 10 13] All the combinations of 711 15 and 20 joints shown in Figure 2 are included in theexperimental tests

Considering the ldquonew-personrdquo scheme the proposedalgorithm tested separately on the three subsets of MSRAction3D reaches an average accuracy of 812 with 119875 = 7and 119870 = 10 The confusion matrices for the three subsetsare shown in Figure 7 where it is possible to notice thatthe algorithm struggles in the recognition of the AS2 subset(Figure 7(b)) mainly represented by drawing gestures Betterresults are obtained processing the subset AS3 (Figure 7(c))where lower recognition rates are shown for complex ges-tures golf swing and pickup and throw Table 10 shows theresults obtained including also other joints which are slightlyworse A comparison with previous works validated usingthe ldquonew-personrdquo test is shown in Table 11 The proposedalgorithm achieves results comparable with [50] whichexploits an approach based on skeleton data Chaaraoui etal [15 36] exploits more complex algorithms considering

Table 9 Three subsets of gestures fromMSR Action3D dataset

AS1 AS2 AS3[11988602] Horizontalarm wave [11988601] High arm wave [11988606] High throw

[11988603] Hammer [11988604] Hand catch [11988614] Forward kick[11988605] Forwardpunch [11988607] Draw X [11988615] Side kick

[11988606] High throw [11988608] Draw tick [11988616] Jogging[11988610] Hand clap [11988609] Draw circle [11988617] Tennis swing[11988613] Bend [11988611] Two-hand wave [11988618] Tennis serve[11988618] Tennis serve [11988612] Side boxing [11988619] Golf swing[11988620] Pickup andthrow [11988614] Forward kick [11988620] Pickup and throw

Table 10 Accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting with different subsetsof joints and different subsets of activities

Algorithm AS1 AS2 AS3 AvgProposed (119875 = 15) 785 688 928 800

Proposed (119875 = 20) 790 702 919 804

Proposed (119875 = 11) 776 737 914 809

Proposed (119875 = 7) 795 719 923 812

the fusion of skeleton and depth data or the evolutionaryselection of the best set of joints

5 Discussion

The proposed algorithm despite its simplicity is able toachieve and sometimes to overcome state-of-the-art per-formance when applied to publicly available datasets Inparticular it is able to outperform some complex algorithmsexploiting more than one classifier in KARD and CAD-60datasets

Limiting the analysis to AAL related activities only thealgorithm achieves interesting results in all the datasetsThe group of 8 activities labeled as Actions in the KARD

12 Computational Intelligence and Neuroscience

Table 11 Average accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting compared to otherworks

Algorithm AccuracyCeliktutan et al (2015) [51] 729

Azary and Savakis (2013) [52] 785

Proposed (119875 = 7) 812

Azary and Savakis (2012) [50] 839

Chaaraoui et al (2013) [15] 906

Chaaraoui et al (2014) [36] 935

dataset is recognized with an accuracy greater than 987in the considered experiments and the group includes twosimilar activities such as phone call and drink The CAD-60dataset contains only actions so all the activities consideredwithin the proper location have been included in the eval-uation resulting in global precision and recall of 939 and935 respectively In the UTKinect dataset the performanceimproves considering only the AAL activities and the samehappens also in the Florence3D having the highest accuracyclose to 91 even if some actions are very similar

The MSR Action3D is a challenging dataset mainlycomprising gestures for human-computer interaction andnot actions Many gestures especially the ones included inAS2 subset are very similar to each other In order to improvethe recognition accuracy more complex features should beconsidered including not only jointsrsquo relative positions butalso their velocity Many sequences contain noisy skeletondata another approach to improving recognition accuracycould be the development of a method to discard noisyskeleton or to include depth-based features

However the AAL scenario raises some problems whichhave to be addressed First of all the algorithm exploits amul-ticlass SVM which is a good classifier but it does not makeeasy to understand if an activity belongs or not to any of thetraining classes In fact in a real scenario it is possible to havea sequence of frames that does not represent any activity ofthe training set in this case the SVM outputs the most likelyclass anyway even if it does not make sense Other machinelearning algorithms such as HMMs distinguish amongmultiple classes using the maximum posterior probabilityGaglio et al [29] proposed using a threshold on the outputprobability to detect unknown actions Usually SVMs do notprovide output probabilities but there exist some methodsto extend SVM implementations and make them able toprovide also this information [53 54] These techniquescan be investigated to understand their applicability in theproposed scenario

Another problem is segmentation of actions In manydatasets the actions are represented as segmented sequencesof frames but in real applications the algorithm has to handlea continuous stream of frames and to segment actions byitself Some solutions for segmentation have been proposedbut most of them are based on thresholds on movementswhich can be highly data-dependent [46] Also this aspect hasto be further investigated to have a systemwhich is effectivelyapplicable in a real AAL scenario

6 Conclusions

In this work a simple yet effective activity recognitionalgorithm has been proposed It is based on skeleton dataextracted from an RGBD sensor and it creates a featurevector representing the whole activity It is able to overcomestate-of-the-art results in two publicly available datasets theKARD and CAD-60 outperforming more complex algo-rithms in many conditions The algorithm has been testedalso over other more challenging datasets the UTKinect andFlorence3D where it is outperformed only by algorithmsexploiting temporal alignment techniques or a combinationof several machine learning methods The MSR Action3Dis a complex dataset and the proposed algorithm strugglesin the recognition of subsets constituted by similar gesturesHowever this dataset does not contain any activity of interestfor AAL scenarios

Future works will concern the application of the activityrecognition algorithm to AAL scenarios by consideringaction segmentation and the detection of unknown activities

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the contributionof the COST Action IC1303 AAPELE (Architectures Algo-rithms and Platforms for Enhanced Living Environments)

References

[1] P Rashidi and A Mihailidis ldquoA survey on ambient-assistedliving tools for older adultsrdquo IEEE Journal of Biomedical andHealth Informatics vol 17 no 3 pp 579ndash590 2013

[2] O C Ann and L B Theng ldquoHuman activity recognition areviewrdquo in Proceedings of the IEEE International Conference onControl System Computing and Engineering (ICCSCE rsquo14) pp389ndash393 IEEE Batu Ferringhi Malaysia November 2014

[3] S Wang and G Zhou ldquoA review on radio based activityrecognitionrdquo Digital Communications and Networks vol 1 no1 pp 20ndash29 2015

[4] L Atallah B Lo R Ali R King and G-Z Yang ldquoReal-timeactivity classificationusing ambient andwearable sensorsrdquo IEEETransactions on Information Technology in Biomedicine vol 13no 6 pp 1031ndash1039 2009

[5] D De P Bharti S K Das and S Chellappan ldquoMultimodalwearable sensing for fine-grained activity recognition in health-carerdquo IEEE Internet Computing vol 19 no 5 pp 26ndash35 2015

[6] J K Aggarwal and L Xia ldquoHuman activity recognition from 3Ddata a reviewrdquo Pattern Recognition Letters vol 48 pp 70ndash802014

[7] S Gasparrini E Cippitelli E Gambi S Spinsante and FFlorez-Revuelta ldquoPerformance analysis of self-organising neu-ral networks tracking algorithms for intake monitoring usingkinectrdquo in Proceedings of the 1st IET International Conferenceon Technologies for Active and Assisted Living (TechAAL rsquo15)Kingston UK November 2015

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

Computational Intelligence and Neuroscience 9

093 002 005

094 006

099 001

100

002 098W

alk

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(a)

088 004 008

095 005

100

003 097

010 090

Wal

k

Sit d

own

Stan

d up

Pick

up

Carr

y

Walk

Sit down

Stand up

Pick up

Carry

(b)

Figure 4 Confusion matrices of the UTKinect dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 7 (b) Worst accuracy confusion matrix obtained with 119875 = 15

Table 7 Global accuracy () of the proposed algorithm forUTKinect dataset and LOOCV setting with different subsets ofjoints compared to other works

Algorithm AccuracyXia et al [26] 909

Theodorakopoulos et al [30] 9095

Ding et al [31] 915

Zhu et al [17] 919

Jiang et al [35] 919

Gan and Chen [24] 920

Liu et al [22] 920

Proposed (119875 = 15) 931

Proposed (119875 = 11) 942

Proposed (119875 = 19) 943

Anirudh et al [34] 949

Proposed (119875 = 7) 951

Vemulapalli et al [33] 971

only 06 with 119870 = 5 that provided the worst result Theselection of different sets of joints from 119875 = 7 to 119875 = 15changes the accuracy only by a 2 from 931 to 951In this dataset the main limitation to the performance isgiven by the reduced number of frames that constitute somesequences and limits the number of clusters representing theactions Vemulapalli et al [33] reach the highest accuracybut the approach is much more complex after modelingskeleton joints in a Lie group the processing scheme includesDynamic TimeWarping to perform temporal alignments anda specific representation called Fourier Temporal Pyramidbefore classification with one-versus-all multiclass SVM

Contextualizing the UTKinect dataset to AAL involvesthe consideration of a subset of activities which includes onlyactions and discards gestures In more detail the following 5activities have been selectedwalk sit down stand up pick upand carry In this condition the highest accuracy (967) isstill given by119875 = 7 with 3 clusters and the lower one (941)is represented by 119875 = 15 again with 3 clustersThe confusion

Table 8 Global accuracy () of the proposed algorithm forFlorence3D dataset and ldquonew-personrdquo setting with different subsetsof joints compared to other works

Algorithm AccuracySeidenari et al [44] 820

Proposed (119875 = 7) 821

Proposed (119875 = 15) 847

Proposed (119875 = 11) 861

Anirudh et al [34] 897

Vemulapalli et al [33] 909

Taha et al [28] 962

matrix for these two configurations are shown in Figure 4where the main difference is the reduced misclassificationbetween the activities walk and carry that are very similarto each other

424 Florence3D Dataset The Florence3D dataset is com-posed of 9 activities and 10 people performing themmultipletimes resulting in a total number of 215 activities Theproposed setting to evaluate this dataset is the leave-one-actor-out which is equivalent to the ldquonew-personrdquo settingpreviously described The minimum number of frames is 8so the following sequence of clusters has been considered119870 = [3 4 5 6 7 8] A number of 15 joints are available forthe skeleton so only the first three schemes are included inthe tests

Table 8 shows the results obtained with different subsetsof joints where the best result for the proposed algorithmis given by 119875 = 11 with 6 clusters The choice of clustersdoes not significantly affect the performance because themaximum reduction is 2 In this dataset the proposedapproach does not achieve state-of-the-art accuracy (962)given by Taha et al [28] However all the algorithmsthat overcome the proposed one exhibit greater complexitybecause they consider Lie group representations [33 34] orseveral machine learning algorithms such as SVM combined

10 Computational Intelligence and Neuroscience

(a)

(b)

(c)

Figure 5 Sequences of frames representing the drink from a bottle (a) answer phone (b) and read watch (c) activities from Florence3Ddataset

081 019

014 086

100

005 005 090

005 095

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(a)

076 019 005

032 064 005

004 092 004

005 010

005010005

085

080

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(b)

Figure 6 Confusion matrices of the Florence3D dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 11 (b) Worst accuracy confusion matrix obtained with 119875 = 7

to HMM to recognize activities composed of atomic actions[28]

If different subsets of joints are considered the perfor-mance decreases In particular considering only 7 joints and3 clusters it is possible to reach a maximum accuracy of821 which is very similar to the one obtained by Seidenariet al [44] who collected the dataset By including all theavailable joints (15) in the algorithm processing it is possibleto achieve an accuracy which varies between 840 and847

The Florence3D dataset is challenging because of tworeasons

(i) High interclass similarity some actions are verysimilar to each other for example drink from abottle answer phone and read watch As can beseen in Figure 5 all of them consist of an uprising

arm movement to the mouth hear or head Thisfact affects the performance because it is difficult toclassify actions consisting of very similar skeletonmovements

(ii) High intraclass variability the same action is per-formed in different ways by the same subject forexample using left right or both hands indifferently

Limiting the analysis to AAL related activities only thefollowing ones can be selected drink answer phone tight lacesit down and stand up The algorithm reaches the highestaccuracy (908) using the ldquonew-personrdquo setting with 119875 =11 joints and 119870 = 3 The confusion matrix obtained in thiscondition is shown in Figure 6(a) On the other hand theworst accuracy is 798 and it is obtained with 119875 = 7 and119870 = 4 In Figure 6(b) the confusion matrix for this scenario

Computational Intelligence and Neuroscience 11

081 008

004 056 022 011007

012 077 012

012

004 004 081 012

100

081 004 015

013 087

004 007019 070

a02

a03

a05

a06

a10

a13

a18

a20

a02

a03

a05

a06

a10

a13

a18

a20

(a)

056 015 007 015 007

007

007 017 013

016 044 008 004 004 024

011 056 004

003

022

010 070 017

063

100

010 087 003

007 093

a01

a04

a07

a08

a09

a11

a12

a14

a01

a04

a07

a08

a09

a11

a12

a14

(b)

088 012

100

005 095

097 003

003 097

093 007

003 003 083 010

004 011 085

a06

a14

a15

a16

a17

a18

a19

a20

a06

a14

a15

a16

a17

a18

a19

a20

(c)

Figure 7 Confusion matrices of the MSR Action3D dataset obtained with 119875 = 7 and the ldquonew-personrdquo test (a) AS1 (b) AS2 and (c) AS3

is shown In both situations the main problem is given by thesimilarity of drink and answer phone activities

425 MSR Action3D Dataset The MSR Action3D datasetis composed of 20 activities which are mainly gestures Itsevaluation is included for the sake of completeness in thecomparison with other activity recognition algorithms butthe dataset does not contain AAL related activities Thereis a big confusion in the literature about the validationtests to be used for MSR Action3D Padilla-Lopez et al[49] summarized all the validation methods for this datasetand recommend using all the possible combinations of 5-5 subjects splitting or using LOAO The former procedureconsists of considering 252 combinations of 5 subjects fortraining and the remaining 5 for testing while the latter is theleave-one-actor-out equivalent to the ldquonew-personrdquo schemepreviously introduced This dataset is quite challenging dueto the presence of similar and complex gestures hence theevaluation is performed by considering three subsets of 8gestures each (AS1 AS2 and AS3 in Table 9) as suggestedin [45] Since the minimum number of frames constitutingone sequence is 13 the following set of clusters has beenconsidered 119870 = [3 5 8 10 13] All the combinations of 711 15 and 20 joints shown in Figure 2 are included in theexperimental tests

Considering the ldquonew-personrdquo scheme the proposedalgorithm tested separately on the three subsets of MSRAction3D reaches an average accuracy of 812 with 119875 = 7and 119870 = 10 The confusion matrices for the three subsetsare shown in Figure 7 where it is possible to notice thatthe algorithm struggles in the recognition of the AS2 subset(Figure 7(b)) mainly represented by drawing gestures Betterresults are obtained processing the subset AS3 (Figure 7(c))where lower recognition rates are shown for complex ges-tures golf swing and pickup and throw Table 10 shows theresults obtained including also other joints which are slightlyworse A comparison with previous works validated usingthe ldquonew-personrdquo test is shown in Table 11 The proposedalgorithm achieves results comparable with [50] whichexploits an approach based on skeleton data Chaaraoui etal [15 36] exploits more complex algorithms considering

Table 9 Three subsets of gestures fromMSR Action3D dataset

AS1 AS2 AS3[11988602] Horizontalarm wave [11988601] High arm wave [11988606] High throw

[11988603] Hammer [11988604] Hand catch [11988614] Forward kick[11988605] Forwardpunch [11988607] Draw X [11988615] Side kick

[11988606] High throw [11988608] Draw tick [11988616] Jogging[11988610] Hand clap [11988609] Draw circle [11988617] Tennis swing[11988613] Bend [11988611] Two-hand wave [11988618] Tennis serve[11988618] Tennis serve [11988612] Side boxing [11988619] Golf swing[11988620] Pickup andthrow [11988614] Forward kick [11988620] Pickup and throw

Table 10 Accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting with different subsetsof joints and different subsets of activities

Algorithm AS1 AS2 AS3 AvgProposed (119875 = 15) 785 688 928 800

Proposed (119875 = 20) 790 702 919 804

Proposed (119875 = 11) 776 737 914 809

Proposed (119875 = 7) 795 719 923 812

the fusion of skeleton and depth data or the evolutionaryselection of the best set of joints

5 Discussion

The proposed algorithm despite its simplicity is able toachieve and sometimes to overcome state-of-the-art per-formance when applied to publicly available datasets Inparticular it is able to outperform some complex algorithmsexploiting more than one classifier in KARD and CAD-60datasets

Limiting the analysis to AAL related activities only thealgorithm achieves interesting results in all the datasetsThe group of 8 activities labeled as Actions in the KARD

12 Computational Intelligence and Neuroscience

Table 11 Average accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting compared to otherworks

Algorithm AccuracyCeliktutan et al (2015) [51] 729

Azary and Savakis (2013) [52] 785

Proposed (119875 = 7) 812

Azary and Savakis (2012) [50] 839

Chaaraoui et al (2013) [15] 906

Chaaraoui et al (2014) [36] 935

dataset is recognized with an accuracy greater than 987in the considered experiments and the group includes twosimilar activities such as phone call and drink The CAD-60dataset contains only actions so all the activities consideredwithin the proper location have been included in the eval-uation resulting in global precision and recall of 939 and935 respectively In the UTKinect dataset the performanceimproves considering only the AAL activities and the samehappens also in the Florence3D having the highest accuracyclose to 91 even if some actions are very similar

The MSR Action3D is a challenging dataset mainlycomprising gestures for human-computer interaction andnot actions Many gestures especially the ones included inAS2 subset are very similar to each other In order to improvethe recognition accuracy more complex features should beconsidered including not only jointsrsquo relative positions butalso their velocity Many sequences contain noisy skeletondata another approach to improving recognition accuracycould be the development of a method to discard noisyskeleton or to include depth-based features

However the AAL scenario raises some problems whichhave to be addressed First of all the algorithm exploits amul-ticlass SVM which is a good classifier but it does not makeeasy to understand if an activity belongs or not to any of thetraining classes In fact in a real scenario it is possible to havea sequence of frames that does not represent any activity ofthe training set in this case the SVM outputs the most likelyclass anyway even if it does not make sense Other machinelearning algorithms such as HMMs distinguish amongmultiple classes using the maximum posterior probabilityGaglio et al [29] proposed using a threshold on the outputprobability to detect unknown actions Usually SVMs do notprovide output probabilities but there exist some methodsto extend SVM implementations and make them able toprovide also this information [53 54] These techniquescan be investigated to understand their applicability in theproposed scenario

Another problem is segmentation of actions In manydatasets the actions are represented as segmented sequencesof frames but in real applications the algorithm has to handlea continuous stream of frames and to segment actions byitself Some solutions for segmentation have been proposedbut most of them are based on thresholds on movementswhich can be highly data-dependent [46] Also this aspect hasto be further investigated to have a systemwhich is effectivelyapplicable in a real AAL scenario

6 Conclusions

In this work a simple yet effective activity recognitionalgorithm has been proposed It is based on skeleton dataextracted from an RGBD sensor and it creates a featurevector representing the whole activity It is able to overcomestate-of-the-art results in two publicly available datasets theKARD and CAD-60 outperforming more complex algo-rithms in many conditions The algorithm has been testedalso over other more challenging datasets the UTKinect andFlorence3D where it is outperformed only by algorithmsexploiting temporal alignment techniques or a combinationof several machine learning methods The MSR Action3Dis a complex dataset and the proposed algorithm strugglesin the recognition of subsets constituted by similar gesturesHowever this dataset does not contain any activity of interestfor AAL scenarios

Future works will concern the application of the activityrecognition algorithm to AAL scenarios by consideringaction segmentation and the detection of unknown activities

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the contributionof the COST Action IC1303 AAPELE (Architectures Algo-rithms and Platforms for Enhanced Living Environments)

References

[1] P Rashidi and A Mihailidis ldquoA survey on ambient-assistedliving tools for older adultsrdquo IEEE Journal of Biomedical andHealth Informatics vol 17 no 3 pp 579ndash590 2013

[2] O C Ann and L B Theng ldquoHuman activity recognition areviewrdquo in Proceedings of the IEEE International Conference onControl System Computing and Engineering (ICCSCE rsquo14) pp389ndash393 IEEE Batu Ferringhi Malaysia November 2014

[3] S Wang and G Zhou ldquoA review on radio based activityrecognitionrdquo Digital Communications and Networks vol 1 no1 pp 20ndash29 2015

[4] L Atallah B Lo R Ali R King and G-Z Yang ldquoReal-timeactivity classificationusing ambient andwearable sensorsrdquo IEEETransactions on Information Technology in Biomedicine vol 13no 6 pp 1031ndash1039 2009

[5] D De P Bharti S K Das and S Chellappan ldquoMultimodalwearable sensing for fine-grained activity recognition in health-carerdquo IEEE Internet Computing vol 19 no 5 pp 26ndash35 2015

[6] J K Aggarwal and L Xia ldquoHuman activity recognition from 3Ddata a reviewrdquo Pattern Recognition Letters vol 48 pp 70ndash802014

[7] S Gasparrini E Cippitelli E Gambi S Spinsante and FFlorez-Revuelta ldquoPerformance analysis of self-organising neu-ral networks tracking algorithms for intake monitoring usingkinectrdquo in Proceedings of the 1st IET International Conferenceon Technologies for Active and Assisted Living (TechAAL rsquo15)Kingston UK November 2015

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

10 Computational Intelligence and Neuroscience

(a)

(b)

(c)

Figure 5 Sequences of frames representing the drink from a bottle (a) answer phone (b) and read watch (c) activities from Florence3Ddataset

081 019

014 086

100

005 005 090

005 095

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(a)

076 019 005

032 064 005

004 092 004

005 010

005010005

085

080

Drin

k

Answ

er p

hone

Tigh

t lac

e

Sit d

own

Stan

d up

Drink

Answer phone

Tight lace

Sit down

Stand up

(b)

Figure 6 Confusion matrices of the Florence3D dataset with only AAL related activities (a) Best accuracy confusion matrix obtained with119875 = 11 (b) Worst accuracy confusion matrix obtained with 119875 = 7

to HMM to recognize activities composed of atomic actions[28]

If different subsets of joints are considered the perfor-mance decreases In particular considering only 7 joints and3 clusters it is possible to reach a maximum accuracy of821 which is very similar to the one obtained by Seidenariet al [44] who collected the dataset By including all theavailable joints (15) in the algorithm processing it is possibleto achieve an accuracy which varies between 840 and847

The Florence3D dataset is challenging because of tworeasons

(i) High interclass similarity some actions are verysimilar to each other for example drink from abottle answer phone and read watch As can beseen in Figure 5 all of them consist of an uprising

arm movement to the mouth hear or head Thisfact affects the performance because it is difficult toclassify actions consisting of very similar skeletonmovements

(ii) High intraclass variability the same action is per-formed in different ways by the same subject forexample using left right or both hands indifferently

Limiting the analysis to AAL related activities only thefollowing ones can be selected drink answer phone tight lacesit down and stand up The algorithm reaches the highestaccuracy (908) using the ldquonew-personrdquo setting with 119875 =11 joints and 119870 = 3 The confusion matrix obtained in thiscondition is shown in Figure 6(a) On the other hand theworst accuracy is 798 and it is obtained with 119875 = 7 and119870 = 4 In Figure 6(b) the confusion matrix for this scenario

Computational Intelligence and Neuroscience 11

081 008

004 056 022 011007

012 077 012

012

004 004 081 012

100

081 004 015

013 087

004 007019 070

a02

a03

a05

a06

a10

a13

a18

a20

a02

a03

a05

a06

a10

a13

a18

a20

(a)

056 015 007 015 007

007

007 017 013

016 044 008 004 004 024

011 056 004

003

022

010 070 017

063

100

010 087 003

007 093

a01

a04

a07

a08

a09

a11

a12

a14

a01

a04

a07

a08

a09

a11

a12

a14

(b)

088 012

100

005 095

097 003

003 097

093 007

003 003 083 010

004 011 085

a06

a14

a15

a16

a17

a18

a19

a20

a06

a14

a15

a16

a17

a18

a19

a20

(c)

Figure 7 Confusion matrices of the MSR Action3D dataset obtained with 119875 = 7 and the ldquonew-personrdquo test (a) AS1 (b) AS2 and (c) AS3

is shown In both situations the main problem is given by thesimilarity of drink and answer phone activities

425 MSR Action3D Dataset The MSR Action3D datasetis composed of 20 activities which are mainly gestures Itsevaluation is included for the sake of completeness in thecomparison with other activity recognition algorithms butthe dataset does not contain AAL related activities Thereis a big confusion in the literature about the validationtests to be used for MSR Action3D Padilla-Lopez et al[49] summarized all the validation methods for this datasetand recommend using all the possible combinations of 5-5 subjects splitting or using LOAO The former procedureconsists of considering 252 combinations of 5 subjects fortraining and the remaining 5 for testing while the latter is theleave-one-actor-out equivalent to the ldquonew-personrdquo schemepreviously introduced This dataset is quite challenging dueto the presence of similar and complex gestures hence theevaluation is performed by considering three subsets of 8gestures each (AS1 AS2 and AS3 in Table 9) as suggestedin [45] Since the minimum number of frames constitutingone sequence is 13 the following set of clusters has beenconsidered 119870 = [3 5 8 10 13] All the combinations of 711 15 and 20 joints shown in Figure 2 are included in theexperimental tests

Considering the ldquonew-personrdquo scheme the proposedalgorithm tested separately on the three subsets of MSRAction3D reaches an average accuracy of 812 with 119875 = 7and 119870 = 10 The confusion matrices for the three subsetsare shown in Figure 7 where it is possible to notice thatthe algorithm struggles in the recognition of the AS2 subset(Figure 7(b)) mainly represented by drawing gestures Betterresults are obtained processing the subset AS3 (Figure 7(c))where lower recognition rates are shown for complex ges-tures golf swing and pickup and throw Table 10 shows theresults obtained including also other joints which are slightlyworse A comparison with previous works validated usingthe ldquonew-personrdquo test is shown in Table 11 The proposedalgorithm achieves results comparable with [50] whichexploits an approach based on skeleton data Chaaraoui etal [15 36] exploits more complex algorithms considering

Table 9 Three subsets of gestures fromMSR Action3D dataset

AS1 AS2 AS3[11988602] Horizontalarm wave [11988601] High arm wave [11988606] High throw

[11988603] Hammer [11988604] Hand catch [11988614] Forward kick[11988605] Forwardpunch [11988607] Draw X [11988615] Side kick

[11988606] High throw [11988608] Draw tick [11988616] Jogging[11988610] Hand clap [11988609] Draw circle [11988617] Tennis swing[11988613] Bend [11988611] Two-hand wave [11988618] Tennis serve[11988618] Tennis serve [11988612] Side boxing [11988619] Golf swing[11988620] Pickup andthrow [11988614] Forward kick [11988620] Pickup and throw

Table 10 Accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting with different subsetsof joints and different subsets of activities

Algorithm AS1 AS2 AS3 AvgProposed (119875 = 15) 785 688 928 800

Proposed (119875 = 20) 790 702 919 804

Proposed (119875 = 11) 776 737 914 809

Proposed (119875 = 7) 795 719 923 812

the fusion of skeleton and depth data or the evolutionaryselection of the best set of joints

5 Discussion

The proposed algorithm despite its simplicity is able toachieve and sometimes to overcome state-of-the-art per-formance when applied to publicly available datasets Inparticular it is able to outperform some complex algorithmsexploiting more than one classifier in KARD and CAD-60datasets

Limiting the analysis to AAL related activities only thealgorithm achieves interesting results in all the datasetsThe group of 8 activities labeled as Actions in the KARD

12 Computational Intelligence and Neuroscience

Table 11 Average accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting compared to otherworks

Algorithm AccuracyCeliktutan et al (2015) [51] 729

Azary and Savakis (2013) [52] 785

Proposed (119875 = 7) 812

Azary and Savakis (2012) [50] 839

Chaaraoui et al (2013) [15] 906

Chaaraoui et al (2014) [36] 935

dataset is recognized with an accuracy greater than 987in the considered experiments and the group includes twosimilar activities such as phone call and drink The CAD-60dataset contains only actions so all the activities consideredwithin the proper location have been included in the eval-uation resulting in global precision and recall of 939 and935 respectively In the UTKinect dataset the performanceimproves considering only the AAL activities and the samehappens also in the Florence3D having the highest accuracyclose to 91 even if some actions are very similar

The MSR Action3D is a challenging dataset mainlycomprising gestures for human-computer interaction andnot actions Many gestures especially the ones included inAS2 subset are very similar to each other In order to improvethe recognition accuracy more complex features should beconsidered including not only jointsrsquo relative positions butalso their velocity Many sequences contain noisy skeletondata another approach to improving recognition accuracycould be the development of a method to discard noisyskeleton or to include depth-based features

However the AAL scenario raises some problems whichhave to be addressed First of all the algorithm exploits amul-ticlass SVM which is a good classifier but it does not makeeasy to understand if an activity belongs or not to any of thetraining classes In fact in a real scenario it is possible to havea sequence of frames that does not represent any activity ofthe training set in this case the SVM outputs the most likelyclass anyway even if it does not make sense Other machinelearning algorithms such as HMMs distinguish amongmultiple classes using the maximum posterior probabilityGaglio et al [29] proposed using a threshold on the outputprobability to detect unknown actions Usually SVMs do notprovide output probabilities but there exist some methodsto extend SVM implementations and make them able toprovide also this information [53 54] These techniquescan be investigated to understand their applicability in theproposed scenario

Another problem is segmentation of actions In manydatasets the actions are represented as segmented sequencesof frames but in real applications the algorithm has to handlea continuous stream of frames and to segment actions byitself Some solutions for segmentation have been proposedbut most of them are based on thresholds on movementswhich can be highly data-dependent [46] Also this aspect hasto be further investigated to have a systemwhich is effectivelyapplicable in a real AAL scenario

6 Conclusions

In this work a simple yet effective activity recognitionalgorithm has been proposed It is based on skeleton dataextracted from an RGBD sensor and it creates a featurevector representing the whole activity It is able to overcomestate-of-the-art results in two publicly available datasets theKARD and CAD-60 outperforming more complex algo-rithms in many conditions The algorithm has been testedalso over other more challenging datasets the UTKinect andFlorence3D where it is outperformed only by algorithmsexploiting temporal alignment techniques or a combinationof several machine learning methods The MSR Action3Dis a complex dataset and the proposed algorithm strugglesin the recognition of subsets constituted by similar gesturesHowever this dataset does not contain any activity of interestfor AAL scenarios

Future works will concern the application of the activityrecognition algorithm to AAL scenarios by consideringaction segmentation and the detection of unknown activities

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the contributionof the COST Action IC1303 AAPELE (Architectures Algo-rithms and Platforms for Enhanced Living Environments)

References

[1] P Rashidi and A Mihailidis ldquoA survey on ambient-assistedliving tools for older adultsrdquo IEEE Journal of Biomedical andHealth Informatics vol 17 no 3 pp 579ndash590 2013

[2] O C Ann and L B Theng ldquoHuman activity recognition areviewrdquo in Proceedings of the IEEE International Conference onControl System Computing and Engineering (ICCSCE rsquo14) pp389ndash393 IEEE Batu Ferringhi Malaysia November 2014

[3] S Wang and G Zhou ldquoA review on radio based activityrecognitionrdquo Digital Communications and Networks vol 1 no1 pp 20ndash29 2015

[4] L Atallah B Lo R Ali R King and G-Z Yang ldquoReal-timeactivity classificationusing ambient andwearable sensorsrdquo IEEETransactions on Information Technology in Biomedicine vol 13no 6 pp 1031ndash1039 2009

[5] D De P Bharti S K Das and S Chellappan ldquoMultimodalwearable sensing for fine-grained activity recognition in health-carerdquo IEEE Internet Computing vol 19 no 5 pp 26ndash35 2015

[6] J K Aggarwal and L Xia ldquoHuman activity recognition from 3Ddata a reviewrdquo Pattern Recognition Letters vol 48 pp 70ndash802014

[7] S Gasparrini E Cippitelli E Gambi S Spinsante and FFlorez-Revuelta ldquoPerformance analysis of self-organising neu-ral networks tracking algorithms for intake monitoring usingkinectrdquo in Proceedings of the 1st IET International Conferenceon Technologies for Active and Assisted Living (TechAAL rsquo15)Kingston UK November 2015

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

Computational Intelligence and Neuroscience 11

081 008

004 056 022 011007

012 077 012

012

004 004 081 012

100

081 004 015

013 087

004 007019 070

a02

a03

a05

a06

a10

a13

a18

a20

a02

a03

a05

a06

a10

a13

a18

a20

(a)

056 015 007 015 007

007

007 017 013

016 044 008 004 004 024

011 056 004

003

022

010 070 017

063

100

010 087 003

007 093

a01

a04

a07

a08

a09

a11

a12

a14

a01

a04

a07

a08

a09

a11

a12

a14

(b)

088 012

100

005 095

097 003

003 097

093 007

003 003 083 010

004 011 085

a06

a14

a15

a16

a17

a18

a19

a20

a06

a14

a15

a16

a17

a18

a19

a20

(c)

Figure 7 Confusion matrices of the MSR Action3D dataset obtained with 119875 = 7 and the ldquonew-personrdquo test (a) AS1 (b) AS2 and (c) AS3

is shown In both situations the main problem is given by thesimilarity of drink and answer phone activities

425 MSR Action3D Dataset The MSR Action3D datasetis composed of 20 activities which are mainly gestures Itsevaluation is included for the sake of completeness in thecomparison with other activity recognition algorithms butthe dataset does not contain AAL related activities Thereis a big confusion in the literature about the validationtests to be used for MSR Action3D Padilla-Lopez et al[49] summarized all the validation methods for this datasetand recommend using all the possible combinations of 5-5 subjects splitting or using LOAO The former procedureconsists of considering 252 combinations of 5 subjects fortraining and the remaining 5 for testing while the latter is theleave-one-actor-out equivalent to the ldquonew-personrdquo schemepreviously introduced This dataset is quite challenging dueto the presence of similar and complex gestures hence theevaluation is performed by considering three subsets of 8gestures each (AS1 AS2 and AS3 in Table 9) as suggestedin [45] Since the minimum number of frames constitutingone sequence is 13 the following set of clusters has beenconsidered 119870 = [3 5 8 10 13] All the combinations of 711 15 and 20 joints shown in Figure 2 are included in theexperimental tests

Considering the ldquonew-personrdquo scheme the proposedalgorithm tested separately on the three subsets of MSRAction3D reaches an average accuracy of 812 with 119875 = 7and 119870 = 10 The confusion matrices for the three subsetsare shown in Figure 7 where it is possible to notice thatthe algorithm struggles in the recognition of the AS2 subset(Figure 7(b)) mainly represented by drawing gestures Betterresults are obtained processing the subset AS3 (Figure 7(c))where lower recognition rates are shown for complex ges-tures golf swing and pickup and throw Table 10 shows theresults obtained including also other joints which are slightlyworse A comparison with previous works validated usingthe ldquonew-personrdquo test is shown in Table 11 The proposedalgorithm achieves results comparable with [50] whichexploits an approach based on skeleton data Chaaraoui etal [15 36] exploits more complex algorithms considering

Table 9 Three subsets of gestures fromMSR Action3D dataset

AS1 AS2 AS3[11988602] Horizontalarm wave [11988601] High arm wave [11988606] High throw

[11988603] Hammer [11988604] Hand catch [11988614] Forward kick[11988605] Forwardpunch [11988607] Draw X [11988615] Side kick

[11988606] High throw [11988608] Draw tick [11988616] Jogging[11988610] Hand clap [11988609] Draw circle [11988617] Tennis swing[11988613] Bend [11988611] Two-hand wave [11988618] Tennis serve[11988618] Tennis serve [11988612] Side boxing [11988619] Golf swing[11988620] Pickup andthrow [11988614] Forward kick [11988620] Pickup and throw

Table 10 Accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting with different subsetsof joints and different subsets of activities

Algorithm AS1 AS2 AS3 AvgProposed (119875 = 15) 785 688 928 800

Proposed (119875 = 20) 790 702 919 804

Proposed (119875 = 11) 776 737 914 809

Proposed (119875 = 7) 795 719 923 812

the fusion of skeleton and depth data or the evolutionaryselection of the best set of joints

5 Discussion

The proposed algorithm despite its simplicity is able toachieve and sometimes to overcome state-of-the-art per-formance when applied to publicly available datasets Inparticular it is able to outperform some complex algorithmsexploiting more than one classifier in KARD and CAD-60datasets

Limiting the analysis to AAL related activities only thealgorithm achieves interesting results in all the datasetsThe group of 8 activities labeled as Actions in the KARD

12 Computational Intelligence and Neuroscience

Table 11 Average accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting compared to otherworks

Algorithm AccuracyCeliktutan et al (2015) [51] 729

Azary and Savakis (2013) [52] 785

Proposed (119875 = 7) 812

Azary and Savakis (2012) [50] 839

Chaaraoui et al (2013) [15] 906

Chaaraoui et al (2014) [36] 935

dataset is recognized with an accuracy greater than 987in the considered experiments and the group includes twosimilar activities such as phone call and drink The CAD-60dataset contains only actions so all the activities consideredwithin the proper location have been included in the eval-uation resulting in global precision and recall of 939 and935 respectively In the UTKinect dataset the performanceimproves considering only the AAL activities and the samehappens also in the Florence3D having the highest accuracyclose to 91 even if some actions are very similar

The MSR Action3D is a challenging dataset mainlycomprising gestures for human-computer interaction andnot actions Many gestures especially the ones included inAS2 subset are very similar to each other In order to improvethe recognition accuracy more complex features should beconsidered including not only jointsrsquo relative positions butalso their velocity Many sequences contain noisy skeletondata another approach to improving recognition accuracycould be the development of a method to discard noisyskeleton or to include depth-based features

However the AAL scenario raises some problems whichhave to be addressed First of all the algorithm exploits amul-ticlass SVM which is a good classifier but it does not makeeasy to understand if an activity belongs or not to any of thetraining classes In fact in a real scenario it is possible to havea sequence of frames that does not represent any activity ofthe training set in this case the SVM outputs the most likelyclass anyway even if it does not make sense Other machinelearning algorithms such as HMMs distinguish amongmultiple classes using the maximum posterior probabilityGaglio et al [29] proposed using a threshold on the outputprobability to detect unknown actions Usually SVMs do notprovide output probabilities but there exist some methodsto extend SVM implementations and make them able toprovide also this information [53 54] These techniquescan be investigated to understand their applicability in theproposed scenario

Another problem is segmentation of actions In manydatasets the actions are represented as segmented sequencesof frames but in real applications the algorithm has to handlea continuous stream of frames and to segment actions byitself Some solutions for segmentation have been proposedbut most of them are based on thresholds on movementswhich can be highly data-dependent [46] Also this aspect hasto be further investigated to have a systemwhich is effectivelyapplicable in a real AAL scenario

6 Conclusions

In this work a simple yet effective activity recognitionalgorithm has been proposed It is based on skeleton dataextracted from an RGBD sensor and it creates a featurevector representing the whole activity It is able to overcomestate-of-the-art results in two publicly available datasets theKARD and CAD-60 outperforming more complex algo-rithms in many conditions The algorithm has been testedalso over other more challenging datasets the UTKinect andFlorence3D where it is outperformed only by algorithmsexploiting temporal alignment techniques or a combinationof several machine learning methods The MSR Action3Dis a complex dataset and the proposed algorithm strugglesin the recognition of subsets constituted by similar gesturesHowever this dataset does not contain any activity of interestfor AAL scenarios

Future works will concern the application of the activityrecognition algorithm to AAL scenarios by consideringaction segmentation and the detection of unknown activities

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the contributionof the COST Action IC1303 AAPELE (Architectures Algo-rithms and Platforms for Enhanced Living Environments)

References

[1] P Rashidi and A Mihailidis ldquoA survey on ambient-assistedliving tools for older adultsrdquo IEEE Journal of Biomedical andHealth Informatics vol 17 no 3 pp 579ndash590 2013

[2] O C Ann and L B Theng ldquoHuman activity recognition areviewrdquo in Proceedings of the IEEE International Conference onControl System Computing and Engineering (ICCSCE rsquo14) pp389ndash393 IEEE Batu Ferringhi Malaysia November 2014

[3] S Wang and G Zhou ldquoA review on radio based activityrecognitionrdquo Digital Communications and Networks vol 1 no1 pp 20ndash29 2015

[4] L Atallah B Lo R Ali R King and G-Z Yang ldquoReal-timeactivity classificationusing ambient andwearable sensorsrdquo IEEETransactions on Information Technology in Biomedicine vol 13no 6 pp 1031ndash1039 2009

[5] D De P Bharti S K Das and S Chellappan ldquoMultimodalwearable sensing for fine-grained activity recognition in health-carerdquo IEEE Internet Computing vol 19 no 5 pp 26ndash35 2015

[6] J K Aggarwal and L Xia ldquoHuman activity recognition from 3Ddata a reviewrdquo Pattern Recognition Letters vol 48 pp 70ndash802014

[7] S Gasparrini E Cippitelli E Gambi S Spinsante and FFlorez-Revuelta ldquoPerformance analysis of self-organising neu-ral networks tracking algorithms for intake monitoring usingkinectrdquo in Proceedings of the 1st IET International Conferenceon Technologies for Active and Assisted Living (TechAAL rsquo15)Kingston UK November 2015

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

12 Computational Intelligence and Neuroscience

Table 11 Average accuracy () of the proposed algorithm for MSRAction3D dataset and ldquonew-personrdquo setting compared to otherworks

Algorithm AccuracyCeliktutan et al (2015) [51] 729

Azary and Savakis (2013) [52] 785

Proposed (119875 = 7) 812

Azary and Savakis (2012) [50] 839

Chaaraoui et al (2013) [15] 906

Chaaraoui et al (2014) [36] 935

dataset is recognized with an accuracy greater than 987in the considered experiments and the group includes twosimilar activities such as phone call and drink The CAD-60dataset contains only actions so all the activities consideredwithin the proper location have been included in the eval-uation resulting in global precision and recall of 939 and935 respectively In the UTKinect dataset the performanceimproves considering only the AAL activities and the samehappens also in the Florence3D having the highest accuracyclose to 91 even if some actions are very similar

The MSR Action3D is a challenging dataset mainlycomprising gestures for human-computer interaction andnot actions Many gestures especially the ones included inAS2 subset are very similar to each other In order to improvethe recognition accuracy more complex features should beconsidered including not only jointsrsquo relative positions butalso their velocity Many sequences contain noisy skeletondata another approach to improving recognition accuracycould be the development of a method to discard noisyskeleton or to include depth-based features

However the AAL scenario raises some problems whichhave to be addressed First of all the algorithm exploits amul-ticlass SVM which is a good classifier but it does not makeeasy to understand if an activity belongs or not to any of thetraining classes In fact in a real scenario it is possible to havea sequence of frames that does not represent any activity ofthe training set in this case the SVM outputs the most likelyclass anyway even if it does not make sense Other machinelearning algorithms such as HMMs distinguish amongmultiple classes using the maximum posterior probabilityGaglio et al [29] proposed using a threshold on the outputprobability to detect unknown actions Usually SVMs do notprovide output probabilities but there exist some methodsto extend SVM implementations and make them able toprovide also this information [53 54] These techniquescan be investigated to understand their applicability in theproposed scenario

Another problem is segmentation of actions In manydatasets the actions are represented as segmented sequencesof frames but in real applications the algorithm has to handlea continuous stream of frames and to segment actions byitself Some solutions for segmentation have been proposedbut most of them are based on thresholds on movementswhich can be highly data-dependent [46] Also this aspect hasto be further investigated to have a systemwhich is effectivelyapplicable in a real AAL scenario

6 Conclusions

In this work a simple yet effective activity recognitionalgorithm has been proposed It is based on skeleton dataextracted from an RGBD sensor and it creates a featurevector representing the whole activity It is able to overcomestate-of-the-art results in two publicly available datasets theKARD and CAD-60 outperforming more complex algo-rithms in many conditions The algorithm has been testedalso over other more challenging datasets the UTKinect andFlorence3D where it is outperformed only by algorithmsexploiting temporal alignment techniques or a combinationof several machine learning methods The MSR Action3Dis a complex dataset and the proposed algorithm strugglesin the recognition of subsets constituted by similar gesturesHowever this dataset does not contain any activity of interestfor AAL scenarios

Future works will concern the application of the activityrecognition algorithm to AAL scenarios by consideringaction segmentation and the detection of unknown activities

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors would like to acknowledge the contributionof the COST Action IC1303 AAPELE (Architectures Algo-rithms and Platforms for Enhanced Living Environments)

References

[1] P Rashidi and A Mihailidis ldquoA survey on ambient-assistedliving tools for older adultsrdquo IEEE Journal of Biomedical andHealth Informatics vol 17 no 3 pp 579ndash590 2013

[2] O C Ann and L B Theng ldquoHuman activity recognition areviewrdquo in Proceedings of the IEEE International Conference onControl System Computing and Engineering (ICCSCE rsquo14) pp389ndash393 IEEE Batu Ferringhi Malaysia November 2014

[3] S Wang and G Zhou ldquoA review on radio based activityrecognitionrdquo Digital Communications and Networks vol 1 no1 pp 20ndash29 2015

[4] L Atallah B Lo R Ali R King and G-Z Yang ldquoReal-timeactivity classificationusing ambient andwearable sensorsrdquo IEEETransactions on Information Technology in Biomedicine vol 13no 6 pp 1031ndash1039 2009

[5] D De P Bharti S K Das and S Chellappan ldquoMultimodalwearable sensing for fine-grained activity recognition in health-carerdquo IEEE Internet Computing vol 19 no 5 pp 26ndash35 2015

[6] J K Aggarwal and L Xia ldquoHuman activity recognition from 3Ddata a reviewrdquo Pattern Recognition Letters vol 48 pp 70ndash802014

[7] S Gasparrini E Cippitelli E Gambi S Spinsante and FFlorez-Revuelta ldquoPerformance analysis of self-organising neu-ral networks tracking algorithms for intake monitoring usingkinectrdquo in Proceedings of the 1st IET International Conferenceon Technologies for Active and Assisted Living (TechAAL rsquo15)Kingston UK November 2015

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

Computational Intelligence and Neuroscience 13

[8] J Shotton A Fitzgibbon M Cook et al ldquoReal-time humanpose recognition in parts from single depth imagesrdquo in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR rsquo11) pp 1297ndash1304 IEEE Providence RdUSA June 2011

[9] J R Padilla-Lopez A A Chaaraoui F Gu and F Florez-Revuelta ldquoVisual privacy by context proposal and evaluationof a level-based visualisation schemerdquo Sensors vol 15 no 6 pp12959ndash12982 2015

[10] X Yang and Y Tian ldquoSuper normal vector for activity recog-nition using depth sequencesrdquo in Proceedings of the 27th IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo14) pp 804ndash811 IEEE Columbus Ohio USA June 2014

[11] R Slama H Wannous and M Daoudi ldquoGrassmannian rep-resentation of motion depth for 3D human gesture and actionrecognitionrdquo inProceedings of the 22nd International Conferenceon Pattern Recognition (ICPR rsquo14) pp 3499ndash3504 IEEE Stock-holm Sweden August 2014

[12] O Oreifej and Z Liu ldquoHON4D histogram of oriented 4Dnormals for activity recognition from depth sequencesrdquo inProceedings of the 26th IEEEConference onComputer Vision andPattern Recognition (CVPR rsquo13) pp 716ndash723 IEEE June 2013

[13] H Rahmani AMahmood D QHuynh andAMian ldquoHOPChistogram of oriented principal components of 3D pointcloudsfor action recognitionrdquo in Computer VisionmdashECCV 2014 DFleet T Pajdla B Schiele and T Tuytelaars Eds vol 8690 ofLecture Notes in Computer Science pp 742ndash757 Springer 2014

[14] G Chen D Clarke M Giuliani A Gaschler and A KnollldquoCombining unsupervised learning and discrimination for 3Daction recognitionrdquo Signal Processing vol 110 pp 67ndash81 2015

[15] A A Chaaraoui J R Padilla-Lopez and F Florez-RevueltaldquoFusion of skeletal and silhouette-based features for humanaction recognition with RGB-D devicesrdquo in Proceedings ofthe 14th IEEE International Conference on Computer VisionWorkshops (ICCVW rsquo13) pp 91ndash97 IEEE Sydney AustraliaDecember 2013

[16] S Althloothi M H Mahoor X Zhang and R M VoylesldquoHuman activity recognition using multi-features and multiplekernel learningrdquo Pattern Recognition vol 47 no 5 pp 1800ndash1812 2014

[17] Y Zhu W Chen and G Guo ldquoFusing spatiotemporal featuresand joints for 3D action recognitionrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo13) pp 486ndash491 IEEE June 2013

[18] G Willems T Tuytelaars and L Van Gool ldquoAn efficient denseand scale-invariant spatio-temporal interest point detectorrdquo inComputer VisionmdashECCV 2008 D Forsyth P Torr and AZisserman Eds vol 5303 of Lecture Notes in Computer Sciencepp 650ndash663 Springer Berlin Germany 2008

[19] A Klaser M Marszałek and C Schmid ldquoA spatio-temporaldescriptor based on 3D-gradientsrdquo in Proceedings of the 19thBritish Machine Vision Conference (BMVC rsquo08) pp 995ndash1004September 2008

[20] J LuoWWang and H Qi ldquoSpatio-temporal feature extractionand representation for RGB-D human action recognitionrdquoPattern Recognition Letters vol 50 pp 139ndash148 2014

[21] Y Zhu W Chen and G Guo ldquoEvaluating spatiotemporalinterest point features for depth-based action recognitionrdquoImage and Vision Computing vol 32 no 8 pp 453ndash464 2014

[22] A-A Liu W-Z Nie Y-T Su L Ma T Hao and Z-X YangldquoCoupled hidden conditional random fields for RGB-D humanaction recognitionrdquo Signal Processing vol 112 pp 74ndash82 2015

[23] M Devanne H Wannous S Berretti P Pala M Daoudiand A Del Bimbo ldquoSpace-time pose representation for 3Dhuman action recognitionrdquo inNewTrends in ImageAnalysis andProcessingmdashICIAP 2013 vol 8158 of Lecture Notes in ComputerScience pp 456ndash464 Springer Berlin Germany 2013

[24] L Gan and F Chen ldquoHuman action recognition using APJ3Dand random forestsrdquo Journal of Software vol 8 no 9 pp 2238ndash2245 2013

[25] J Wang Z Liu Y Wu and J Yuan ldquoMining actionlet ensemblefor action recognition with depth camerasrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 1290ndash1297 Providence RI USA June 2012

[26] L Xia C-C Chen and J K Aggarwal ldquoView invariant humanaction recognition using histograms of 3D jointsrdquo inProceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops (CVPRW rsquo12) pp 20ndash27Providence RI June 2012

[27] X Yang and Y Tian ldquoEffective 3D action recognition usingEigen Jointsrdquo Journal of Visual Communication and ImageRepresentation vol 25 no 1 pp 2ndash11 2014

[28] A Taha H H Zayed M E Khalifa and E-S M El-HorbatyldquoHuman activity recognition for surveillance applicationsrdquo inProceedings of the 7th International Conference on InformationTechnology pp 577ndash586 2015

[29] S Gaglio G Lo Re and M Morana ldquoHuman activity recog-nition process using 3-D posture datardquo IEEE Transactions onHuman-Machine Systems vol 45 no 5 pp 586ndash597 2015

[30] I Theodorakopoulos D Kastaniotis G Economou and SFotopoulos ldquoPose-based human action recognition via sparserepresentation in dissimilarity spacerdquo Journal of Visual Commu-nication and Image Representation vol 25 no 1 pp 12ndash23 2014

[31] WDing K Liu F Cheng and J Zhang ldquoSTFC spatio-temporalfeature chain for skeleton-based human action recognitionrdquoJournal of Visual Communication and Image Representation vol26 pp 329ndash337 2015

[32] R Slama HWannous M Daoudi and A Srivastava ldquoAccurate3D action recognition using learning on the Grassmann mani-foldrdquo Pattern Recognition vol 48 no 2 pp 556ndash567 2015

[33] R Vemulapalli F Arrate and R Chellappa ldquoHuman actionrecognition by representing 3D skeletons as points in a liegrouprdquo in Proceedings of the 27th IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR rsquo14) pp 588ndash595Columbus Ohio USA June 2014

[34] R Anirudh P Turaga J Su and A Srivastava ldquoElastic func-tional coding of human actions From vector-fields to latentvariablesrdquo in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR rsquo15) pp 3147ndash3155Boston Mass USA June 2015

[35] M Jiang J Kong G Bebis and H Huo ldquoInformative jointsbased human action recognition using skeleton contextsrdquo SignalProcessing Image Communication vol 33 pp 29ndash40 2015

[36] A A Chaaraoui J R Padilla-Lopez P Climent-Perez andF Florez-Revuelta ldquoEvolutionary joint selection to improvehuman action recognitionwith RGB-Ddevicesrdquo Expert Systemswith Applications vol 41 no 3 pp 786ndash794 2014

[37] S Baysal M C Kurt and P Duygulu ldquoRecognizing humanactions using key posesrdquo in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR rsquo10) pp 1727ndash1730Istanbul Turkey August 2010

[38] A A Chaaraoui P Climent-Perez and F Florez-RevueltaldquoSilhouette-based human action recognition using sequences of

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 14: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

14 Computational Intelligence and Neuroscience

key posesrdquo Pattern Recognition Letters vol 34 no 15 pp 1799ndash1807 2013

[39] C Cortes and V Vapnik ldquoSupport-vector networksrdquo MachineLearning vol 20 no 3 pp 273ndash297 1995

[40] C-W Hsu and C-J Lin ldquoA comparison of methods for mul-ticlass support vector machinesrdquo IEEE Transactions on NeuralNetworks vol 13 no 2 pp 415ndash425 2002

[41] C-C Chang and C-J Lin ldquoLIBSVM a library for supportvector machinesrdquo ACM Transactions on Intelligent Systems andTechnology vol 2 no 3 article 27 2011

[42] OpenNI 2015 httpstructureioopenni[43] J Sung C Ponce B Selman and A Saxena ldquoUnstructured

human activity detection fromRGBD imagesrdquo in Proceedings ofthe IEEE International Conference on Robotics and Automation(ICRA rsquo12) pp 842ndash849 Saint Paul Minn USA May 2012

[44] L Seidenari V Varano S Berretti A Del Bimbo and P PalaldquoRecognizing actions from depth cameras as weakly alignedmulti-part Bag-of-Posesrdquo in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition Workshops(CVPRW rsquo13) pp 479ndash485 Portland Ore USA June 2013

[45] W Li Z Zhang and Z Liu ldquoAction recognition based ona bag of 3D pointsrdquo in Proceedings of the IEEE ComputerSociety Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW rsquo10) pp 9ndash14 San Francisco Calif USAJune 2010

[46] J Shan and S Akella ldquo3D human action segmentation andrecognition using pose kinetic energyrdquo in Proceedings of theIEEE Workshop on Advanced Robotics and its Social Impacts(ARSO rsquo14) pp 69ndash75 Evanston Ill USA September 2014

[47] D R Faria C Premebida and U Nunes ldquoA probabilisticapproach for human everyday activities recognition using bodymotion from RGB-D imagesrdquo in Proceedings of the 23rd IEEEInternational Symposium on Robot and Human InteractiveCommunication (RO-MAN rsquo14) pp 732ndash737 Edinburgh UKAugust 2014

[48] G I Parisi C Weber and S Wermter ldquoSelf-organizing neuralintegration of pose-motion features for human action recogni-tionrdquo Frontiers in Neurorobotics vol 9 no 3 2015

[49] J R Padilla-Lopez A A Chaaraoui and F Florez-Revuelta ldquoAdiscussion on the validation tests employed to compare humanaction recognition methods using the MSR Action3D datasetrdquohttparxivorgabs14077390

[50] S Azary and A Savakis ldquo3D Action classification using sparsespatio-temporal feature representationsrdquo in Advances in VisualComputing G Bebis R Boyle B Parvin et al Eds vol 7432 ofLecture Notes in Computer Science pp 166ndash175 Springer BerlinGermany 2012

[51] O Celiktutan C Wolf B Sankur and E Lombardi ldquoFast exacthyper-graph matching with dynamic programming for spatio-temporal datardquo Journal ofMathematical Imaging andVision vol51 no 1 pp 1ndash21 2015

[52] S Azary and A Savakis ldquoGrassmannian sparse representationsand motion depth surfaces for 3D action recognitionrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (CVPRW rsquo13) pp 492ndash499Portland Ore USA June 2013

[53] J C Platt ldquoProbabilistic outputs for support vector machinesand comparisons to regularized likelihood methodsrdquo inAdvances in Large Margin Classifiers pp 61ndash74 MIT PressCambridge Mass USA 1999

[54] H-T Lin C-J Lin and R C Weng ldquoA note on Plattrsquosprobabilistic outputs for support vector machinesrdquo MachineLearning vol 68 no 3 pp 267ndash276 2007

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 15: Research Article A Human Activity Recognition System Using Skeleton …downloads.hindawi.com/journals/cin/2016/4351435.pdf · 2019-07-30 · Kinect skeleton is motivated by the fact

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014