body joints regression using deep convolutional neural

8
DRO Deakin Research Online, Deakin University’s Research Repository Deakin University CRICOS Provider Code: 00113B Body joints regression using deep convolutional neural networks Citation of the final article: Abobakr, Ahmed, Hossny, Mohammed and Nahavandi, Saeid 2017, Body joints regression using deep convolutional neural networks, in SMC 2016 : IEEE International Conference on Systems, Man and Cybernetics, IEEE, Piscataway, N.J., pp. 3281-3287. Published in its final form at https://doi.org/10.1109/SMC.2016.7844740. This is the accepted manuscript. © 2016, IEEE Downloaded from DRO: http://hdl.handle.net/10536/DRO/DU:30094569

Upload: others

Post on 08-Jan-2022

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Body joints regression using deep convolutional neural

DRO Deakin Research Online, Deakin University’s Research Repository Deakin University CRICOS Provider Code: 00113B

Body joints regression using deep convolutional neural networks

Citation of the final article: Abobakr, Ahmed, Hossny, Mohammed and Nahavandi, Saeid 2017, Body joints regression using deep convolutional neural networks, in SMC 2016 : IEEE International Conference on Systems, Man and Cybernetics, IEEE, Piscataway, N.J., pp. 3281-3287.

Published in its final form at https://doi.org/10.1109/SMC.2016.7844740.

This is the accepted manuscript.

© 2016, IEEE

Downloaded from DRO: http://hdl.handle.net/10536/DRO/DU:30094569

Page 2: Body joints regression using deep convolutional neural

Body Joints Regression Using Deep ConvolutionalNeural Networks

Ahmed Abobakr, Mohammed Hossny and Saeid Nahavandi

Institute for Intelligent Systems Research and Innovation (IISRI)Deakin University, Australia

Abstract—Human pose estimation is a well-known computervision problem that receives intensive research interest. Thereason for such interest is the wide range of applications thatthe successful estimation of human pose offers. Articulated poseestimation includes real time acquisition, analysis, processing andunderstanding of high dimensional visual information. Ensemblelearning methods operating on hand-engineered features havebeen commonly used for addressing this task. Deep learningexploits representation learning methods to learn multiple levelsof representations from raw input data, alleviating the needto hand-crafted features. Deep convolutional neural networksare achieving the state-of-the-art in visual object recognition,localisation, detection. In this paper, the pose estimation taskis formulated as an offset joint regression problem. The 3Djoints positions are accurately detected from a single raw depthimage using a deep convolutional neural networks model. Thepresented method relies on the utilisation of the state-of-the-artdata generation pipeline to generate large, realistic, and highlyvaried synthetic set of training images. Analysis and experimentalresults demonstrate the generalisation performance and the realtime successful application of the proposed method.

Index Terms—Kinect, RGB-D sensors, deep learning, ConvNetand pose estimation.

I. INTRODUCTION

Human posture estimation is a key essential step towardsachieving better human behaviour understanding and machineinteraction. It is a well-known computer vision problem thatreceives intensive research interest. The reason for such in-terest is the wide range of applications that the successfulestimation of the human pose offers. These applications in-clude but not limited to gaming, fall detection of lonely eldersor generally health care, video surveillance, human computerinteraction, and robots learning. Human pose estimation is theprocess of inferring the 2D or 3D positions of human bodyjoints from images or videos. The generic pose estimationpipeline includes real time acquisition, analysis, processingand understanding of high dimensional visual information.Many factors control the accurate estimation of human pos-tures. First, the efficient utilisation of the acquired highdimensional visual information via compact, discriminative,and hierarchical representation. Second, the ability of thelearning model to obtain a generalisable mapping functionbased on these representations. In addition to that, largevariations of human poses, occlusions, invisible body parts,noisy data sources, environmental effects such as illuminationand cluttering further complicate this task.

Fig. 1: Method overview. An input depth image is transformedinto RGB colour space, and fed into a deep ConvNet model toestimate the 2D body joint positions. Green and red skeletonsare the ground truth and the predicted skeletons respectively.

Marker-less human pose estimation approaches can be clas-sified into two main categories: body parts and regressionmethods [2]. The parts-based approaches follow the notion ofusing the pictorial structure model [3], [4], where, an object isgiven by a collection of parts with connections between certainpairs of parts. The current state-of-the-art level performancein terms of accuracy and real time applicability, for the humanpose estimation problem was reported by Shotton et al. [5], andBuys et al. [6]. They incorporated depth imaging technologiesand performed joints localisation via semantic body partssegmentation, also known as pixel-wise classification. In thismethod, a random decision forest (RDF) [7] model is trainedon dense representations obtained using the hand-crafted depthcomparison feature extractor (DCF) [5].

The main advantage of the parts-based methods is the abilityto model strong pose articulations [8]. However, these methodshave limitations. First, body parts based methods rely on localdetectors, hence, they are prone to losing track of unobservablebody parts due to self-occlusions or cluttered environments,resulting in unrealistic skeletons. Secondly, human poses spacegrows exponentially with the number of joints that could bemodelled [5], therefore, it is infeasible to model all possibleinteractions between body parts [8]. To address these limita-tions, regression based methods which infer the human posethrough a holistic view have become necessary [8]–[10].

Page 3: Body joints regression using deep convolutional neural

Fig. 2: Fine-tuned AlexNet [1] model for offset joint regression. The input is a colorised depth image (dark blue ’near’, red’far’), rescaled to 227 x 227 x 3 dimensionality, and fed into the network to produce a 28 real valued pose vector for the keyjoints coordinates. At each layer, we visualise parameters and activations shapes. For instance, the parameters of CONV-1 are96 kernels of size 11 x 11, applied with stride s = 4, without padding, p = 0, this results into 96 feature maps of 55 x 55dimensionality.

In this paper, we follow the holistic posture estimationapproach. The problem is formulated as an offset jointsregression from an input raw depth image, and tackled usinga deep learning architecture. Deep learning methods attemptto learn multiple levels of representations from raw inputdata, by composing simple but non-linear modules that eachtransforms the representation at one level into a representationat a higher, slightly more abstract level [11]. The key aspectof deep learning models is that these levels of representationsare trainable, hence, alleviating the need for hand-craftedfeatures [11]. They are the breakthrough for visual objectrecognition [1], [12], localisation and detection [13], [14]. Inparticular, the convolutional neural network (ConvNet) [15],is considered the most successful deep learning architecture.The ConvNet is considered an extension to the traditionalmultilayer perceptron, that is well suited to process dataof multiple arrays of different modalities such as imagesand speech signals. Recently, Toshev et al. [8], incorporateda deep ConvNet regressor for the holistic pose estimationproblem from RGB images, and achieved the state-of-the-art performance. However, human pose estimation from depthimages is more robust than estimation from 2D images [2].

Traditional RGB approaches are plagued by the difficulty ofseparating subjects from backgrounds. Thus, they either sufferfrom poor performance; need uniform backgrounds; or needstationary cameras with static backgrounds [6]. Therefore, agrowing trend is to use the depth cameras as they providemuch richer geometrical information and facilitate prepro-cessing tasks such as background subtraction and objectsdelineation [16], [17].

In this work, we fine-tune the popular AlexNet [1] objectrecognition model, to work as an offset joint regressor forthe articulated human pose estimation problem. The proposedmethod, as shown in Fig. 1, takes as input a raw depthframe and trained in a supervised mode to produce bodyjoints coordinates. The advantages of this formulation arethreefold. First, ConvNets are end-to-end trainable models thatcan automatically learn hierarchical feature representations.Second, deeper models can learn more abstract concepts fromthe raw input. Hence, the body joints regressor is able to

capture the full context of the input. Third, to benefit fromthe AlexNets’ well trained low level feature extractors. Themain difficulty with deep models is the need for large amountsof training data to achieve desired generalisation performance.Therefore, we extend the implementation of the state-of-the-art synthetic data generation pipeline [6], to enable generatingvirtually infinite amounts of high quality depth images eachwith the respective joint positions.

The rest of this paper is organised as follows: Section IIpresents the synthetic data generation and preprocessingpipeline. Section III, describes the proposed deep ConvNettraining method. Experiments and results are discussed inSection IV. Finally, conclusion and future enhancements arehighlighted in Section V.

II. DATA GENERATION

Training deep learning architectures is an optimisation prob-lem with respect to millions of parameters. Thus, consideringa supervised setting, it requires large amounts of labelled train-ing data to achieve an acceptable generalisation performanceand control the effect of overfitting. For computer visionproblems, and human pose estimation in particular, collectingtraining data for people by capturing their movements in frontof an RGBD sensor is an expensive process in terms oftime and efforts. In addition, acquiring high quality groundtruth labels such as joint positions is a challenge. The useof synthetic training datasets has been proven efficient forgeneralisation [5], [6]. To that end, we extend the state-of-the-art data generation pipeline proposed in [6], to synthesisetraining data. In this section, the building blocks of datageneration and preprocessing pipeline, shown in Fig. 3, aredescribed.

A. Motion Mapping

Typically, motion capture (MoCap) data are mapped ontovirtual human models with different anthropometric measurescreating virtually infinite amounts of training data. MoCapdata is a variation of articulated human poses recorded usingmarker based motion capture systems for real humans. Awide collection of human behaviours is captured and madeavailable for the research community by CMU Graphics

Page 4: Body joints regression using deep convolutional neural

Fig. 3: Data generation and preparation pipeline. Motion frames of high degree of dissimilarity are mapped onto a 3D humanmodel. The rendering process is parameterised with the camera view angle and distance, and produces depth and label imagepairs. Depth maps are further processed by shifting in range (0, 255), rescaling to (227 x 227), and colorisation. Generatedlabel maps are not sufficient for the joint localisation stage, therefore, we label the synthesised depth image using a trainedRDF model from [6]. Both label maps are fused together and fed into the skeleton extraction module to obtain the groundtruth pose vector for the current depth image.

Lab [18]. After excluding the irrelevant motion sequences suchas acrobatics, and maintaining an euclidean distance thresholdbetween consecutive poses, we end up with around 350Kposes. During the mapping process, body depth pixels aregrouped and assigned to the respective body parts identifiedby body joints. Body depth pixels are animated by subsequenttransformations applied to the body joints.

B. Depth Rendering and Labelling

Animated human models are rendered using a parameterisedvirtual depth camera. For each model and camera parameterssetting, such as viewing angle and distance, we can generate350K fully labelled training images. Pixels of the ground truthimages are labelled, each with the respective body part. Bodyjoint positions are localised by averaging body part pixels.However, not all joints are defined, in particular, elbow andknee joints. Label images are based on the skeleton structureof the CMU dataset, which defines body parts in terms ofbones. Buys et al. [6], have succeeded in adjusting the skeletonmodel of the CMU dataset and adding body part definitionsfor the elbow and knee joints and trained an RDF modelon the new body representation. While their trained modelhas been released, their approach remains mysterious. Toovercome this issue, we utilised their publicly shared RDFmodel for localising the elbows and knees joints. The evaluatedlabel image is fused with the synthetic one and passed to theskeleton extraction module to produce the final ground truthjoints vector. The fusion stage is further depicted in Fig. 4.

C. Depth Encoding and Preprocessing

Learning powerful representations from raw input data is akey aspect of deep learning models, with a great success for

RGB modalities. Recently, many approaches [14], [19]–[21]investigated the ability of these models to effectively learnfeatures from depth modalities. Couprie et al. [19], combinedthe depth data as an additional input channel to the ConvNetmodel of Farabet et al. [20], forming RGB-D input modality,and achieved good performance for the semantic segmentationtask. Gupta et al. [14], proposed the HHA encoding of depthinformation. In their approach, each depth pixel is representedusing three channels, horizontal disparity, height above groundand angle of pixel’s surface normal with the gravity direction.Empirical results demonstrated better performance using theHHA encoding compared to the raw depth data. However, theirapproach is computationally expensive [21], and may soundanalogies to the traditional feature extraction paradigm, whichincorporates much prior knowledge about the input. Eitel etal. [21], reported better performance compared to the resultsin [14], by colorising the depth data.

In this work, similar to [21], depth pixels are representedusing three RGB colour channels. The values of the colourcomponents vary according to the distance from the depthcamera, and provide more powerful input signal to the Con-vNet. Initially, input depth images are cropped and rescaled to227 x 227 to suit the input dimensionality of AlexNet. Then,depth measurements are shifted, to provide depth invariance,and normalised to be in range (0, 255). Finally, a colour mapis applied to produce the final RGB colorised depth image, asshown in Fig. 3 (top right).

III. CONVOLUTIONAL NEURAL NETWORKS

The ConvNet model is a composition of convolutional,pooling, and optionally fully connected layers. These layersare stacked and followed by a classification or regression

Page 5: Body joints regression using deep convolutional neural

Fig. 4: Label fusion strategy. On the left, the generated labelmap based on the skeleton model accompanying the CMUmotion sequences. As depicted, it does not model joints likeelbows and knees. Therefore, using the trained RDF modelof [6], we obtain the label map in the middle which has bodyparts representing the joints of interest. It also has a limitationof not representing the shoulders. Both label maps are mergedto obtain the resulting pose vector.

module, which commonly is an MLP. The Inputs and outputsof each layer are called feature maps. Units of a featuremap are simply neurons, where each neuron is connectedthrough a set of trainable weights to a local patch in theprevious layer, called the local receptive field of the unit. Allunits of a feature map share the same set of weights, alsocalled kernel. Therefore, by convolution, each feature maprepresents a particular feature extracted at different locationsin the input using a particular kernel. The resulting featuremaps are passed through a nonlinear transfer function, forinstance, the rectified linear unit (ReLU). The pooling (sub-sampling) layer, operates independently on each feature map.It performs average or maximum operation on local patches ofthe input, merging semantically similar features. This resultsin a reduced-resolution output feature map which is robustto small variations of input. The ConvNet model is an end toend trainable model using a generic purpose training algorithmsuch as the stochastic gradient descent (SGD).

We formulate the pose estimation task as a regressionproblem. Given a labelled training set D = {(xi, ti)}Ni=1 ofN training samples, where xi ∈ Rd is a colorised depthimage, and ti =

((x(1), y(1)), ...., (x(k), y(k))

)T, ti ∈ R2k is

the ground truth pose vector, the target is to learn a generalmapping function that associates a previously unseen depthimage with its correct continuous pose vector.

We use the pre-trained AlexNet [1] model provided withthe Caffe deep learning framework [22]. The model consistsof 7 trainable layers, five convolutional (CONV) and two fullyconnected layers (FC). These 7 layers are interleaved with anonlinear transformation (ReLU). The first two CONV layersare followed by a max pooling (P) layer and local responsenormalisation (LRN), while the fifth CONV layer is followedby max pooling only. Dropout layers were introduced tocontrol the effect of overfitting via setting a ratio of activationsto zero, forcing the network to learn redundant representationsfor the same input. Since the model was originally designed for

the visual object recognition challenge of ImageNet [23], it hasa softmax classification head that discriminates between 1000categories. The training objective function was to minimise theaverage cross entropy loss over training samples. Precisely, themodel is described as, { RGB input - ( CONV - P - LRN )* 2 - ( CONV ) * 3 - P - (FC) * 2 - 1000-output softmaxclassifier }. This model has a total of 60 million parameters,for more information, the reader is referred to [1].

For the purpose of regression, we replace the classificationhead with a trainable linear regression module to predict thepose vector, as depicted in Fig. 2. The input is an RGBimage of size 227 x 227, and the output is a real-valuedpose vector of 28 numbers representing the concatenated (x,y)coordinates of 14 key joints. The training objective function isalso changed into minimising the euclidean (L2) loss betweenthe estimated and the ground truth pose vector. Estimating aset of parameters θ that minimises the global L2 distance couldbe formulated as

argminθ

1

2N

N∑i=1

‖(f(xi, θ)− ti) ‖ 22 (1)

where N is the number of training samples, f(.) represents amodel with trainable parameters θ and ti is the ground truthpose vector.

Using the synthetic data generation and preprocessingpipeline described in section II, we randomly sampled a totalof 500K labelled images covering three frontal view angles(+45, 0, -45). The dataset is split into 300K for training, 100Kfor validation and 100K are left for testing.

It may seem unreasonable to fine-tune a pre-trained modelinstead of training from scratch, while having the ability togenerate large amounts of training data. However, from ourexperiments and empirical results from [14], we found that, itis always better to start training from a well initialised networkparameters than starting from random weight initialisation.

The idea of fine-tuning also known as transfer learningdepends on the degree of similarity between the data used forinitialising the original model, and the data for the task at hand.Similar to AlexNet, our data is in the form of RGB modality.Therefore, we can make use of the powerful low level featureextractors that the AlexNet provides, for instance, the first twoCONV layers. However, deeper layer of AlexNet are lookingfor generic concepts from the ImageNet [23] dataset whichare not relevant to our task. We overcome this issue by usinglayer-wise learning rate policy to control the speed of learningfor each layer. In particular, a base low learning rate of 0.0001is used with the first three CONV layers, 0.0005 for theremaining layers, the regression module on top has the highestlearning rate of 0.001. All learning rates are decreased with afactor of 10 every 20K iterations. We use the SGD optimisationprocedure with a mini-batch size of 256, and a momentum of0.9. The number of trainable parameters is approximately 57million. We fine-tuned for 100K iterations which takes aboutthree days on a NVIDIA Titan X GPU.

Page 6: Body joints regression using deep convolutional neural

TABLE IPercentage of Correct Parts (PCP) Vs Training Iterations

Model Arm Leg AvgUpper Lower Upper Lower

10K 0.75 0.25 0.92 0.85 0.6930K 0.76 0.28 0.93 0.85 0.7160K 0.76 0.27 0.93 0.86 0.71

PCP accuracy computed on a synthetic test set of 100K imagesat PCP threshold of 0.5. The model reaches its maximum learningcapacity after 30K iterations.

IV. EXPERIMENTAL RESULTS

We have fine-tuned the popular AlexNet deep ConvNetmodel for the task of human pose estimation. The input tothis method is a colorised depth image of dimensionality 227x 227 x 3 and the output is a real-valued pose vector of jointscoordinates. The proposed model is trained on 300K imagesfor 100K iterations and achieves a final validation loss of 0.03on validation set of 100K images. In this section, we evaluatethe generalisation performance of the proposed method. First,the test datasets and evaluation criteria are described. Second,quantitative results are reported on the synthetic test set. Third,as in the mean time, we do not have a high quality groundtruth for the real test dataset, we show some inferences asqualitative results.

A. Test Datasets

We use both synthetic and real test data to assess theperformance of the proposed method. The synthetic test datasetcontains 100K covering three frontal view angles. The real testdataset contains 10K images captured using the Kinect camera,for a single subject. All test images are not included duringtraining.

B. Evaluation Criteria

The two most widely accepted evaluation metrics for thehuman pose estimation problem are the percentage of cor-rectly estimated body parts (PCP) [24], and the percentageof detected joints (PDJ) [8]. The PCP measures the detectionrate of limbs. A limb is considered detected if the distancebetween its inferred endpoints and the ground truth endpointsis within a fraction of the limb length. This fraction is knownas the PCP-threshold and is commonly set for a value between0.1 and 0.5. The PCP metric has the drawback of penalisingshorter limbs, such as lower arms, which are usually harderto detect [8].

PDJ overcomes this limitation by measuring the detectionrate of body joints based on the same distance threshold.Typically, a joint is considered detected if the distance betweenthe predicted and the ground truth joint position is within acertain ratio of the torso diameter. We define the torso diameteras the euclidean distance between the right shoulder and theleft hip joints.

Fig. 5: Percentage of Correct Parts (PCP) at varying threshold.

Fig. 6: Percentage of detected joints (PDJ) on the synthetictest dataset, at 0.5 localisation threshold.

C. Quantitative Results

Table I, demonstrates the effect of the number of trainingiterations on the PCP scores. We report PCP results on thesynthetic test set for the upper and lower arm and the upper andlower leg at PCP threshold of 0.5. The detection rate increaseswith the number of training iterations. However, the learningprocess saturates after 30K training iterations, approximately25 epochs. Possible reasons for this saturation are either themodel reached its maximum learning capacity, or there are noadditional discriminative features to capture from the trainingdata. For the rest of experiments, we will be using the 30Kiterations model.

The presented results also demonstrate the drawback of thePCP, as it largely penalises the lower arms, which are verychallenging to detect. This difficulty is shown in the overallresults, as the lower arm limbs have the worse results withlarge margin. Figuring out a solution to improve the detectionaccuracy for hands specially, is a work in progress. On theother hand, the model shows strong capabilities to localisethe core body limbs by achieving a PCP accuracy of 0.93 forthe upper leg. Fig. 5, shows the detection rate of limbs usingdifferent PCP thresholds.

The PDJ metric is also evaluated on the synthetic testdataset. Fig. 6, reports the detection rates of the 14 modelled

Page 7: Body joints regression using deep convolutional neural

Fig. 7: Percentage of Detected Joints(PDJ) at varying threshold.

body joints using a PDJ localisation threshold of 0.5. We alsoexamine the detection rate of joints against range of thresholdvalues, shown in Fig. 7

The presented results conclude that the proposed method hasa high detection rates for core body joints such as head, neck,shoulders, hips and knees. However, we still have difficulties indetecting the rapidly changing, small body part joints such ashands and foots. This opens a room for further improvementson both the encoding method, in such a way that strengthsthe representation of these parts, and the low level featureextractors of the deep learning model.

D. Qualitative Results

Fig. 8 shows example inferences for the proposed methodon the real test dataset. Note the high localisation accuracy forcore body joints such as shoulders, hips, knees and head acrossdifferent poses and depth in scene. Further, the anthropometricmeasures of the test subject are different from the virtualmodel, which was used during the data synthesis process.None of these real images is included during the trainingphase.

To obtain these results, first, we performed backgroundmodelling using the mode of a set of empty frames, no pres-ence of subjects, as a calibration step. Second, the modelledbackground is subtracted from input depth frames. Third, postprocessing operations (erosion, hole filling and blob analysis)are applied to improve the quality of the foreground frames.Finally, jet colour map is applied on the resulting images.

Although the results seem promising, the model did notgeneralise well to unseen camera view angles. Hence, itrequires modelling all the remaining angles during the trainingphase, which is feasible. Most importantly, as noted during ourexperiments, deep learning models are very sensitive to noise.This problem is commonly tackled by augmenting differentnoise models during the training phase. In this work, wealternatively applied the previously mentioned preprocessingoperations to ensure noise free input to the model. The thirdapproach is to extend the synthetic data generation pipeline,

presented in section II, to simulate different noise models,which is technically possible. However, modelling all existingnoise models is intractable. There are several noise modelsfor different input devices, environments and materials. Thislimitation opens a room for investigating how the current deeplearning models are looking through the input data.

V. CONCLUSION

In this paper, we followed the holistic pose estimationapproach. A deep ConvNet offset joints regression modelis trained to estimate the articulated human pose from aninput raw depth image. We extended the state-of-the-art datageneration pipeline to allow training on virtually infiniteamounts of synthetic training data. We also presented a com-putationally efficient colorisation technique to provide morepowerful input signal to the network, by distributing the depthinformation over the 3 RGB channels, incorporating minimalprior knowledge. We evaluated the model using the PCP andPDJ performance metrics. Results demonstrate the possibilityof transferring knowledge from a generic deep learning modelthat was trained for visual classification tasks, to performjoints localisation, via training on synthetic data. In additionto that, the proposed method was able to generalise fromsynthetic training images to real images. Future work includesinvestigating different encoding techniques and architecturaldesigns to further improve the detection rate of lower limbs.Further, simplifying the data generation pipeline via bypassingthe RDF labelling stage to allow generating high qualityground truth joint positions is a work in progress.

ACKNOWLEDGEMENT

This research was fully supported by the Institute forIntelligent Systems Research and Innovation (IISRI) at DeakinUniversity. The motion files used in this research were ob-tained from mocap.cs.cmu.edu. The database was created withfunding from NSF EIA-0196217.

Page 8: Body joints regression using deep convolutional neural

Fig. 8: Example inferences of the proposed method on real test images. Note the high localisation accuracy for core body jointssuch as head, neck, shoulders, hips and knees across different poses. Also, these results confirm with the quantitative scoresin terms of the low sensitivity to hands and foots positions. None of these real images is included during the training phase.

REFERENCES

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” Advances in neural informa-tion processing systems, pp. 1097–1105, 2012.

[2] S. Li, Z.-Q. Liu, and A. B. Chan, “Heterogeneous Multi-task Learningfor Human Pose Estimation with Deep Convolutional Neural Network,”International Journal of Computer Vision, vol. 113, no. 1, pp. 19–36,2015.

[3] M. A. Fischler and R. A. Elschlager, “The representation and matchingof pictorial structures,” IEEE Transactions on computers, no. 1, pp. 67–92, 1973.

[4] P. F. Felzenszwalb and D. P. Huttenlocher, “Pictorial structures for objectrecognition,” International Journal of Computer Vision, vol. 61, no. 1,pp. 55–79, 2005.

[5] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finoc-chio, R. Moore, P. Kohli, A. Criminisi, A. Kipman et al., “Efficienthuman pose estimation from single depth images,” Pattern Analysis andMachine Intelligence, IEEE Transactions on, vol. 35, no. 12, pp. 2821–2840, 2013.

[6] K. Buys, C. Cagniart, A. Baksheev, T. De Laet, J. De Schutter, andC. Pantofaru, “An adaptable system for rgb-d based human bodydetection and pose estimation,” Journal of Visual Communication andImage Representation, pp. 39–52, 2014.

[7] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp.5–32, 2001.

[8] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deepneural networks,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2014, pp. 1653–1660.

[9] G. Mori and J. Malik, “Estimating human body configurations usingshape context matching,” in Computer VisionECCV 2002. Springer,2002, pp. 666–680.

[10] G. Shakhnarovich, P. Viola, and T. Darrell, “Fast pose estimation withparameter-sensitive hashing,” in Computer Vision, 2003. Proceedings.Ninth IEEE International Conference on. IEEE, 2003, pp. 750–757.

[11] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,no. 7553, pp. 436–444, 2015.

[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” arXiv preprint arXiv:1512.03385, 2015.

[13] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for objectdetection,” in Advances in Neural Information Processing Systems, 2013,pp. 2553–2561.

[14] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik, “Learning rich featuresfrom rgb-d images for object detection and segmentation,” in ComputerVision–ECCV 2014. Springer, 2014, pp. 345–360.

[15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, pp. 2278–2324, 1998.

[16] H. Haggag, M. Hossny, D. Filippidis, D. Creighton, S. Nahavandi, andV. Puri, “Measuring depth accuracy in rgbd cameras,” in Signal Pro-cessing and Communication Systems (ICSPCS), 2013 7th InternationalConference on. IEEE, 2013, pp. 1–7.

[17] H. Haggag, M. Hossny, S. Haggag, S. Nahavandi, and D. Creighton,“Safety applications using kinect technology,” in Systems, Man andCybernetics (SMC), 2014 IEEE International Conference on. IEEE,2014, pp. 2164–2169.

[18] “CMU Graphics Lab Motion Capture Database.” [Online]. Available:http://mocap.cs.cmu.edu

[19] C. Couprie, C. Farabet, L. Najman, and Y. LeCun, “Indoor semanticsegmentation using depth information,” arXiv preprint arXiv:1301.3572,2013.

[20] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchicalfeatures for scene labeling,” Pattern Analysis and Machine Intelligence,IEEE Transactions on, vol. 35, no. 8, pp. 1915–1929, 2013.

[21] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard,“Multimodal deep learning for robust rgb-d object recognition,” inIntelligent Robots and Systems (IROS), 2015 IEEE/RSJ InternationalConference on. IEEE, 2015, pp. 681–687.

[22] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.

[23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, andL. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.211–252, 2015.

[24] M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari, “2d ar-ticulated human pose estimation and retrieval in (almost) unconstrainedstill images,” International journal of computer vision, vol. 99, no. 2,pp. 190–214, 2012.