arxiv:1609.03773v1 [cs.cv] 13 sep 2016view-invariant high-level space by cnns and fourier temporal...

34
Lie-X : Depth Image Based Articulated Object Pose Estimation, Tracking, and Action Recognition on Lie Groups Chi Xu 1 , Lakshmi Narasimhan Govindarajan 1 , Yu Zhang 1 , and Li Cheng *1 1 Bioinformatics Institute, A*STAR, Singapore Abstract Pose estimation, tracking, and action recognition of articulated objects from depth images are important and challenging problems, which are normally con- sidered separately. In this paper, a unified paradigm based on Lie group theory is proposed, which enables us to collectively address these related problems. Our approach is also applicable to a wide range of articu- lated objects. Empirically it is evaluated on lab ani- mals including mouse and fish, as well as on human hand. On these applications, it is shown to deliver competitive results compared to the state-of-the-arts, and non-trivial baselines including convolutional neu- ral networks and regression forest methods. More- over, new sets of annotated depth data of articulated objects are created which, together with our code, are made publicly available. 1 Introduction With 3D cameras becoming increasingly ubiquitous in the recent years, there has been growing interest in utilizing depth images for key problems involving articulated objects (e.g. human full-body and hand) such as pose estimation [38, 45, 52, 31, 43, 42, 54], tracking [7, 32, 4, 36, 19], and action recognition [11, 25, 48, 37]. On the other hand, although they are closely related, most existing research efforts target- ing these problems in literature are based on diverse * Corresponding author with email address: [email protected] star.edu.sg and possibly disconnected principles. Moreover, ex- isting algorithms typically focus on a unique type of articulated objects, such as human full-body, or hu- man hand. This leads us to consider in this paper a principled approach to address these related prob- lems across object categories, in a consistent and sen- sible manner. Our approach possesses the following contribu- tions: (1) A unified Lie group-based paradigm is pro- posed to address the problems of pose estimation, tracking, and action recognition of articulated ob- jects from depth images. As illustrated in Fig. 1, a 3D pose of an articulate object corresponds to a point in the underlying pose manifold, a long-time track of its 3D poses amounts to a long curve in the same manifold, whilst an action is represented as a certain curve segment. Therefore, given a depth image input, pose estimation corresponds to inferring the optimal point in the manifold; Action recognition amounts to classifying a curve segment in the same manifold as a particular action type; Meanwhile for the track- ing problem, Brownian motion on Lie groups is em- ployed as the generator to produce pose candidates as particles. This paradigm is applicable to a diverse range of articulated objects, and for this reason it is referred to as Lie-X. (2) Learning based techniques are incorporated instead of the traditional Jacobian matrices for solving the incurred inverse kinematics problem, namely, presented with visual discrepancies of current results, how to improve on skeletal estima- tion results. More specifically, an iterative sequential learning pipeline is proposed: multiple initial poses are engaged simultaneously to account for possible 1 arXiv:1609.03773v1 [cs.CV] 13 Sep 2016

Upload: others

Post on 04-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Lie-X : Depth Image Based Articulated Object Pose Estimation,

Tracking, and Action Recognition on Lie Groups

Chi Xu1, Lakshmi Narasimhan Govindarajan1, Yu Zhang1, and Li Cheng∗1

1Bioinformatics Institute, A*STAR, Singapore

Abstract

Pose estimation, tracking, and action recognition ofarticulated objects from depth images are importantand challenging problems, which are normally con-sidered separately. In this paper, a unified paradigmbased on Lie group theory is proposed, which enablesus to collectively address these related problems. Ourapproach is also applicable to a wide range of articu-lated objects. Empirically it is evaluated on lab ani-mals including mouse and fish, as well as on humanhand. On these applications, it is shown to delivercompetitive results compared to the state-of-the-arts,and non-trivial baselines including convolutional neu-ral networks and regression forest methods. More-over, new sets of annotated depth data of articulatedobjects are created which, together with our code,are made publicly available.

1 Introduction

With 3D cameras becoming increasingly ubiquitousin the recent years, there has been growing interestin utilizing depth images for key problems involvingarticulated objects (e.g. human full-body and hand)such as pose estimation [38, 45, 52, 31, 43, 42, 54],tracking [7, 32, 4, 36, 19], and action recognition [11,25, 48, 37]. On the other hand, although they areclosely related, most existing research efforts target-ing these problems in literature are based on diverse

∗Corresponding author with email address: [email protected]

and possibly disconnected principles. Moreover, ex-isting algorithms typically focus on a unique type ofarticulated objects, such as human full-body, or hu-man hand. This leads us to consider in this papera principled approach to address these related prob-lems across object categories, in a consistent and sen-sible manner.

Our approach possesses the following contribu-tions: (1) A unified Lie group-based paradigm is pro-posed to address the problems of pose estimation,tracking, and action recognition of articulated ob-jects from depth images. As illustrated in Fig. 1, a3D pose of an articulate object corresponds to a pointin the underlying pose manifold, a long-time track ofits 3D poses amounts to a long curve in the samemanifold, whilst an action is represented as a certaincurve segment. Therefore, given a depth image input,pose estimation corresponds to inferring the optimalpoint in the manifold; Action recognition amountsto classifying a curve segment in the same manifoldas a particular action type; Meanwhile for the track-ing problem, Brownian motion on Lie groups is em-ployed as the generator to produce pose candidatesas particles. This paradigm is applicable to a diverserange of articulated objects, and for this reason it isreferred to as Lie-X. (2) Learning based techniquesare incorporated instead of the traditional Jacobianmatrices for solving the incurred inverse kinematicsproblem, namely, presented with visual discrepanciesof current results, how to improve on skeletal estima-tion results. More specifically, an iterative sequentiallearning pipeline is proposed: multiple initial posesare engaged simultaneously to account for possible

1

arX

iv:1

609.

0377

3v1

[cs

.CV

] 1

3 Se

p 20

16

Page 2: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

location, orientation, and size variations, with eachproducing its corresponding estimated pose. Theythen pass through a learned scoring metric to de-liver the final estimated pose. Note the purpose ofthe learned metric in our approach is to mimic thebehavior of the practical evaluation metric. (3) Em-pirically our approach has demonstrated competitiveperformance on fish, mouse, and human hand fromdifferent imaging modalities, where it is also specifi-cally referred to as e.g. Lie-fish, Lie-mouse, Lie-hand,respectively. The runtime speed of our pose estima-tion system is more than realtime — it executes ataround 83-267 FPS (frame per second) on a desktopcomputer without resorting to GPUs. Moreover, newsets of annotated depth images and videos of artic-ulated objects are created. It is worth noting thatthe depth imaging devices considered in our empiri-cal context are also diverse, including structured illu-mination and light field technologies, among others.These datasets and our code are to be made pub-licly available in support of the open-source researchactivities. 1.

2 Related Work

The recent introduction of commodity depth cam-eras has led to significant progress in analyzing artic-ulated objects, especially human full-body and hand.In terms of pose estimation, Microsoft Kinect is al-ready widely used in practice at the scale of humanfull-body, while it is still a research topic at humanhand scale [45, 30, 52, 31, 43, 42, 54], partly due to thedexterous nature of hand articulations. [45] is amongthe first to develop a dedicated convolutional neuralnet (CNN) method for hand pose estimation, whichis followed by [30]. [31] also utilizes deep learningin a synthesizing-estimation feedback loop. [54] fur-ther considers to incorporate geometry informationin hand modelling by embedding a non-linear gener-ative process within a deep learning framework. [52]studies and evaluates a theoretically motivated ran-dom forest method for hand pose estimation. A hier-

1Our datasets, code, and detailed information pertainingto the project can be found at a dedicated project webpagehttp://web.bii.a-star.edu.sg/~xuchi/Lie-X.html.

Figure 1: A cartoon illustration of our main idea:An articulated object can be considered as a point incertain manifold. Its 3D pose, a long track of its 3Dmotion, and its action sequences (color-coded herein)correspond to a point, a curve, and curve segmentsin the underlying manifold respectively.

archical sampling optimization procedure is adoptedby [43] to minimize the error-induced energy func-tions, where a similar energy function is optimized viaefficient gradient-based techniques in [42] for person-alizing hand shapes to individual users. [39] insteadcasts hand pose estimation as a matrix completionproblem with deeply learned features. Meanwhile,various tracking methods have been developed forfull-body [19] and hand [32, 4, 36]. A particle swarmoptimization (PSO) scheme is utilized in [32] to re-cover temporal hand poses by stochastically seekingsolution to the induced minimization problem. A hy-brid method is adopted in [36] that combines PSOfurther with the widely-used iterated closest pointtechnique. [4] considers learning salient points onfingers, for which an objective function is introducedto jointly take into account of edges, flow and colli-sion cues. [19] describes a tracking-by-detection [3]type method based on 3D volumetric representation.3D action recognition has also drawn great amountof attentions lately [29, 25, 48, 37]. For example,[25] tackles action recognition using variants of re-current neural nets. [37] considers a mapping to a

2

Page 3: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

view-invariant high-level space by CNNs and Fouriertemporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for humanfull-body pose estimation and action recognition us-ing spatial temporal and-or graphs. On the otherhand, it is a much harder problem when a color cam-era is used instead of a depth camera, such as [1],where pose estimation is formulated as a regressionproblem that is subsequently addressed by relevancevector machine and support vector machine. Now,let us look at the other two articulated objects to bedescribed in this paper, i.e. fish and mouse. Theyare relatively simple in nature but are less studied.Existing literature [7, 11, 12] are mostly 2D-based,and the focus is mainly on pose estimation. [49] is avery recent work in analyzing group-level behavior oflab mice that relies on a simplified straight-line rep-resentation of a mouse skeleton. We also would liketo point out that there are research efforts across ob-ject categories: [12] estimates poses of zebrafish, labmouse, and human face; Meanwhile there are alsoworks that deal with more than one problem, suchas [29]. They have achieved very promising results asdiscussed previously. Our work may be considered asa renewed attempt to address related problems andwork with a broad range of articulated objects underone unified principle. For more detailed overview ofrelated works, interested readers may consult to therecent surveys [34, 10, 33, 5].

Lie groups [35] have been previously used in [46] fordetection and tracking of relatively rigid objects in2D, however this requires the expensive image warp-ing operations. [40] reviews in particular the recentdevelopment of applying shape manifold based ap-proaches in tracking and action recognition. Its ap-plication in articulated objects is relatively sparse.Lie algebraic representation is considered in [27] forhuman full-body pose estimation based on multiplecameras or motion captured data. Rather than re-sorting to the traditional Jacobian matrices as in [27],learning based modules are employed in our approachto tackle the the incurred inverse kinematics prob-lem. [47, 48] also extract Lie algebra based featuresfor action recognition. Instead of focusing on specificproblem and object, here we attempt to provide aunified approach.

Part-based models have long been considered inthe vision community, such as the pictorial struc-tures [13], the flexible mixtures-of-parts [53], theposlet model [6], the deep learning model [44], amongothers. The most related works are probably [12, 41],where the idea of group action has been utilized.Moreover, multiple types of objects are also evalu-ated in [12] that focuses on the 2D pose estimationproblem, while [41] is dedicated to 3D hand pose esti-mation. On the other hand, our approach aims to ad-dress these three related problems altogether in 3D,and we explicitly advocate the usage of Lie group the-ory. Note that the concept of pose indexed featurehas been coined and employed in [14, 2]. In addition,learning based optimization has been considered ine.g. [50], although in very different contexts. Finally,the idea of learning the internal evaluation metric isconceptually related to the recent learning to rankapproaches [9] in the information retrieval commu-nity for constructing ranking models.

3 Notations and MathematicalBackground

The skeletal representation is in essence based on thegroup of rigid transformations in 3D Euclidean R3,a Lie group that is usually referred to as the specialEuclidean group SE(3). In what follows, we providean account of the related mathematical concepts thatwill be utilized in our paper.

An articulated object, such as a human hand, amouse or a fish, is characterized in our paper by askeletal model in the form of a kinematic tree thatcontains one or multiple kinematic chains. As illus-trated in Fig. 2, a fish or a mouse skeleton both pos-sess one kinematic chain, while a human hand con-tains a kinematic tree structure of multiple chains.Note that only the main spine is considered hereinfor the mouse model. The skeletal model is repre-sented in the form of Jo joints interconnected by aset of bones or segments of fixed lengths. Empir-ical evidence has suggested that it is usually suffi-cient to use such fixed skeletal models with properscaling, when working with pose estimation of artic-

3

Page 4: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

ulated objects in depth images [41]. The pose of thisobject can thus be defined as a set of skeletal jointlocations. Furthermore, we define the home positionof an articulated object as a set of default joint lo-cations. Taking a mouse model as depicted in themiddle panel of Fig. 2 for example, its home positioncould be a top-view upward-facing mouse with thefull body straight-up, and the bottom joint at thecoordinate origin. Note this bottom joint contains 6degrees of freedom (DoF) of the entire object, and isalso referred to as the base joint. Then the pose couldalso be interchangeably referred to as the sequenceof SE(3) transformations or group actions applied tothe home position, Θ = θ1, . . . , θJo. The estimatedpose is denoted as Θ = θ1, . . . , θJo to better dif-ferentiate from the ground-truth pose. Here θ couldbe either ξ or ξ (to be discussed later) when withoutconfusion in the context. To simplify the notation, weassume a kinematic chain contains J joints. ClearlyJ = Jo for fish and mouse models, while J < Jo forhuman hand or human full-body, by focusing on oneof the chains. A depth image is not only a 2D imagebut also a set of 3D points (i.e. a 3D point cloud)describing the surface profile of such object under aparticular view. Ideally the estimated pose (the set ofpredicted joint locations) is expected to align nicelywith the 3D point cloud of the object in the observeddepth image.

Before proceeding further with the proposed ap-proach, let us pause for a moment to have a con-cise review of the involved mathematical background.Motivated readers may refer to [28, 18, 23] for furtherdetails.

Lie Groups A Lie group G is a group as well asa smooth manifold such that the group operationsof (g, h) 7→ gh and g 7→ g−1 are smooth for allg, h ∈ G. For example, the rotational group SO(3)is identified as the set of 3 × 3 orthonormal matricesR ∈ R3×3 : RRT = I3,det(R) = 1

, with RT de-

noting the transpose, det(·) being the determinant,and I3 being a 3 × 3 identity matrix. Another ex-ample is SE(3), which is defined as the set of rota-tional and translational transformations of the formg(x) = Rx + t, with R ∈ SO(3) and t ∈ R3. In other

words, g is the 4 × 4 matrix of the form

g =

(R t0T 1

), (1)

where 0 = (0, 0, 0)T . Note the identity element ofSE(3) is the 4 × 4 identity matrix I4. Both I3 andI4 will be simply denoted as I if there is no confu-sion in the context. Now given a reference frame, arigid-body transformation of two consecutive jointsx and x′ in a kinematic chain can be represented

as

(x′

1

)← g

(x1

). Moreover the product of multi-

ple SE(3) groups (i.e. a kinematic chain) is still aLie group. In other words, as tree-structured skeletalmodels are considered in general for articulated ob-jects, each of the induced kinematic chains forms aLie group.

Lie Algebras and Exponential Map The tan-gent plane of Lie group SO(3) or SE(3) at identityI is known as its Lie algebra, so(3) := TISO(3) orse(3) := TISE(3), respectively. An arbitrary elementof so(3) admits a skew-symmetric matrix represen-tation parameterized by a three dimensional vectorω = (ω1, ω2, ω3)T ∈ R3 as

ω =

0 −ω3 ω2

ω3 0 −ω1

−ω2 ω1 0

. (2)

In other words, the DoF of a full SO(3) is three. Notethat a rotational matrix can alternatively be repre-sented by the Euler angles decomposition [28]. Abijective mapping ∨ : so(3) → R3 and its reversemapping ∧ : R3 → so(3) are defined as ω∨ = ω, andω∧ = ω, respectively. Let ν ∈ R3, an element of se(3)can then be identified as

ξ =

(ω ν0T 0

). (3)

With a slight abuse of notation, similarly there ex-ist the bijective maps ξ∨ = ξ, and ξ∧ = ξ, withξ = (ωT , νT )T . Now a tangent vector ξ ∈ R6

(or its matrix form ξ ∈ R4×4) is represented as

4

Page 5: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

6 DoF

2 DoF

1 DoF

End effector

Figure 2: A display of three different articulated objects considered in our paper, which are for (from leftto right) fish, mouse, and human hand, respectively. Each of our skeletal models is an approximation of theunderlying anatomy, presented as a side-by-side pair. Note end-effectors have zero degree of freedom (DoF).See text for details.

ξ =∑6i=1 ξ

i∂i, with ξi indexing the i-th componentof ξ. Here ∂1 = (1, 0, . . . , 0)T , . . ., ∂6 = (0, . . . , 0, 1)T ,or in their respective matrix forms,

∂1 =

0 0 0 00 0 −1 00 1 0 00 0 0 0

, ∂2 =

0 0 1 00 0 0 0−1 0 0 00 0 0 0

,

∂3 =

0 −1 0 01 0 0 00 0 0 00 0 0 0

,

∂4 =

0 0 0 10 0 0 00 0 0 00 0 0 0

, ∂5 =

0 0 0 00 0 0 10 0 0 00 0 0 0

, ∂6 =

0 0 0 00 0 0 00 0 0 10 0 0 0

.

The exponential map Exp : se(3) → SE(3) inour context is simply the familiar matrix exponential

ExpI(ξ) = eξ = I + ξ + 12 ξ

2 + . . . for any ξ ∈ se(3).From the Rodrigues’s formula, it can be further sim-plified as

eξ =

(eω Aν0T 1

), (4)

with

A = I +ω

‖ω‖2(1− cos ‖ω‖

)+

ω2

‖ω‖3(‖ω‖ − sin ‖ω‖

),

(5)

where ‖ · ‖ is the vector norm.It has been known in the screw theory of

robotics [28] that every rigid motion is a screw mo-tion that can be realized as the exponential of a twist(i.e. a infinitesimal generator) ξ, with its compo-nents ω and ν corresponding to the angular velocityand translation velocity of the segment (i.e. bone)around its joint, respectively.

Product of Exponentials and Adjoint Repre-sentation Consider a partial kinematic chain in-volving j joints with j ∈ 1, 2, · · · , J, which be-comes the full chain when j = J . With a slight abuseof notation, let gΘ1:j

be the Lie group action on thepartial kinematic chain, and gθj or simply gj be thegroup action on the j-th joint. Its forward kinemat-ics can be naturally represented as the product of ex-

ponentials formula, gΘ1:j = eξ1eξ2 · · · eξj . Therefore,for an end-effector from the home configuration (or

home pose)

(x1

), its new configuration is described

by

(x′

1

)= gΘ1:j

(x1

)= eξ1eξ2 · · · eξj

(x1

). As dis-

cussed in [28], this formula can be regarded as a series

5

Page 6: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

of transformations from the body coordinate b (for lo-cal joint) of each joint to the spatial coordinate s (forglobal kinematic chain). Let us focus on a joint j

and denote ξ(b) and ξ(s) the twists of this joint in thebody and spatial coordinates, respectively. Assumethe transformation of this joint to the spatial coordi-

nate is gΘ1:j−1 = eξ1eξ2 · · · eξj−1 . The two twists canbe related by the adjoint representation

ξ(s) = AdgΘ1:j−1

(ξ(b))

:= gΘ1:j−1ξ(b)g−1

Θ1:j−1,

which is obtained by

eξ(s)

= gΘ1:j−1eξ

(b)

g−1Θ1:j−1

= egΘ1:j−1

ξ(b)g−1Θ1:j−1 ,

and repeatedly applying the identity geξg−1 = egξg−1

for g ∈ SE(3) and ξ ∈ se(3).

Geodesics Given two configurations g1 and g2,the geodesic curve between them is g(t) =(R(t) At(t)0T 1

), with R(t) = R1e

(Ω0 t), t(t) = (t2 −

t1)t+t1, t ∈ [0, 1], and Ω0 = LogI(R−11 R2). Here the

logarithm map LogI or its simplified notion log canbe regarded as the inverse of the exponential map.

Brownian Motion on Manifolds We refer in-terested readers to [18] for a more rigorous accountof Brownian motion and stochastic differential ge-ometry as they are quite involved. Here it is suffi-cient to know that Brownian motion can be regardedas a generalization of Gaussian random variables onmanifolds, where the increments are independent andGaussian distributed, and the generator of Brownianmotion is the Laplace-Beltrami operator. In whatfollows we will focus more on the computational as-pect [26]. Let t ∈ R denote a continuous variable,and δ > 0 be a small step size. Let ξt = (ξ1

t, · · · , ξ6

t)T

denote a random vector sampled from normal distri-bution N (0, C), for k = 0, 1, · · · with C ∈ R6×6 beinga covariance matrix. Then a left-invariant Brownianmotion with starting point g(0) ∈ SE(3) can be ap-proximated by

g((k + 1)δ

)= g(kδ)e

√δ∑6i=1 ξ

ik∂i

. (6)

In addition, these sampled points can be interpolatedby geodesics to form a continuous sample path. Inother words, for t ∈

(kδ, (k + 1)δ

)we have

g(t) = g(kδ)e

t−kδ√δ

∑6i=1 ξ

ik∂i

. (7)

4 Our Approach

In what follows we describe the proposed Lie-Xapproach for pose estimation, tracking, and actionrecognition of various articulated objects.

Preprocessing & Initial Poses For simplicity weassume that there exists one and only one articulatedobject in an input depth image or patch. A simplepreprocessing step is employed in our approach to ex-tract individual foreground objects of interest. Thiscorresponds to the point cloud of the object of in-terest extracted from the image. We then estimatethe initial 3D location of base joint as follows: The2D location of the base joint is set as the center ofthe point cloud, while its depth value is the averagedepth of the point cloud. Initial poses are obtainedby first setting these poses as the home pose of theunderlying articulate object, i.e. bones of the objectare straight-up for the three empirical applications.For each of the initial poses, the initial orientationof the object is generated by perturbing the in-planeorientation of the above-mentioned base joint from auniform distribution over (−π, π). To account for thesize variations, the bone lengths of each initial poseestimate are also scaled by a scalar that is uniformlydistributed in the range of [0.9, 1.1].

Skeletal Models After preprocessing, an initial es-timated pose is provided for an input depth image.The objects of interest are represented here in termsof kinematic chains. Without loss of generality, inthis paper we focus on fish and mouse that bothhave one chain, as well as human hand that possessesmultiple chains, as depicted in the respective panelsof Fig. 2. Our fish and mouse models contain 21and 5 joints (including the end-effectors) along themain spine, respectively, while our hand model has23 joints. Their corresponding DoFs are 25, 12, and

6

Page 7: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

26, respectively. Overall our models are designed asproper approximations following the respective artic-ulate objects’ anatomies. The base joint is fixed atcoordinate origin which always has six DoFs describ-ing 3D locations and orientations of the entire object;One DoF joints are applied to the rest fish joints char-acterizing the yaw of fish bones; Two DoFs are usedfor the rest mouse joints to account for both yaw andpitch; Similarly in our human hand model, two DoFsare used for each root joint of finger chain, while oneDoFs are used for the rest joints. In all three mod-els, zero DoFs are associated with the end-effectors,as each of them can entirely be determined by thepreceding joints of the chain. Note that althoughsimplified, the mouse model includes the most essen-tial components (joints of the spine) at a reasonableresolution in our study. Our human hand model fol-lows that of the existing literature (e.g. [31, 52]) thatworks with the widely-used NYU hand depth imagebenchmark [45].

4.1 Pose Estimation

Fig. 3 provides a visual mouse example that illus-trates the execution pipeline of our pose estimationprocedure at test run. This is also presented moreformally in Algorithm 1. Meanwhile the correspond-ing training process is explained in Algorithm 2. Notethat inside both the training and testing processes, aninternal evaluation metric or scoring function is used,which is also learned from data. In what follows weare to explain each of the components in detail.

Given a set of nt training images, define

∆θj :=1

nt

∑i∈1,··· ,nt

∆θj (8)

the mean deviation over training images for the j-thjoint of the set of J joints. The deviation ∆θj charac-terizes the amount of changes between the estimatedpose and the ground-truth pose, with its concreteform to be described later. A global error functioncan be defined over a set of examples that evaluatesthe sum of differences from the mean deviation, as

for example the following form,∑j∈1,··· ,J

‖∆θj‖22, (9)

with ‖ · ‖2 being the standard vector norm in Eu-clidean space. Presented with the above visual dis-crepancies of current results, our aim here is to im-prove skeletal estimation results. Traditionally Jaco-bian matrices are employed for solving the incurredinverse kinematics problem. Here we instead advo-cate the usage of an iterative learning pipeline as dis-played in Fig. 3 with a mouse example. At test run,it behaves as follows: Assume for each of the J jointsthere are C rounds or iterations. As presented inAlgorithm 1, given a test image and an initial poseestimation, for each joint j ∈ 1, · · · , J following thekinematic chain of length J from the base joint, atcurrent round c ∈ 1, · · · , C, the current pose of the

joint will be corrected by the Lie group action er(c)j ,

with the twist r(c)j being the output of a local re-

gressor, R(c)j . In other words, denote the short-hand

notations g(C)

Θ1:j−1:= g

(1:C)

θ1g

(1:C)

θ2· · · g(1:C)

θj−1, eξ

(1:c−1)j :=

eξ(1)j eξ

(2)j · · · eξ

(c−1)j , and g

(c−1)

Θ1:j:= g

(1:C)

Θ1:j−1eξ

(1:c−1)j , The

j-th joint spatial coordinate can be updated by thefollowing left group action

g(c)

Θ1:j= g

(c−1)

Θ1:jer

(c)j , (10)

with er(c)j being the latest group element used to fur-

ther correct the spatial location of j-th joint at roundc. It is worth emphasizing that this process requires

learning the set of local regressorsR(c)

j

, where the

output of each regressor, r(c)j , is dedicated to a partic-

ular round c and joint j. Each of these local models islearned based on local features, i.e. the pose-indexeddepth features to be described later. In a sense,it endows our system with the ability to memorizelocal gradient updating rules from similar trainingpatterns. This essentially forms the key ingredientthat allows for removal of the commonly used Jaco-bian matrices for error-prorogation in our approach.Moreover, at test run, multiple initial poses are gen-erated for each input image. They will then pass

7

Page 8: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Preprocessed inputdepth image k = 1

k = Kt

θ1(0) θ1

(0)... θ1(C) θ2

(0)... θ2(C) θ3

(0)... θ3(C) θ4

(0)... θ4(C)

. . .Render top view

Render side view

Output

Initial Poses joint 1 joint 2 joint 3 joint 4

⟹⟹⟹⟹Iteration 0 Iteration C

Init

ial P

ose

Gen

erat

or

Lea

rned

Met

ric

Figure 3: An illustrative example of our pose estimation pipeline in Algorithm 1. In this example, a top-viewdepth image is used as the sole input. After a brief preprocessing and obtaining an initial pose, an iterativeprocess is executed over each joint j and every round c, to produce its output estimate. For demonstrationpurpose, we also present a 3D virtual mouse fitted with the predicted skeletal model and with trianglemeshing and skin texture mapping, which is then rendered in its top-view as well as side-view. Note thelimbs of this virtual mouse are pre-fixed to default configuration.

through our learned inverse kinematic regressors andproduce corresponding candidate poses. These out-put poses will nevertheless be screened by our learnedmetric to be discussed later, where the optimal oneis to be picked as the final estimated pose.

At training stage, a set of K initial poses of theinput image is obtained in the same manner as thoseof the testing stage. The aforementioned local re-gressors are then learned as follows. Each example ofthe training dataset consists of an instance: a pair ofposes including the estimated pose and the ground-

truth pose, (θ(c−1)j , θj), as well as its label : the devi-

ation of estimation θj from ground-truth θj , as

∆θj = log(g

(1:c−1)

Θ1:j

−1gΘ1:j

). (11)

For the first joint j = 1 (the base joint in the kine-matic chain) and at the first round c = 1, the label ofan example will be the amount of changes from thefirst joint of the initial pose to that of the ground-truth. Then at any round c, its corresponding ini-tial pose is obtained by executing the current partialkinematic model until the immediate previous roundc − 1. Similarly for the second joint and at round

c, the initial pose in each of the training examples isattained by executing the current partial model frombase joint up to round c− 1 of the current joint, andits label is then the amount of changes from the cur-rent joint of the aforementioned initial pose to thesecond joint of the ground-truth. In this way, thetraining examples are prepared separately over jointsand then across rounds until the very last joint J& round C. Algorithm 2 presents the procedure of

learning the set of regressorsR(c)

j

, with each re-

gressor R(c)j of round c and joint j being learned from

its local context to make its prediction, r(c)j . With-

out loss of generality the random forest method [8] isengaged here as the learning engine.

Note that our hand skeletal model contains fivekinematic chains, all of which share the hand basejoint as root of the tree. For each chain, the sub-chain resulting from the exclusion of the base jointis independent of the other sub-chains given that theroot is set. This motivates us to consider the fol-lowing procedure: At test run, the base joint is firstworked out, following the process described above.After this is done, Algorithm 1 is executed for each

8

Page 9: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Algorithm 1 Pose Estimation: Testing Stage

Input: An unseen depth imageOutput: Estimated skeletal joint locations and aprediction of its evaluation scorePreprocessing to obtain Kt initial poses by randomperturbation of the home pose centered at the ob-ject point cloud.for k=1:Kt do

for j=1:J dofor c=1:C do

Twist prediction by applying a learned local

regressor R(c)j :

(c−1)j , θj

)7→ r

(c)j .

Update prediction of current joint spatialcoordinate by applying the correspondingleft group action of Eq.(10).

end forend for

end forPick the best out of Kt candidates by applying thelearned metric.

Algorithm 2 Pose Estimation: Training Stage

Input: The set of training examples. For eachexample i, obtain K initial poses by random per-turbations from the base system estimate.

Output: a series of learned regressors R(c)j : j =

1, · · · , J ; c = 1, · · · , C.for j=1:J do

for c=1:C doGiven the context, learn a local regressorR(c)

j .

Update prediction of current joint spatial co-

ordinate by R(c)j using Eq.(10).

end forPrepare the training set of j + 1-th joint spa-

tial coordinate by applyingR(c)

j′

j,Cj′=1,c=1

, the

partial set of local regressors learned so far.end for

of the five sub-chains separately.

Learning the Internal Evaluation MetricSince multiple pose hypotheses are presented in our

approach, it remains to decide on which one from thecandidate pool we should choose as the final pose es-timate. Traditionally this can be dealt with by eithermode seeking or taking their empirical average as ine.g. Hough voting methods [17, 24, 15] or randomforests [8], respectively; It could also be carried out bysimply matching with a small set of carefully craftedtemplates such as distance transform or DOT [16].Instead we propose to learn a surrogate scoring func-tion that is to be consistent with the real evaluationmetric employed during practical quantitative anal-ysis. This learned scoring function then becomes abuilt-in module in our approach to select the posehypothesis with the least error.

More concretely, the widely used criteria of averagejoint error [51] is adopted as the evaluation metric forour scoring function to mimic. During training stage,a set of nm training examples are generated, where atraining example consists of an instance and a label:A training instance contains an input depth image, itsground-truth pose (i.e. skeletal joint locations) andan estimated pose as a set of corresponding joint lo-cations after random perturbations from the ground-truth. Its label is the average joint error between theestimated and the ground-truth poses. Therefore asecond type of regressor, Rm, is utilized here to learnto predict the error at test run. Namely, given anunseen depth image and an estimated pose, our re-gressor would produce a real-valued score mimickingthe average joint error as where the ground-truth isknown.

4.2 Tracking

Particle filters such as [20] have long been regardedas a powerful mean for tracking, and is also consid-ered in our context to address the tracking problem.To facilitate a favorable balance between efficiencyand effectiveness, we consider a probabilistic particlefilter based approach only for the base joint, whereparticles are formed by Brownian motion based sam-pling in the pose manifold; Meanwhile the parametersof the remaining joints are obtained by invoking thesame inference machinery as in our deterministic poseestimation algorithm. This design is also motivatedfrom our empirical observation that often the object

9

Page 10: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

poses are also well-estimated when the prediction ofthe base joint is in close vicinity of the true values.That is, according to our observation the first joint iscrucial in pose estimation: If ξ1 is wrongly predicted,estimation results of the remaining joints could beseriously damaged; When our prediction of ξ1 is ac-curate, the follow-up joints estimates would also beaccurate. Algorithm 3 further presents the main pro-cedure for our tracking task, which is also visuallyillustrated in Fig. 4, with a detailed description inthe following paragraphs.

Following the particle filter paradigm [20] we con-sider a discretized time step t, and use x to denote alatent random variable as well as y for its observation.Here the state of a tracked object (i.e. the estimatedpose Θ at time t) is denoted as xt and its history isx1:t = (x1, · · · , xt). Similarly, current observation isdenoted as yt, and its history as y1:t = (y1, · · · , yt).The underlying first-order temporal Markov chain in-duces conditional independence property, which bydefinition gives p(xt|x1:t−1) = p(xt|xt−1). Followingthe typical factorization of this state-space dynamicmodel, we have

p(y1:t−1, xt|x1:t−1) = p(xt|x1:t−1)p(y1:t−1|x1:t−1)

= p(xt|xt−1)

t−1∏i=1

p(yi|xi),

with p(y1:t−1|x1:t−1) =∏t−1i=1 p(yi|xi). We also

need the posterior probability p(xt|y1:t) for filter-ing purpose, which in our context is defined asp(xt|y1:t) ∝ p(yt|xt)p(xt|y1:t−1), with p(xt|y1:t−1) =∫p(xt|xt−1)p(xt−1|y1:t−1)dxt−1. In other words, it is

evaluated by considering the posterior p(xt−1|y1:t−1)from the previous time step in a recursively manner.

The realization of the particle filter paradigm inour context involves the three-step probabilistic infer-ence process of selection-prorogation-measurement,which serves as the one time-step update rule inparticle filter, and is also described in Algorithm 3.Specifically, the process at current time-step t corre-sponds to an execution of the selection-prorogation-measurement triplet steps: The output of previous

time-step contains a set of Kr weighted particles

St−1 :=

(s(i)t−1, π

(i)t−1)

Kri=1

.

Here each particle i, s(i)t−1, corresponds to a partic-

ular realization of the set of tangent vector param-eters Θ(i) that uniquely determines a pose, whereeach of the vectors is attached to a joint following

the underlying kinematic chain. The particle s(i)t−1 is

also associated with its weight π(i)t−1 ∈ [0, 1]. Col-

lectively this set of weighted particles is thus re-garded as an approximation to the posterior distri-bution p(xt−1|y1:t−1). The selection step operates byuniform sampling from the cumulative distributionfunction (CDF) of p(xt−1|y1:t−1) to produce a newset of Kr particles with equal weights. It is followedby the propagation step where the manifold-basedBrownian motion sampling of Eq.(7) is employed torealize p(xt|xt−1), i.e. to obtain the new state basedon discretized Brownian motion deviation from theprevious time-step. Note that this Brownian motionsampling is carried out only on the base joint, whilethe remaining joints are obtained by directly execut-ing the same inference machinery of Eq.(10) as in ourpose estimation algorithm. Now, the sample set con-stitutes an approximation to the predictive distribu-tion function of p(xt|y1:t−1). The measurement step

finally provides an updated weight π(i)t for each par-

ticle s(i)t as follows: Let m

(i)t be the predicted error

value of the i-th particle s(i)t , obtained by applying

our learned metric. The weight is thus evaluated as

π(i)t =

1√2πσ

e−m

(i)t

2

2σ2 . (12)

After obtaining all the Kr weights, each of the

weights, π(i)t , is further normalized as

π(i)t ←

π(i)t∑Kr

i′=1 π(i′)t

. (13)

The updated sample set now collectively approx-imates the corresponding posterior distributionp(xt|y1:t) at time t.

The set of weighted particles allows us to representthe entire distribution instead of a point estimate as

10

Page 11: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Select

Propagate

Measure

(st-1(i),πt-1

(i))

(st(i))

(st(i),πt

(i))

Figure 4: A visual illustration of one time-step up-date process of the particle filter paradigm consideredin our tracking task.

what we have done during the pose estimation task.The final pose estimate, x∗t (i.e. Θ at time t), is pro-duced by weighted averaging over this set of particles,

x∗t ←Kr∑i=1

π(i)t s

(i)t . (14)

4.3 Action Recognition

Our approach can be further employed to work withthe problem of action recognition. The key insight isthat an action instance (i.e. a pose sequence) corre-sponds to a curve segment in the manifold, whereasthe set of all instances of a particular action type cor-responds to a group of curves. Therefore, the task ofaction recognition can be cast as separating differentgroups of action curves. It motivates us to consider athird type of learned predictor, Ra. Here dedicatedfeatures are extracted as to be described next, andthe output concerns that of predicting its action cat-egories.

Action Recognition Features As the length ofaction sequences may vary, they are firstly normalizedto the same length (in practice 32 frames) using linearinterpolation. Local features from each frame of a se-quence can be obtained based on the tangent vectors

Algorithm 3 Tracking at time-step t

Input: St−1

Output: x∗t , St(1) Select:calculate the normalized cumulative probabilities:

for i = 1 · · ·Kr doSample a particle s

′(i)t uniformly from the CDF

of p(xt−1|y1:t−1).end for(2) Propagate:for i = 1 · · ·Kr do

Obtain s(i)t by sampling from

s′(i)t

using the

transition probability p(xt|xt−1), which is real-ized by manifold-based Brownian motion sam-pling of Eq.(7) of the tangent vector for the basejoint, ξ1, followed by directly executing Eq.(10)for each of the remaining joints following thekinematic chain.

end for(3) Measure:for i = 1 · · ·Kr do

Evaluate π(i)t by Eq.(12).

end fornormalize π

(i)t by Eq.(13). Now St =

(s(i)t , π

(i)t )Kri=1

is ready.

(4) Estimate the poseThe estimated pose x∗t is finally obtained byEq.(14).

(Lie algebras) of the estimated joints in the mani-fold. Each temporal sequence is further split into 4equal-length segments, where the frames in a segmentcollectively contain the set of tangent vectors as localfeatures. Moreover, a temporal pyramid structuralrepresentation is utilized in a sense similar to that ofthe spatial pyramid matching [22], where features areextracted using hierarchical scales of 4,2,1, where 4corresponds to the 4 segments introduced previously,and the rest correspond to those coarser scales builtover it layer by layer. In total it leads to 7 tempo-ral segments (or sets of variable sizes) over these 3scale spaces. For each such segment, the mean and

11

Page 12: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

standard deviation of its underlying pose representa-tion (in terms of Lie algebras to be described below)are then used as features. Now let us investigatethe details of these pose representations defined onsingle frames, which can be decomposed into jointsfollowing the aforementioned kinematic chain struc-ture. For the base joint, we use the Lie algebras ofthe transformation from the current frame to the nextframe. For the rest of the joints, we use the Lie al-gebras of the transformation from previous joint tothe current joint and that of the transformation fromcurrent frame to the next frame. Besides, we alsouse the 3D location and orientation of the first joint(i.e. base joint) as features. In particular for fish-related actions, rather than using the Lie algebras ofall 20 joints, we emphasize on robust estimation byconsidering a compact feature representation: Thefirst component or sub-vector of the feature vectorcorresponds to the Lie algebra of the base joint; Thesecond and the third components are the sub-vectorsof the same length obtained by averaging over theset of Lie algebras from second to tenth joints, andfrom eleventh to the last joint, respectively. Overalla 252-dim feature vector is thus constructed to fullycharacterize an action sequence.

4.4 Random Forests, Pose-indexedDepth features, and Binary Tests

There are three types of learned predictors (namely

the set of local regressorsR(c)

j

, the learned inter-

nal metric Rm, and the action recognition predictorRa) considered in our approach. In general any rea-sonable learning method can be used to realize thesethree types of predictors. In practice the random for-est method [8, 15] is engaged for these learning tasks,so it is worthwhile to describe its details here.

For action recognition, a unique set of action fea-tures are used as stated previously. In what follows,we thus focus on the description of our pose-indexeddepth features, which are used in the first two types

of regressors,R(c)

j

, andRm. Our feature represen-

tation can be regarded as an extension of the populardepth features as discussed in [38, 51] by incorporat-ing the idea of pose-indexed features [14, 2] to model

3D objects. We start by focusing on a joint j withits current 3D location x ∈ R3, where a 3D offset ucan be obtained by random sampling from the homepose. Let gΘ1:j

(u) denote the Lie group left action of

current object pose Θ1:j applied onto u. Now the 3Dlocation of the offset is naturally x+gΘ1:j

(u), and itsprojection onto 2D image plane under current cameraview is denoted as u = Proj

(x + gΘ1:j

(u)). Similarly

we can obtain another random offset v. For a 2Dpixel location x = Proj

(x)∈ R2 of an image patch I

containing the object of interest, its depth value canbe denoted as dI(x). Now we are ready to constructa feature φI,(u,v)(x) or its short-hand notation φ, byconsidering two 2D offsets positions u, v from x:

φI,(u,v)(x) = dI

(x + u

)− dI

(x + v

). (15)

Due to the visibility constraint, we are only able toobtain the depth values of the projected 2D locationsu and v from the object surface. Thus Proj is a sur-jective map. Nevertheless, this serves our intentionof sampling random features well. Following [8], a bi-nary test is defined as a pair of elements, (φ, ε), with φbeing the feature function, and ε being a real-valuedthreshold. When an instance with pixel location xpasses through a split node of our binary trees, itwill be sent to the left branch if φ(x) > ε, and to theright side otherwise.

Our random forest predictors are constructedbased on these features and binary tests for splitnodes. Similar to existing regression forests in lit-erature including e.g. [38], at a split node, we ran-domly select a relatively small set of m distinct fea-tures Φ := φimn=1 as candidate features. For everycandidate feature, a set of candidate thresholds Λ isuniformly selected over the range defined by the em-pirical set of training examples in the node. The besttest (φ∗, ε∗) is chosen from these features and accom-panying thresholds by maximizing the following gainfunction. This procedure is then repeated until thereare L levels in the tree or once the node containsfewer than ln training examples. More specifically,the above-mentioned split test is obtained by

(φ∗, ε∗) = arg maxφ∈Φ,ε∈Λ

I(φ, γ),

12

Page 13: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Lightfield camera(Raytrix)

Fish tank12.7 cm

9.5 cm

30.1

cm

(a)

Depth camera(Primesense)

Open-field cage

44 cm

44 cm

85 c

mAverage mouse length 105 mm (65 pixels)

(b)

Figure 5: The capture setups used in constructing our (a) fish and (b) mouse depth image datasets, respec-tively.

where the gain I(φ, ε) is defined as:

I(φ, ε) = E(S)−(|Sl||S|

E(Sl) +|Sr||S|

E(Sr)

). (16)

Here | · | denotes the cardinality of the set, S de-notes the set of training examples arriving at cur-rent node, which is further split into two subsets Sland Sr according to the test (φ, ε). Define ∆θj :=

1‖S‖

∑i∈S ∆θj the mean deviation of the set to j-th

joint, and accordingly for Sl and Sr. The function Eis defined over a set of examples that evaluates thesum of differences from the mean deviation:

E(S) =∑i∈S‖∆θj‖2. (17)

In the final decision stage, for the first two re-gression modules, the mean-shift mode searching inHough voting space is used as e.g. in [15], while forthe third module (action recognition) the classicalrandom forest strategy [8] is used to pick the categorywith largest counts from the averaged histogram.

5 Empirical Evaluations

Empirically our Lie-X approach is examined on threedifferent articulated objects: fish, mouse, and humanhand.

Performance Evaluation Metric Our perfor-mance evaluation metric is based on the commonly-used average joint error, computed as the averagedEuclidean distance in 3D space over all the joints.Formally, let vg and ve be the ground-truth and es-timated joint locations, respectively. The joint errorof the pose estimate ve is defined as e = 1

m

∑i ‖vgi−

vei‖, where ‖ · ‖ is the Euclidean norm in 3D space.When dealing with test images, let k = 1, . . . , ntst

index over the test images, and their correspondingjoint errors denoted as e1, · · · , entst. The averagejoint error is then defined as 1

ntst

∑j ej .

Internal Parameters Throughout experiments, afixed set of values is always used for the internal pa-

13

Page 14: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

rameters of our approach, unless otherwise stated, asfollows. For the the first type of regressors (namely

the set of local regressorsR(c)

j

), the number of

trees is fixed to (3, 10, 10), while the tree depth is(24, 24, 24) for the triplet of articulated objects (fish,mouse, hand), respectively. For the second type (thelearned internal metric Rm), the number of trees is(20, 20, 20), while the tree depth is (15, 15, 20) forthe triplet of articulated objects (fish, mouse, hand),respectively. For the third type (the action recogni-tion predictor) Ra, the number of trees in the forestis 50, and tree depth is 20. The number of featuresis m=8,000, and the maximum number of examplesin the leaf node is ln=5. The number of rounds ateach joint is set to C=7, 3, and 3, for fish, mouse,and hand, respectively. The local image patch sizesconsidered in our approach for fish, mouse, and handare normalized to 25×25, 100×100, 100×100 mm2,respectively. These patches are used as input to thelocal random forest regressors in our approach to es-timate the spatial coordinate of next joint based oncurrent joint following the kinematic chain. One im-portant parameter is Kt, the number of initial posesin pose estimation. In practice, after factoring-inthe efficiency consideration, Kt is set to 40, 40, 20for pose estimation of fish, mouse, and hand, respec-tively, throughout experiments. Similarly, the num-ber of initial poses for tracking is set to Kr=200. Forlearning the internal evaluation metric, the numberof candidates is set to 8. Namely, given a trainingdataset of size nt, the number of training examples forpose estimation becomes Kt × nt, while the numberof training examples for learning the internal metricis nm = 8× nt.

5.1 Datasets

To examine the applicability of our approach on di-verse articulated objects, we demonstrate in this pa-per its empirical implementation for fish, mouse, andhuman hand, respectively, where three distinct real-life datasets are employed. In particular, here weintroduce our home-grown 3D image datasets of ze-brafish and lab mouse that are dedicated to the re-lated problems of pose estimation, tracking, and ac-

In useNot used

Figure 6: Following the evaluation protocol of [45,30, 31, 54], for NYU hand depth image dataset, onlya subset of 14 joints out of the total 23 hand skeletaljoints is considered during performance evaluation forhand pose estimation.

tion recognition. The images have been captured andannotated by experts to provide the articulated skele-ton information describing the pose of the subject.The popular NYU hand depth image dataset [45] isalso considered here. More details of the datasets arediscussed next. It is worth noting that different imag-ing modalities are utilized across the three datasets:light-field depth images are used for fish, while struc-tured illumination depth cameras are employed formouse and human hand objects. Regardlessly ourapproach is demonstrated to work well across thesediverse image modalities.

Our Fish Dataset Depth images are acquiredwith a top-mount Raytrix R5 light-field camera ata frame rate of 50 FPS and a resolution of 1, 024 ×1, 024, as displayed in Fig. 5(a). The depth imagesare obtained from the raw plenoptic images by utiliz-ing Raytrix on-board SDK. In total 7 different adultzebrafish of different genders and sizes are engaged inour study. From the captured images, 2,972 imagescontaining distinct poses are annotated. The train-ing dataset of nt =95,104 images is thus formed byaugmenting each fish object of these images with 31additional transformations, where each transforma-tion comes with a random scaling within [0.9, 1.1]

14

Page 15: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Fish Mouse Hand

Number of trees

2 4 6 8 10 12 14 16 18 200.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.8

Number of trees

Average

jointerror(m

m)

2 4 6 8 10 12 14 16 18 20

6.5

7

7.5

8

8.5

9

9.5

Number of trees

Average

jointerror(m

m)

2 4 6 8 10 12 14 16 18 20

14.5

15

15.5

16

16.5

17

17.5

Number of trees

Average

jointerror(m

m)

Tree depth

6 8 10 12 14 16 18 20 22 240.6

0.8

1

1.2

1.4

1.6

1.8

Tree Depth

Average

jointerror(m

m)

6 8 10 12 14 16 18 20 22 24

6.5

7

7.5

8

8.5

9

9.5

10

Tree Depth

Average

jointerror(m

m)

6 8 10 12 14 16 18 20 22 24

15

20

25

30

35

40

45

Tree Depth

Average

jointerror(m

m)

Number ofinitial poses Kt

10 20 30 40 50 60 70 80 90 100

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.8

Number of initial poses

Average

jointerror(m

m)

10 20 30 40 50 60 70 80 90 1006.4

6.5

6.6

6.7

6.8

6.9

7

Number of initial poses

Average

jointerror(m

m)

10 20 30 40 50 60 70 80 90 100

14

14.2

14.4

14.6

14.8

15

Number of initial poses

Average

jointerror(m

m)

Number ofiterations C

1 2 3 4 5 6 7 8

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of iterations

Average

jointerror(m

m)

1 2 3 4 5 6 7 86.4

6.5

6.6

6.7

6.8

6.9

7

7.1

7.2

7.3

7.4

Number of iterations

Average

jointerror(m

m)

1 2 3 4 5 6 7 814.4

14.6

14.8

15

15.2

15.4

15.6

15.8

Number of iterations

Average

jointerror(m

m)

Number of treesfor learning

internal metric

5 10 15 20 25 30

0.68

0.7

0.72

0.74

0.76

0.78

0.8

Number of trees for learning internal metric

Average

jointerror(m

m)

5 10 15 20 25 30

6.6

6.8

7

7.2

7.4

Number of trees for learning internal metric

Average

jointerror(m

m)

5 10 15 20 25 30

14.5

14.6

14.7

14.8

14.9

15

Number of trees for learning internal metric

Average

jointerror(m

m)

Figure 7: Sensitivity analysis of our Lie-X approach w.r.t. internal parameters for pose estimation tasks:In each of the five rows, average joint error is plotted as a function of the respective internal parameter. Itis further displayed in three columns for fish, mouse, and hand, respectively. In each of the panels, a red dotis placed to indicate the specific parameter value empirically employed in our approach.

15

Page 16: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

0 1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

Distance threshold (mm)

Fractionof

fram

eswithin

distance

(%)

RFCNN

Lie-Fish

(a)

0 5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

70

80

90

100

Distance threshold (mm)

Fractionof

fram

eswithin

distance

(%)

RFCNN

Lie-Mouse

(b)

0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

90

100

Distance threshold (mm)

Fractionof

fram

eswithin

distance

(%)

RF

CNN

Lie-Hand

Tompson et al. [45]

Oberweger et al. [30]

Oberweger et al. [31]

Zhou et al. [54]

(c)

Figure 8: Cumulative error distribution curves for pose estimation of (a) fish, (b) mouse and (c) hand,respectively. Horizontal axis displays the distance amount in mm of the estimated poses from ground-truths. Vertical axis presents the fraction of examples where their corresponding estimated poses possessaverage joint errors within the current distance range.

and with a random in-plane rotation within (-π, π).The test dataset of pose estimation problem containsntst =1,820 distinct fish images.

In addition to single-frame based pose annotations,we also record, annotate, and make available a fishaction dataset. [21] provides a comprehensive cata-logue of zebrafish actions, from which a subset of 9action classes are considered in this paper, which islisted below as well as illustrated in Fig.19:

Scoot : Moves along a straight line.

J-turn: Fine reorientation during which the bodyslightly curves (30 − 60), with a characteristicbend at the tail.

C-turn: Fish body curves to form a C-shape enroute to a near 180 turn.

R-turn: Involves routine angular turn of greaterthan 60.

Surface: Moves up towards the water surface.

Dive: Moves towards the tank bottom.

Zigzag : Contains erratic movements with multipledarts in various directions.

Thrash: Consists of forceful swimming against theside or bottom walls of the tank.

Freeze: Refers to complete cessation of movement.

Our fish action dataset contains 426 training se-quences and 173 testing sequences, respectively, from7 different fish over the aforementioned 9 categories.The length of each fish action sequence varies from 7frames to 135 frames.

Our Mouse Dataset Mouse depth images are col-lected using a top-mount Primesense Carmine depthcamera at a frame rate of 30 FPS and with a res-olution 640 × 480. Fig. 5(b) presents our dedicatedimaging set-up. Two different lab mouse are engagedin our study. We select 3,253 images containing dis-tinct poses and depth noise patterns, and augmentthem with additional transformations following thesame protocol as above, which gives rise to the train-ing dataset here containing nt =104,096 images. Thetesting dataset of pose estimation problem containsntst =4,125 distinct depth images. For tracking prob-lem, the test set consists of two sequences of length511 and 300 frames, respectively.

The NYU Hand Dataset We also evaluate ourapproach on the benchark NYU hand depth image

16

Page 17: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Comparison methods on articulated objects fish mouse

RF 1.28 12.24CNN 0.79 9.17Lie-X (w/o multiple initial poses) 3.28 13.27Lie-X (w/o learned metric) 1.71 9.82Lie-X 0.68 6.64

Table 1: Quantitative evaluation of competing meth-ods on pose estimation problem for fish and mouserespectively. Performance is measured in terms ofaverage joint error (mm).

dataset[45] 2. It contains nt =72,757 depth imagesfor training and ntst=8,252 frames for testing. All im-ages are depth images captured by Microsoft Kinectusing the structured illumination technique, which isthe same as the Primesense camera used in our mousedataset. Depth images in the training set are from asingle user, while images in the test set are from twousers. While a ground-truth hand label contains 36annotated joints, only 14 of these joints are consid-ered in many existing efforts using this dataset, suchas [45, 30, 31, 54], which is followed during our exper-iments. This is also presented in Fig. 6: Importanthand joints are included in this subset of 14 joints,such as all the finger tips and the hand base.

5.2 Pose Estimation of Fish, Mouse,and Human Hand

In this subsection, we focus on the problem of poseestimation for articulated objects such as fish, mouse,and hand. To make a fair comparison with ex-isting methods, we specifically implement two non-trivial baseline methods, namely regression forest(RF), and convolutional neural network (CNN). TheRF method is a re-implementation of the classical re-gression method used by Microsoft Kinect [38], withthe only difference being that our RF implementa-tion explicitly utilizes a skeletal model, instead of es-timating joint locations without skeletal constraintsas in [38]. Two separate regression forests, F1 and F2,are trained for this purpose. Here F1 is used to esti-

2The NYU dataset is publicly available at http://cims.

nyu.edu/~tompson/NYU_Hand_Pose_Dataset.htm.

Initial Final

Figure 16: Illustrating the convergence process onthe same mouse example presented in Fig. 14. Itstarts from an initial pose candidate to the final poseestimation result, which is the top-left one among thelist of all 40 output candidates.

mate the 3D location and in-plane orientation of thesubject, followed by F2 which produces a set of 3Dpose candidates. The number of trees trained are setto 7 and 12 for F1 and F2, respectively. In both cases,the maximum tree depth is fixed to L=20. The stan-dard depth invariant two-point offset features of [38]are also used. The CNN method is obtained as fol-lows: The pre-trained AlexNet CNN model from Im-ageNet is engaged as the initial model. To tailor thetraining data for our CNN, objects of interest fromthe training depth images are cropped according totheir bounding boxes. The depth values in each patchare rescaled to be in the range of 0 to 255. Each ob-ject patch is replicated three times to form into aRGB image, which is then resized as an input in-stance of size 224 × 224. This together with its cor-responding annotation prepares a training example.Then our CNN model is finally obtained by executingthe MatConvNet package to train on these trainingexamples for 50 epochs.

Sensitivity Analysis of the Internal Parame-ters As our approach contains a number of internalparameters, it is of interest to systematically inves-tigate the influence of these parameters w.r.t. thefinal performance of our system. Here we considerfive influential parameters, which are the number ofinitial poses Kt, the number of rounds C, the numberand depth of trees in our first type of regressors (i.e.

the local regressorsR(c)

j

), as well as the number

17

Page 18: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Comparison methods on hand pose estimation average joint error (mm)

RF 24.81CNN 18.82Tompson et. al. [45] 21.00Oberweger et. al. [30] 20.00Oberweger et. al. [31] 16.50Zhou et. al. [54] 16.90Lie-X (w/o multiple initial poses) 20.50Lie-X (w/o learned metric) 16.72Lie-X 14.51

Table 2: Quantitative evaluation of competing methods on the benchmark NYU dataset [45] for hand poseestimation task. Performance is in terms of average joint error (mm).

of trees used in our learned metric component Rm.Fig. 7 displays the performance of Lie-X with respectto each of these five parameters row-by-row. Mean-while each of the three columns presents the respec-tive results for fish, mouse, and hand. Each of thefifteen panels in this five by three matrix is obtainedby varying one parameter of interest while keepingthe other parameters fixed to default values. In gen-eral, our system behaves in a rather stable mannerw.r.t. the change of internal parameters over a widerange of values. Moreover, in each of the panels, ared colored dot is placed to indicate the specific pa-rameter value empirically employed by our approachin this paper. It is worth noting that the choice ofthese internal parameter values represents a compro-mise between performance and efficiency.

Comparison with Baselines and the State-of-the-art Methods To evaluate the performance ofthe proposed approach, a series of experiments areconducted on the aforementioned datasets for fish,mouse, and hand pose estimation tasks. Table 1presents a comparison of Lie-X w.r.t. the two non-trivial baseline methods (i.e. RF and CNN) on fishand mouse pose estimation tasks. Overall, our ap-proach clearly outperforms the others by a significantmargin, while CNN achieves better results over RF.Moreover, the error distributions of these comparisonmethods are also presented in Fig. 8(a-b), where ourapproach clearly outperforms the baselines most of

the time. The superior performance of our Lie-X ap-proach is also demonstrated in Fig. 9, which providesvisual comparisons of pose estimation results for sixfish and six mouse examples, respectively, over thethree competing methods. From these visual exam-ples, it is observed that the estimated poses from ourLie-X approach tend to be more faithfully alignedwith the ground-truth when compared against thetwo baseline methods.

Our approach is also validated on the NYU handdepth benchmark, as is displayed in Table 2. Overall,our CNN baseline result is on par with the standarddeep learning results of e.g. [30] that also utilizes aAlexNet-like CNN. This helps to establish that ourbaselines are consistent in terms of performance withwhat has been reported in the literature, which arealso used as pose estimation baselines on fish andmouse objects. Moreover, the results of the state-of-the-art methods are also directly compared here,including Tompson et. al. [45], Oberweger et. al. [31],and Zhou et. al. [54]. It is worth pointing out thatthe test error rate of our approach is 14.51 mm interms of average joint error. This is by far the bestresult on hand pose estimation task to our knowledge,which improves over the best state-of-the-art resultof 16.50 mm of [31] by almost 2 mm. More detailedquantitative information is revealed through the er-ror distributions of comparison methods in Fig. 8(c),where our approach clearly outperforms the baselinesand the state-of-the-art methods. Similarly, visual

18

Page 19: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Comparison methods uses GPU frames per second (FPS)

Tompson et. al. [45] X 30Oberweger et. al. [30] X 5,000Oberweger et. al. [31] X 400Zhou et. al. [54] X 125Lie-X × 123

Table 3: Runtime speed comparison with state-of-the-art methods for hand pose pose estimation task. Noteour Lie-X results are obtained using CPU only, while the rest methods all utilize GPUs.

comparison results are provided in Fig. 10, Fig. 11,and Fig. 10, where our approach is again shown toclearly outperform other methods. More specifically,Fig. 10 and Fig. 11 present the visual results of allcompeting methods on the same four exemplar handimages. Due to the access limit, we are only able topresent the side-view results on our approach and thebaseline methods of RF and CNN. Fig. 10 providesadditional visual results comparing our approach tostate-of-the-arts on ten more hand images.

To reveal the inner working of our approach, inFig. 13, Fig. 14, and Fig. 15, a visual example is re-spectively provided for pose estimation of fish, mouse,and hand. It is evident that a diverse set of posecandidates are obtained that covers distinct pose lo-cation, orientation, and sizes. This is made possibledue to the adoption of multiple initial poses. More-over, prediction scores and associated orders from ourlearned metric module in general closely resemblesthat of the empirical evaluation metric. In addition,Fig. 16 presents several intermediate pose estimationresults from different joints and rounds on an exem-plar mouse image, when executing the pose estima-tion pipeline as illustrated in Fig. 3. It is observedthat each step of the iterative process usually helpsin converging toward the final pose estimation.

With vs. Without Multiple Initial Poses Aspresented in Table 1 for fish and mouse objects andTable 2 for hand objects, empirically we observe thatthe presence of multiple initial poses always improvesthe pose estimation performance. As presented inFigs. 13, 14, and 15, execution of our pose estimationprocess, starting from multiple distinct initial poses,

results in unique pose estimates, each of which canbe regarded as a locally optimal result. This is dueto the highly non-convex nature of our problem, awell-known fact for systems of rigid-bodies in general.These visual examples also demonstrate the impor-tance of having multiple initial poses to avoid gettingtrapped into local optimal points that are far fromthe ground-truth point.

With vs. Without the Learned Metric To ex-amine the usefulness of our learned internal metric,a special variant of our approach without this com-ponent is engaged here, which is also referred to asLie-X w/o learned metric. Provided with multipleoutput pose candidates, this variant differs from ourfull-fledged approach by averaging over them for eachof the joints in the 3D Euclidean space, instead ofscoring them with the learned metric to pick up thebest estimate. Empirical experiments such as thosepresented in Table 1 for fish and mouse objects andTable 2 for hand objects suggest a noticeable perfor-mance degradation when without the learned metric.Clearly the learned internal metric does facilitate inselecting from a global viewpoint the final estimate,which is obtained from the pool of locally optimalcandidates using multiple initial poses. It has alsobeen demonstrates in Figs. 13, 14, and 15 that inour context a max operation (i.e. with the learnedmetric) may well outperform an average operation(i.e. w/o learned metric). In particular, visually ourlearned internal metric is capable of producing pre-dicted error scores that are nicely aligned with thetrue average joint error when we have access to theground-truth.

19

Page 20: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Computational Efficiency All experiments dis-cussed in this paper are performed on a desktop PCwith an Intel i7-960 CPU and with 24Gb memory.At this moment, our CPU implementation achievesan average run-time speed of 83 FPS, 267 FPS, and123 FPS for fish, mouse, and hand tasks, respec-tively. Table 3 summarizes the run-time speed com-parisons with state-of-the-art hand pose estimatorson the NYU hand dataset [45]. Our result of 123FPS is obtained with only CPU access, neverthelessit is still comparable with most of these recent meth-ods which are based on GPUs. Note the empiricalruntime speed of our approach could be further im-proved by exploiting the computing power of mod-ern GPUs. Meanwhile, an exceptionally high speedmethod is developed in Oberweger et. al. [30], whichis made possible by the usage of very shallow neuralnets. This however comes with degraded performanceas shown in Table 2, with a significant average jointerror increase of 5.41 mm when compared to our ap-proach.

Common Pose Estimation Errors of Our Ap-proach Although our Lie-X approach performs rel-atively well in practical pose estimation settings, in-evitably it does make mistakes in practical situations.These common errors include the following ones: ori-entation flips, displacement along the z-direction andsub-optimal shape fitting. A visual illustration ofthese common errors made by our approach is pro-vided in Fig. 17. As can be observed, usually ourLie-Fish results are best aligned with the ground-truths. The mistakes of Lie-Mouse are more notice-able. Meanwhile the visual displacements of our Lie-Hand results from the ground-truths are most signif-icant. This is to be expected, as the correspondingcomplexity levels of the three tasks varies from beingrelatively simple (i.e. kinematic chains) to complex(i.e. kinematic trees).

5.3 Tracking of Mouse

Our Lie-X approach is also examined on the track-ing task using the mouse tracking dataset describedbeforehand. Compared with our single frame basedpose estimation of Alg. 1, it is of interest to examine

on how much we can gain from our tracking algo-rithm of Alg. 3, when temporal information is avail-able. Empirically our mouse tracker is shown to pro-duce an improved performance of 7.19 mm from the8.42 mm results from our pose estimator on singleframes. This can also be observed from the bottomrow of Fig. 18 where visual comparisons are providesat several time frames. By exploiting temporal in-formation, the results of our tracker are shown toproduce less dramatic mistakes comparing to that ofpose estimation. It is again evidenced quantitativelyin Fig. 18, which presents a frame-by-frame averagejoint error comparison of tracking vs. pose estimationon a test sequence. Clearly there exists a number ofvery noisy predictions of pose estimation. In com-parison our tracking results are in general much lessnoisy. Overall, the tracking results outperforms postestimation with a noticeable gap of at least 1 mm.Note the tracking results in some frames are slightlyinferior to that of the pose estimation counterpart,which we attribute to the utilization of the averagingoperations in our tracker.

5.4 Action Recognition of fish

To demonstrate the application of our approach onaction recognition tasks, in what follows we con-duct experiments on the aforementioned fish actiondataset. In addition to the proposed tangent vec-tor (i.e. Lie algebras) based feature representation,as comparison we also consider a joint based featurerepresentation. Here the main body of the featurerepresentation follows exactly as in the tangent vec-tor representation, including e.g. the adoption of atemporal pyramid of 4, 2, 1, with the only change asfollows: Instead of tangent vectors, the correspond-ing the 3D joint positions are employed. This finallyleads to an 888-dim feature vector representation.

Fig. 19 displays our fish dataset that contains nineunique action categories. The standard evaluationmetric of average classification accuracy are consid-ered in this context. Empirically the comparisonmethod that utilizing joint position features achievesa performance of 79.19%, which is significantly out-performed by our approach based on tangent vectorfeatures with the average accuracy of 91.33%. Fig. 20

20

Page 21: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

provides further information of category-wise errorsin the form of the confusion matrices. It is observedthat the joint based method tends to confuse amongthe subset of actions of scoot, J-turn, c-turn, and r-turn, which are indeed more challenging to be sepa-rated due to their inherent similarities. Nonetheless,the performance on this subset is dramatically im-proved in our approach with tangent vector basedfeatures. We hypothesize that by following the natu-ral tangent vector representation, our approach gainsthe discriminative power to separate these otherwisetroublesome action categories.

6 Conclusion and Future Work

A unified Lie group approach is proposed for the re-lated key problems of pose estimation, tracking, andaction recognition of diverse articulated objects fromdepth images. Empirically our approach is evaluatedon human hand, fish and mouse datasets with verycompetitive performance. For future work, we planto work with more diverse articulated objects such ashuman full body and wild animals, as well as theirinteractions with other articulated objects and back-ground objects.

Acknowledgment

The project is partially supported by A*STAR JCOgrants 1431AFG120 and 15302FG149. Mouse andfish images are acquired with the help of Zoe Bich-ler, James Stewart, Suresh Jesuthasan, and AdamClaridge-Chang. Zilong Wang helps with the anno-tation of mouse data, while Wei Gao and AshwinNanjappa help with implementing the mouse base-line method.

References

[1] Agarwal, A., Triggs, B.: Recovering 3D human posefrom monocular images. IEEE Trans. PAMI 28(1)(2006)

[2] Ali, K., Fleuret, F., Hasler, D., Fua, P.: Joint poseestimator and feature learning for object detection.In: ICCV (2009)

[3] Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and people-detection-by-tracking. In: CVPR (2008)

[4] Ballan, L., Taneja, A., Gall, J., Gool, L.V., Polle-feys, M.: Motion capture of hands in action usingdiscriminative salient points. In: ECCV (2012)

[5] Barsoum, E.: Articulated hand pose estimation re-view. Arxiv pp. 1–50 (2016)

[6] Bourdev, L., Malik, J.: Poselets: Body part detec-tors trained using 3D human pose annotations. In:ICCV (2009)

[7] Branson, K., Belongie, S.: Tracking multiple mousecontours (without too many samples). In: CVPR(2005)

[8] Breiman, L.: Random forests. Machine Learning45(1), 5–32 (2001)

[9] Burges, C., Shaked, T., Renshaw, E., Lazier, A.,Deeds, M., Hamilton, N., Hullender, G.: Learningto rank using gradient descent. In: ICML (2005)

[10] Chen, L., Wei, H., Ferryman, J.: A survey on modelbased approaches for 2D and 3D visual human poserecovery. PRL 34(15), 1995–2006 (2013)

[11] Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.:Behavior recognition via sparse spatio-temporal fea-tures. In: IEEE Workshop on PETS (2005)

[12] Dollar, P., Welinder, P., Perona, P.: Cascaded poseregression. In: CVPR (2010)

[13] Felzenszwalb, P., Huttenlocher, D.: Pictorial struc-tures for object recognition. International Journal ofComputer Vision 61(1), 55–79 (2005)

[14] Fleuret, F., Geman, D.: Stationary features and catdetection. JMLR 9, 2549–2578 (2008)

[15] Gall, J., Yao, A., Razavi, N., van Gool, L., Lempit-sky, V.: Hough forests for object detection, tracking,and action recognition. IEEE Trans. PAMI 33(11),2188–2202 (2011)

[16] Hinterstoisser, S., Lepetit, V., Ilic, S., Fua, P.,Navab, N.: Dominant orientation templates for real-time detection of textureless objects. In: CVPR(2010)

[17] Hough, P.: Machine analysis of bubble chamber pic-tures. In: Proc. Int. Conf. High Energy Acceleratorsand Instrumentation (1959)

[18] Hsu, E.P.: Stochastic Analysis on Manifolds. AMSPress (2002)

21

Page 22: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

[19] Huang, C., Allain, B., Franco, J., Navab, N., Boyer,E.: Volumetric 3D tracking by detection. In: CVPR(2016)

[20] Isard, M., Blake, A.: Condensation - conditionaldensity propagation for visual tracking. Interna-tional journal of computer vision 29(1), 5–28 (1998)

[21] Kalueff, A., Gebhardt, M., Stewart, A., Cachat, J.,Brimmer, M., Chawla, J., Craddock, C., Kyzar, E.,Roth, A., Landsman, S., Gaikwad, S., Robinson, K.,Baatrup, E., Tierney, K., Shamchuk, A., Norton,W., Miller, N., Nicolson, T., Braubach, O., Gilman,C., Pittman, J., Rosemberg, D., Gerlai, R., Echevar-ria, D., Lamb, E., Neuhauss, S., Weng, W., Bally-Cuif, L., Schneider, H.: Towards a comprehensivecatalog of zebrafish behavior 1.0 and beyond. Ze-brafish 10(1), 70–86 (2013)

[22] Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags offeatures: spatial pyramid matching for recognizingnatural scene categories. In: CVPR (2006)

[23] Lee, J.: Introduction to Smooth Manifolds. Springer(2003)

[24] Leibe, B., Leonardis, A., Schiele, B.: Combined ob-ject categorization and segmentation with an im-plicit shape model. In: In ECCV workshop on sta-tistical learning in computer vision, pp. 17–32 (2004)

[25] Mahasseni, B., Todorovic, S.: Regularizing longshort term memory with 3D human-skeleton se-quences for action recognition. In: CVPR (2016)

[26] Manton, J.: A primer on stochastic differential ge-ometry for signal processing. IEEE J. Sel. TopicsSignal Processing 7(4), 681–99 (2013)

[27] Mikic, I., Trivedi, M.M., Hunter, E., Cosman, P.C.:Human body model acquisition and tracking usingvoxel data. International Journal of Computer Vi-sion 53(3), 199–223 (2003)

[28] Murray, R., Sastry, S., Li, Z.: A Mathematical Intro-duction to Robotic Manipulation. CRC Press (1994)

[29] Nie, X., Xiong, C., Zhu, S.: Joint action recognitionand pose estimation from video. In: CVPR (2015)

[30] Oberweger, M., Wohlhart, P., Lepetit, V.: Handsdeep in deep learning for hand pose estimation. In:Computer Vision Winter Workshop (2015)

[31] Oberweger, M., Wohlhart, P., Lepetit, V.: Traininga feedback loop for hand pose estimation. In: ICCV(2015)

[32] Oikonomidis, N., Argyros, A.: Efficient model-based3D tracking of hand articulations using Kinect. In:BMVC (2011)

[33] Perez-Sala, X., Escalera, S., Angulo, C., Gonzalez,J.: survey of human motion analysis using depth im-agery. Sensors 14, 4189–4210 (2014)

[34] Poppe, R.: Vision-based human motion analysis: Anoverview. Comput. Vis. Image Underst. 108(1–2),4–18 (2007)

[35] Procesi, C.: Lie groups: An approach through in-variants and representations. Springer (2007)

[36] Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Re-altime and robust hand tracking from depth. In:CVPR (2014)

[37] Rahmani, H., Mian, A.: 3D action recognition fromnovel viewpoints. In: CVPR (2016)

[38] Shotton, J., Girshick, R., Fitzgibbon, A., Sharp, T.,Cook, M., Finocchio, M., Moore, R., Kohli, P., Cri-minisi, A., Kipman, A., Blake, A.: Efficient humanpose estimation from single depth images. IEEETPAMI 35(12), 2821–40 (2013)

[39] Sinha, A., Choi, C., Ramani, K.: Deephand: Ro-bust hand pose estimation by completing a matriximputed with deep features. In: CVPR (2016)

[40] Srivastava, A., Turaga, P., Kurtek, S.: On advancesin differential-geometric approaches for 2D and 3Dshape analyses and activity recognition. Image Vi-sion Comput. 30(6-7), 398–416 (2012)

[41] Sun, X., Wei, Y., Liang, S., Tang, X., Sun, J.: Cas-caded hand pose regression. In: CVPR (2015)

[42] Tan, D., Cashman, T., Taylor, J., Fitzgibbon, A.,Tarlow, D., Khamis, S., Izadi, S., Shotton, J.: Fitslike a glove: Rapid and reliable hand shape person-alization. In: CVPR (2016)

[43] Tang, D., Taylor, J., Kohli, P., Keskin, C., Kim, T.,Shotton, J.: Opening the black box: Hierarchicalsampling optimization for estimating human handpose. In: ICCV (2015)

[44] Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Jointtraining of a convolutional network and a graphicalmodel for human pose estimation. In: NIPS (2014)

[45] Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands usingconvolutional networks. SIGGRAPH (2014)

22

Page 23: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

[46] Tuzel, O., Porikli, F., Meer, P.: Learning on Liegroups for invariant detection and tracking. In:CVPR (2008)

[47] Vemulapalli, R., Arrate, F., Chellappa, R.: humanaction recognition by representing 3D skeletons aspoints in a Lie group. In: CVPR (2014)

[48] Vemulapalli, R., Chellappa, R.: Rolling rotations forrecognizing human actions from 3D skeletal data. In:CVPR (2016)

[49] Wiltschko, A., Johnson, M., Iurilli, G., Peterson, R.,Katon, J., Pashkovski, S., Abraira, V., Adams, R.,Datta, S.: Mapping sub-second structure in mousebehavior. Neuron 88(6), 1121–35 (2015)

[50] Xiong, X., Torre, F.: Supervised descent method andits applications to face alignment. In: CVPR (2013)

[51] Xu, C., Cheng, L.: Efficient hand pose estimationfrom a single depth image. In: ICCV (2013)

[52] Xu, C., Nanjappa, A., Zhang, X., Cheng, L.: Esti-mate hand poses efficiently from single depth images.International Journal of Computer Vision pp. 1–25(2015)

[53] Yang, Y., Ramanan, D.: Articulated pose estimationwith flexible mixtures-of-parts. In: CVPR (2011)

[54] Zhou, X., Wan, Q., Zhang, W., Xue, X., Wei, Y.:Model-based deep hand pose estimation. In: IJCAI(2016)

23

Page 24: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

1

1 2

2

1

1 2

2

1

1 2

2 1

1

2

21

1

2

21

1

2

2

1

1 2

2

1

1 2

2

1

1 2

2

(a) (b) (c)

1

1 2

2

1

1 2

2

1

1 2

21

1 2

2

11 2

2

11 2

2

1

1

2

2

1

1

2

2

1

1

2

2

(d) (e) (f)

1

1 2

2

1

1 2

2

1

1 2

21

1 2

21

1 2

2

1

1 2

2

(g) (h) (i)

1

1 2

2

1

1 2

2

1

1 2

21

1 2

2

1

1 2

21

1 2

2

(j) (k) (l)

Figure 9: Visual comparison of fish and mouse examples. Here pose estimates of RF, CNN, as well asour Lie-X approach are compared together with respective human-annotated ground-truths. Panels (a)–(f)present six fish examples, which is followed by panels (g)–(l) for six exemplar mouse results. In each ofthe twelve panels, top row displays the full top-view together with one or two zoom-in visual examinations.Meanwhile, the bottom row also provides a side-view. Best viewed in color.

24

Page 25: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

1

12

2

3

3

4

4

1

12

2

3

3

4

41

12

2

3

3

4

4

1

12

2

3

3

4

4

1

12

2

3

3

4

4

1

12

2

3

3

4

4

1

12

2

3

3

4

4

1

12

2

3

3

4

4

1

12

2

3

3

4

4

1

12

2

3

3

4

4(a) (b)

1

12 2

3

3

4

4

1

12 2

3

3

4

4

1

12 2

3

3

4

4

1

12 2

3

3

4

4

1

12 2

3

3

4

4

1

12

2

3

34 4

1

12

2

3

34 4

1

12

2

3

34 4

1

12

2

3

34 4

1

12

2

3

34 4

(c) (d)

Figure 10: Visual comparison of hand examples. Here pose estimates of RF, CNN, as well as our Lie-Xapproach are compared together with respective human-annotated ground-truths. In each of the four panels,top row displays the full top-view together with four zoom-in visual examinations. Meanwhile, the bottomrow displays side-views of the respective methods. Best viewed in color.

25

Page 26: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Groundtruth Lie-X Oberweger et. al. [30] Oberweger et. al. [31] Zhou et. al. [54]

11

22

3

3

4

4

11

22

3

3

4

4

11

22

3

3

4

4

11

22

3

3

4

4

11

22

3

3

4

4

1

12 2

3

3

4

4

1

12 2

3

3

4

4

1

12 2

3

3

4

4

1

12 2

3

3

4

4

1

12 2

3

3

4

4

(a) (b)

1

1

2

2

3

3

4

4

1

1

2

2

3

3

4

4

1

1

2

2

3

3

4

4

1

1

2

2

3

3

4

4

1

1

2

2

3

3

4

4

1

12

2

3

3

4

4

1

12

2

3

3

4

4

1

12

2

3

3

4

4

1

12

2

3

3

4

41

12

2

3

3

4

4

(c) (d)

Figure 11: Visual comparison of Lie-X as well as the state-of-the-art methods on the same four input handimages presented in Fig. 10. In each of the panels, the corresponding example is presented with four zoom-invisual examinations. Best viewed in color.

26

Page 27: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Groundtruth Lie-X Oberweger et. al. [30] Oberweger et. al. [31] Zhou et. al. [54]

Figure 12: Visual comparison of Lie-X results and the state-of-the-art methods on ten additional handexamples. Each column presents an example, while each row displays results from a particular competingmethod. Best viewed in color.

27

Page 28: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Gro

undt

ruth

Lie-X output candidates

Figure 13: An example of fish pose estimation that visually illustrates the inner-working of the learnedinternal metric in our approach applied onto the Kt = 40 pose candidates of the input depth image. Heregreen colored numbers correspond to the scores (lower the better here) and ranking results obtained byapplying the learned metric, red colored numbers denote the corresponding actual evaluation scores andranking results by engaging the empirical evaluation metric of average joint error when we have access tothe ground-truth annotations. See text for details.

28

Page 29: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Gro

undt

ruth

Lie-X output candidates

Figure 14: An example of mouse pose estimation that visually illustrates the inner-working of the learnedinternal metric in our approach applied onto the Kt = 40 pose candidates. Here green colored numbersrefer to the scores (lower the better here) and ranking results obtained by applying the learned metric, redcolored numbers are the corresponding actual evaluation scores and ranking results by engaging the empiricalevaluation metric of average joint error when we have access to the ground-truth annotations. See text fordetails.

29

Page 30: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Gro

undt

ruth

Lie-X output candidates

Figure 15: An example of hand pose estimation that visually illustrates the inner-working of the learnedinternal metric in our approach applied onto the Kt = 20 pose candidates. Here green colored numberspresent the scores (lower the better here) and ranking results obtained by applying the learned metric, redcolored numbers denote the corresponding actual evaluation scores and ranking results by engaging theempirical evaluation metric of average joint error when we have access to the ground-truth annotations. Seetext for details.

30

Page 31: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Groundtruth Lie-X

(a) Fish (b) Mouse (c) Hand

Figure 17: Visual examples of common pose estimation errors made by our Lie-X approach. These errorsinclude orientation flips, displacement along the z-direction and sub-optimal shape fits. Each of the columnspresents an exemplar depth image of fish, mouse, and hand, respectively, while the first and second rowsdisplay its top and side views.

31

Page 32: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

20 40 60 80 100 120 140 160 180 200 220 240 260 280 3000

10

20

30

40

Time (frames)

Average

jointerror(m

m)

Pose estimationTracking

Groundtruth

Pose estimation result

Tracking result

Figure 18: Pose estimation vs. tracking: An comparison of the average joint error on frames of a mouse testsequence when employing our pose estimation (Alg. 1) vs. tracking (Alg. 3) modules. The horizontal dottedlines in green and blue colors are the respective mean errors of pose estimation and tracking results. Visualcomparisons at various time frames are presented in the bottom row.

32

Page 33: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

Start

End

(a) Scoot (b) J-turn (c) C-turn

(d) R-turn (e) Surface (f) Dive

(g) Zigzag (h) Thrash (i) Freeze

Figure 19: Key frames from the nine distinct fish action categories considered in our experiments. Thecolored dots display the trajectory of the fish motions, where blue and red mark the start and end of theaction respectively. Note for a better illustration of the distinct fish action categories, (e) and (f) present aside view of the surface and dive actions, while a top view is adopted for the rest action types.

33

Page 34: arXiv:1609.03773v1 [cs.CV] 13 Sep 2016view-invariant high-level space by CNNs and Fourier temporal pyramid. Moreover, the work of [29] dis-cusses a method to jointly train models for

(a) (b)

Figure 20: Action recognition confusion matrices on our fish action dataset. (a) is for joint position basedfeatures, while (b) is for tangent vector based features. Their overall performance in terms of averageclassification accuracy for (a) and (b) is 79.19% and 91.33%, respectively.

34