m o tracking with an rgb-d camera - tum

55
M ULTIPLE M ODEL - FREE O BJECT T RACKING WITH AN RGB-D C AMERA eingereichte MASTERRBEIT von Shile Li geb. am 06.04.1990 wohnhaft in: Helene-Mayer-Ring 7 80809 M ¨ unchen Tel.: 0176 91450017 Lehrstuhl f ¨ ur STEUERUNGS- und REGELUNGSTECHNIK Technische Universit¨ at M ¨ unchen Univ.-Prof. Dr.-Ing./Univ. Tokio Martin Buss Betreuer: Prof. Ph.D. Dongheui Lee Ph.D. Seongyong Koo Beginn: 06.10.2014 Zwischenbericht: 06.02.2015 Abgabe: 06.04.2015

Upload: others

Post on 25-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: M O TRACKING WITH AN RGB-D CAMERA - TUM

MULTIPLE MODEL-FREE OBJECT

TRACKING WITH AN RGB-D CAMERA

eingereichteMASTERRBEIT

von

Shile Li

geb. am 06.04.1990wohnhaft in:

Helene-Mayer-Ring 780809 Munchen

Tel.: 0176 91450017

Lehrstuhl furSTEUERUNGS- und REGELUNGSTECHNIK

Technische Universitat Munchen

Univ.-Prof. Dr.-Ing./Univ. Tokio Martin Buss

Betreuer: Prof. Ph.D. Dongheui LeePh.D. Seongyong Koo

Beginn: 06.10.2014Zwischenbericht: 06.02.2015Abgabe: 06.04.2015

Page 2: M O TRACKING WITH AN RGB-D CAMERA - TUM
Page 3: M O TRACKING WITH AN RGB-D CAMERA - TUM

Abstract

This thesis presents a novel point-cloud descriptor for robust and real-time tracking ofmultiple objects without any object knowledge. Following with the framework of incre-mental model-free multiple object tracking from previous work, 6 DoF pose of each objectis firstly estimated with input point-cloud data which is then segmented according to theestimated objects, and incremental model of each object is updated from the segmentedpoint-clouds. Here, we propose Joint Color-Spatial Descriptor (JCSD) to enhance therobustness of the pose hypothesis evaluation to the point-cloud scene in the first pose es-timation step. The method outperforms widely used point-to-point comparison methods,especially in the partially occluded scene, which is frequently happened in the dynamicobject manipulation cases. By means of the robust descriptor, we achieved unsupervisedmultiple object segmentation accuracy higher than 99%. The model-free multiple objecttracking was implemented by using a particle filtering with JCSD as a likelihood func-tion. The robust likelihood function is implemented with GPU, thus facilitating real-timetracking of multiple objects.

Page 4: M O TRACKING WITH AN RGB-D CAMERA - TUM

2

Page 5: M O TRACKING WITH AN RGB-D CAMERA - TUM

CONTENTS 3

Contents

1 Introduction 51.1 Motivation and Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Background 112.1 Point-cloud Representation . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Tracking with Bayesian Filtering Method . . . . . . . . . . . . . . . . . 14

2.2.1 Optimal Bayesian Estimation . . . . . . . . . . . . . . . . . . . 142.2.2 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Hypothesis Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 Global Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.2 Local Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Method 233.1 Model-Free Object Tracking System . . . . . . . . . . . . . . . . . . . . 233.2 Joint Color-Spatial Descriptor . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Smoothed Color Ranging . . . . . . . . . . . . . . . . . . . . . . 283.2.2 Spatial Representation of an Object . . . . . . . . . . . . . . . . 313.2.3 Joint Color-Spatial Representation . . . . . . . . . . . . . . . . . 33

3.3 Hypothesis Evaluation with JCSD . . . . . . . . . . . . . . . . . . . . . 343.4 Point-cloud Segmentation and Model Update . . . . . . . . . . . . . . . 353.5 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 EXPERIMENT AND RESULT 414.1 Tracking accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Segmentation accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3 Computation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 CONCLUSION 47

List of Figures 49

Bibliography 51

Page 6: M O TRACKING WITH AN RGB-D CAMERA - TUM

4 CONTENTS

Page 7: M O TRACKING WITH AN RGB-D CAMERA - TUM

5

Chapter 1

Introduction

Object tracking is one of the core functions that enables autonomous robots to performintelligent applications such as learning from human demonstrations [LN10], especiallyfor grasping [PKAW13][SLP11] and manipulation [KHRF11]. From visual sensor datasequences, robot’s object tracking module identifies the interested target object and keepstrack of the target’s pose relative to the sensor coordinate. Object tracking using 2D videohas been long investigated in the past three decades [YJS06][YSZ+11]. With emergenceof inexpensive consumer RGB-D camera like Microsoft Kinect, the interest to researchobject tracking using 3D data is growing. Compared to monocular camera, RGB-D cam-era provides additional 3D geometry information of the environment. This opens up manypossibilities to develop new tracking algorithms, nevertheless additional dimension ofsensor data also increases the tracking problem’s complexity and also influences the real-time performance. In this master thesis, we will investigate object tracking problem usingpoint cloud data from RGB-D camera.

1.1 Motivation and Challenge

Object tracking is a challenging problem, its difficulty relies on two main aspects: avail-ability of target object information, and uncertainty of the measurement data. In the mostprevious works, a robot requires full knowledge of the object model, where the objectmodel must be learnt offline before the tracking process begins [CC13][PW15][BC12](Fig. 1.1).

However in many cases, when a robot faces to a new environment, prior information oftracking targets is not always provided [PKAW13][KLK13][KLK14]. Furthermore in adynamic environment, objects can be easily occluded by obstacles or parts of the robotitself from the view point of the camera mounted on the robot. These difficulties callfor a model-free and robust tracking method, one of which was proposed in [KLK14].Assuming that only partial knowledge (such as visible part of the object from a specificviewpoint) of the target object is provided, the tracking process using point-cloud data has

Page 8: M O TRACKING WITH AN RGB-D CAMERA - TUM

6 CHAPTER 1. INTRODUCTION

(a) Model-based 3D tracking from [RC11] (b) Model-based 3D tracking from [CC13]

Figure 1.1: Model based 3D tracking example, in this case, full object knowledge isprovided before tracking starts.

(a) Model-free 3D tracking from [PKAW13] (b) Model-free 3D tracking from [KLK13]

Figure 1.2: Model-free 3D tracking example, in this case, object knowledge must beupdated during tracking.

three main steps: pose tracking, point-cloud segmentation, and model update (Fig.1.3).

The pose tracking estimates the target object’s 6 DoF pose in newly arrived point-clouddata. Then point-cloud segmentation groups the data according to the estimated objects’pose (as an example, Fig. 1.4 shows the segmentation results of the four individual ob-jects that are contacted with each other). The model update changes the current objectknowledge (object model) such that it can be adapted to the new observation based on theestimated pose. These steps construct a closed loop system, where each step needs theresult from other step as an input to proceed. In this closed loop system, inaccurate esti-mation of any step can accumulate during tracking process and leads to tracking failure.Therefore it is important to enhance accuracy and robustness in all steps to obtain bettertracking performance.

Page 9: M O TRACKING WITH AN RGB-D CAMERA - TUM

1.1. MOTIVATION AND CHALLENGE 7

Pose tracking

Point-cloudsegmentation

Model update

Sensor data

Figure 1.3: Closed loop tracking system with three main steps.

Pose tracking

Pose tracking is the most important part among above mentioned steps, because posetracking is the first step in the closed loop system, where inaccurate pose tracking willaffect point-cloud segmentation. Inaccurate point-cloud segmentation will falsely labelsome points from background or other objects to the target object, consequently the modelupdate with the noisy set of segmentation data will result inaccurate object model knowl-edge.

Pose tracking with Recursive Bayesian estimation is the most commonly used approach[YJS06][YSZ+11], because Bayesian estimation can smooth the noisy pose trajectorycaused by noisy measurement. In a Recursive Bayesian system, the target’s pose is pre-sented as hidden state and for each new frame data, the pose estimation applies two steps:prediction and update. First the target’s pose will be predicted using prior probabilityfrom dynamic motion model and previous pose estimations. Then the predicted pose willbe updated with new measurement data.

Among different Bayesian filtering methods, particle filter [GSS93a] is popular for track-ing problem [NKMVG03][BC12][CC13], because particle filter makes no assumption oflinear movement of object and Gaussian measurement noise, where object movement ina human environment is usually non-linear. Inside a particle filter framework, the priorprobability of target’s pose is presented as multiple particles. Each particle consists of apose hypothesis and corresponding weight. The particle weight is determined by a hy-

Page 10: M O TRACKING WITH AN RGB-D CAMERA - TUM

8 CHAPTER 1. INTRODUCTION

(a) Input point cloud (b) Object segmentation, four objects are indi-cated in different colors

Figure 1.4: An example of multiple object segmentation result in the case of four individ-ual objects that are contacted with each other.

pothesis evaluation function to estimate how close is the hypothesis pose to the givenobservation data. Then object’s current pose is estimated as the weighted average ofthese particle states. With the point-cloud data, the hypothesis evaluation is performedby comparing a set of measurement points in the neighborhood of the pose hypothesis(hypothesis data) and current knowledge of the object. Using a proper evaluation func-tion is crucial, because it directly influences the pose tracking accuracy. But designing arobust and efficient hypothesis evaluation method is a challenging task, since the knowl-edge of object cannot be specified and the sensor data are often noisy and partially missed.

Some previous approaches [CC13][BC12][PW15] use point-point (pixel-pixel) corre-spondence for likelihood estimation. The correspondences are determined either by per-forming nearest neighbour search or selecting the points projected onto the same imagecoordinate. They estimate likelihood for each point pair and sums up the result for allpairs. The point-point correspondence method works in non-occlusion cases, but as par-tial occlusion occurs, the likelihood evaluation is affected and drifted from the true likeli-hood, resulting in inaccurate pose estimation. Therefore this thesis aims to devise a novelpose hypothesis evaluation method to improve both tracking accuracy and segmentationrobustness.

Object model representation

How to present the current knowledge of the target object? The ideal way to preserve asmuch object information as possible is to use full set of point cloud. But computationalcomplexity for processing unorganized point-cloud data is usually higher than 2D image.

Page 11: M O TRACKING WITH AN RGB-D CAMERA - TUM

1.2. CONTRIBUTION 9

Therefore Koo et al. use parametric model in [KLK13] to simplify the object, where theobject is presented by Gaussian Mixture Model (GMM). Papon et al. use their supervoxelconcept to describe an object. While their object model reduced the computational com-plexity, simplified object model blurs the boundary shape of object, which will cause falsesegmentation in object contact cases. (E.g. in Fig. 1.2b, a part of hand area is includedinto the Gaussian component of white box.) In this thesis, we aim at precise point-cloudsegmentation, therefore we will use point cloud as object model and reduce the computa-tional complexity to achieve real-time performance.

Model update

Model update is necessary for a highly dynamic scenario. Because visible part of objectcan change during tracking, some previously unseen part of object can appear in new sen-sor data. Object model should adapt with these parts to maintain stable tracking. As newsensor data arrives, points from video frame are associated to different objects based onestimated object pose (point-cloud segmentation, Fig. 1.4). Some approaches tend to up-date object model using all new segmentation data to quickly adapt the model knowledgeonto the new sensor data [KLK13][PKAW13][KLK14] (Fig. 1.2b), where segmentationresult for one object is the points (pixels) from new sensor data, which are labelled tothat target object. However, using all segmentation data decreases the stability of the ob-ject model against pose tracking error, which will associate parts from other objects orbackground into object model and further affect tracking in the next time steps (Fig. 1.3).

1.2 ContributionIn this thesis, in order to perform hypothesis evaluation, we propose Joint Color-SpatialDescriptor (JCSD) that represents a probability density of a measurement point in thejoint color-spatial space.

• Compared to widely used point-to-point correspondence methods [CC13][PW15],where likelihood is estimated by summing up each point-pair likelihood (in termsof color coherence, spatial distance etc.), the pose hypothesis evaluation with JCSDoutperforms, especially in the partially occluded scene, which is frequently hap-pened in the dynamic object manipulation cases. Moreover, it is computationallyefficient enough to rebuild the descriptor at each model update step, thus enablingreal-time performance.

• A new model update scheme is also proposed to smoothly add new segmentedpoint-cloud data into the object model to maintain its adaptivity to the new sensordata and at the same time its robustness against segmentation error. Compared to[PKAW13][KLK13][KLK14], which used simplified object model (Gaussian Mix-ture Model or supervoxel), precise object boundary shape can be retained, thushigher object segmentation accuracy was achieved.

Page 12: M O TRACKING WITH AN RGB-D CAMERA - TUM

10 CHAPTER 1. INTRODUCTION

• The proposed method was implemented with GPU by using a particle filtering[GSS93a][DDFG01] with JCSD as a likelihood function. The large number ofhypotheses required by particle filtering can be evaluated in real-time for trackingmultiple objects at the same time. Moreover, the experiment results showed that thesegmentation accuracy was achieved higher than 99%.

1.3 OutlineThis thesis is organized as follows. Background theory is provided in Chapter 2. Theproposed multiple model-free tracking system with the novel feature descriptor JCSD isdescribed in Chapter 3. The performance of our approach is shown with several experi-ments in Chapter 4. Finally, a conclusion is given in Chapter 5.

Page 13: M O TRACKING WITH AN RGB-D CAMERA - TUM

11

Chapter 2

Background

In this chapter we provide some background theory for our tracking method. First thepoint-cloud representation is introduced, because this is the type of data we will usethrough the entire tracking process. Next, tracking based on Bayesian Filtering methods ispresented, where we will emphasize on particle filtering method used in our method. Fi-nally we will compare some global and local feature based hypothesis evaluation methodsin the literature.

2.1 Point-cloud RepresentationAs the depth sensing device used in our proposed framework, Kinect calibrates the ac-quired point-cloud data with 2D colored image. The acquired raw data for us is coloredpoint-cloud, where each point contains both 3D geometric and color information (Fig.2.1). Point-cloud is a widely used 3D data presentation, it consists of a set of points ina certain 3D coordinate-system (camera-coordinate for the raw data). These points aretypically representative of the outer surfaces of object in the real scene, which are visiblefrom the view of the camera. If we define one point as pi:

pi =

xiyizi

.

The point cloud P could then be presented as a matrix:

P = (p1,p2,p3, ...,pn) =

x1 x2 x3 ... xny1 y2 y3 ... ynz1 z2 z3 ... zn

,

where n is the amount of points in P. Usually, the acquired point cloud data containsalso color information in RGB color space, then point pi could be rewritten by increasethe dimensionality: pi = (xi,yi,zi,ri,gi,bi)

T . However, in this thesis, we only use theformer presentation of point. The color information will be extra extracted in the featureestimation stage.

Page 14: M O TRACKING WITH AN RGB-D CAMERA - TUM

12 CHAPTER 2. BACKGROUND

(a) View from the camera position. (b) Side view.

Figure 2.1: An example colored point-cloud of a hand touching a book.

3D Transformation

A common operation for point cloud is 3D transformation. The 3D transformation is acombination of rotation R ∈ R3×3 and translation t ∈ R3×1. The point p = (px, py, pz)

T

after the transformation is then p′ = (p′x, p′y, p′z)T :

p′ = R ·p+ t. (2.1)

If we present the 3-dimensional position of p using a 4-dimensional vector (px, py, pz,1)T ,the transformation process could be simplified using only a single transformation matrixT ∈ R4×4 as:

p′xp′yp′z1

= T ·

pxpypz1

,

where T combines R and t:

T =

R1,1 R1,2 R1,3 t1R2,1 R2,2 R2,3 t2R3,1 R3,2 R3,3 t3

0 0 0 1

. (2.2)

This presentation has the advantage of computational efficiency because the transforma-tion matrices in this presentation are commutative and we can simply use the inverse of Tif we want to transform p′ back to p:

p = T−1 ·p′.

Page 15: M O TRACKING WITH AN RGB-D CAMERA - TUM

2.1. POINT-CLOUD REPRESENTATION 13

K-d Tree Search

In most 3D point cloud applications, the nearest point of a certain point in the source cloudis needed to be retrieved in the target cloud (input and target cloud could be the same).Compared to 2D image based algorithm, nearest neighbour search is more complex foran unorganzied point-cloud. In a unorganzied point-cloud, the points are only listed in amatrix without the information of other points around it. For a point-cloud with n points,the nearest neighbour search for one point requires n−1 operations if we use brute-forcesearch to test each point, results in O(n) complexity. The common method to improvethis task is use of k-d tree search [Ben75]. A k-d tree of a point cloud is a binary tree,where every node presents a point in the tree (Fig. 2.2). Each non-leaf node has a splittingaxis, and the node can be thought as splitting the point cloud into 2 parts with the planepassing through the point and orthogonal to the axis. For example, if a node splits alongthe x-dimension, then the left subtree of the node contains points with lower x-value thanthe x-value of the split node, and the right substree contains points with higher x-valuethan the x-value of the split node.

Figure 2.2: An example of k-d tree. a) Partition of a 128×128 2-dimensional space usingk-d tree. b) The resulting tree structure. 2

By nearest point retrieval using k-d tree structure, a large portion of search space can bequickly excluded. Because branches of the tree, which cannot contain points closer to theinput node than the current best node can be eliminated. With k-d tree search, the averagecomputation complexity for nearest neighbour search is O(logn) which is much betterthan the brute-force search method.

2This image is from the website http://algoviz.org/OpenDSA/dev/OpenDSA/Modules/KDtree.odsa.

Page 16: M O TRACKING WITH AN RGB-D CAMERA - TUM

14 CHAPTER 2. BACKGROUND

2.2 Tracking with Bayesian Filtering MethodObject tracking aims to detect the pose trajectory of moving objects using a sequence ofinput data. If we estimate object pose for each single frame individually, the trajectorywill be noisy due to the measurement noise and pose estimation error noise. However ifwe also use prior information obtained from previous frames, the noisy trajectory can besmoothed using Bayesian filtering method.

2.2.1 Optimal Bayesian EstimationIn a Bayesian framework, object 6 DoF pose is defined as hidden state, the state vector attime step t is defined as Xt = (x,y,z,roll, pitch,yaw). At time step t, we can observe theobject’s state from an observation variable Zt , which is obtained by an other object poseestimation method using single frame data. The Bayesian approach aims to estimate cur-rent object state Xt using the observations {Zi}ti=1. The object state estimation problembecomes the estimation of the posterior probability density function (pdf) p(Xt |{Zi}ti=1).This Bayesian model is illustrated in Fig. 2.3, where the random variables are presented asnodes and the probabilistic relationships are presented as arrows. The transition betweentwo consecutive states is modelled as a possible non-linear function:

Xt = ft(Xt−1,vt−1), (2.3)

and the observation model for Zt is defined as:

Zt = ht(Xt ,ut−1), (2.4)

where vt−1 and ut−1 are i.i.d. distributed noise vectors.

Figure 2.3: Graphical model for Bayesian tracking method

The estimation of posterior pdf p(Xt |{Zi}ti=1) is recursively performed by two steps: pre-diction step using prior information and update step using current measurement data. Theprediction of pdf at time step t, p(Xt |{Zi}t−1

i=1) is estimated via the Chapman-Kolmogorovequation:

p(Xt |{Zi}t−1i=1) =

∫p(Xt ,Xt−1|{Zi}t−1

i=1)dXt−1

=∫

p(Xt |Xt−1)p(Xt−1|{Zi}t−1i=1)dXt−1,

(2.5)

Page 17: M O TRACKING WITH AN RGB-D CAMERA - TUM

2.2. TRACKING WITH BAYESIAN FILTERING METHOD 15

where p(Xt−1|{Zi}t−1i=1) stands for the posterior pdf at time step t − 1 and p(Xt |Xt−1)

stands for the state transition probability (Eq. 2.3).

After prediction step, the state pdf will be updated using the new observation Zt to correctthe uncertainty from the prediction step. Using the Bayes’ rule, the update step is asfollows:

p(Xt |{Zi}ti=1) =p(Zt |Xt) · p(Xt |{Zi}t−1

i=1)

p(Zt |{Zi}t−1i=1)

, (2.6)

where p(Zt |Xt) is the likelihood of Xt estimated from Eq. 2.4. The denominator of Eq.2.6 is:

p(Zt |{Zi}t−1i=1) =

∫p(Zt |Xt) · p(Xt |{Zi}t−1

i=1)dXt , (2.7)

which is a constant normalization factor. With the posterior pdf of p(Xt |{Zi}ti=1), thecomputation of an optimal estimate of the object state Xt can be obtained with a MaximumA Posteriori estimator:

Xt = argmaxXt

p(Xt |{Zi}t−1i=1). (2.8)

The Minimum Mean Square Error estimator would be:

Xt = E[Xt · p(Xt |{Zi}t−1i=1)]

=∫

Xt · p(Xt |{Zi}t−1i=1)dXt .

(2.9)

However, the two above estimators using optimal Bayesian filtering is in most cases im-possible. An analytical solution is not possible, because of the non-linear likelihood func-tion (Eq. 2.4) and non-Gaussian noise in the dynamic models. Particle filtering providesa suboptimal solution to obtain an approximated solution. In the next subsection, we willdescribe object tracking using particle filtering method.

2.2.2 Particle FilterBecause an analytical solution for optimal Bayesian filtering problem is hard to solve(Eq. 2.9), a suboptimal approximated solution is needed. In contrast to other approx-imated methods, such as Extended Kalman Filter (EKF) and Unscented Kalman Filter[Che03], particle filter is more suitable for non-linear system process and non-Gaussiannoises, which is the case for tracking problem in a human environment.

Particle filter recursively simulates the posterior pdf of p(Xt |{Zi}ti=1) using a finite num-ber of possible system states [AMGC02]. The N state samples {X(i)

t }Ni=1 own their corre-

sponding weights {π(i)t }N

i=1, where each sample-weight pair composites a particle S:

S(i)t =< X(i)t ,π

(i)t > .

Page 18: M O TRACKING WITH AN RGB-D CAMERA - TUM

16 CHAPTER 2. BACKGROUND

The weight reveals the quality of its corresponding the sample state’s likelihood:

π(i)t ∝ p(Zt |X(i)

t ). (2.10)

We normalize the weights by ∑Ni=1 π

(i)t = 1, then the estimation of the state is:

Xt =N

∑i=1

π(i)t ·X

(i)t (2.11)

In this thesis, we will use Sample Importance Resampling (SIR) particle filter [GSS93b]to select particles at each time step. The whole SIR particle filter method is illustratedin Fig. 2.4, where the figure shows an example with 1-dimensional state estimation. Thepink area presents the true pdf and the circle presents a particle state, where the circle’sposition presents its state and its size reveals the weight of that particle. In the followingwe will describe 6 DoF pose tracking with SIR Filter step by step.

Initialization

Evaluation

Estimation Resampling

Prediction

Figure 2.4: A graphical representation of Sampling Importance Resampling (SIR) particlefilter for 1-dimensional state estimation.

Initialization At first time step, the particles are randomly selected in the state space.For our tracking problem, usually we obtain an initial object pose Xinit from other object

Page 19: M O TRACKING WITH AN RGB-D CAMERA - TUM

2.2. TRACKING WITH BAYESIAN FILTERING METHOD 17

detection module, we will then sample a set of particle states {X(i)init}N

i=1 around the initialstate with equal particle weights:

X(i)init ∼N (Xinit ,Cinit)

π(i)init =

1N,

(2.12)

where Cinit is a covariance matrix that determines the covariance of the initial sampledparticle states. We assume that each dimension of 6 DoF pose is independent, thereforeCinit is a diagonal matrix:

Cinit = diag(σ (init),2x ,σ

(init),2y ,σ

(init),2z ,σ

(init),2roll ,σ

(init),2pitch ,σ

(init),2yaw ).

Evaluation At time step t, we have a set of particles states {X(i)t }N

i=1 that are samplednear the possible area of the true state. We will then evaluate the likelihood of eachparticle state p(Zt |X(i)

t ) using specific designed function that is dependent on the specificapplication, which reveals the true pdf (pink area in Fig. 2.4). In this thesis, we proposethe Joint Color-Spatial Descriptor to evaluate p(Zt |X(i)

t ) (Sec. 3.2, Sec. 3.3). Then theparticles’ weights will be estimated using Eq. 2.10 and normalized as

N

∑i=1

π(i)t = 1.

Estimation The object’s pose state is then estimated using Eq. 2.11 by averaging theparticle states based on their weights, which is a simulation of Eq. 2.9. In Fig. 2.4 thestate estimation Xt is presented as the red circle.

Resampling With SIR particle filter, a new set of particle states {X(i)t+1}N

i=1 will be gen-erated for the next time step. The new set of particles are generated based on the posteriorpdf p(Xt+1|Xt ,Zt). This is achieved by selecting the old particle states and diffuse theselected old states. The detailed algorithm is described in Algorithm 2.2.2.

The diffusion covariance Ct is calculated based on the velocity of object pose where thevelocity at time step t, ∆Xt is calculated as:

∆Xt = Xt− Xt−1. (2.13)

The diffusion covariance is then determined as:

Ct = α ·∆Xt ·∆XTt +β ·Cinit . (2.14)

where α and β are two tuning parameters which remain constant during tracking, theyleverage the influence of estimated velocity and initial covariance. In this way, the particlestate dimensions, who have large velocity value, will be diffused with larger variance inthe resampling process, because the uncertainty in these dimensions is larger.

Page 20: M O TRACKING WITH AN RGB-D CAMERA - TUM

18 CHAPTER 2. BACKGROUND

Algorithm 1 Resample new set of particles

Input: - {X(i)t }N

i=1 and {π(i)t }N

i=1

Output: - {X(i)t+1}N

i=1- Initialize Cumulative distribution function (CDF): c0 = 0.for i = 1 : N do

- Build CDF: ci = ci−1 +π(i)t .

end forfor i = 1 : N do

- Generate random variable ui ∼ U[0,1).- Find out corresponding particle index j using binary search method such that:

c j−1 <= ui < c j.

- Sample a new particle state using X( j)t and a diffusion covariance matrix at time

step t:X(i)

t+1 ∼N (X( j)t ,Ct).

- Reset the new particle’s weight: π(i)t+1 =

1N .

end for

Prediction Using the velocity information estimated from Eq. 2.13, we can also predictthe pdf distribution for the next time step. In Fig. 2.4, the pdf is predicted to be movedto the right, the resampled particles will be moved according to the predicted direction.Specifically in our tracking algorithm, we predict the resampled particle state as:

X(i)t+1← X(i)

t+1 +∆Xt . (2.15)

However the prediction is noisy because the natural movement is not predictable withlinear system model, therefore in this thesis, we will perform prediction only for 50% ofthe particles.

The above described parts composite the Sampling Importance Resampling (SIR) particlefilter. The different parts will be performed recursively as new frame of visual data arrivesand provide an estimate of the object pose at each time step.

2.3 Hypothesis EvaluationHypothesis evaluation is a core part for tracking algorithm using particle filtering. Giventhe hypothesis object pose state X(i)

t and current measurement data Zt (visual informationfrom camera), hypothesis evaluation estimates a similarity score value si ∝ p(Zt |X(i)

t ),where the particle’s weight will be calculated as πi ∝ p(Zt |X(i)

t ) ∝ si. From Eq. 2.11 wecan see hypothesis evaluation is directly relevant to the accuracy of object pose estimation.

Page 21: M O TRACKING WITH AN RGB-D CAMERA - TUM

2.3. HYPOTHESIS EVALUATION 19

Assuming that we have the full object model or partial object model knowledge. Forhypothesis evaluation of X(i)

t , we will place the object model into the current video databased on its state X(i)

t (for 2D video, the object state can be image-coordinate positionand object size in image pixels, for our approach with point-cloud video, the object stateis the 6 DoF object pose). We define the measurement data involved with the hypothesisstate X(i)

t as hypothesis data. The hypothesis data for 2D video could be image pixelsinside a bounding sphere or a bounding rectangular, where the bounding shape’s centeris the hypothesis state’s position and the size of bounding shape is the hypothesis state’ssize. The hypothesis data for point-cloud video could be the points in the vicinity of thereplaced object model. The score value si evaluates the degree of consistency betweenthe replaced object model and the hypothesis data. Various evaluation functions has beenproposed for both 2D video and RGB-D video, these methods can be roughly categorizedinto global comparison method and local comparison method.

2.3.1 Global MethodGiven the object model and hypothesis data, global methods usually estimate two globaltype features for the object model and for the hypothesis data respectively, and evaluatethe hypothesis by comparing these two features.

Usually, the global type feature descriptor build a color histogram to make a statisticalmodel of object appearance [NKMVG03][CRM03]. Nummiaro et al. [NKMVG03] usea bounding ellipse to model the object’s position and size for 2D video, where y is theellipse center position, Hx and Hy are the half axes length. The function h(xi) assigns thepixel xi a bin index of histogram. They use a 8×8×4 = 256 bins color histogram in HSVcolor space due to the illumination invariant property. The value of each histogram bin{p(u)y }u=1...256 is calculated as:

p(u)y = fI

∑i=1

k(||y−xi||

a)δ [h(xi−u], (2.16)

where kernel function k(r) will assign the pixels more close to the center more weight, δ

is the Kronecker delta function and f is a normalization factor that ensures ∑256u=1 p(u)y = 1.

In the tracking method, they compute the both color histograms p and q for the objectmodel and hypothesis data respectively and evaluate their similarity using:

ρ[p,q] =256

∑u=1

√p(u),q(u), (2.17)

where larger ρ indicates more similarity and hypothesis data will perfectly match themodel if ρ[p,q] = 1.

Page 22: M O TRACKING WITH AN RGB-D CAMERA - TUM

20 CHAPTER 2. BACKGROUND

Besides color histogram, other statistical models have also been used for comparing ob-ject model and hypothesis data. In [PA10], Papadourakis et. al. also use ellipse model,and they propose to use GMM to present the colors in YUV color space. Although colorhistogram and GMM perform well to present the color distribution robustly, but with thisform of description, local differences cannot be reflected in the comparison result. Imag-ine if some different object with same color distribution but complete different textureappears (e.g. a cola can and a red apple), hypothesis around the false object could resulthigh similarity despite of obvious texture difference.

2.3.2 Local MethodLocal methods often perform the comparison between points (pixels) and sum up the localcomparison results. Given replaced object model and hypothesis data, each point (pixel)from hypothesis data will be assigned to a point from the object model. The point cor-respondences are determined using nearest neighbour criteria [PW15], or the points whoproject onto the same image coordinate are grouped as correspondence [BC12][CC13].A local similarity score will be estimated for each point-pair, then these local similarityscore will be summed up to a global score value. Local similarity score is estimated byevaluating the color consistency, geometry distance and normal direction consistency etc.

A recent work for tracking using particle filtering [CC13] use the following method toestimate local similarity of a point-pair. Given the model point m∈R3 and correspondinghypothesis data point z ∈ R3, the operator n(p) ∈ R3 gives the normal direction of onepoint and the operator c(p) ∈ R3 presents the color of one point. The similarity betweenthese two points is give as:

s(m,z) =exp(−λe ·de(m,z))·exp(−λn ·dn(n(m),n(z)))·exp(−λc ·dc(c(m),c(z))),

(2.18)

where de(m,z), dn(n(m),n(z)) and dc(c(m),c(z)) are Euclidean, normal and color dis-tance as:

de = ‖m− z‖ (2.19)

dn = ‖cos−1(n(m) · (z))‖ (2.20)

dc = ‖c(m)− c(z)‖ (2.21)

λe, λn and λc are tuning parameters which decide the sensitivity of the distances to thesimilarity.

This kind of comparison reflect local variance well and is sensitive to local difference.Furthermore the gradient of likelihood in the hypothesis state space is higher than globalmethod, where a subtle change of hypothesis state will cause larger likelihood differencethan global method.

Page 23: M O TRACKING WITH AN RGB-D CAMERA - TUM

2.3. HYPOTHESIS EVALUATION 21

The sensitivity to hypothesis state change is advantageous in term of discriminative powerbut bad in term of robustness against noise and occlusion. Especially in case of partialocclusion, which happens a lot in a human robot interaction environment, for a correcthypothesis state, occluded local elements (pixel, points) of object model will result lowsimilarity scores although their positions are correct, because they are labelled to wrongcorresponding points which might be from the occluder. Instead, for some false particlestate, the points from object model are matched to some wrong points which could resultin better similarity.

The goal of this thesis is to design a novel feature descriptor that combines the global andlocal methods. The feature descriptor should have the robust behaviour of global methodand should be capable to describe local variance efficiently. Please see Chapter 3 for thedetails of our feature descriptor JCSD (Joint Color-Spatial Descriptor).

Page 24: M O TRACKING WITH AN RGB-D CAMERA - TUM

22 CHAPTER 2. BACKGROUND

Page 25: M O TRACKING WITH AN RGB-D CAMERA - TUM

23

Chapter 3

Method

As mentioned in the introduction, pose tracking, segmentation and model update are threeimportant parts in a model-free tracking system. In dynamic human object interaction sce-narios, the motion is usually non-linear and non-Gaussian, therefore we will use particlefiltering to estimate object’s 6 DoF pose at each time step. In a particle filtering frame-work, multiple pose hypotheses should be evaluated, in this chapter, we will proposea novel feature descriptor for hypothesis evaluation, called Joint Color-Spatial Descrip-tor (JCSD), where JCSD encodes one object’s shape and color into a joint color-spatialspace. Next, we will describe the point-cloud segmentation and model update methodused in our tracking system. The particle filtering requires large number of hypotheses tosimulate pose probability distribution, therefore it needs heavy computation load, whichaffects the real-time requirement. However evaluation for each pose hypothesis is inde-pendent to other hypotheses, hence we can parallelize hypotheses evaluation part withGPU. In the end of this chapter, we will present some details about GPU implementation.

3.1 Model-Free Object Tracking System

In this section, we will first shortly review the unsupervised object individuation methodproposed in [KLK14], on which our system built on. Next, we will discuss the benefitsand shortage from [KLK14] and provide an overview of our proposed approach.

Unsupervised Object Individuation

How can a robot distinguish different objects from point-cloud sequence without trainingdata or learnt object models? This problem is easy as different objects are not contactedwith each other. In a non-contact case, point-cloud can be clustered into different ob-jects using Euclidean Clustering method [RC11] (Fig. 3.1). But in a dynamic scenario,objects interact with each other, where object can be contacted with each other, further-more occlusion can occur and separate one single object into two different parts (Fig. 3.2).

Page 26: M O TRACKING WITH AN RGB-D CAMERA - TUM

24 CHAPTER 3. METHOD

(a) Input point cloud (b) Euclidean clustering, each color presents adifferent object

Figure 3.1: An example of the non-contact case where four individual objects are clearlyseparated.

Koo et al. proposed in [KLK14] an unsupervised object individuation method to handlethese dynamic scenarios. Their basic framework is illustrated in Fig. 3.3, where point-cloud is first segmented using Euclidean Clustering, and then an ambiguity test will beperformed to detect merge or occlusion cases. If no merge and occlusion occur, then ini-tial segmentation will be used as clear objects and object indexing part will associate eachobject to previous object tracks. But if merge or occlusion cases are detected by ambigu-ity test, then the ambiguous clustering result will undergo both ”separate falsely mergedobjects” and ”merge falsely separated objects” process to handle contact and occlusioncase respectively.

Their approach achieved robust and accurate result in non-contact cases, furthermore thecomputational cost is kept low in non-contact case, because only Euclidean Clusteringis required. Nevertheless in contact case, such as object stacking, the segmentation ac-curacy decreases due to inaccurate segmentation in the contacted area. This is causedby inaccurate pose tracking result and simplified object model representation. More-over, in contact case, the computational time increases drastically, because their posetracking uses time consuming iterative numerical optimization method. However, besidesthe lower performance in contact case, their framework achieved satisfying performancein non-contact case and moreover flexibility to handle different situations with differentstrategies. Therefore in this thesis we applied their method to handle all non-contact casesand switch to our system for contact cases.

System Overview

The proposed model-free tracking system consists of three parts as shown in Fig. 3.4.When a new point-cloud comes, the object’s pose needs to be estimated at first (pose es-

Page 27: M O TRACKING WITH AN RGB-D CAMERA - TUM

3.1. MODEL-FREE OBJECT TRACKING SYSTEM 25

(a) Contact (b) Occlusion

Figure 3.2: Two example cases where simple Euclidean Clustering fails to separate dif-ferent objects. Three objects are represented in different colors.

timation). Then, a set of points from current sensor data, which belongs to the object,is extracted based on the estimated pose (segmentation). Finally, the object model is up-dated with old model data and newly extracted measurement points (model update).

In this work, a frame of sensor data, Zt , and an object model, Mt , are represented as a setof points, {(p1,c1), ...,(pn,cn)}, each of which contains 3D location, pi = (x,y,z), and itscorresponding color, c = (r,g,b). For the multiple object case, {M(k)

t }Kk=1 are K object

models at time step t. Let’s assume that the initial object models {M(k)t0 }

Kk=1 can be ob-

tained from the non-contact case1 as shown in Fig. 3.1, where the input point-cloud canbe easily segmented into object individuals using an Euclidean clustering method [RC11].Here, t0 is the time step for last frame of non-contact case.

When new frame data Zt arrives at time step t, the pose estimation process is firstlyapplied based on a particle filter with the proposed JCSD. Then, the previous object modelM(k)

t−1 is transformed from the object coordinate to the world coordinate based on theestimated pose. Next, point-cloud segmentation is performed based on the transformedobject models M(k)

t , which results in a subset of input point-cloud data Z(k)t . New object

model M(k)t is then built by fusing M(k)

t and Z(k)t . Finally, the new object model M(k)

t istransformed back to its object coordinate.

1The transition from non-contact to contact case is detected using algorithms provided by [KLK14].

Page 28: M O TRACKING WITH AN RGB-D CAMERA - TUM

26 CHAPTER 3. METHOD

Figure 3.3: Unsupervised object individuation framework proposed in [KLK14]

Pose estimation

Segmentation Model update

Sec. 3.2, Sec 3.3Sec. 3.4

Sec. 3.4

Figure 3.4: System overview for model-free object tracking. All bold symbols are pre-sented as point cloud.

Object pose representationAll points in the object model M are described relative to the object coordinate, whichis fixed on the centroid of the initial object model Mt0 . At time step t, one object’s posestate is presented as:

Xt = (x,y,z,roll, pitch,yaw). (3.1)

For clarity, we denote transformed point cloud M from M based on the pose state X as:

M = X⊗M. (3.2)

The actual transformation is performed as described in Sec. 2.1.

Particle filter for pose estimationParticle filter is a robust approach for tracking object pose, it is not limited to a linearsystem and does not require noise to be Gaussian [YJS06][YSZ+11]. At time step t, theposterior probability of an object’s pose is approximated by a set of hypothesis poses{X(i)

t }Ni=1 with their corresponding weights {π(i)

t }Ni=1. The set of hypotheses {X(i)

t }Ni=1

Page 29: M O TRACKING WITH AN RGB-D CAMERA - TUM

3.2. JOINT COLOR-SPATIAL DESCRIPTOR 27

is sampled by using Sampling Importance Resampling method. Each sample-weightpair is defined as a particle S(i)t = (X(i)

t ,π(i)t ), where the weight values are normalized

as ∑Ni=1 π

(i)t = 1. Each weight value is proportional to its corresponding hypothesis likeli-

hood, which is the probability that an object model Mt−1 fits to the current measurementZt based on the hypothesis pose state X(i)

t :

π(i)t ∝ p(Zt |X(i)

t ⊗Mt−1). (3.3)

The likelihood estimation method will be described in Sec. 3.2 and Sec. 3.3.The final estimation of the current pose Xt is then the integration of all particle statesbased on their weights:

Xt =N

∑i=1

π(i)t ·X

(i)t (3.4)

3.2 Joint Color-Spatial DescriptorHypothesis evaluation is a core part of the particle filtering to estimate an object pose inthe point-cloud data. As seen in Eq. 3.3, it represents how likely a pose hypothesis isclose to the measurement data, and decides the weight value of each particle. Thus, theform of the evaluation function determines robustness and computational efficiency of thetracking performance.

In order to evaluate between an object model and a point-cloud data, some previous ap-proaches used local comparison where point-to-point correspondences are used to eval-uate hypothesis likelihood [PW15][CC13][BC12]. They estimated likelihood for eachpoint pair and sum up the result for all pairs. It is advantageous for tracking accuracy butsensitive to the noise and partially missed data. On the other hand, global comparison ap-proach has been used for object tracking in 2D video stream [NKMVG03][PA10][OTDF+04].Most of the global comparison methods estimated color distributions of object model andhypothesis data, and compared these two distributions. Although it is a robust description,the tracking accuracy is less than local comparison, because small change of the objectorientation does not affect the color distribution.

How to represent an object from a point-cloud such that it is locally efficient and glob-ally robust? First, we select a set of evaluation points that are equally distributed in theinterested point-cloud space. Each evaluation point represents local part of an object bycalculating the proposed Joint Color-Spatial Descriptor (JCSD) based on its neighbour-ing measurement points. Then, the object is globally represented as a probability densityof the JCSD that is estimated by interpolating the JCSDs of the neighboring evaluationpoints. In this way, local variation of an object can be reflected, and at the same time,robust description is retained in the global area of the object.

Page 30: M O TRACKING WITH AN RGB-D CAMERA - TUM

28 CHAPTER 3. METHOD

In this section, we propose Joint Color-Spatial Descriptor to combine the advantages ofthe local and global comparison methods. We first introduce the color feature SmoothedColor Ranging (SMR) [WCC+13], then we will present spatial representation of an objectwithout considering color and finally will show how to combine spatial representation andcolor feature into JCSD.

3.2.1 Smoothed Color Ranging

We want to express an object in joint color-spatial space, the correlation of two kinds offeature increases the dimensionality drastically. Therefore we should keep the color fea-ture dimension as low as possible and retain the discriminative power at the same time. In[WCC+13], to perform object recognition task, the authors designed smooth color rang-ing technique to describe color, their method only uses 8 bins. We modified this methodto adapt our framework, thus the ith point pi will have a 8-dimensional feature vectorhi ∈ R8×1. In the following, we will describe the SMR color feature.

In [GS97], Gevers et al. pointed out that hue remains in general invariant to viewingdirection, surface orientation, highlights, illumination direction and illumination intensity,which are the factors causing different color appearances for the same object. Thereforewe first convert each points color value from RGB space to HSV (Hue, Saturation, Value)space. The conversion is:

S

0

12

3

4 5

pw3

w4

chromatic

achromatic

fuzzyred

yellowgreen

cyan

blue purple

V

S

a) b)

1

23

4

5 6

w4

w5

Figure 3.5: a)Hue-Saturation space b)Saturation-Value space

H = arctan(√

3(G−B)(R−G)+(G−B)

S =

{1− min(R,G,B)

max(R,G,B) if max(R,G,B)> 00 if max(R,G,B) = 0

V = max(R,G,B)

(3.5)

Page 31: M O TRACKING WITH AN RGB-D CAMERA - TUM

3.2. JOINT COLOR-SPATIAL DESCRIPTOR 29

where R,G,B ∈ [0,1], H ∈ [0,360] and S,V ∈ [0,1]. For each hue value, when one ofthe saturation S and value V is near 0, the color is in gray-scale and can be consideredas achromatic. For S = 0, as one goes from 0 to 1 on the V-axis, the color changes fromblack to white. On the other hand for constant V , as we go from 0 to 1 on the S-axis,the color changes from gray to pure color representation of the hue (Fig 3.5a). The gray-scaled color has very unstable hue value, if we only use hue ignoring saturation and value,the resulted color statistic could be inaccurate. Therefore all three elements of the HSVcolor space are utilized for the smoothed ranging process. Exploiting the property of HSVspace, the ranging process is presented below.

Chromatic vs. Achromatic We use all the HSV values for the ranging process, for bothpoints with distinguishing color (red, yellow, blue etc.) and gray-scaled points (black,white and gray) (Fig. 3.5b). Therefore we divide the feature space into 2 components:chromatic component and achromatic component, where the chromatic component is re-served for points which have distinguishing colors and achromatic component is reservedfor gray-scaled points. But as illustrated in Fig. 3.5b, there is a fuzzy area that is ambigu-ous between chromatic and achromatic area. Borrowing idea from [SQP02], each pointis ranged into both chromatic component and achromatic component with correspond-ing weights: wC(S,V ) for chromatic component and wG(S,V ) for achromatic component.The two weights can be presented as:

wC = Sr·( 1V )r1

,wG = 1−wH(S,V ),

(3.6)

where r,r1 ∈ [0,1]. From observation of function plot with different values of r andr1: when r = 0.14 and r1 = 0.9, the function represents the colorfulness of SV-spaceadequately. The resulting weight for chromatic part is illustrated in Fig. 3.6, where darkred presents high chromatic weight and dark blue presents low chromatic weight.

Chromatic Component We use 6 hue values to represent the hue dimension, we dubthem as R1 = 0◦ (red), R2 = 60◦ (yellow), R3 = 120◦ (green), R4 = 180◦ (cyan), R5 =240◦(blue), R6 = 300◦ (purple). Each representative corresponds to a bin in the featurehistogram. For each point, only H (hue) is considered to determine in which bins onepoint should distribute. For a normal histogram, hard quantization is used to decide intowhich bin one hue value falls in. But the result is unstable if the hue is near quantiza-tion threshold, in this case, a little shift of hue will change the color statistic significantly.Instead, we use soft quantization scheme here, each point p is distributed into two neigh-bouring histogram bins, the two bin indexes are < n,n′ >:

n = d H60e,

n′ =

{n+1 if n < 60 if n = 6 .

(3.7)

Page 32: M O TRACKING WITH AN RGB-D CAMERA - TUM

30 CHAPTER 3. METHOD

Figure 3.6: Chromatic weight in Saturation-Value space, red presents high value and bluepresents low value

and these two bins will be distributed with different weights < wHn,wHn′ > based on thedistance between p’s hue value and hue representatives Rn and Rn′ (Fig. 3.5a):

wHn =CRn′−H

60 ,wHn′ = 1−wHn.

(3.8)

Recall that the i-th point’s color feature is dubbed as a vector hi ∈ R8, which was set to azero vector initially, now we put the distribution to chromatic part into hi:

hi(n) = wC ·wHn,hi(n′) = wC ·wHn′ .

(3.9)

Achromatic Component The achromatic component is divided to two ranges, wherebins < 7,8 > are used. In the achromatic component, we use p’s V-value of HSV topresent intensity of gray-scale and use a hard quantization to decide into which histogrambin p should distribute.

fi(7) ={

wG if V < 0.50 if V ≥ 0.5 , fi(8) =

{0 if V < 0.5wG if V ≥ 0.5 . (3.10)

In summary, based on HSV value, each point p is ranged into three histogram regions< n,n′,7|8 > with different weights and reflected in the resulting distribution vector fi.

Page 33: M O TRACKING WITH AN RGB-D CAMERA - TUM

3.2. JOINT COLOR-SPATIAL DESCRIPTOR 31

3.2.2 Spatial Representation of an ObjectThe spatial representation of an object is expressed as spatial probability density at mul-tiple evaluation points. The spatial density of each evaluation point is obtained by theKernel Density Estimation (KDE) method.

Kernel Density Estimation (KDE) is a fundamental statistic technique for estimation ofprobability density function (pdf). KDE is a nonparametric method which is similar tohistogram density estimation with improved property. Compared to a histogram densityestimation, the probability density estimated from KDE is smooth in the probability spaceand is closer to the true pdf.

We denote an arbitrary point-cloud P with n points. To build spatial representation on P,first, a set of equal sized 3D grids inside a bounding box with grid length r are createdon P as shown in Fig. 3.7. Then, all grids’ corners {si}m

i=1 are selected as evaluationpoints, where m is the number of corner. To estimate spatial density at the ith evaluationpoint si, its neighboring points from P inside the kernel bandwidth are used, where thekernel bandwidth is set as the same as grid length r. Assuming the neighboring point setof si is P(i) with ni points, then the spatial density of P(i) at si can be obtained by KDE asfollowing:

fr(si) =1n

ni

∑j=1

Kr(‖si−p(i)j ‖), (3.11)

where Kr(x) is a kernel function that determines the contribution of one point based on itsdistance. Here, we used a triangular kernel with a benefit of its computational efficiency:

Kr(x) =max((r− x),0)

r. (3.12)

With the constructed grids, we can achieve less total computational complexity for spatialdensity estimation. For an unorganized point-cloud, neighbor search for the evaluationpoints often requires k-d tree search with O(m logn) complexity for m evaluation pointsand n number of measurement points.

However, the proposed method can have linear complexity O(n) without neighbor search.For each point pi, a grid is first selected to which the point belongs. Assuming thatthe 3D bounding box is presented with two corner points < minPt,maxPt > and the di-mension of grid in x,y,z directions are < dim.x,dim,y,dim.z >, then the grid indices< IND.x, IND.y, INd.z > in each dimension for a point p are estimated as follows:

IND.x = b p.x−minpt.xr

c,

IND.y = b p.y−minpt.yr

c,

IND.z = b p.z−minpt.zr

c.

(3.13)

Page 34: M O TRACKING WITH AN RGB-D CAMERA - TUM

32 CHAPTER 3. METHOD

Figure 3.7: A set of grids and corresponding evaluation points to represent spatial distri-bution of an object. Here, they are illustrated in 2D case for the simplicity. In the leftfigure, a bounding box with grid length r is covered on the point-cloud P (colored pointsof a book). Evaluation points are located on the grid corners {si}m

i=1. Spatial density ofeach evaluation point is calculated with a kernel bandwidth of r. The right figure explainsthe influence of a random point pi from P to the neighboring evaluation points. Here, thepoint (blue color) only affect the four evaluation points (red color) because of the kernelbandwidth size (red circles). In the 3D case, there exists eight neighboring points on thecorners of a cube.

Then the global index INDEX for a point who has indices < IND.x, IND.y, INd.z > is:

INDEX(IND.x, IND.y, IND.z) = IND.x ·dim.y ·dim.z+ IND.y ·dim.z+ IND.z. (3.14)

Then only pi’s eight surrounding evaluation points’ spatial densities will be updated withpi, because pi only contributes to these eight evaluation points as shown in Fig. 3.7. Theeight corner points’ indices can be estimated as:

INDEX1 = INDEX(IND.x, IND.y, IND.z),INDEX2 = INDEX(IND.x, IND.y, IND.z+1),INDEX3 = INDEX(IND.x, IND.y+1, IND.z),INDEX4 = INDEX(IND.x, IND.y+1, IND.z+1),INDEX5 = INDEX(IND.x+1, IND.y, IND.z),INDEX6 = INDEX(IND.x+1, IND.y, IND.z+1),INDEX7 = INDEX(IND.x+1, IND.y+1, IND.z),INDEX8 = INDEX(IND.x+1, IND.y+1, IND.z+1).

(3.15)

Page 35: M O TRACKING WITH AN RGB-D CAMERA - TUM

3.2. JOINT COLOR-SPATIAL DESCRIPTOR 33

Therefore for each one of the n points from P, eight update operations are needed, whichresults in O(8 ·n) = O(n) in total. The summary of the spatial density estimation for allevaluation points is shown in Algorithm 3.2.2.

Algorithm 2 Calculate spatial density at the evaluation pointsInput: - {si}m

i=1 and P = {pk}nk=1

Output: - { fr(si)}mi=1

- { fr(si)}mi=1← 0

for k = 1 : n do- Determine grid index of pk. (Eq. 3.14)- Determine the indexes {kl}8

l=1 of the 8 evaluation points for pk based on its gridindex. (Eq. 3.15)- Update each relevant evaluation point of pk:for l = 1 : 8 do

- fr(skl)← fr(skl)+1n ·Kr(‖skl −pk‖)

end forend for

3.2.3 Joint Color-Spatial RepresentationIn order to represent appearance of an object, color distribution for each evaluation pointshould also be considered. The color feature should not only have low-dimensional spacefor the sake of computation memory size, but also retain discriminative property. Addi-tionally in a dynamic environment, object pose and illumination intensity can be chang-ing. These aspects should also be considered in the color feature selection. Wang et. al.proposed the Smoothed Color Ranging (SMR) method to describe color with 8 represen-tative colors (red, yellow, green, cyan, blue, purple, light gray, dark gray) [WCC+13]. Intheir work, SMR using HSV (Hue, Saturation, Value) color space has been proven to beinvariant against illumination changes.

Using SMR color feature, spatial density will be calculated eight times for each colorrange individually. To describe its color distribution, a 8-dimensional feature vector hhas at most three non-zero elements and ‖h‖1 = 1 will be estimated. For each evaluation

point si, eight different spatial densities { f (c)r (si)}8c=1 are evaluated to correlate spatial

and color data. For the cth color range, to estimate the density f (c)r (si) in that color rangeat evaluation point si, Equation (3.11) is updated to:

f (c)r (si) =1n

ni

∑j=1

h(i)j (c) ·Kr(‖si−p(i)

j ‖), (3.16)

where h(i)j (c) is the weight of point p(i)

j in cth color range.

Page 36: M O TRACKING WITH AN RGB-D CAMERA - TUM

34 CHAPTER 3. METHOD

As a result, the JCSDs for P with m evaluation points can be described as a Matrix JP ∈R8×m, where jth column contains the jth evaluation point’s densities in eight differentcolor ranges:

JP =

f (1)r (s1) f (1)r (s2) · · · f (1)r (sm)

f (2)r (s1) f (2)r (s2) · · · f (2)r (sm)...

... . . . ...f (8)r (s1) f (8)r (s2) · · · f (8)r (sm)

. (3.17)

The estimation of the JCSD feature matrix JP for a point cloud P = {pk}nk=1 is summa-

rized in Algorithm 3.2.3.

Algorithm 3 Estimation of JCSD feature matrixInput: - measurement points {si}m

i=1, input point-cloud P = {pk}nk=1 and color feature

for each point {hk}nk=1

Output: - JP ∈ R8×m

- JP← 08×m

for k = 1 : n do- Determine grid index of pk. (Eq. 3.14)- Determine the indexes {kl}8

l=1 of the 8 evaluation points for pk based on its gridindex. (Eq. 3.15)- Update each relevant evaluation point of pk:- Iterate over 8 relevant measurement pointsfor l = 1 : 8 do

- Iterate over 8 color rangesfor c = 1 : 8 do

- JP(c,kl)← JP(c,kl)+1n ·hk(c) ·Kr(‖skl −pk‖).

end forend for

end for

3.3 Hypothesis Evaluation with JCSDThe likelihood estimation is a task to measure how well the object model Mt−1 fits theobservation data Zt based on the ith hypothesis state X(i)

t . First, hypothesis data needs tobe extracted from sensor data, which are the points near the object’s last estimated posefor the time step t− 1, because we can assume that object movement is limited betweentwo consecutive frames. With this assumption, a part of points of Zt that have large dis-tance to the object model’s last estimated pose Xt−1⊗Mt−1 are removed. The remainingpoint cloud Z′t with n points then compose the hypothesis data. For the ith particle stateX(i)

t , the hypothesis data is transformed to the coordinate of the object to compare based

Page 37: M O TRACKING WITH AN RGB-D CAMERA - TUM

3.4. POINT-CLOUD SEGMENTATION AND MODEL UPDATE 35

on inverse of X(i)t , resulting in Z′(i)t = {z(i)1 , ...z(i)n }.

Hypothesis evaluation is performed on each point from the hypothesis data Z′it by estimat-ing its fitting score to determine how well it fits to the object model’s JCSD, JMt−1 . The

fitting score for X(i)t is then obtained by summing up all points’ fitting score. The fitting

score s(z(i)j ) for the jth point z(i)j from Z′(i)t is calculated based on its color feature h(i)j .

The 8 JCSDs { f (c)(z(i)j )}8c=1 for the object model at z(i)j is evaluated by using trilinear

interpolation of the eight surrounding evaluation points, which are the corners of a cubethat surrounds z(i)j :

f (c)(z(i)j ) =8

∑k=1

∏3d=1(13−|z

(i)j − s jk |)d

l3grid

·JMt−1(c, jk), (3.18)

where c indicates the index of color range and 13 is a three-dimensional vector with allones. The fitting score of point z(i)j is then calculated by correlation between h(i)

j and

{ f (c)(z(i)j )}8c=1:

s(z(i)j ) =8

∑c=1

h(i)j (c) · f (c)(z(i)j ). (3.19)

As a result, the fitting score of ith particle si is then the summation of fitting scores of allpoints from Z′it :

si =n

∑j=1

s(z(i)j ). (3.20)

The summary of estimation of si is shown in Algorithm 3.3. After fitting score estimationfor all particle states, the likelihood p(Zt |Xi

t⊗Mt−1) is calculated as

p(Zt |X(i)t ⊗Mt−1) ∝ exp(−λ · (1− si− smin

smax− smin)), (3.21)

where smin and smax are the minimal and maximal score value among all hypothesis statesrespectively.

3.4 Point-cloud Segmentation and Model Update

Point-cloud SegmentationAssuming that K objects are presence, object segmentation task is to label each point{pi}n

i=1 from Zt with an object index oi, where oi ∈ [1,K] and n is the number of points inZt . It can be determined by calculating Euclidean distance from one point to each objectin their current estimated pose. The Euclidean distance between ith point p(i)

t and kth

Page 38: M O TRACKING WITH AN RGB-D CAMERA - TUM

36 CHAPTER 3. METHOD

Algorithm 4 Fitting score estimation

Input: - JCSDs of object model: JMt−1 , ith hypothesis data: Z′(i)t = {z(i)1 , ...z(i)n } and its

corresponding color features: {h(i)1 , ...h(i)

n }Output: - Fitting score si of Z′it

- si← 0for j = 1 : n do

- Determine JCSD at location of z(i)j : { f (c)(z(i)j )}8c=1 (Equation 3.19).

- Estimate fitting score of z(i)j : s(z(i)j ) using correlation between h(i)1 and JCSD

{ f (c)(z(i)j )}8c=1 (Eqaution 3.20).

- si← si + s(z(i)j )end for

object M(k)t = Xt ⊗M(k)

t−1 is defined as d(k)i , which is the distance between p(i)

t and the

closest point in M(k)t :

d(k)i = min

j‖p(i)

t − m(k)t, j ‖, (3.22)

where m(k)t, j is the jth point of M(k)

t . This is implemented using k-d tree search method in[RC11].

The ith point is then labeled to the object, which results in the smallest d(k)i and if d(k)

i issmaller a threshold τ:

o(k)i =

{argmink(d

(k)i ) if min(d(k)

i )5 τ

/0 if min(d(k)i )> τ

(3.23)

The threshold τ is introduced here to avoid false association of irrelevant points (pointsfrom other objects or background) and is set as twice of the sampling distance betweenpoints in the sensor data. The subset of identified input point-cloud that is associated withthe kth object is then presented as Z(k)

t .

Model updateIn a highly dynamic environment, the pose of tracked object changes drastically, con-sequently the visible part of the target can also change. To maintain robust tracking, theobject model knowledge should be adapted to the current circumstances time to time. Themajor problem caused by model update is the drift problem, where inaccurate segmen-tation can cause inaccurate object model. From time to time, the object model could bedegraded from the true object appearance and cause tracking failure.

Page 39: M O TRACKING WITH AN RGB-D CAMERA - TUM

3.5. GPU IMPLEMENTATION 37

In our model update step, if the newly segmented point-cloud {Z(k)t }K

k−1 simply replaces

the object model {M(k)t }K

k−1, possibly existing segmentation errors will cause inaccurateobject model estimation, and the errors are accumulated in the tracking process due to thefeedback loop. To overcome this issue, we propose the following model update schemeto merge old object model and new segmented data. At time step t, we randomly selectα% (α ∈ [1,100]) points from new segmented data {Z(k)

t }Kk=1 and (100−α)% points

from M(k)t to build the new object model M(k)

t . Fig. 3.8 shows the examples of the objectmodel with α = 2%, which is empirically obtained value for balancing the trade-off be-tween stability and adaptability of the model update.

The variable α can be viewed as a forgetting factor for the old object knowledge, at eachtime we only add a few new information into the object model. Of course, the morechange of visible part of target occurs, the more forgetting factor we need to adapt thenew measurements.

Figure 3.8: Two examples of how the smoothly adapted object model (figures on the rightcolumn) can be updated comparing with current input point-clouds (figures on the leftcolumn) and fixed initial object models (figures on the middle column). An example onthe upper row shows that the initially occluded but currently appeared part (red circle)is updated in the proposed object model. Another example on the lower row shows thatinitially appeared but currently occluded part (red circle) can be updated as well.

3.5 GPU ImplementationAmong whole pose tracking process, the hypothesis evaluation for each particle is themost computational costly part. This is due to the large amount of particles we need to

Page 40: M O TRACKING WITH AN RGB-D CAMERA - TUM

38 CHAPTER 3. METHOD

simulate posterior pose probability distribution. Since hypothesis evaluation for each par-ticle is independent to other particles, this part can be implemented in parallel.

(a) Thread, block, grid hierarchy in CUDA frame-work.

(b) Different memory types in CUDA framework.

Figure 3.9: CUDA framework structure [Nvi08].

We implement this part using GPU with Nvidia CUDA programming [Nvi08]. In CUDAterminology the programming model is divided into a host side and a device side. Thedevice is the graphic card that will perform actual parallel data processing, and host refersto the CPU, which controls the device computation. The parallel function which will beexecuted simultaneously, is called a kernel function. Each core in device, which executesone kernel function, is called a thread, threads are grouped into blocks (Fig. 3.9). Threadswithin a block can cooperate with low latency shared memory, but the shared memorysize is limited. If different threads need to cooperate with large data, then they must usehigh latency global memory with larger communication time.

To evaluate one hypothesis’ likelihood, we need the hypothesis pose X, the JCSD featurematrix J of the current object model and the hypothesis data Z from current measurement(Fig. 3.10). We use each block to process one pose hypothesis, where 1024 threads areused inside one block, because 1024 is the maximum allowable threads number inside ablock. Using algorithm provided by Sec. 3.3, we will calculate each point’s fitting scoreand then sum them up to the pose hypothesis’ fitting score. In one block, each thread willhandle one portion of the hypothesis data. Assuming that hypothesis data Z = {z j}k

j=1consists of k points, then the number of points each thread will handle, kthread is:

kthread = d k1024

e,

Page 41: M O TRACKING WITH AN RGB-D CAMERA - TUM

3.5. GPU IMPLEMENTATION 39

the last thread will handle maybe less points than other threads:

klastthread = k− kthread ·1023,

where the hypothesis data size is usually ca. 2000, then each thread needs to process 2points in serial. In this way we have two levels of parallelism. In the first level multipleblocks perform hypothesis evaluation for multiple hypotheses. In second level each threadin the same block handles one portion of hypothesis data and integrates the fitting scoresinto the global fitting score of that block’s corresponding hypothesis. In this way we canutilize almost the maximum computational capability of the graphic card.

Block 1

Block 2

Block N-1

Block N

GPU implementation

Pose hypotheses

Hypothesis data

Object model feature

Input Output: fitting score for each hypothesis

Thread 1

Thread 2

Thread 1024

th Block

th pose hypothesis

Hypothesis data

Object model feature

Input

Output: fitting score for th hypothesis

Figure 3.10: GPU implementation of multiple hypotheses evaluation.

Page 42: M O TRACKING WITH AN RGB-D CAMERA - TUM

40 CHAPTER 3. METHOD

Page 43: M O TRACKING WITH AN RGB-D CAMERA - TUM

41

Chapter 4

EXPERIMENT AND RESULT

In this chapter, we present a set of experiments that demonstrate pose tracking accuracy,segmentation accuracy, and real-time performance of our approach. As shown in Fig. 4.1,the experiments were performed by using three table-top scenarios with multiple objectdynamic interactions as follows.

• Stacking and unstacking two objects: A human hand approaches object 1 and stacksthe object onto another object 2, then the hand leaves the scene for a while andunstacks the object 1 to its original position.

• Contacting two objects: Two human hands grasp two objects respectively. Thenboth hands manipulate the objects such that both object make a contact with eachother and move together.

• Rotating an object: A human hand grasps one object and rotates it arbitrarily.

The point-cloud data of a table are first excluded using a plane extraction method in[RC11], then to reduce the size of point-cloud, the remaining data were downsampledwith 5mm sampling distance using VoxelGrid filter [RC11]. The length of grid used forJCSD was 15mm for all experiments. All experiments run on a standard desktop computer(CPU: Intel i5-2500 3.30GHz; GPU: Nvidia GeForce GTX 560) with ASUS Xtion ProLive RGB-D camera.

4.1 Tracking accuracySince we performed model-free object tracking, it is hard to obtain the ground truth dataof the object and hand poses. Instead of measuring the exact pose error, we used scene-fitting distance to evaluate the tracking accuracy, it is inspired from [ATDSV12] and inthis thesis we use a simplified version. The fitting distance describes how well the objectmodel fits to the sensor data based on its estimated pose, it is defined as the average dis-tance of points from the object model in its estimated pose M to the closest point fromcurrent sensor data. However, we excluded points from M that are occluded by current

Page 44: M O TRACKING WITH AN RGB-D CAMERA - TUM

42 CHAPTER 4. EXPERIMENT AND RESULT

(a) Stacking and unstacking two objects

(b) Contacting two objects

(c) Rotating an object

Figure 4.1: Object segmentation result. Notice our approach can clearly segment fingerin stack/unstack and rotation case.

sensor data, because they result in large distance even if the object model is in its true pose.

Fitting Distance Given the object model in the estimated pose M= {mi}ni=1, the current

sensor input data Z = {z j}kj=1 and the camera position c. We first find for each of the

model point mi a nearest neighbour in the sensor data znn(i). Then for each point pair< mi,znn(i) > a local fitting distance di is estimated as:

di =

{‖mi− znn(i)‖ if mi is not occluded0 if mi is occluded

(4.1)

The occlusion is detected based on camera position c with ray-casting like algorithm from[RC11]. We did not use occlusion detection in our actual tracking algorithm, because it

Page 45: M O TRACKING WITH AN RGB-D CAMERA - TUM

4.1. TRACKING ACCURACY 43

time [frame]0 100 200 300 400 500 600 700 800 900

fitn

ess

dis

tan

ce [

m]

x 10-3

0

2

4

6

8

our approachpoint-pointcorrespondencecontact

startscontact

ends

object 1 stacked

object 1 unstacked

(a) Stacking and unstacking two objects scenario

time [frame]0 100 200 300 400 500 600 700 800

fitn

ess

dis

tance

[m

]

0

0.002

0.004

0.006

0.008

0.01

our approach

point-point correspondence

contact starts

contact ends

(b) Contacting two objects scenario

Figure 4.2: Fitting distance comparison: green line is from our approach, red line isfrom point-point correspondence approach [CC13]. In both scenarios, as soon as partialocclusion occurs, our approach outperforms.

needs heavy computational cost. However here we use occlusion detection to obtainpose tracking accuracy analysis. The fitting distance d for the whole object based on itsestimated pose is then:

d =n

∑i=1

di (4.2)

We implemented the point-point correspondence based hypothesis functions from [CC13]and compare it with our method. Apart from the hypothesis evaluation function, otherpart are implemented using exact same procedure. We compare both methods in terms offitting distance. As shown in Fig. 4.2, our approach outperforms the point-point corre-spondence approach in both stack/unstack and multiple contact scenario, because in thesetwo scenarios, partial occlusion of boxes caused inaccurate hypothesis likelihood of thepoint-point correspondence method, consequently resulting in inaccurate pose estimation.

Page 46: M O TRACKING WITH AN RGB-D CAMERA - TUM

44 CHAPTER 4. EXPERIMENT AND RESULT

4.2 Segmentation accuracy

To obtain the ground truth segmentation data, we attached AR markers with 4cm× 4cmsize on the center of two objects (Fig. 4.3) and constructed manually their correspondingcube type point-clouds. We obtain 6 DoF poses from the ROS node ar track alvar, theROS node detects the AR markers with their identity and produces their poses. However,the obtained poses is noisy and has a jittering effect, therefore we filter the poses with aKalman Filter to obtain smoother marker poses. The ground truth segmentation data isthen obtained by positioning the constructed cube model at the tracked marker poses ontothe scene data. Then points near the constructed cube model will be segmented as boxand other remaining points will be segmented as hands.

Figure 4.3: Objects for experiment with AR marker sticked on the center.

In this experiment, we set the model update ratio α = 2% and used 200 particles for theparticle filtering. Since we removed the table point-cloud priorly, all points belong to acertain object. The accuracy of the object segmentation is defined as the ratio between thesize of correctly associated points and the overall point-cloud size:

Accuracy =#correct associated points

#all points, (4.3)

which is equivalent as the segmentation accuracy evaluation method PMOTA (Point-levelMultiple Object Tracking Accuracy) that is presented in [KLK14]. The average objectsegmentation accuracy for all three scenarios is 99.4%. This outperforms the result from[KLK14], where they had similar experiment settings and achieved ca. 90% segmentationaccuracy. The qualitative segmentation results can be seen in Fig. 4.1.

Page 47: M O TRACKING WITH AN RGB-D CAMERA - TUM

4.3. COMPUTATION TIME 45

4.3 Computation timeWe first implemented our hypothesis evaluation method with CPU programming. Toevaluate one particle for an object model with ca. 900 measurement points, our CPUimplementation requires in average 0.29ms while point-point correspondence approachneeds in average 0.9ms. This is because our approach has linear complexity, whereaspoint-point correspondence search requires k-d tree search, which results in O(m logn)complexity.

We further implemented hypothesis evaluation method with GPU programming to paral-lelize the particle weighting process. We compare the CPU and the GPU programmingfor hypothesis evaluation part with different particle numbers for an object with ca. 900points, the comparison result is shown in Table 4.3.

Table 4.1: Particle number vs. particle hypotheses evaluation time

Particle number CPU time [ms] GPU time [ms] Speed up10 9.7 0.82 11.8x20 16.9 0.93 18.2x30 25.6 0.95 26.9x40 30.5 0.97 31.4x50 35 1.13 30.9x

100 67 1.8 37.2x200 130 2.35 55.3x500 343 3.8 90.2x

1000 654 6.1 107.2x1500 1070 8.7 123x2000 1445 11 131.3x

0 500 1000 150040

50

60

70

80

90

1.4

1.6

1.8

2

2.2

2.4

fitting distancetime per frame

particle number

tim

e p

er

seco

nd [

ms]

fitt

ing d

ista

nce

[m

m]2.4

2.2

2.0

1.8

1.6

1.4

90

80

70

60

50

400 500 1000 1500100

Figure 4.4: Total computation time and fitting distance in relation with particle number(three objects with ca. 3000 points).

Page 48: M O TRACKING WITH AN RGB-D CAMERA - TUM

46 CHAPTER 4. EXPERIMENT AND RESULT

To evaluate real-time performance of whole tracking method, the stacking/unstackingcase is used, where three objects are presence and average point number is 3213 for eachframe. For different particle numbers, we calculated the required computation time foreach frame and the average fitting distance over all frames. As expected, Fig. 4.4 showswith more particle number, computation time grows and fitting distance decreases. Afterca. 100 particles, there is no significant decrease of fitting distance while computationtime still grows. Therefore, a reasonable particle size is around ca. 100 with 21 FPScomputation speed.

Page 49: M O TRACKING WITH AN RGB-D CAMERA - TUM

47

Chapter 5

CONCLUSION

In this thesis, we investigated multiple model-free object tracking application to improveboth tracking accuracy and real-time performance compared to [KLK14]. We chose touse particle filtering based object tracking method, because particle filtering is not limitedto linear system process and Gaussian noise. Moreover, compared to iterative registrationmethod used in [KLK14], Particle filtering provides the possibility to parallelize, thusthe real-time performance can be improved with parallel programming. We designed anovel feature Joint Color-Spatial Descriptor (JCSD) to evaluate the hypotheses in parti-cle filtering framework. The hypothesis evaluation with proposed JCSD provides morerobust likelihood value compared to point-point correspondence based method [CC13],especially in partial occlusion case. With the robust hypothesis evaluation, the obtainedpose tracking result is accurate enough to achieve object segmentation accuracy higherthan 99%, which is higher than the result from [KLK14]. In addition to the segmenta-tion accuracy, the method is computationally efficient to be implemented with GPU inreal-time. The experiment results showed that with the minimum number of particles themethod can be used for tracking three objects at the same time in 21 FPS. Thanks to theaccurate segmentation result of the unknown objects in real-time, on-line object learningtasks will be more investigated in future work.

In order to adapt the object model to the more recent observations, we used a forgettingfactor to smoothly delete the model’s old observation points and add new observationpoints. By setting this forgetting factor, the object model is stable against pose trackingerror and remains adaptivity at the same time. However, our model update is simple anddose not consider the issues like occlusion, objects penetration etc.. Thus in the futurework, the model update method should be more investigated.

A limitation of our tracking method is that it is constraint to rigid objects, therefore inthe experiments we held our hands as rigid as possible. However, in a dynamic human-robot environment, a lot of non-rigid objects exist. In the future work, we can divide eachunknown object into assumable rigid local elements and use proposed JCSD feature todiscover the non-rigidity of the objects.

Page 50: M O TRACKING WITH AN RGB-D CAMERA - TUM

48 CHAPTER 5. CONCLUSION

Page 51: M O TRACKING WITH AN RGB-D CAMERA - TUM

LIST OF FIGURES 49

List of Figures

1.1 Model based tracking example . . . . . . . . . . . . . . . . . . . . . . . 61.2 Model-free tracking example . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Main steps of tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Segmentation Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Point-cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 K-d tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Grphical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 SIR Graphic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Non Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Two example cases where simple Euclidean Clustering fails to separate

different objects. Three objects are represented in different colors. . . . . 253.3 Unsupervised object individuation framework proposed in [KLK14] . . . 263.4 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5 Hue Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.6 Colorfulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.7 2d Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.8 Model Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.9 CUDA Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.10 GPU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 Object segmentation result. Notice our approach can clearly segment fin-ger in stack/unstack and rotation case. . . . . . . . . . . . . . . . . . . . 42

4.2 Fitting Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 AR Marker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4 Time vs. Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Page 52: M O TRACKING WITH AN RGB-D CAMERA - TUM

50 LIST OF FIGURES

Page 53: M O TRACKING WITH AN RGB-D CAMERA - TUM

BIBLIOGRAPHY 51

Bibliography

[AMGC02] M Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. Atutorial on particle filters for online nonlinear/non-gaussian bayesian track-ing. Signal Processing, IEEE Transactions on, 50(2):174–188, 2002.

[ATDSV12] Aitor Aldoma, Federico Tombari, Luigi Di Stefano, and Markus Vincze. Aglobal hypotheses verification method for 3d object recognition. In Com-puter Vision–ECCV 2012, pages 511–524. Springer, 2012.

[BC12] James Anthony Brown and David W Capson. A framework for 3d model-based visual tracking using a gpu-accelerated particle filter. Visualizationand Computer Graphics, IEEE Transactions on, 18(1):68–80, 2012.

[Ben75] Jon Louis Bentley. Multidimensional binary search trees used for asso-ciative searching. Magazine Communications of the ACM, 18:509–517,September 1975.

[CC13] Changhyun Choi and Henrik I Christensen. Rgb-d object tracking: A parti-cle filter approach on gpu. In Intelligent Robots and Systems (IROS), 2013IEEE/RSJ International Conference on, pages 1084–1091. IEEE, 2013.

[Che03] Zhe Chen. Bayesian filtering: From kalman filters to particle filters, andbeyond. Statistics, 182(1):1–69, 2003.

[CRM03] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. Kernel-basedobject tracking. Pattern Analysis and Machine Intelligence, IEEE Trans-actions on, 25(5):564–577, 2003.

[DDFG01] Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction tosequential Monte Carlo methods. Springer, 2001.

[GS97] T. Gevers and A. Smeulders. Color based object recognition. PatternRecognition, 32:453–464, 1997.

[GSS93a] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approachto nonlinear/non-gaussian bayesian state estimation. In IEE ProceedingsF (Radar and Signal Processing), volume 140, pages 107–113. IET, 1993.

Page 54: M O TRACKING WITH AN RGB-D CAMERA - TUM

52 BIBLIOGRAPHY

[GSS93b] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approachto nonlinear/non-gaussian bayesian state estimation. In IEE ProceedingsF (Radar and Signal Processing), volume 140, pages 107–113. IET, 1993.

[KHRF11] Michael Krainin, Peter Henry, Xiaofeng Ren, and Dieter Fox. Manipulatorand object tracking for in-hand 3d object modeling. The InternationalJournal of Robotics Research, 30(11):1311–1327, 2011.

[KLK13] Seongyong Koo, Dongheui Lee, and Dong-Soo Kwon. Gmm-based 3dobject representation and robust tracking in unconstructed dynamic envi-ronments. In 2013 IEEE International Conference onRobotics and Au-tomation (ICRA), pages 1114–1121. IEEE, 2013.

[KLK14] Seongyong Koo, Dongheui Lee, and Dong-Soo Kwon. Unsupervised ob-ject individuation from rgb-d image sequences. In Intelligent Robots andSystems (IROS 2014), 2014 IEEE/RSJ International Conference on, pages4450–4457. IEEE, 2014.

[LN10] Dongheui Lee and Yoshihiko Nakamura. Mimesis model from partial ob-servations for a humanoid robot. The International Journal of RoboticsResearch, 29(1):60–80, 2010.

[NKMVG03] Katja Nummiaro, Esther Koller-Meier, and Luc Van Gool. An adaptivecolor-based particle filter. Image and vision computing, 21(1):99–110,2003.

[Nvi08] CUDA Nvidia. Programming guide, 2008.

[OTDF+04] Kenji Okuma, Ali Taleghani, Nando De Freitas, James J Little, andDavid G Lowe. A boosted particle filter: Multitarget detection and track-ing. In Computer Vision-ECCV 2004, pages 28–39. Springer, 2004.

[PA10] Vasilis Papadourakis and Antonis Argyros. Multiple objects tracking inthe presence of long-term occlusions. Computer Vision and Image Under-standing, 114(7):835–846, 2010.

[PKAW13] Jeremie Papon, Tomas Kulvicius, Eren Erdal Aksoy, and Florentin Wor-gotter. Point cloud video object segmentation using a persistent supervoxelworld-model. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJInternational Conference on, pages 3712–3718. IEEE, 2013.

[PW15] Jeremie Papon and Florentin Worgotter. Spatially stratified correspon-dence sampling for real-time point cloud tracking. In Applications of Com-puter Vision (WACV), 2015 IEEE Conference on (SUBMITTED), 2015.

Page 55: M O TRACKING WITH AN RGB-D CAMERA - TUM

BIBLIOGRAPHY 53

[RC11] R. B. Rusu and S. Cousins. 3D is here: Point Cloud Library (PCL). In IEEEInternational Conference on Robotics and Automation (ICRA), Shanghai,China, May 9-13 2011.

[SLP11] Alexander M Schmidts, Dongheui Lee, and Angelika Peer. Imitationlearning of human grasping skills from motion and force data. In 2011IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), pages 1002–1007. IEEE, 2011.

[SQP02] Shamik Sural, Gang Qian, and Sakti Pramanik. A histogram with percep-tually smooth color transition for image retrieval. In Fourth InternationalConference on Computer Vision, Pattern Recognition and Image Process-ing, pages 664–667, 2002.

[WCC+13] Wei Wang, Lili Chen, Dongming Chen, Shile Li, and K Kuhnlenz. Fastobject recognition and 6d pose estimation using viewpoint oriented color-shape histogram. In Multimedia and Expo (ICME), 2013 IEEE Interna-tional Conference on, pages 1–6. IEEE, 2013.

[YJS06] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A survey.Acm computing surveys (CSUR), 38(4):13, 2006.

[YSZ+11] Hanxuan Yang, Ling Shao, Feng Zheng, Liang Wang, and Zhan Song.Recent advances and trends in visual tracking: A review. Neurocomputing,74(18):3823–3831, 2011.