3d urban modelling - university of adelaidedsuter/graduated/eehuithesisfinal.pdf · 3d urban...

3D Urban Modelling

by

Ee Hui Lim

B.Eng.Hons (Monash University) 2004

A thesis submitted in total fulfilment of the requirements for the degree of

Doctor of Philosophy`

in the

Department of Electrical and Computer Systems Engineering

Monash University

Clayton Victoria 3800

Australia

July 2011

3D Urban Modelling

Copyright © 2011

by

Ee Hui Lim

All Rights Reserved

i

Summary iv

Declaration vi

Preface vii

Acknowledgements viii

Dedication x

Contents ........................................................................................................................................................... vii

1.0 INTRODUCTION .................................................................................................................... 1

1.1 BACKGROUND AND MOTIVATION ........................................................................................... 2

1.1.1 Photogrammetry .......................................................................................................... 3

1.1.2 LIDAR data .................................................................................................................. 4

1.1.3 Why is 3D Urban Modelling difficult?......................................................................... 5

1.2 STRUCTURE OF THE THESIS ..................................................................................................... 9

2.0 DATA ACQUISITION AND OCCLUSION REMOVAL .................................................. 13

2.1 INTRODUCTION ..................................................................................................................... 13

2.2 TECHNOLOGY ....................................................................................................................... 13

2.3 DATA PRE-PROCESSING ........................................................................................................ 16

2.4 PREVIOUS WORK IN IMAGE OCCLUSION REMOVAL ............................................................... 19

2.5 OCCLUSION REMOVAL WITH MINIMUM NUMBER OF IMAGES ............................................... 21

2.5.1 Occlusion Detection................................................................................................... 21

2.5.2 Occlusion Removal .................................................................................................... 24

Contents

ii

2.6 EXPERIMENTAL VALIDATION ................................................................................................ 28

2.6.1 Indoor data set ........................................................................................................... 29

2.6.2 Outdoor data set ........................................................................................................ 30

2.6.3 Panoramic Data Set ................................................................................................... 33

2.7 CONCLUSION ......................................................................................................................... 36

3.0 FEATURE DESCRIPTORS FOR 3D CLASSIFICATION ............................................... 37

3.1 INTRODUCTION ..................................................................................................................... 37

3.2 REGION COVARIANCE AS 3D FEATURE DESCRIPTOR ............................................................ 39

3.2.1 Extension to Saliency Features .................................................................................. 41

3.2.2 Validation of the Extended Saliency Features ........................................................... 42

3.3 ESTIMATED NORMALS AS 3D FEATURE DESCRIPTOR ........................................................... 50

3.3.1 Delaunay/Voronoi Method ........................................................................................ 50

3.3.2 Numerical Optimization Methods .............................................................................. 52

3.3.3 Evaluation of the Surface Normal Estimation Approaches ....................................... 54

3.4 CONCLUSION ......................................................................................................................... 59

4.0 3D OVER-SEGMENTATION .............................................................................................. 60

4.1 INTRODUCTION ..................................................................................................................... 60

4.2 BACKGROUND OF OVER-SEGMENTATION ............................................................................. 61

4.2.1 Shape of the Super-voxel ........................................................................................... 63

4.2.2 Scale Selection ........................................................................................................... 63

4.3 SUPER-VOXEL – A 3D OVER-SEGMENTATION APPROACH ...................................................... 64

4.3.1 Sphere as Dividing Boundary .................................................................................... 64

4.3.2 Automatic Scale Selection .......................................................................................... 65

4.3.3 Synthetic Data ............................................................................................................ 67

4.3.4 Outdoor LIDAR Data ................................................................................................ 73

4.4 CONCLUSION ......................................................................................................................... 80

5.0 DATA CLASSIFICATION WITH MCRF .......................................................................... 81

5.1 INTRODUCTION ..................................................................................................................... 81

5.2 BACKGROUND OF SUPERVISED CLASSIFICATION .................................................................. 82

iii

5.2.1 Generative Model ...................................................................................................... 82

5.2.2 Discriminative Model ................................................................................................ 84

5.2.3 Graphical Model ........................................................................................................ 85

5.3 MULTI-SCALE CONDITIONAL RANDOM FIELD ....................................................................... 87

5.4 PLANE PATCHES FITTING ...................................................................................................... 91

5.5 RESULTS FOR DATA CLASSIFICATION .................................................................................... 92

5.5.1 Synthetic data sets...................................................................................................... 93

5.5.2 Urban data sets .......................................................................................................... 95

5.5.3 Summary of the Experiment Results ........................................................................ 103

5.6 CONCLUSION ....................................................................................................................... 104

6.0 ROBUST SEGMENTATION ............................................................................................. 105

6.1 INTRODUCTION ................................................................................................................... 105

6.2 BACKGROUND OF ROBUST SEGMENTATION ........................................................................ 110

6.2.1 Region Growing ....................................................................................................... 110

6.2.2 Model Fitting ........................................................................................................... 111

6.2.3 Clustering ................................................................................................................ 116

6.3 INFINITE GAUSSIAN MIXTURE MODEL ................................................................................ 121

6.4 RESULTS ............................................................................................................................. 125

6.5 CONCLUSION ....................................................................................................................... 129

7.0 CONCLUSIONS .................................................................................................................. 131

7.1 CONCLUDING REMARKS ...................................................................................................... 131

APPENDIX I: DATA ACQUISITION ...................................................................................................... 135

APPENDIX II: SINGULAR VALUE DECOMPOSITION ..................................................................... 138

APPENDIX III: DERIVATION OF CONDITIONAL POSTERIOR DISTRIBUTION OF

PARAMETERS FOR IGMM ..................................................................................................................... 140

REFERENCES ............................................................................................................................................ 146

iv

This dissertation concerns the development of techniques for the robust generation of 3D

terrestrial urban models. 3D urban models can be constructed by processing the data from

photogrammetry, which is the earliest remote sensing technology. Compared to traditional

photogrammetry, LIDAR (Light Detecting and Ranging) is a relatively new and fast

method to obtain 3D urban models, which samples surfaces with high density and high

point accuracy. The use of a “Multisensor” (laser scanner and camera) permits more

complete and efficient data acquisition.

There are several challenges in 3D urban modelling. The construction of 3D

terrestrial urban models requires the acquisition of a large amount of LIDAR data which

can take months and it is memory-intensive to process the data. The very large amount of

data, the unavoidable noise due to the uncontrollable environment, the large varieties and

shapes of the structures, and the sheer number of structures make robust multi-structure

data processing extremely challenging. In addition, the number of structures and the

locations of the objects in the outdoor environment are unknown, making classical

statistical methods insufficient to segment or model the data accurately.

A series of algorithms contributing to the generation of urban models from

terrestrial LIDAR data is presented in this dissertation. The approach starts in Chapter 2

with improved solutions for removing inconsistencies (due to moving occlusions) between

the images and LIDAR data. An existing algorithm which assumes that the occlusions are

all independent objects was modified; a connected occluded region cannot consist of

occlusions from different images. However, this assumption is often violated in outdoor

environments due to the large amount of pedestrian overlap. By understanding that a large

difference in the discontinuity measure at the boundary of the occlusions is most likely to

indicate an occlusion, the idea of the proposed improvement is to analyse the discontinuity

measure of the boundary where occlusions occur, to separate any “overlapped” occlusions.

Summary

v

The pre-processed data, with occlusions removed, are then labelled into different

data classes. 3D data labelling in existing studies has been mostly point-based. This

introduces a large amount of redundant computation – labelling every point will result in a

high computational load which can be reduced by classifying a smaller sub-set of the data.

A 3D over-segmentation algorithm, based on the context of super-voxel, is introduced for

this purpose. This novel method, based on 3D scale theory, groups regions which are

homogeneous with respect to colour and geometry similarities. This method is shown to

efficiently reduce the outdoor terrestrial LIDAR data for classification. Feature descriptors

are then computed for the super-voxels. One of the feature descriptors, saliency, is shown

in this dissertation to be invariant to the adaptive size of the super-voxel used to compute

the descriptors. A hierarchical graphical model, the multi-scale Conditional Random Fields

(mCRF), is proposed to learn the parameters of the extracted features to label the super-

voxels into different data types.

In addition, a robust estimation method that successfully addresses a number of

issues associated with the segmentation of complex data with unknown numbers of

structures is described. The residual function of the Infinite Gaussian Mixture Model

(IGMM) is modified for the clustering of labelled data belonging to planar surfaces into

locally-delimited planes. The modified algorithm is evaluated on the labelled planar data.

The results of the proposed method show the robustness of the algorithm for the clustering

problem. The proposed method is also shown to be capable of handling missing data caused

by occlusion or transparent objects.

This thesis proposes several improvements to urban modelling methodology. The

proposed approaches in this dissertation have been tested on LIDAR data acquired from

outdoor terrestrial environments, and are effective in solving some of the existing

challenges in 3D urban modelling.

vi

1 July 2011

I declare that:

1. This thesis contains no material that has been accepted for the awards of any other

degree or diploma in any university or institute.

2. To the best of my knowledge, this thesis contains no material that has previously

been published or written by another person except where due reference is made in

the text of the thesis.

Signed:

Ee Hui Lim

Declaration

vii

Preface

During my study at Monash University, a number of papers, which contain material used in

this thesis, have been published:

Lim, E. H.; Suter, D.; 3D Terrestrial LIDAR Classifications with Super-voxels and

Multiscale Conditional Random Fields, in Journal of Computer-Aided Design, 41(10),

pp.701-710, 2009.

Lim, E. H.; Suter, D.; Unsupervised plane data and plane patches clustering for 3d

terrestrial urban modelling based on modified Dirichlet process mixture model method, in

Visualization, Imaging and Image Processing Conference, Palma de Mallorca, Spain, pp.

97-102, 2008.

Lim, E. H.; Suter, D.; Multi-scale Conditional Random Fields for over-segmented irregular

3D point clouds classification, in IEEE Computer Society Conference on Computer Vision

and Pattern Recognition Workshops. Alaska, USA, pp. 1-7, 2008.

Lim, E. H.; Suter, D.; Conditional Random Field for 3D Point Clouds with Adaptive Data

Reduction, International Conference on Cyberworlds, CW '07, Hanover, Germany, pp.404-

408, 24-26 Oct. 2007.

Lim E. H; Suter D.; Classification of 3D LIDAR Point Clouds for Urban Modelling.

Proc.21st Image and Vision Computing New Zealand, Great Barrier Island, New Zealand,

pp. 149-154, 2006.

Lim E. H; Suter D.; Occlusion Removal in Image for 3D Urban Modelling. Proc. 21st

Image and Vision Computing New Zealand, Great Barrier Island, New Zealand, pp. 191-

196, 2006.

viii

My first and primary acknowledgement must go to my advisor and mentor, Prof. David

Suter, from whom I have learnt so much. He has made a significant contribution to this

dissertation, by introducing me to this research area, providing invaluable guidance and

giving me opportunities to visit other computer vision laboratories. My PhD research would

not have been accomplished without his enlightening ideas, constructive criticism, warm

support, and the countless hours spent on the discussion and proofreading of my research.

I would also like to thank my associate supervisor, Prof. Raymond Jarvis, for his

insightful comments during the weekly seminars. I am also very grateful to my colleagues

in the Institute for Vision Systems Engineering for all the discussions and assistance they

offered during my candidature. I thank them sincerely for all their help. I would particularly

like to thank Dr. Nghia Ho for proofreading parts of the manuscript and discussing

numerous mathematical issues. My years at Monash would not have been so enjoyable with

the help and friendship of my fellow PhD students and the staff in the ECSE department. Dr

Alex McKnight assisted by proofreading the final version of the thesis. I would like to

extend my gratitude to them all.

Thanks to Prof. Yasuhito Suenaga and David Suter, I was able to visit the Intelligent

Media Integration for Social Information Infrastructure under the 21st Century COE

(Centre of Excellence) Program for two months. I would like to thank all the staff and

students in the laboratories led by Prof Toyohide Watanabe and Prof. Masayuki Tanimoto

for sharing their experiences and insights in the Virtual Construction of Real World and the

FTV (Free Viewpoint Television) projects. I am also very grateful to have been able to visit

and discuss my work with Simon Kocbek and Prof. Peter Kokol of the System Design

laboratory at the University of Maribor, Slovenia. Thanks are owed also to Dr. Konrad

Schindler, Dr. Hanzi Wang, Dr Liang Wang and Dr Cathy Zhou. I have benefited greatly

from their valuable advice and kind support.

Acknowledgements

ix

I would like to express my gratitude to my parents who have been understanding,

caring and passionate about my education, my relatives, friends and colleagues in Silcar

who have both been encouraging and supportive. Last but not least, Qing Huang has always

been my inspiration. I truly share this accomplishment with them all.

x

Dedication

To my mother, my father and Aunt Kim.

1

1.0 INTRODUCTION

3D modelling of urban environments has become increasingly popular with the growth

and advances in computational capability. Meanwhile, the development of laser scanning

technology in the past decades has provided various high resolution and long range laser

scanners that allow rapid and efficient acquisition of 3D data. The demand for urban

modelling is apparent in both government and commercial users. An OEEPE survey [1]

conducted by the European Organization for Experimental Photogrammetric Research

showed that 95% of the participants were most interested in 3D urban modelling and 75%

were interested in 3D information about vegetation.

This rising demand has resulted in a wide range of applications for 3D urban

modelling. For instance, the generation of 3D urban models improves the practice of urban

environmental planning and design, by providing the possibility to simulate a proposed

change for the planning authorities. With these virtual urban models, applications such as

regional planning, precise navigation and disaster management can be enhanced. Many

governments and city councils are now using the models for planning and for studies of

climate, air quality, fire propagation and public safety. Moreover, new applications for

virtual reality walk-through and environmental simulation using urban models, such as

Chapter 1

Introduction

2

Google Earth1 and Microsoft Virtual Earth

2, are being developed. Apart from the above-

mentioned applications, urban models are also used in signal propagation modelling in

telecommunications. Batty et al. [2] group these applications into 12 categories of use,

including emergency services, urban planning, telecommunications, architecture, facilities

and utilities management. Shiode [3] later summarized these applications into four

categories, (1) planning and design; (2) infrastructure and facility services; (3) commercial

sector and marketing; and (4) promotion and learning of information on cities.

Various terms are used in describing these 3D models, including “Urban Model”,

“City Model”, “Virtual City”, “Cybercity” and “Digital City”. Currently, urban models

have been developed for most major cities, especially in cities with populations of over

one million, including Tokyo, New York, Mexico, London, Paris, Los Angeles, Chicago

and Delhi [2]. These 3D urban models generally contain models of man-made structures

such as buildings, and some of the models also contain natural objects such as vegetation

and terrain. Particular interest is focused on building reconstruction, where the building

models usually contain the geometric description (shapes and positions) and the

radiometric description (texture) of the building. Natural objects, including terrain and

vegetation, are also modelled to provide, for example, a more realistic representation for

robot localisation applications and visualisation purposes.

1.1 BACKGROUND AND MOTIVATION

Early urban models were constructed using wood from manual measurements where

survey crews physically placed conventional measurement devices on every key feature to

obtain the measurements. With advances in computing technology, virtual models can be

created with computer-aided design (CAD) tools, design maps of existing buildings and

(though often impractical) manual measurements. Various sensors which will be discussed

in the following sections have been developed for efficient data acquisition.

1 http://earth.google.com/

2 http://www.microsoft.com/maps/

3

1.1.1 Photogrammetry

3D urban models can be constructed by processing data from photogrammetry [4-6].

Photogrammetry is the earliest remote sensing technology, dating back to 1860, when

Albrecht Meydenbauer made his first investigations into architectural photogrammetry.

Photogrammetry refers to determining geometric properties, including the position,

orientation, shape and size of objects, from photographs and videos. To understand how

photogrammetry obtains these geometric measurements, consider the human vision

system: a stereo (3D) view can be obtained when we see an object from two different

positions (left eye and right eye). Similarly, if an object is photographed from two

different positions, a 3D picture of the object can be acquired, i.e., depth perception is

achieved.

When two or more overlapping images (with known camera characteristics:

including lens focal length, image size and resolution) from different perspectives are

available, stereo image matching, which is a standard photogrammetry technique, can be

employed to determine geometric measurements. The position of the camera where the

images are acquired is not required, as the geometry measurements are determined directly

from the images. To compute the building and terrain surfaces from stereo images, a

number of common points are identified on each image. A projection line is then

constructed from the camera location to the point on the object. The intersection of these

rays or triangulation determines the three-dimensional location of the point [7]. As pointed

out before, triangulation is one of the ways our two eyes coordinate together to estimate

distance.

Data acquisition (taking pictures) for photogrammetry is fast compared to LIDAR

scanning (see next section). However, processing the image (to retrieve the 3D

measurements for complex images with multiple buildings and vegetation) can be quite

tedious. Comparisons of the results of the accuracy and robustness of the reconstruction

process with LIDAR (Light Detecting and Ranging) data and with stereo images have

been presented [8, 9]. In short, LIDAR data acquisition is comparable with

photogrammetry to a certain extent and will replace photogrammetry in some cases. Both

technologies are fairly complementary and their integration can provide more complete

and accurate data acquisition.

4

1.1.2 LIDAR data

LIDAR (Light Detecting and Ranging) is a relatively new technology. It works by

emitting a laser pulse and the distance can be obtained via the time-of-flight method by

precisely measuring the pulse return time to the source, calculating the distance using the

value of the speed of light. The range (or depth), together with the direction of the pulse,

are recorded by the scanner to generate the 3D coordinates of the point on the target from

which the pulse is reflected. Another method for measuring distance is via phase

differencing, which uses a continuous wave with a continuous phase change. The method

calculates the distance based on the phase of a returning pulse [10, 11]. The coordinates

are expressed in a coordinate system fixed to the laser scanner, not an absolute coordinate

system fixed to the earth. These 3D points are commonly known as “point clouds”.

LIDAR is a fast method for sampling surfaces with high density and high accuracy.

The range image generally not only stores range or depth, but the return laser intensity

(which is particularly useful for its light invariant property) can also be obtained. With

calibration, the use of a “Multisensor” system (laser scanner and camera) permits more

complete and efficient data acquisition. This “Multisensor” system provides a high

resolution and complete coverage of the environment for urban modelling.

There are two types of urban model in general: airborne and terrestrial. The

airborne urban model can be built with airborne laser scanner data and satellite or aerial

images. The airborne laser scanner is generally handheld or mounted on a helicopter or an

Unmanned Aerial Vehicle (UAV) [12]. Data acquisition for a terrestrial urban model, on

the other hand, is via a close-range terrestrial laser scanner mounted on a mobile robot or

vehicle.

Airborne data consists only of information on the floor-tops (building roof) and the

terrain shapes. Information on building façades is unable to be represented in detail. The

airborne model is suitable for applications where a birds-eye view of the urban model is

desirable. On the other hand, ground-based or terrestrial modelling is capable of

reproducing detailed building façade scans, while the building roof-top details are

generally unavailable. The terrestrial urban model is preferable for applications such as a

walkthrough in virtual reality, environmental simulation or for robot navigation. In the

5

case where both building roof-top and façade details are required, the fusion of the

airborne and ground-based models permits a complete urban model to be developed.

1.1.3 Why is 3D Urban Modelling difficult?

The construction of 3D terrestrial urban models requires the acquisition of a large amount

of LIDAR data which can take months, and it is memory-intensive to store and process the

data. The very large amount of data, the unavoidable noise due to the uncontrollable

environment, the large varieties and shapes of the structures, and the sheer number of

structures make robust multi-structure data processing extremely challenging. In addition,

the number of structures and the location of objects in the outdoor environment are

unknown, making classical statistical methods insufficient to segment or model the data

accurately.

The following section outlines several challenges of 3D urban modelling.

Challenge I: Inconsistencies between the images and the LIDAR data

The acquisition of both image and LIDAR data are prone to anomalous occlusion due to

moving pedestrians and other objects in the uncontrolled outdoor environment. This is

because there is a distinct time difference between the data acquisition from the laser

scanner and the camera, i.e., the data acquisition processes for the laser scanner and the

camera do not occur simultaneously. As a result, anomalous occlusion or inconsistency

between the images and the LIDAR data (when a moving object does not coincide at the

same location on both the laser scan and the image, explained in detail in Section 2.3) may

occur. Removing the inconsistencies in the images can be challenging. In a busy

environment, a simple median filtering may not be sufficient to remove most of the

inconsistencies, as any given portion of the image scenes may be occluded more than 50%

of the time. Also, the median filter does not take any spatial continuity properties into

account. A more complicated image processing solution is required for image occlusion

removal. Based on the concept that when an occlusion happens, there is generally a

discontinuity around the boundary of the occlusion, Herley [13] proposed a method which

6

removes occlusion using a minimum number of images. However, the method is based on

the assumption that the occlusions are independent objects, which is often violated in

outdoor environments especially along a pathway during busy hours.

Challenge II: Large amount of complex outdoor data

With fast data acquisition, the amount of acquired data is massive. It is possible to collect

50,000 points per second to generate a point cloud that contains millions of points. A

complete city model usually requires registration of several scans from different locations

(with each scan being large in size). Storing or visualising the raw data is memory

demanding, and processing the data requires a long time. Manually reconstructing a 3D

urban model is costly and time consuming. An automated surface reconstruction algorithm

is therefore highly desirable.

There exist several point cloud reduction techniques, such as the coarse-to-fine

point cloud simplification by Moenning and Dodgson [14], using farthest point sampling.

Farthest point sampling is based on the idea of minimizing any reconstruction error by

repeatedly adding one sample point at a time and placing the sample point in the middle of

the least-known sampling domain. Similarly, Alexa et al. [15] reduce the point cloud by

removing the points with least contribution to the moving least squares (MLS)

representation of the underlying surface. These point cloud simplification methods can be

a useful step to speed up the subsequent surface reconstruction process.

Another method to reduce the data is via Delaunay Triangulation [16-18], which is

a common method for indoor object reconstructions from range data. The Delaunay

Triangulation is a set of triangles that connect each data point to its neighbours, with the

property that for each triangle, the unique circle circumscribed about the triangle contains

no data point. Nevertheless, the outdoor terrestrial laser scanned point clouds have very

different properties compared to the smaller indoor objects, which make triangulation and

other direct reconstruction/model fitting methods difficult. The outdoor range data often

suffer from occlusions due to moving (e.g. pedestrians) or still (e.g. vegetation)

obstructions during data acquisition. The data also have the property of varying density

due to different distances of scanned objects from the laser scanner. In addition, a large

outdoor range data set contains multiple multi-structure objects, including cluttered

7

vegetation. Due to these properties, direct reconstruction is very difficult and challenging.

Triangulating vegetation data often causes unwanted spikes (as shown in Figure 1-1), or

even connects the near-by vegetation and building data into the same surface. In addition,

it is difficult to represent the edges and corners properly with simple triangulations. Also,

extra knowledge is required to recover occlusions as building surfaces are often obstructed

by vegetation, as shown in Figure 1-1.

In a complex environment that contains various data types, especially an outdoor

environment with a relatively large amount of data belonging to vegetation, one method to

model the data is to geometrically fit the dense point clouds with primitive models (in

other words, surface reconstructions) [19, 20]. For example, hundreds or thousands of data

acquired from a plane can be defined using three points, or four points which represent the

four vertices of the planar rectangle in urban environments. Vegetation such as trees and

Figure 1-1 Triangulations of outdoor range data with RiScan Pro software. (Edge clearing

threshold = 0.05m; depth factor = 8; depth threshold = 0.05m)

8

bushes can be replaced with generic models or simplified with primitive models,

depending on the required level of detail. In order to fit the data with the appropriate

geometric models, the data can be classified into different classes, such as man-made

buildings, terrain, vegetation; or at a lower level, the raw data can be classified into

different geometrical classes, i.e. planar, linear or cluttered.

3D outdoor data classifications are not new. Several previous attempts at urban

modelling use aerial LIDAR data, where the acquired data consist of 2D (or 2.5D) birds-

eye view point clouds. Methods to classify the data that have been applied to aerial urban

modelling include the utilization of linear classifiers, or clustering methods with features

extracted from the height difference [21-24], variations of surface normal vectors [22] and

colours [25]. These methods are often inappropriate for data classification in terrestrial

urban modelling. For example, vegetation removal is commonly performed by filtering via

changes in the height differences. However, rooftop information is generally unavailable

in terrestrial or ground-based data acquisition.

In previous work on terrestrial urban modelling, the classification of raw point

clouds has been point-based. With the amount of data acquired, as mentioned before,

extracting features and labelling every single point are computational intensive. It has

been shown in work on 2D pixel classification [26] that classifying every pixel is

redundant and this is also an issue in 3D data. However, over-segmentation techniques

used in 2D images are not applicable to 3D data: the multi-structure and varying density of

the complex data makes it challenging to group the 3D data into regions that are

homogeneous with respect to certain chosen properties. A method capable of reducing the

complex 3D data into a smaller sub-set will greatly reduce the computation required in

processing the data.

Challenge IV: Unknown number of structures and unknown location

With the data belonging to man-made surfaces extracted, classical statistical methods may

be applied to segment or model the data. However, the number of structures and locations

of the objects in the outdoor environment are often unknown. This information is essential

for the classical statistical methods or estimators to robustly segment the data. Robust

estimators such as RANSAC are essentially designed to only handle single-structure

9

segmentation contaminated with gross outliers. In order to handle multi-structure

segmentation, RANSAC can be applied sequentially to detect and remove the inliers of the

best fitted planes from the data set. This process is not optimal and determining when to

stop the fit-remove process is not straight-forward.

Alternatively, clustering methods provide an approach to segmenting the data

simultaneously. Determining the number of clusters, which is the fundamental problem of

cluster validity, is one of the main problems in clustering methods. Most solutions to

solving the unknown number of clusters require the estimation of the maximum number of

clusters, which is however often difficult to estimate in a large number of points.

1.2 STRUCTURE OF THE THESIS

With the fore-mentioned challenges, this thesis presents a contribution to 3D Urban

Modelling with terrestrial LIDAR data and calibrated camera image information. The

outdoor LIDAR and image data are acquired with a system that consists of a Riegl LMS-

Z420i Terrestrial Laser Scanner and a calibrated high-resolution Nikon D100 digital

camera. Scans from different positions are registered in a local coordinate system with the

provided software RiScan Pro3. This dissertation is organised as follows and as as

illustrated in Figure 1-2:

Chapter 2 proceeds with description of the equipment used in data acquisition and

collection. This chapter concerns the aforementioned challenge regarding the

inconsistencies between the images and LIDAR data due to moving occlusions.

� The image occlusions are removed by extending the image occlusion removal

approach (with minimum number of images) proposed by Herley [13] to remove

occlusions that are not independent. The extended solution is explained in detail

and evaluated with indoor and outdoor data in the chapter.

3 RiScan Pro is the companion software for the RIEGL terrestrial scanner.

10

Chapter 3 – 5 describe robust and efficient 3D data classification methods to label the

outdoor LIDAR data (with complex properties as described above) into different data

classes. Chapter 3 starts with an overview and comparison of feature descriptors for the

classification of the LIDAR data, including the saliency features (as a function of the

eigenvalues of the region covariance) and the estimated surface normal vectors. The size

of the region neighbourhood used to compute the features cannot be fixed due to the

nature of the LIDAR data, as explained in Section 1.1.3.

� An issue is identified with the adaptive size of the region covariance used to

compute the saliency features. We modify the saliency features to be invariant to

the size of the adaptive region covariance by normalising the 1st and 2

nd largest

eigenvalues with the size of the region covariance. The extended saliency features

and the estimated surface normal vector are evaluated on synthetic and outdoor

terrestrial LIDAR data sets.

Chapter 4 addresses redundant computation in the existing point-based point cloud

classification approach.

� To avoid the redundant computation, we introduce a 3D over-segmentation

algorithm, namely super-voxel, and apply the algorithm to the outdoor terrestrial

LIDAR data. The algorithm, which is based on 3D scale theory, reduces the

original data to a smaller sub-set of the data by grouping the data into the super-

voxel subcomponents, depending on the level of a similarity criterion. This method

is shown to efficiently reduce the outdoor terrestrial LIDAR data for classification.

The features explained in Chapter 3 are then extracted from every super-voxel and Chapter

5 describes the learning model for the super-voxel classification.

� We propose a hierarchical graphical model, the multi-scale Conditional Random

Fields (mCRF), to learn the parameters of the extracted local and regional features.

A comparison of the classification results shows improvement in accuracy over

existing methods using 3D over-segmentation and mCRF.

11

Building and terrain models can then be derived from the classified data belonging to the

planar surface. In Chapter 6, the Infinite Gaussian Mixture Model (IGMM) is introduced

to robustly segment extracted data belonging to planar surfaces.

� A robust estimation method is described that successfully addresses a number of

the challenges associated with the segmentation of complex data with unknown

numbers of structures. The residual function of the Infinite Gaussian Mixture

Model (IGMM) is modified for the clustering of labelled data belonging to planar

surfaces into locally delimited planes. The proposed method is evaluated on the

labelled planar data and the robustness of the algorithm for the clustering problem

is demonstrated.

Finally, Chapter 7 concludes this dissertation by summarizing the contributions and

discussing directions for future research.

12

Figure 1-2 Overview of approach to 3D Urban Modelling

3D Urban Modelling

Chapter 2:

Data Acquisition

Registration

Occlusion

Removal

Data Classification on super voxel

(Chapter 4)

Chapter 3:

Feature Extraction

Digital Surface Model

Chapter 5:

Plane Patch Fitting

Chapter 6:

Robust Segmentation

Further work:

Chapter 7:

Vegetation Data

Chapter 5:

Learning Model

13

2.0 DATA ACQUISITION AND OCCLUSION REMOVAL

2.1 INTRODUCTION

The use of a Multisensor system – the laser scanner and camera - in the LIDAR (Light

Detecting and Ranging) method permits a much faster, more complete and more efficient

data acquisition process compared to photogrammetry. The laser scanner provides

geometry data, whereas the image taken provides colour information for realistic texture

mapping, which is valuable information for further point cloud analysis [27, 28].

In this chapter, the technologies involved in the creation of an urban model are

described, including the pre-processing required to prepare the data for data classification

and geometry fitting.

2.2 TECHNOLOGY

The equipment used in our research is a Multisensor system that includes a high-

performance long-range Riegl LMS-Z420i Terrestrial Laser Scanner with a wide field-of-

Chapter 2

Data Acquisition and

Occlusion Removal

14

view and a calibrated high-resolution Nikon D100 6 megapixel digital camera with a

14mm lens firmly mounted on top of the scanner (Figure 2-1); the camera rotates with the

laser scanner when in operation. The operating range of the scanner is greater than 800m

from the scanned object with 80% reflectivity with accuracy of 10mm. This single shot

accuracy can be improved up to 5mm by acquiring several scans in a sequence. The field

of view is 80º x 360º and the scanner can be tilted up to 180º. The measuring rate is 8000

points/sec with scan resolution up to 0.004 deg. A 80º x 360º scan with scan resolution of

0.06 deg (the resolution selected for our data acquisition) requires around 20 minutes for

data acquisition.

a b

Figure 2-1 (a) Riegl LMS-Z420i with mounted Nikon D100 (b) Riegl LMS-Z420i on cart

The camera provides the acquisition of calibrated photographs, which in turn are

automatically registered with the laser scanned data. A software package called RiSCAN

PRO, that manages the acquisition, registration and archiving tasks, is used to export the

3D point clouds and the colour information to an ASCII file for further processing. Figure

2-2 and Figure 2-3 show an example of two registered laser scans mapped with colour

information from the images using the RiSCAN PRO software. Figure 2-2 shows the front

view of the registered scans. The green and red dots indicate the two different scan

positions (locations 1 and 2). Figure 2-3i shows the birds-eye view of the registered scans

in real colour, whereas Figure 2-3ii shows the same view in simple colour (Green: location

1; Red: location 2). The RiScan Pro software carries out the registration automatically

with special targets manually placed in the environment.

15

The laser scanner is mounted on a cart and data collection is achieved in a stop-

and-go manner. In order to reduce the amount of time spent in data acquisition, the

method explained in [29] that plans the Next Best View can be implemented. With this

method, the overlap of data points in the mentioned technique would be at minimum.

Currently, the data collection is based on manually selecting the locations that best cover

the region of interest. The system can be upgraded for continuous scanning in the case of a

more complex environment.

Figure 2-2 Front view of registered LIDAR scans

i ii Figure 2-3 Birds-eye view of registered LIDAR scans

16

2.3 DATA PRE-PROCESSING

The result of the combination of image and data capture laser is however, prone to

anomalous occlusion due to moving pedestrians and other objects. This is because there is

a distinct time difference between the data acquisition from the laser scanner and camera:

the data acquisition process for laser scanner and camera does not occur simultaneously4.

This results in inconsistency or anomalous occlusions, where a moving object does not

coincide at the same location on both the laser scan and the image.

To illustrate this problem, consider the following scenario: imagine a pedestrian

happened to be at point A in Figure 2-4 when the laser scanning was taking place. As a

result, part of the laser scan of the terrain at point A would be occluded by the pedestrian.

On completion of the laser scanning, i.e. after time t, the pedestrian moved to point B as

the camera was taking images. The image of the tree at point B would be occluded by the

pedestrian. The aforementioned anomalous occlusion can be seen when both LIDAR data

and colour images are combined, as demonstrated in Figure 2-5.

a. at time = 0 b. at time = t

Figure 2-4 Example of situation where scan artefacts could occur

4 In our experiment, the Riegl terrestrial laser scanner scans the environment for some time (around 20

minutes for a resolution of 0.06 deg, followed by the camera taking seven images (with an overlap of 15%).

For the purpose of occlusions removal, another set of laser scans and images are taken after the first one.

17

a

b

Figure 2-5 Scanned results of occlusion with (a) Human object (3D points) mapped with grass image

texture (b) Human image mapped onto tree trunk

As described in the above scenario, the LIDAR data of the pedestrian captured in

position A would be colour mapped with grass (Figure 2-5a), which is the image of

position A after some time t. On the other hand, the human image taken at time t would be

colour mapped onto the tree trunk and onto the ground on position B (Figure 2-5b). This

can be an issue for visualisation requiring accurate texturing. Apart from being

“unrealistic”, such false data collection can be a problem for further analysis. For example,

if point cloud classification is based on the colour property, the green human object may

be recognised as vegetation instead.

18

a b

c

Figure 2-6 Example of LIDAR data showing raw infrared intensity readings (a), (b)Input LIDAR data

I and II with moving occlusions.; (c)Occlusion-free LIDAR data

In general, two types of occlusions have to be addressed: occlusions in the LIDAR

data and occlusions in the image data. The occlusion in the LIDAR data can usually be

solved by taking more than two scans at the same scan location and taking the greatest

depth of each point, assuming each data point is not occluded more than once in the set of

LIDAR scans from the same scan location. Figure 2-6 depicts an example of moving

occlusion removal from two LIDAR data scans using the method described. Vertical

spikes in the input images are due to pedestrians moving at a speed that is higher than the

laser scanning speed.

19

Removing occlusions in images is more difficult. This chapter presents a method

for image occlusion removal. This method works with the minimum required amount of

input pictures and is robust in detecting occlusion boundaries in the input images.

2.4 PREVIOUS WORK IN IMAGE OCCLUSION REMOVAL

Image occlusion removal can be achieved by manual intervention. For example,

Ulm [30] removed obstacles like cars or trees on terrestrial images by manual retouching

of the artefacts or occlusions. However, this is very tedious for a large collection of

images. The inconsistency caused by occlusion in the images can be otherwise removed

using automatic computer vision techniques. Removing occlusion in images for urban

modelling is somewhat analogous to background modelling [31-33]. In background

modelling, a long stream of video is taken from the same standpoint to obtain the

background model using robust statistical methods. The estimated background model can

then be employed to extract foreground objects for various purposes, including traffic

monitoring, human action recognition and object tracking.

Since the background is more likely to appear in a scene, one simple statistical

method is median filtering. Assuming two or more images contain views of the same

scene at a different time, an occlusion-free image can be formed with median filtering: the

final occlusion-free image is assigned to the median of the N input images from the same

viewpoint. However, the above assumption does not always hold. In outdoor

environments, any given portion of the image scenes may be occluded more than 50% of

the time. In addition, the median filter does not take any spatial continuity properties into

account. Therefore, some occlusion might only be partially removed. There are various

algorithms in background modelling that attempt to overcome the problems with the

limitation of the median or mean. Wang [34] proposed a more complicated solution that

locates all “stable sub-sequences” of pixel values in the video stream. The most “reliable”

sub-sequence is then chosen using the RANSAC algorithm. The initial background model

carries the mean value of the intensities over that sub-sequence.

20

The difference between background modelling and the image inconsistency

problem in this thesis is that, in the application studied here, there is not a large stream of

images for every viewpoint. For our problem, an efficient method is adopted that is based

on Herley's image occlusion removal algorithm [13]. Herley shows that multiple images

(>2 images) are not always necessary in solving image occlusion. An occlusion-free image

can be formed automatically, provided that each location of the image is not occluded at

least once. This is based on the concept that, when an occlusion occurs, there is generally

a discontinuity around the boundary of the occlusion.

To obtain the occlusion-free image, Herley first constructed a “consensus image”

that acquires the value of any two images that agree at that location with “occlusions

holes”. The occlusion holes are connected closed sets of pixels which have to be filled in.

Herley’s algorithm assumes that the occlusions are all independent objects; one occlusion

hole cannot consist of occlusions from different images. Based on this assumption, each

connected set can be filled with data from a single connected set, and hence the problem is

simplified to determining which image is the best. This works by comparing the similarity

of the occlusion’s outer boundary in the “consensus image” with the occlusion’s inner

boundary in all the input images (details are illustrated in the next section).

However, this assumption is often violated in an outdoor environment, as for

example, during a busy period along a narrow pathway with a large number of pedestrian

overlaps. To remedy this, Herley’s approach is extended to enable the removal of

occlusion when a single occlusion boundary requires information from more than a single

image. In addition, our proposed algorithm includes the ability to detect unremoved

occlusions. In the case where complete occlusion removal is not possible (no single

unobstructed view of the background is seen in any input images), the algorithm is capable

of detecting such cases. This additional step allows shape retrieval or manual retouching to

recover the particular image. This is important, as the number of images in the acquired

urban image database is large, and it can be very time-consuming to examine all processed

images to select out images that need to be further processed.

21

2.5 OCCLUSION REMOVAL WITH MINIMUM NUMBER OF

IMAGES

Let the images I0(i,j), I1(i,j)… IM-1(i,j) be a set of input images obtained from the

calibrated camera taken at different times with the same view point. Therefore, Im(i,j) =

In(i,j) m,n unless either Im or In is occluded at that location or affected by illumination

changes. The steps of the occlusion removal algorithm are detailed below:

2.5.1 Occlusion Detection

Constructing consensus image

As explained previously, the occlusion removal algorithm [13] starts by constructing a

“consensus image” U, which contains pixels that are similar in Im. The “consensus image”

can be constructed from two or more images by acquiring pixel values from any two

images that have a difference less than threshold α.

Otherwise

jiIjiISimilarityanyIfjiIjiU

nmm α<−→=

),(),(

0

),({),(

( 2-1 )

≠

=OtherwiseI

jiUforjiUjiI

m

m

0),(),(),('

( 2-2 )

Each pixel in any two images Im and In is compared with a threshold α, where α is a small

value to allow some matching error. If the similarity is low, the consensus image U is

assigned a pixel value of zero. Otherwise, the consensus image carries the average pixel

value in image Im and In. Each of the Im then forms a new image '

mI that matches the

“consensus image” except where the “consensus image” is zero.

The visual similarity can be measured using various features such as intensity or

colour, gradient, contour, texture, or spatial layout. A popular choice for similarity is

colour due to its simplicity and robustness against scaling, rotation, partial occlusion, and

22

non-rigid deformation. However, the RGB colour space is sensitive to the change of

illumination. Outdoor environments cannot be controlled and the illumination condition

may vary for the set of images from the same viewpoint. The normalized RGB space was

employed:

BGR

RI R

++=

BGR

GI G

++=

BGR

BI B

++=

( 2-3 )

The consensus image is then constructed from any two images with colour similarity less

than α, which was set to 5 in our experiments:

Colour Similarity 222 )()()( BnBmGnGmRnRm IIIIIIc −+−+−=

( 2-4 )

Figure 2-7a and Figure 2-7b shows an example of the input images from the same

viewpoint in RGB space. Both images were converted into the normalised RGB space, as

shown in Figure 2-7c and Figure 2-7d. The resulting “consensus image” based on the

visual similarity of the normalised RGB of the input images is shown in Figure 2-7e. The

black regions in the consensus image are the estimated locations of the occlusion.

23

a b

c d

e f

Figure 2-7 Construction of consensus image (a),(b) Image sequence with occlusions; (c),(d) Input

images in normalised RGB space; (e) Consensus image; (f) Filtered consensus image

Discarding consensus image noise

The consensus image at this stage may contain a large number of occlusion ‘holes’, i.e.,

pixels with zero RGB value, that are caused by image noise. In an outdoor environment,

moving trees and bushes (due to wind5) can cause mismatch in the input images, thereby

generating a large number of small holes in the consensus image. Eliminating these

relatively small occlusion ‘holes’ at this stage will reserve more computation time for

more complex processing.

5 Occlusion can be divided into static and moving occlusions in general. Vegetation may be classified as a

type of static occlusion in some building modelling applications. In our application, vegetation is identified

as part of the terrestrial urban model and would not be removed from the image.

24

A morphological filter is employed to fill in the holes that are relatively small (less

than 0.01% of the total pixels). It is important to select an image filter that does not change

the position of the occlusion boundary (for instance, erode or dilate the filter), as the

accuracy of the true occlusion boundary has a great effect on the occlusion removal.

Figure 2-7f shows an example of the filtering result of the original “consensus image” in

Figure 2-7e.

2.5.2 Occlusion Removal

Forming closed connected set in consensus image

Each occlusion which appears as a connected hole in the consensus image is grouped

together as Sp, p=1,2,…,P, where P is the number of occlusion regions in the consensus

image. For instance, in Figure 2-7f, the consensus image has three ‘holes’ – P=3 and

Figure 2-9 shows an example of grouped connected zero pixels. Only holes greater than 5%

of the largest hole is considered. In order to fill in the holes, the internal and external

boundaries of the hole in the input images are compared, to identify which of the input

images has the non-occluded data.

Figure 2-8 Definition of external boundary

Figure 2-9 Grouping of occlusion holes in the consensus image

25

The internal and external boundaries of the holes are defined as follows: For each

set of Sp, the internal boundary of each occlusion, Bmp, m=1,2,…M, where M is the number

of input images, is defined as the set of pixels in the zero-connected region that has at least

one neighbouring pixel with the background image. Therefore, for each set of occluded

images, there will be M×p internal boundaries. The green outlines in Figure 2-10a-f are

examples of the internal boundaries of the holes in the “consensus image”. For each set of

Sp, the external boundary Ep is defined as the set of pixels which is not in Sp and has at

least one neighbour in Sp. The values of the external boundary are taken from the

consensus image.

a b

c d

e f

Figure 2-10 The occlusion boundaries (a) B11 (b) B21 (c) B12 (d) B22 (e) B13 (f) B23

The discontinuity across the boundaries for Sp can then be computed. Note that the

external boundary in general is larger than the internal boundary. For example, in Figure

2-8, pixels in black represent occluded data: the external boundary pixels are labelled

“u,t,w,x,y,z” and the internal boundary “a,b,c” for the purpose of illustration. The number

of pixels in the external boundary is matched to the internal boundary as a requirement for

26

the discontinuity measure: for every pixel in the internal boundary (for example, pixel “a”

in Figure 2-8), the median of the 8-connected pixels6 external boundary pixels (for

example, “u,t,w”) that are in the external boundary is computed as the “matched” external

boundary pixel. For instance, in Figure 2-8, the corresponding new external boundary for

the internal boundary },,{ cba is )},(),,,,(),,,({ zymedianzyxwmedianwtumedian .

a b

c d

e f

Figure 2-11 (a) Plot of the D3 function; (b) “Boundary separation” of S3; New boundaries (c) B14 (d)

B24 (e) B15 (f) B25

Testing each set of Sp for occlusion overlap: “Boundary Separation”

As explained in Section 2.5, a large difference in the discontinuity measure at the

boundary is most likely to indicate an occlusion. The discontinuity measure, dmp, is the

absolute difference of internal and external boundary of Sp in image Im:

6 8-connected pixels are neighbours to every pixel that touches one of their edges or corners

27

pmpmp EBd −=

( 2-5 )

∑=

=k

i

mpmp idL1

)(

( 2-6 )

where k is the number of pixels in the boundary.

Herley [13] proposed the sum of the discontinuity measure, Lmp, as an indicator that

reveals the level of discontinuity over the boundary of Sp in image Im. He proposed for

every hole Sp, data from Im with the smallest dmp is used to fill the hole. A small Lmp,

indicates that data from Im is sufficient to be used to cover the hole Sp. In the case where a

hole must be patched partly from Im and partly from In, i.e., the occlusions are not

independent objects, the discontinuity indicator Lmp for all Im would be large.

We extend Herley’s algorithm by providing a solution to allow each hole in the

consensus image to be filled with data from one or more of the Im. For a set Sp in U, if Im is

occluded over part of Sp (subset I), and not occluded over the rest (subset II); In is

occluded over part of Sp (subset II7), and not occluded over the rest (subset I); the

proposed algorithm will be able to fill subset I from In and subset II from Im. For example,

in Figure 2-10, S1 would be filled from I1 and S2 can be filled either from I1 or I2. The set

S3 however, would be filled partly from I1 and I2. If subset I and II intersects, i.e., part of

Sp is occluded in both Im and In8, manual image restoration would be required to remove

the occluded data. The discontinuity measure can be compared with a threshold to identify

Sp that is not completely resolved.

To break a set Sp into n subsets, called “boundary separation”, so that the set Sp

subsets can be filled from both Im and In, an occlusion-overlap measure Dp9 is first defined:

mpnpp ddD −=

( 2-7 )

7 Assuming subset I and II do not intersect

8 The common occluded area of subset I and subset II should not be identical in both Im and In, as if this is

the case, the area would not be in the set Sp. 9 Although the occlusion-overlap measure Dp can be defined for more than two images, the simplest way is

to process two images at a time. The resulting image is then compared with the third image, and so forth.

28

Consider that, if a set Sp can be solely filled from Im, the occlusion-overlap

measure Dp would always be positive, as the discontinuity measure dnp would always be

larger than dmp. For example, in Figure 2-10, d11 will be constantly relatively small and d21

will be constantly relatively large. If Sp has to be patched partly from Im and partly from In,

Dp would be positive at the portion of the boundary that should be patched from Im; and

negative at the portion of the boundary that should be patched from In. Hence the problem

of deciding if a set Sp can be solely filled from Im is simplified to identifying if there exists

a zero-crossing in the occlusion-overlap measure Dp. The location of the zero-crossing in

Dp indicates the location to break the set Sp into sub-sets. For instance, in Figure 2-11a, D3

= d23-d13 is plotted. The negative regions indicate filling from I2 and positive regions from

I1. The new boundary (shown in Figure 2-11b) between the two sub-sets is formed by

connecting the boundary locations of the zero-crossing in D3 (labelled as A and B). Set S3

is therefore divided to form two new Sp sub-sets: S4 sub-sets S5.

Filling up consensus image zero regions

The image with the minimum Lmp as defined in Eq.2.6 is then selected to cover pS .

Consider the example in Figure 2-10a,b: L11 = 16 and L21 = 49. Therefore, S1 will be filled

from I1. The final result for occlusion removal in the example is shown in the next section.

Static occlusion, for example vegetation or parked cars in an outdoor environment,

would not be removed. These objects are considered consistent occlusions, i.e., they also

exist in the LIDAR point cloud. Therefore, removal of these objects in the image is

unnecessary. Static occlusions can be removed via classification of the LIDAR data,

which will be detailed in the following chapters.

2.6 EXPERIMENTAL VALIDATION

A selection of our results is shown in Figure 2-13 (indoors), Figure 2-14 (outdoors)

and Figure 2-15 (outdoors). The proposed method is also evaluated on two sets of seven

images, taken by a calibrated Nikon D100 6 Mega Pixel digital camera with 14mm lens,

29

which covers a field of view up to 360°. The calibrated images with occlusions removed

are then stitched together (shown in Figure 2-16 and Figure 2-17) before being mapped

onto the LIDAR point cloud.

2.6.1 Indoor data set

Our proposed method was evaluated on a set of indoor data as shown in Figure 2-12a,b.

The results are shown in Figure 2-13a-c. Without the “boundary separation”, S3 would

have to be filled from a single image with the lowest Lmp, which happened to be I1 (as

shown in Figure 2-13a). With “boundary separation”, part of the result image is still

occluded (as shown in Figure 2-13b). As a result, the occluded portion is never seen in the

input images. The algorithm is capable of detecting this issue and prompts the user for

manual retouch or shape retrieval [35].

a b

Figure 2-12 (a), (b) Indoor image sequence with occlusion

30

a

b c

Figure 2-13 Results of artefact removal for input images in Figure 2-12 (a) Without “boundary

separation” (b) With “boundary separation” (c) After manual retouch

2.6.2 Outdoor data set

Outdoor data set I

The proposed occlusion removal method with “boundary separation” is compared against

Herley’s method in the outdoor data set shown in Figure 2-14a. Part of the occlusion in

both images overlapped, as shown in the filtered consensus image in Figure 2-14b. The

result of occlusion removal without “boundary separation” still includes occlusion (a full

figure of a person is not removed), despite the occluded region being not occluded in one

of the input images. With “boundary separation”, all occlusions are removed automatically,

as the part where the occlusions overlap is relatively small.

31

Figure 2-14 Input images with occlusions and resulting image (a) Input image sequence; (b)

Consensus Image (c) Without “boundary separation” (d) With “boundary separation”

Outdoor data set II

The proposed occlusion-removal method with an outdoor data set in a busy environment

was evaluated as shown in Figure 2-15a. The consensus image is shown in Figure 2-15b

and the final “occlusion-free” image is shown in Figure 2-15c. Except for parts of the

image that are occluded in both inputs, most of the large occlusions in the input sequence

are removed. Some of the occlusions (far from the camera) that are not removed are

relatively small and insignificant.

(a) Input images

(b) Consensus image

(d) Result with boundary separation

(c) Result without boundary separation

32

Figure 2-15: Input images with occlusions and result of the implementation II (a) Input image

sequence (b) Consensus Image (c) Result image

33

2.6.3 Panoramic Data Set

Panoramic Data Set I

The images in Figure 2-16a,b are original input image sets that are stitched into panoramic

images. Figure 2-16c is the occlusion-free panoramic image. The original images were

individually processed for occlusion removal before being stitched into the panoramic

image. A simple linear image blending technique was used, where the adjoining areas of

the component images were matched for colour, brightness and contrast, in order to

correct the different exposures in different images.

Note that blending is not performed on the pixels belonging to the sky. To explain

this, we note first that the range of intensity levels that can be recorded by a sensor without

clipping or saturation is often referred to as the dynamic range. The JPEG images stored

by a digital camera provide only 8-bits of intensity information, which is usually

insufficient to capture the entire dynamic range for real outdoor scenes containing bright

and dark areas. When images of different exposures are stitched together to form a

panoramic image, a higher dynamic range, i.e., greater than 8-bits, is required for image

blending. However, blending of the sky is unnecessary for the proposed application in this

thesis. This is because the sky is not picked up by the LIDAR. Figure 2-16c depicts the

result of the combination of the occlusion-free image and the LIDAR data.

Panoramic Data Set II

Similar to the “Panoramic Data Set I”, Figure 2-17a and b depict the panoramic

view of the images acquired from the calibrated Nikon D100 digital camera. Occlusions in

the input images were removed. The stitched and blended panoramic view is shown in

Figure 2-17c. As mentioned in Section 2.5, analogous to vegetation (as a type of static

occlusion in building modelling), the parked cars are not removed. The result of the

combination of the occlusion-free image and the LIDAR data is shown in Figure 2-17d.

34

a

b

c

d

Figure 2-16: (a),(b) Input images; (c) Resulting image after occlusion removal and colour blending; (d)

Combination of the occlusion-free image and the LIDAR data

35

a

b

c

d

Figure 2-17: (a),(b) Input images; (c) Resulting image after occlusion removal and colour blending; (d)

Combination of the occlusion-free image and the LIDAR data

36

2.7 CONCLUSION

In this chapter, the problem with occlusion inconsistency between the acquired LIDAR

data and the image data was identified. The importance of pre-processing of the data to

remove the moving occlusion (that causes the inconsistency) in both range and image data

was shown. The proposed occlusion removal approaches for both data types were

explained and evaluated on the acquired urban data set.

Occlusion removal in LIDAR data can generally be solved by taking more than

two scans in the same scan location and taking the greatest depth of each point. For image

data, the algorithm for image occlusion removal using minimum images detailed in [13]

was extended to include the capability of detecting and removing occlusions in input

images that overlap. The extended algorithm was tested on outdoor images that included

moving occlusions such as pedestrians. The algorithm is capable of removing most of the

occlusions that cause the inconsistency. The occlusions that are not removed are in general

due to the lack of an unobstructed view of the background and are relatively small and

insignificant. Static occlusions, for example vegetation and cars parked in one spot over

the entire duration of image capture and laser scan, can be removed by processing the

LIDAR data, i.e. classifying the LIDAR into different classes and removing the particular

objects. This will be explained in the rest of the thesis.

37

3.0 FEATURE DESCRIPTORS FOR 3D CLASSIFICATION

3.1 INTRODUCTION

This chapter describes the extraction of features for the classification of outdoor-scanned

LIDAR data. In order to reconstruct 3D city models from the acquired LIDAR data, first

the raw data have to be divided into different classes. A low-level classification would be

to divide the data into “planar” (for example, façades of man-made buildings, pathways

and terrain), “linear” (for example, cylindrical poles, tree trunks) and “cluttered” (for

example, vegetation, trees) data types. As discussed in Chapter 2, one of the most

important tasks in the classification process is to extract the right features, i.e. features that

will capture the relevant relationships among observations.

The term ‘features’ means variables extracted from the raw data that are

appropriate and distinctive for correct classification with low probability of mismatch. A

feature descriptor can also be seen as a special form of data and dimensionality reduction.

Generally, feature extraction is divided into feature construction and feature selection.

Feature construction can be seen as the process of obtaining new variables by a linear or

non-linear transformation of the original raw data. Feature selection is used to determine if

Chapter 3

Feature Descriptors for 3D

Classification

38

the extracted features are sufficient to identify the different classes and to eliminate

redundant features [36].

We focus on feature construction in this chapter because feature selection is

usually only necessary when there is a relatively large number of features and relatively

few (training) data. In such a case, many features are often redundant or have such a high

dimensionality that it is difficult to determine which features are appropriate for which

classes. In contrast, we have a relatively large amount of training data and fewer features.

Thus, for our classification problem, we concentrate on the construction of distinctive

feature descriptors for outdoor LIDAR data.

Existing features used for 3D data classification include intensity [23], height [21-

23, 37], surface curvature [21], spin image [37-39], shape distribution [40, 41], local

tensors [42], shape maps [43], 3D active contour [44], normals [22] and colour [45]. These

features are often combined together, or treated independently as feature descriptors, for

urban classification. For example, in order to identify the building points, Anguelov et al.

[37] first use height to filter out most of the terrain points. The authors then compute the

minimum eigenvalue of the region covariance (of the spatial coordinates of 100 sampled

points in a cube of radius 0.5 meters) to identify the principal plane location. The cube is

then partitioned into 3x3x3 bins around the point, oriented with respect to the principal

plane. The percentage of points lying in various sub-cubes provides the information on the

local distribution. The location of the planes in the data can then be identified. In order to

identify linear objects such as trees, Anguelov et al. [37] compute a cylinder of radius 0.25

meters which extends vertically to include all the points in a “column”. The percentage of

the points that lie in various segments of the column (e.g., between 2m and 2.5m) is then

computed as one of the feature descriptors. Another feature, an indicator of whether a

point lies within 2m of the ground, is used to identify the bushes. Most of these features

rely on fixed thresholds and therefore require readjustment for different data types or

different scanning resolutions.

In another example, Triebel et al. [46] first divide the data into walls using a plane

extraction algorithm. Several methods exist to robustly extract planes from outdoor data

which are explained in detail in Chapter 8. The authors then use three types of feature

descriptor to distinguish ‘window’, ‘wall’ and ‘gutter’: 1) the cosine of the angles between

39

the local normal vectors near each point and the plane normal vector; 2) the distribution of

neighboring points in front of and behind the extracted planes and 3) the normalised

height of each point.

In general, two of the most common geometric features are region covariance and

estimated normals. The covariance matrix extracted from a region is usually sufficient as a

region descriptor to match the region in different pose or rotation and scale [47]. The

region covariance does not have any information regarding the ordering and the number of

points, which means that it has a certain scale and rotation invariance over the regions. For

the purpose of discriminating data belonging to the flat terrain from data belonging to the

building surface, information on the direction of the estimated surface normal vector is

useful. In the next sections, we focus on the region covariance and estimated normals. The

features are computed and evaluated on both synthetic and outdoor LIDAR data.

3.2 REGION COVARIANCE AS 3D FEATURE DESCRIPTOR

One of the popular features for 3D data classification is derived from the estimated region

covariance matrix [47]. A region covariance matrix is a matrix of covariances between

elements of a number of neighbouring data. For example, we generated 3D data of a plane

with Gaussian noise of standard deviation of 0.05, as shown in Figure 3-1. This has the

following covariance matrix:

−−

−

−

2429.08041.28765.13

8041.2495.83325.0

8765.1325.0505.833

40

Figure 3-1 Plane data with Gaussian noise of standard deviation = 0.05

By performing principal eigenvalue decomposition of a region covariance matrix,

also known as Principal Component Analysis (PCA), the data is transformed to a new

coordinate system. If a 3-dimensional data set is given, the greatest variance by any

projection of the data comes to lie on the first coordinate, the second greatest variance on

the second coordinate and the third greatest variance on the third coordinate. The

variances describe the shape and the coordinates describe the orientation of the data. We

can use Singular Value Decomposition10

(SVD) to perform PCA [48].

For example, the data in Figure 3-1 have the eigenvalues of 0.0025; 833.3; 833.9.

The smallest eigenvalue, 0.0025, which is the estimated covariance along the least

dominant direction of the data, matches the variance of generated plane data11

. This

smallest eigenvalue of the estimate region covariance is often computed to determine the

planarity of the data [49, 50]. For instance, Stamos and Allen [50] classify each point into

planar and non-planar data by thresholding the smallest eigenvalue corresponding to the

covariance matrix of k neighbouring data. In practice, the smallest eigenvalue for a planar

data set may range from a small value (for a relatively smooth building surface) to a larger

value (for rougher building surface). Classification using solely the smallest eigenvalue

can be difficult, as a small cluttered data set may be confused as a rough plane.

10 Details of the SVD algorithm can be found in Appendix II

11 variance = (standard deviation)

2 = (0.05)

2 = 0.0025

41

To minimise the dependency of the features on the smallest eigenvalues, Lalonde

et al. [51] derived a saliency feature descriptor using the relationship between the

eigenvalues of the region covariance, instead of solely the smallest eigenvalue. Let

λ1>λ2>λ3 be the eigenvalues of the covariance matrix of the k nearest neighbours. In case

of clutter, λ1≈λ2≈λ3 and there is no dominant direction. For points on surfaces (where

λ1,λ2>>λ3) and for linear structures where (λ1>>λ2,λ3), the saliency features were

evaluated using Eq. 3.1:

−

−=

−

−

−

23

12

1

λλ

λλ

λ

nesscurve

nesssurface

nessclutter

( 3-1 )

3.2.1 Extension to Saliency Features

The described saliency features have a disadvantage. To explain the disadvantage, we

need to first understand that a region covariance of a point p, as described in the previous

section, may be defined as i) the covariance of a k neighbouring point or ii) all points

within a defined distance (which may be determined adaptively for different data points)

from point p. In both cases, the size of the region covariance, defined by the distance of

the neighbouring point furthest away from p, may vary considerably. As a result of the

size variation, the measure of the ‘surfaceness’ and ‘curveness’ are inconsistent and can

spread over a large range of data. To eliminate the effect of size changes, we normalised

the 1st and 2nd largest eigenvalues with the size of the region covariance, r. The new

saliency features are as follows:

−

−=

−

−

−

)//log(

)/log(

)log(

23

12

1

rr

r

nesscurve

nesssurface

nessclutter

λλ

λλ

λ

( 3-2 )

After normalisation, the eigenvalues become invariant to the size of the region

covariance, especially for planar-like data and linear-like data. The features are more

distinguishable as depicted in the experiment with synthetically-generated data (Figure

42

3-7). For point-like data, the region covariance is relatively small and the eigenvalues are

similar in size; therefore normalization of the 3rd largest eigenvalue is unnecessary.

3.2.2 Validation of the Extended Saliency Features

In this section, the extended features defined in Eq 3.1 are evaluated using synthetically

generated data and outdoor LIDAR data acquired from a Riegl terrestrial laser scanner.

3.2.2.1 Synthetic Data

The extended features are evaluated using a set of synthetic data with 600 planar

data and 600 cluttered data, as shown in Figure 3-2. The planar data are corrupted by a

Gaussian noise of variance 0.05 and the cluttered data by a Gaussian noise with variance

of 3. The size of the region covariance is adaptively determined12

. The eigenvalues of the

region covariance, the saliency features and the normalised saliency features are computed

for every point, as depicted in Figure 3-3 to Figure 3-8:

12 The size of the region covariance is determined using an extended 3D scale theory approach, explained in

detail in Chapter 4 - 3D Over-segmentation. The optimal size of the region covariance is iteratively

estimated with equations that depend on the estimated curvature, density, noise and the colours of the points.

43

Figure 3-2 Plane data with Gaussian noise of standard deviation = 0.05 with cluttered data

Figure 3-3 Eigenvalues for Planar Data

44

Figure 3-4 Eigenvalues for Cluttered Data

As shown in Figure 3-3, the two largest eigenvalues for planar data are relatively

large compared to the smallest eigenvalue. The three eigenvalues are similar for cluttered

data, as shown in Figure 3-4.

Figure 3-5 Saliency Features for Planar Data

45

Figure 3-6 Saliency Features for Cluttered Data

Figure 3-7 Normalised Saliency Features for Planar Data

Figure 3-8 Normalised Saliency Features for Cluttered Data

As shown in Figure 3-5, most of the saliency values of the planes features are

greater than the saliency values of the lines and points features for planar data. Also, most

46

of the saliency values of the points features are greater than the saliency values of the lines

and planes features for cluttered data (Figure 3-6). The normalisation of the saliency

values (Figure 3-7 and Figure 3-8) successfully discriminates the points data – all saliency

values of the points features are greater than the saliency values of the planes and lines

features. The distinctive saliency features are then exploited to geometrically classify the

data with a learning model.

3.2.2.2 Outdoor LIDAR Data

We also validated the extended method with outdoor LIDAR data of 3648 points

shown in Figure 3-9 in the following experiment. The outdoor LIDAR data consist of a

vertical plane (building wall, shown in blue); a horizontal plane (pathway and grass

terrain, shown in green) and clutter (tree, shown in red).

47

a

b

Figure 3-9 Outdoor LIDAR data

48

Figure 3-10 Eigenvalues for Planar Data

Figure 3-11 Eigenvalues for Cluttered Data

As shown in Figure 3-3, the two largest eigenvalues for planar data are relatively

large compared to the smallest eigenvalue. The three eigenvalues are similar for cluttered

data, as shown in Figure 3-4.

49

Figure 3-12 Saliency Features for Planar Data

Figure 3-13 Saliency Features for Cluttered Data

Figure 3-14 Normalised Saliency Features for Planar Data

50

Figure 3-15 Normalised Saliency Features for Cluttered Data

3.3 ESTIMATED NORMALS AS 3D FEATURE DESCRIPTOR

In addition to region covariance, as explained in Section 3.1, the estimated surface normal

is useful to discriminate data belonging to the flat terrain from data belonging to the

building surface. This section explains the general methods to estimate surface normal

vectors from 3D data (in Sections 3.3.1 and 3.3.2). The explained methods are then

evaluated and compared for the synthetic and outdoor LIDAR data.

A normal to a surface at a point is the normal to the estimated tangent plane to that

surface at that point. The surface normal is useful in distinguishing some data classes and

can also be a good representation of texture. There are several ways to estimate the tangent

plane. One method is via Delaunay Triangulation (DT) or its dual Voronoi Diagram, as

shown in Figure 3-16 below.

3.3.1 Delaunay/Voronoi Method

The Delaunay Triangulation is a set of triangles that connect each data point to its

neighbours, with the property that for each triangle, the unique circle circumscribed about

the triangle contains no data point. The normal vector for the data point can be calculated

51

as the weighted average of the normal vectors of the triangles formed by each data point

and pairs of its neighbours. There are numerous variations of the weighted average,

including angle-weighted, area-weighted, centroid-weighted and gravitational-weighted

methods. Comparisons of these methods can be found in [52] and [53].

Figure 3-16 Delaunay Triangulation and its dual Voronoi Diagram

Related to the Delaunay triangulation, the Voronoi diagram is a closest-point

plotting technique which consists of a set of Voronoi polygons. A Voronoi polygon for

point p encloses all the intermediate points that are closer to point p than to any other point

in the set of coplanar points, as shown in Figure 3-16. The circumscribed spheres of the

Delaunay Triangles, known as the Delaunay Balls, are the vertices of the Voronoi

Diagram. In noise-free data, the normal vectors for the data point can be approximated as

the pole [54]. The pole is defined as the line through each data point and its furthest

Voronoi vertex. The pole is also the centre of the largest Delaunay balls incident to the

point p on both sides of the sampled surface. These largest Delaunay balls are also known

as the polar balls.

For noisy data, Dey et al. [55] extended the Delaunay Balls algorithm by

introducing Big Delaunay balls (BDB), i.e. only Delaunay balls incident on point p larger

52

than a threshold are used to estimate the normals. The algorithm starts by computing the

Delaunay Triangulation for the data points P. For each point p, the average distance to the

k nearest neighbor, λp, is computed. Next, the Delaunay ball incident to the point p with

radius greater than cλp is marked as BDB, where c is a user-defined parameter. The

normal vector for point p can then be estimated as the line through p and its pole. Note

that in the case for points where none of the Delaunay ball is marked as BDB, no normal

can be estimated for these points. One of the solutions to estimate normal for these points

is to interpolate the normal for these points with the neighbouring normals. The sensitivity

of the algorithm to noise relies on the value of k and c. Dey et al. fixed the value of k as

the average distance of point p to its five nearest neighbours. The value of c is then

determined empirically by finding c that minimizes the error between a set of referential

normals (that are computed with clean data) and the estimated normal on the data with

noise added.

3.3.2 Numerical Optimization Methods

Another approach for surface normal estimation is via numerical optimization methods.

One of the most commonly applied numerical optimization methods is via the least square

plane fitting which was proposed by Hoppe et al [56] in 1992. The method estimates

tangent planes of the point cloud data by fitting local planes (with minimum fitting error to

the data). There are two kinds of fitting error: The x and y values for the least squares or

the regression plane are fixed and the fitting error is only in z (vertical) direction. A

variation from the traditional least squares, the total least squares (TLS) or the orthogonal

distance regression plane minimises the perpendicular distances to the plane (as shown in

Figure 3-17a), i.e. there is fitting error in all three coordinates. The traditional least square

fitting that minimises the vertical deviations [yi – f (xi,a1,a2,a3,a4)] (as shown in Figure

3-17b), where (a1,a2,a3,a4) are the plane coefficients, does not minimise the actual

deviation.

53

a b

Figure 3-17 Fitting Error of (a) Total least squares (TLS) or the orthogonal distance regression (b)

Least squares

In total least squares, for every point p, the algorithm finds the best-fit local plane

nTx = c that minimises the cost function e(n,c) under the constraint n

Tn = 1. The classical

cost function is the sum of squares of the fitting error for point p and its k nearest points as

depicted in Eq. 3.3.The estimated surface normal n is then the unit vector perpendicular to

all tangent planes. Since the cost function e can be stated as a linear problem in matrix-

vector notation, the minimiser can be expressed directly as the result of a SVD (details of

the SVD algorithm are given in Appendix II).

∑=

−=k

i

i

T cpncne1

2)(),(

( 3-3 )

Note that in general, the estimated normal to a surface is not oriented. A surface normal can be

can be ‘positively’ (right-handed) or ‘negatively’ (left-handed) oriented, as shown in

Figure 3-18. A straight-forward approach to orient the normal vectors is by multiplying

the normal vectors with their z-components, resulting in all normal vectors having positive

z-component. In the case where consistent normal orientation (for example, having normal

vectors of the points belonging to the overhang of a roof pointing downward instead of in

the positive z-direction) is highly desirable, the method described in [56] can be adopted.

The method works by first adjusting the sign of the estimated normal vector with the

largest z coordinate to ensure having a positive z-component. The orientation is then

“propogated” to the neighbouring points in a Riemannian Graph, which is a graph

constructed to be a connected graph that encodes geometric proximity of the point data.

54

Figure 3-18 Ambiguity of the Orientations of Surface Normals

3.3.3 Evaluation of the Surface Normal Estimation Approaches

We compared the performance of the BDB and the TLS methods for surface

normal estimation on the synthetic (Figure 3-2) and outdoor LIDAR (Figure 3-9) data.

The normal vectors (shown in red arrows) are estimated for every data point and plotted in

Figure 3-19 and Figure 3-20 respectively. In our experiment, the number of nearest

neighbours in TLS is determined adaptively13

. In the BDB experiments, five nearest

neighbours of p are selected for each p. Another user-defined parameter, c, is fixed to 2.5

in the experiment. Unlike TLS, the BDB algorithm is not as sensitive to the choice of k as

the algorithm only requires the average distance of the point to its local neighbours.

Figure 3-19 Estimated normal vectors using the BDB method

13 Similar to the computation of region covariance, an adaptive k is essential for computation of TLS in

complicated outdoor data. This is explained in detail in Chapter 4 - 3D Over-segmentation.

55

3.3.3.1 Synthetic Data

Figure 3-20 Estimated normal vectors using the TLS method

For the synthetic data in our experiment, the reference normal is given by the plane

coefficients used to generate the data that belong to the plane. The data belonging to the

plane are then corrupted by Gaussian noise of variance 0.05 and cluttered data is a

Gaussian noise with variance of 3.

As mentioned in Section 3.3.1, the Delaunay/Voronoi method does not guarantee

normal estimation for all data points. As depicted in Figure 3-19, no normal is estimated

for some of the cluttered data. Understanding that the surface normal feature is generally

used for identifying supporting surfaces14

, such as ground with horizontal support or

building walls with vertical support, not having a normal estimated at the cluttered data

point is not an issue.

With the TLS method, the estimation of surface normal for data belonging to a

plane outperforms the BDB method, as shown in Table 3-1 and the close-ups of the results

14 In our 3D data classification, the extended saliency features explained in Section 3.2.1 are used to filter

out the cluttered data. The estimated normal feature is then used to classify the planar data into different

classes.

56

in Figure 3-21 and Figure 3-22. The estimation of normal vectors using the BDB approach

is less robust to noisy data in the experiment. Also, BDB performs poorly in estimating the

normal vectors at the edge of the planar data. This can be explained by understanding the

nature of the BDB algorithm, in that the algorithm is designed for data belonging to an

irregular surface. In BDB, only intermediate points in the Voronoi Diagram are used in the

estimation for each data point.

In contrast, the estimation of normal vectors using the TLS method does not only

rely on local information. This is more desirable as, ideally, the estimation of normal

vectors of the data belonging to a plane should exploit information from all points on the

plane. The adaptive k estimation in the TLS approach can be designed to select a larger

number of k neighbouring points which belong to the same plane. The close-up of the

normal vectors estimation in Figure 3-22 shows the robustness of the algorithm in the

noisy data.

The mean square error of the estimated normal vectors and the ground truth for the

data belonging to the plane is shown in Table 3-1:

MSE No of points TLS BDB

Plane 900 3.43 x 10-5

0.0714

Table 3-1 Mean Square Error of Estimated Normals Estimation of synthetic data using TLS and BDB

57

Figure 3-21 Estimated Normals using the BDB Approach (Close-up of data belonging to the plane in

Figure 3-19)

Figure 3-22 Estimated Normals using the TLS Approach (Close-up of data belonging to the plane in

Figure 3-20)

3.3.3.2 Outdoor LIDAR Data

We compared the approaches with the set of outdoor LIDAR data shown in Figure

3-9. The reference normal used for comparison is obtained by first manually segmenting

the vertical and horizontal planes. Least-square planes are then computed for both plane

data. The resulting normal vectors for both best-fit planes are used as the reference normal.

58

Similar to the synthetic data set, TLS works better than BDB in the normal vector

estimation for data belonging to plane, as depicted in Table 3-2, Figure 3-23 and Figure

3-24. The results show a MSE of 0.0148 for the vertical plane using the TLS method,

while the BDB method has a MSE of 0.028. As the size of the local neighbour for the TLS

method is adaptive, the estimation of normal vectors of data belonging particularly to the

horizontal grass terrain is more robust using the TLS.

MSE No of points TLS BDB

Vertical plane 697 0.0148 0.028

Horizontal plane 1484 0.0815 1.2427

Table 3-2 Mean Square Error of Normal Estimation of LIDAR data using TLS and BDB

Figure 3-23 Estimated normal vectors using the BDB method

59

Figure 3-24 Estimated normal vectors using the TLS method

3.4 CONCLUSION

In this chapter, we provided background studies on feature descriptors for 3D outdoor

LIDAR data classification and selected two types of feature descriptors for our

experiments. We extended the saliency features and validated the method on both

synthetically-generated data and outdoor LIDAR data acquired from a terrestrial laser

scanner. The extended features are shown to be distinctive for the three data classes of

interest (linear, planar and cluttered). The features are also validated to be invariant to the

size of the region neighbourhood.

In order to classify data belonging to planes into different classes, we next selected

normal vectors as a feature descriptor. We compared two most commonly used methods,

i.e. total least squares (TLS) and big Delaunay balls (BDB), in the estimation of surface

normal vectors. We showed that with adaptive estimation of nearest neighbouring points,

the TLS method is more robust for normal estimation of noisy multi-structure planar data.

60

4.0 3D OVER-SEGMENTATION

4.1 INTRODUCTION

Previous work in 3D data labelling has mostly been point-based [37, 46, 57, 58],

which introduces redundant computations for two reasons. Firstly, classifying every single

point is unnecessary. This is because most neighbouring points have similar features. For

example, points on the same plane will most likely have similar colour and similar

(estimated) normals. The amount of redundancy increases with the resolution, especially

for data with large planar surfaces. Therefore, labeling every point will result in a high

computational load which can be reduced by classifying a smaller sub-set of the data.

Next, depending on the type of learning model for data labelling, i.e.

discriminative or generative, using the complete data set can be unnecessary. To

understand this, complete data are essential for the estimation of the prior parameters

(which depend on the ratio of data of different classes to the total amount of data) of the

learning model. Unlike the generative models, the discriminative models do not require

estimation of the priors. Additional training data with similar features does not affect the

estimated parameters of the discriminative learning model [59].

Chapter 4

3D Over-segmentation

61

Previous work [60] has shown that the discriminative models generally perform

better with larger training sets compare to the generative classifier, as the discriminative

classifier reaches a higher asymptotic error at a slower rate. In the urban modeling

application, the amount of training data is in general relatively large, making the

discriminative model suitable for the application. The discriminative model needs to “see”

all possibilities during training of the learning model15

. This requirement is achievable

with over-segmentation if the reduction of training data does not affect the degree of

variation of the data. We propose an over-segmentation algorithm that groups training data

with homogeneous features, therefore maintaining the degree of variation of the data to a

reasonable extent. An overview of the existing over-segmentation algorithm for 2D and

3D data will be discussed in the next section, followed by an elaboration of our proposed

algorithm and validation of the approach on synthetic and outdoor terrestrial 3D laser data.

4.2 BACKGROUND OF OVER-SEGMENTATION

One solution to the redundancy issue in computation is to group similar points and

then label the groups instead of individual points. This approach is well-established in 2D

image analysis. The raw image is often over-segmented into a higher level representation

(compared with the individual pixels) to avoid point-based classification. Over-

segmentation is the process by which the objects being segmented from the background

are themselves segmented into sub-components, depending on the level of a visual

similarity criterion. Complete segmentation generally requires cooperation with higher

processing levels that use specific knowledge of the problem domain (that match the real-

world object); whereas over-segmentation groups regions that are homogeneous with

respect to a chosen property such as brightness, colour or texture. Although over-

segmentation is often only a means to an end in segmentation problems, the process

15 To explain this statement, we first look at the how the generative and discriminative model learns. The

generative model indirectly learns P(Y|X) on the basis of Bayes rule. In Bayes rule, the class-prior

probabilities and class-conditional densities are computed separately. For example, the generative Gaussian

Mixture Models (GMM) can be used to model the class-conditional densities, where a Gaussian is fitted to

data in each class. In contrast, the discriminative model directly learns P(Y|X) from the training data, for

example, making point estimates of the parameters using maximum likelihood. As a result, the

discriminative model needs to observe all possibilities during training.

62

increases the chances that boundaries of importance have been extracted for data

classification. Most of the previous segmentation approaches [61, 62] have shown that

extracting all objects of interest from the background, or each other, is difficult without

over-segmenting the data.

In 2D images, the graph cuts method is commonly employed for over-

segmentation (to group similar regions) [63]. A graph (image) G = (V, E) where the node

V of the graph are the pixels in feature space and an edge E is formed between every pair

of nodes. The graph can be partitioned into two disjoint sets – A and B, where VBA =U

and 0=BAI , by removing the edges connecting the two parts. The degree of difference

between the two parts can be computed as the total weight of the edges that have been

removed, also known as the cut value. Finding the minimum cut value will give the

optimal bi-partitioning of the graph or over-segmentation of the image.

However, there is a vast difference between 2D image pixel and 3D point cloud

processing – the sampling pattern of 3D point clouds is irregular, thus lacking an

organised lattice-like structure compared to the 2D image with regular lattice. In addition,

in terms of feature descriptors, the search for a coherent region in an image, compared

with 3D data segmentation, imposes very different requirements. Texture and colour are

generally used as similarity metrics for segmentation purposes in 2D images [64, 65],

whereas curveness is the main criterion for segmentation in 3D point clouds.

In reviewing the literature related to processing 3D point clouds, the author has not

located an algorithm similar to our proposed over-segmentation theory. To remove the

redundant data points, Triebel et al [46] performed kd-tree pruning (in the 3D data

labelling process) which prunes according to the position of the point and its label.

Therefore, their method only reduces the 3D data used for model training and cannot be

applied to the inference of the model. The other methods of point cloud reduction [14, 15,

55, 66] are not suitable for direct application for the present purpose, as mentioned in the

introduction. We need an algorithm that not only reduces the point cloud set, but also

retains information from the removed point cloud. That is, the information from the

removed points should still contribute to the classification of the remaining point cloud,

similar to the over-segmentation approach for 2D image processing.

63

In this chapter, we explain our approach to 3D over-segmentation in which we

have designed an adaptive support region, namely super-voxel, for the purpose. Super-

voxels are the result of an over-segmentation of the 3D point cloud. The super-voxel

reduces the complexity of the raw data and provides a longer range of interaction to the

data. It is also a perceptually-consistent unit that is uniform in the underlying data

structure and colour. We also identify two important factors for the design of the 3D over-

segmentation algorithm that must be considered:

4.2.1 Shape of the Super-voxel

A major factor of concern in the design of the super-voxel is the shape. A closely-related

area to the design of a super-voxel is the design of a region descriptor. A region descriptor

is a compact intermediate representation of the input data, often computed from connected

sets of relatively homogeneous data. Previous studies in determining the shape of a region

descriptor can be associated to super-voxel. The most common shape for region

descriptors includes sphere (3D shape contexts and harmonic shape contexts) [67],

cylinder (SPIN images) [49] and ellipsoid (Minimum Volume Ellipsoid) [68]. Previous

work [69, 70] has shown that the ellipsoid shape is preferable to the bounding box or

sphere for the approximation of the underlying object [69, 70].

4.2.2 Scale Selection

The scale of the super-voxel is another crucial parameter. Lindeberg [71] shows

that the notion of scale selection is of utmost importance for automatic processing of

unknown data. However, previous work in the computation of region point descriptors

often assumes a fixed scale. For example, the regional point descriptor used in [67] is

computed as follows: each point cloud is divided into a fixed scale of 0.2-meter voxels

and one point is selected at random from each occupied voxel. In another study, Stamos

and Allen [50] compute the region descriptor from fixed k neighbouring data for every

point. Even though the authors claim the k value is optimal, the consistency of the grouped

points cannot be guaranteed. The variation in the structures of the data and sampling

64

density as explained in Chapter 1 requires different scale levels. We have developed a

method that is capable of computing the scale of super-voxel adaptively with 3D scale

theory. With super-voxels, the classification of the 3D data is then based on a reduced data

set of the original point clouds. The concept of reducing the data set is such that

geometrically similar features data are omitted from training and inference of the learning

model. With this method, the total processing time required for training and testing the

learning model can be reduced.

4.3 SUPER-VOXEL – A 3D OVER-SEGMENTATION APPROACH

We propose to over-segment the 3D data into super-voxels before classifying the

data, using algorithms adopted from 3D scale theory [51, 72]. The individual 3D points

are clustered together to form a higher level representation, as shown in Figure 4-1. For

the p data points in Figure 4-1, the number of n super-voxels, where n<<p, are computed

based on the normalized colour similarity and the geometry of the data structure.

In this section, we will explain the scale and shape selection of the super-voxel.

Super-voxels are then computed on the selected data sets.

4.3.1 Sphere as Dividing Boundary

We have chosen sphere over ellipsoid as the shape of the super-voxel. The reason for

choosing a sphere (centred on basis point p), over ellipsoid or some irregular shapes, is to

avoid the effect of shape variation which can affect the features extracted. For instance, if

the shape of the segmented super-voxel region is ellipsoid, the super-voxel can be “longer”

in one direction, as shown in Figure 4-2a. The reason for the shape may be noise or the

surface scanned not being exactly flat. The resulting saliency features explained in

Chapter 3 that depend on the eigen-analysis of the scatter matrix will be linear-like, even

when the data is almost planar. This effect can be minimised using a sphere. As shown in

Figure 4-2b, both computed super-voxels (shown as two blue circles) are planar-like with

65

larger “surface-ness” saliency features, as explained in Chapter 3. By choosing a regular

shape, the over-segmented super-voxels will overlap each other as depicted in the same

figure. Despite having some points belonging to more than one super-voxel region, the

additional amount of computation is relatively small. Also, allowing overlap effectively

offers the advantage of providing maximum coverage of data with similar properties.

4.3.2 Automatic Scale Selection

The radius of the super-voxel, r is iteratively estimated with the following

algorithm [73, 74] (that depends on the estimated curvature, density, noise and the colours

of the points):

Randomly pick a point p

Check if the p belongs to any previous super-voxel

Start with k =15 (adjust manually according to noise constant)

Figure 4-2 Over-segmentation of 3D point clouds into super-voxels

(a) (b)

Figure 4-1. Comparison of different shapes of super-voxel (ellipsoid: yellow; sphere: blue)

66

Iterate and refine (maximum of 10 steps) to estimate the optimal size of the super-voxel

Compute radius of super-voxel r as the distance from p to its kth nearest neighbours locally.

If k < 2, assign k to 2 in order to compute least square solution for estimation of parameter

d.

Fit least square plane to p and its kt h

nearest neigbours and compute d, where d is the

shortest distance16

between p and the least square plane.

Compute µ, the average distance from p to all the points within the super-voxel.

Compute the estimated density ρ, estimated curvature κ [75] locally

2r

k

πρ ←

( 4-1 )

2

2

µκ

d←

( 4-2 )

Use the known parameters to compute rnew

))]}max(var(1[)](1

{[

)1(

3

1

2

21 colourdddamp

rdampr

n

n

oldnew

−×+×+

×−=

σερ

σ

κ

( 4-3 )

Compute

[ ]]}2

newnew rk πρ←

( 4-4 )

Stop if knew>threshold or knew saturates

Until all points belong to a super-voxel

Algorithm 4-1 3D Over-segmentation

According to the Chebysev’s Inequality, for every :0>ε

εε

σµ =>− )|(| pP

( 4-5 )

ε which is found in Eq.4.3 is set to be 0.1 in our experiments. The noise constant σ can be

estimated experimentally by computing the average distance of every point (acquired from

a single plane) to the least square fitted plane. In the case where different scanning

16 This can be computed using singular value decomposition (SVD) as explained in Section 3.3.2. The

shortest distance from p to the least square fitted plane is equivalent to the smallest singular value.

67

resolution is used, the noise constant has to be re-estimated for the different resolution

using the same approach. The accuracy of the estimation process relies on the level of

“flatness” and the amount of plane data used. To estimate the value of d1 and d2 in the

same equation, ground truth normals17

are required to solve the following linear

minimization problem (refer to [51] for more details):

2

0

32

21 ))(1

(min∑=

−+N

i

in

i

n

i

rdd σερ

σ

κ

( 4-6 )

Another variable in Eq. 4.3, the damping factor damp, is set to 0.2 to avoid the iterations

diverging and going into a marginally stable state.

To provide colour constraints within the super-voxel, we include colour properties

of the data in the estimation of the radius of the super-voxel (Eq. 4.3). First, the variances

of the normalized RGB values are computed for the super-voxels. The maximum of [varR

varG varB] is then used as a factor to reduce the radius if the colour within the super-voxel

is inconsistent. The proposed over-segmentation approach is validated on two sets of

synthetic data and two sets of outdoor LIDAR data. Super-voxels are computed and

plotted on the datasets. The noise constant in the algorithm is adjusted in both

synthetically generated data and outdoor LIDAR data sets.

4.3.3 Synthetic Data

Synthetic Data Set I

17 Similar to the estimation of the noise factor, the ground truth normal can be estimated by least square

fitting a plane, where the plane coefficient is the best estimated normal.

68

a b

c d

e f

Figure 4-3 Super-voxels for synthetic data. a) at 10th

iterations; b) at 20th

iterations; c) at 30th

iterations d) at 40th

iterations e) at 50th

iterations f) at 100th

iterations

We validated our algorithm on the synthetically generated data shown in the previous

chapter (Figure 3-2). With over-segmentation, the original data were reduced from 1500

data points to 129 super-voxels (which is around 8.6% of total amount of data). The

69

number of super-voxels on the plane is 19 (2.1% of all data belonging to the plane) and

the number of super-voxels on the clutter is 92 (15.3% of all data belonging to the plane).

The average radius of the super-voxels for the data belonging to the plane is 5.32, while

the average radius for the data belonging to the clutter is smaller, at around 2.45. The

computation time is 74s on an Intel Core 2 Duo 2.13GHz CPU and 2GB of RAM.

Figure 4-4 Super-voxels for synthetic data at 129th

(final) iteration

Figure 4-3 shows the results of the over-segmentation at 10th

, 20th

, 30th

, 40th

, 50th

and 100th

iterations. The image in Figure 4-4 shows the final over-segmentation result and

Figure 4-5 shows the top view of the result. As depicted in these figures, the sizes of the

computed super-voxels for the data belonging to the clutter remain small compared to the

sizes of the super-voxels for the data belonging to the plane. Notice that there are some

super-voxels on the clutter data that appear much larger than the average size of the super-

voxels (on the clutter data). To explain this, the over-segmentation algorithm requires at

least three points in the super-voxel for computation of the least square fitted plane. The

data from the clutter which are at a distance from the rest of the data would require a

larger super-voxel to connect with the other minimum requirement of two points.

70

Figure 4-5 Bird’s-eye view of super-voxels for synthetic data at 129th

(final) iteration

Figure 4-6 Synthetic data set 2

Synthetic Data Set II

To further evaluate the performance of the algorithm on data, we synthetically

generated two clusters of data (as shown in Figure 4-6) which also consisted of a plane

71

and a clutter. The difference of the data from the Synthetic Data Set II is that a portion of

the plane is closer to the clutter. The Synthetic Data Set II consists of two sets (for training

and inference purpose) of a hundred points that form a plane with Gaussian noise of

standard deviation at 0.01, and hundred point clutter data that represent vegetation.

The original data have been reduced from 200 data points to 52 super-voxels.

Similar to the results in the first data set, the computed super-voxels for the data belonging

to the clutter remain small compared to the super-voxels for the data belonging to the

plane. Because some portion of the clutter data is closer to the plane, as observed from the

over-segmentation result in Figure 4-7, the super-voxels fitted to the data belonging to the

plane that are located closer to the clutter are smaller, in order to exclude data belonging

to the clutter.

The number of super-voxels on the plane is 25 (25% of total data belonging to the

plane) and the number of super-voxels on the clutter is 32 (32% of total data belonging to

the clutter). The average radius of the super-voxels for the data belonging to the plane is

2.6, while the average radius for the data belonging to the clutter is 1.5. The computation

time is 12.1s. Details of the results from different viewpoints are shown in Figure 4-8 and

Figure 4-9.

72

a b

c d

e f

Figure 4-7 Super-voxels for Synthetic Data Set I. a) at 5th






(final) iterations

73

Figure 4-8 Super-voxels for Synthetic Data Set I at 52th

(final) iteration

Figure 4-9 Bird’s-eye view of super-voxels for Synthetic Data Set I at 52th

(final) iteration

4.3.4 Outdoor LIDAR Data

Outdoor LIDAR Data Set I

The over-segmentation algorithm is also applied to the outdoor LIDAR data shown in

Figure 3-9. The original data set, which consists of 3648 points, has been reduced to 180

74

super-voxels (4.9% of the total data). The result during the iterations is shown in Figure

4-10, while close-ups from different viewpoints can be found in Figure 4-11 and Figure

4-12.

In order to evaluate the performance of the algorithm, we manually labelled this

data set into different planes and clutter. With over-segmentation, the result is similar to

the results for over-segmenting the synthetically-generated data sets, where the number of

super-voxels on the planes is relatively small compared to the number of super-voxels on

the clutter. The number of super-voxels on the vertical plane is 16 (2.3%), on the

horizontal plane is 27 (1.8%) and on the clutter is 111 (7.7%). The average radius of the

super-voxels for the data belonging to the vertical plane is 0.97, and for the data belonging

to the horizontal plane is 1.2; while the average radius for the data belonging to the clutter

is 0.29. The computation time is 237s on an Intel Core 2 Duo 2.13GHz CPU with 2GB of

RAM.

75

a b

c d

e f

Figure 4-10 Super-voxels for outdoor LIDAR data. a) at 10th






(final) iterations

76

Figure 4-11 Super-voxels for outdoor LIDAR data at 180th

(final) iteration

Figure 4-12 Bird’s-eye view of super-voxels for outdoor LIDAR data at 180th

(final) iteration

77

Outdoor LIDAR Data Set II

We tested the algorithm in a more complicated outdoor LIDAR data in Data set 2, as

shown in Figure 4-13:

Figure 4-13 Outdoor LIDAR Data set II

The results during the iterations are shown in Figure 4-14 and Figure 4-15. The

final over-segmentation result is shown in Figure 4-16 and Figure 4-17 (top-view). The

original data have been reduced from 1500 data points to 279 super-voxels. The number of

super-voxels on the horizontal plane is 38 (2.1%), on the vertical plane is 54 (2.9%) and

the number of super-voxels on the clutter is 187 (6.7%). The average radius of the super-

voxels for the data belonging to the horizontal plane is 1.2, while for the data belonging to

the vertical plane it is 0.81, and the average radius for the data belonging to the clutter is

0.36. The computation time is around 7 minutes on an Intel Core 2 Duo 2.13GHz CPU

with 2GB of RAM. Notice that the super-voxels at the sparse horizontal plane are

relatively large compared to the denser vertical plane. This is because the estimated radius

of the super-voxels is inversely proportional to the estimated density (as defined in Eq. 4.2.

78

a b

c d

e f




iterations

79

a b


iterations;


iterations;

80


iterations;

4.4 CONCLUSION

In this chapter, we identified the problem with point-based classification. We

showed that the amount of original outdoor LIDAR data can be greatly reduced, avoiding

redundant computation for the classification of the data. We then proposed an effective

method to over-segment 3D outdoor LIDAR data. The algorithm is adaptive and capable

of accurately grouping data of similar properties, reducing original data to a relatively

small proportion. This greatly reduces the computation time required to classify the

original data, as only around 10% of the original data (depending on the resolution and the

complexity of the data) require labelling. Even though this additional pre-processing step

for classification may slightly increase the total processing time, over-segmentation

enhances the classification results, which will be described and shown in the following

chapter.

81

5.0 DATA CLASSIFICATION WITH MCRF

5.1 INTRODUCTION

In this chapter, we consider the supervised learning approach, which is a machine learning

technique to classify point clouds into different data types with extracted features. With

both the input and desired outputs (training data), supervised learning tries to find the

connection between the two sets of observations. The connection is a global model or a set

of local models that is capable of mapping the inputs to the desired outputs. With the

learned model, it is then possible to predict the output (as a continuous value in regression

problems or as a label for classification problems) for any valid input data. This is similar

to concept learning in human psychology, where the learner simplifies what has been

observed from the examples and uses the simplified concept to apply to future examples.

Existing learning approaches include logistic regression, neural networks (Multi-

layer Perception), Support Vector Machines (SVM), k-nearest neighbours, Gaussian

mixture model (GMM), naïve Bayes, decision tree and Radial Basis Function (RBF)

classifiers. These learning models each have strengths and weaknesses – referred to as the

‘no free lunch (NFL) theorem’ of Wolpert [76]: that no single classifier will outperform all

Chapter 5

Data Classification with

mCRF

82

other classifiers on all learning tasks. In order to choose the optimal classifier, it is crucial

to understand the characteristics of the data set. Several empirical comparisons have been

carried out [77-80] to define the classifier selection criteria for different data

characteristics. However, none of the approaches has successfully predicted and explained

the classifier performance for different data sets [81]. Some research has proposed

combining classifiers of different natures to complement each other. For example, Barat et

al. [82] evaluated a set of neural and statistical classifiers and provided the appropriate

fusion rule. The drawback is an increase in classifier complexity and inefficiency.

The background to supervised classification and the models that have been used

for effective 3D data classification is first described in this chapter. We demonstrate the

advantage of using a discriminative model over a generative model for our dataset. We

then show the need for a multi-scale graphical learning model for the classification

problem, and propose a multi-scale Conditional Random Field (mCRF) solution. The

proposed model is evaluated and the result is compared with some existing models. The

results confirm improvements over classification using logistic regression and Conditional

Random Fields.

5.2 BACKGROUND OF SUPERVISED CLASSIFICATION

Supervised classifiers can generally be divided into generative (model-based) and

discriminative classifiers. The following sections explain the details and differences of

both types of classifiers, and introduce the concept of graphical model which combines

both graph theory and probabilistic theory.

5.2.1 Generative Model

Popular generative models include: Bayes classifier, Hidden Markov Models and

Maximum Entropy Markov Models. These models define a joint probability distribution

of the observation and labelling sequences P(X,Y).

83

Consider the supervised learning problem:

We would like to approximate the unknown mapping function f: X→Y or P(Y | X), where

Y is the predicted output label and X is the input data.

The problem can be approached with the Naïve Bayes method which is a

generative supervised learning model based on Bayes’ theorem.

With Bayes rule:

)(

)()|()|(

XP

YPYXPXYP =

( 5-1 )

In order to predict the output label for any data, we need to estimate P(X|Y) and

P(Y) using the training data pairs. P(X) is the expected data likelihood which is the

expectation over the prior distribution. It can also be seen for normalisation purposes to

ensure the summation of probability goes to 1.

∑= )()|()( YPYXPXP

( 5-2 )

The naïve Bayes is called a generative classifier as P(X|Y) can be seen as a

distribution that describes how to generate random instances X conditioned on the

predicted target attribute Y. The term naïve is used because of the strong (naïve)

independent assumption, i.e., the presence (or absence) of a particular feature of a class is

unrelated to the presence (or absence) of any other feature. This greatly simplifies the

problem, as only the variances for each class have to be computed instead of the whole

covariance matrix. Even though the independence assumption is fairly strong and

unrealistic, the naïve Bayes has surprisingly often performed much better than expected in

real-world learning problems.

For naïve Bayes with continuous input, P ( X | Y ) has to be modeled by fitting

with Gaussian Mixture Models (GMM). Although the Gaussian Mixture Model is a

parametric unsupervised learning method, it can be used as part of a supervised

classification scheme by defining prototypes. To train the GMM for non-parametric

supervised learning, we need to first discover the number of clusters that exist and the

84

clusters that correspond to different classes. For more details on determining the optimal

number of clusters, refer to Chapter 8. The Bayesian classification with GMM has been

applied in classifying terrestrial point clouds into different data types. For instance, instead

of using a manually-fixed threshold [49, 50], Lalonde et al [51] learned the distribution of

the saliency feature on hand-labelled data with GMM and estimated the parameters of the

GMM with the Expectation Maximization algorithm.

The proposed approach by Lalonde et al. only clusters in the feature space.

Classifying only with local features (not taking neighbouring points into account) can be

very difficult due to ambiguity at point level. Local classification can also lead to isolated

false positives and missing false negatives. The reason that the authors use only local

features is because the application is real-time and the classifier is labelling every new

point as soon as it is acquired.

5.2.2 Discriminative Model

Another popular class of approach is the discriminative model, including logistic

regression, Conditional Random Fields (CRFs)[83] and Markov Random Fields (MRFs),

which specify the probability of a label given an observation sequence p(Y|X). By

modelling the conditional probability distribution instead of the joint probability

distribution, the discriminative models do not need to enumerate all possible observation

sequences, which may not be feasible [83].

For example, instead of estimating the joint probabilities, the logistic regression

computes P(Y | X) directly with the following parameterisation:

∑=

++

==n

i

ii Xww

XYP

1

0 )exp(1

1)|1(

( 5-3 )

85

∑

∑

=

=

++

+

==n

i

ii

n

i

ii

Xww

Xww

XYP

1

0

1

0

)exp(1

)exp(

)|0(

( 5-4 )

The training data are used to estimate W=<w0, w1,…, wn> which satisfies

→W argmaxw ∏l

llWXYP ),|(

( 5-5 )

Unlike naïve Bayes, logistic regression is a function approximating algorithm that

discriminatively predicts the label Y given any instance X. As mentioned before, no

classifier is optimal for all classification problems. Several authors have studied the

comparison of the generative and the discriminative classiers [84, 85]. It is generally

agreed that generative classifiers are more suitable when the acquired data are limited.

This is because the generative model has assumed the underlying distribution. Therefore,

the discriminative logistic regression converges at a slower rate during parameter

estimation compared to the generative naïve Bayes. With a sufficient amount of

comprehensive training data, discriminative classifiers have the advantage of being more

accurate. Even though logistic regression also assumes conditional independencies

between the input features, it is not rigidly tied to the assumption like naïve Bayes. When

the assumption is violated, the parameters of the logistic regression will be adjusted by the

conditional likelihood maximization algorithm to fit to the data. Real life data typically

cannot be modelled precisely with an exact distribution model. Thus, the discriminative

classifiers are more suitable for real life data in general.

5.2.3 Graphical Model

The aforementioned naïve Bayes and logistic regression methods are probabilistic

models. These models take only the node potentials into account. However, in 2D images

or 3D point clouds, a single pixel/point is very ambiguous by itself. Kumar [86] identified

this problem as the curse of ambiguity: since pixels/points from different classes can

86

appear similar, it is extremely difficult to identify the class label of each pixel/point

independently. By combining graph theory and probability theory, graphical models are

capable of modelling the spatial interaction between pixels/points. Jordan described

graphical models as “a natural tool for dealing with two problems that occur throughout

applied mathematics and engineering -- uncertainty and complexity -- and in particular

they are playing an increasingly important role in the design and analysis of machine

learning algorithms.” [87]. The previously mentioned supervised models such as hidden

Markov models (HMM), Markov random fields (MRF) and conditional random fields

(CRF) are examples of graphical models.

Using both generative and discriminative graphical models, Wolf et al. [58]

classified 3D points into navigable and non-navigable regions with a HMM locally

followed by global segmentation with a MRF. The HMM is implemented to learn the

difference between different classes, whereas MRF is implemented to enforce the spatial

constraint between the neighbouring points (smoothing). Instead of labelling and

smoothing the data labels in an ad-hoc manner, as proposed by Wolf et al., Anguelov et al.

[37] approached the 3D data classification problem with discriminative graphical models –

the associative Markov network (AMN). AMN allows effective inference using graph-cuts

[88]. The authors segmented 3D scan data into four features - ground, tree, building and

shrubbery. The experimental evaluation showed that AMN predicted 93% correctly,

whereas as SVM (non-graphical model) predicted 68% correctly. Also using AMNs,

Triebel et al. [46] classified point clouds into windows, walls and gutters. The features

employed include: the cosine of angles between the local normal vectors, distribution of

neighbours and the normalised height of the points. The results showed that the AMN

outperforms a generative model, the Bayes classifier, with sufficient training examples.

Similar to Wolf et al. and Anguelov et al, we only start to process data after all

data acquisition in one area has been completed. Therefore, we have the advantage of

“knowing” complete neighbouring data for each point. As a result, we can include the

neighbouring information into the learning model using a graphical model, as spatial

relationships exist among the input data. Also, with sufficient training data, a

discriminative graphical model, the CRF, is a suitable choice of learning model. In order

to apply the learning model to super-voxels, as elaborated in Chapter 7, we propose a

multi-scale approach which will be explained in detail in the next section.

87

5.3 MULTI-SCALE CONDITIONAL RANDOM FIELD

Conditional Random Fields are undirected graphical models which have shown

promising results in text processing [83, 89], image segmentation [26, 90], DNA sequence

prediction [91], table or diagram structure extraction from documents [92, 93] and more

recently, in 3D range data classification at point level [46, 94].

As stated previously, graphical models take neighboring data into account.

Therefore, in addition to the local node potentials, the pair-wise edge potentials are

included in the model. However, the edge potentials in a graphical model are limited in

providing long-range correlation, especially for high resolution data. In addition,

classifying every point using only the features of the point and of its neighboring points is

sensitive to the difference in resolution among different scans and scanner technologies

(the density of the point clouds also varies with respect to different distances from the

laser scanner). In 2D image labeling, multi-scaled [26] approaches have been introduced

to the CRFs for super-pixel labelling. Given an over-segmentation of the data (as

explained in Chapter 6), we construct the multi-scale Conditional Random Fields

(mCRFs), as shown in Figure 5-1, for super-voxel labelling with the following context:

Figure 5-1. Multi-scale Conditional Random Fields with local edges (green) and regional edges

(black).

Let s=s1, …,sN be the observed feature vectors of some N super-voxels. Each feature

vector consists of a combination of feature descriptors such as heights, colours, SPIN

images and estimated normals.

88

Let c=c1,…,cN be the labels in C given the observable super-voxel. In urban modelling,

labels are selected from low level such as ‘planar’ and ‘cluttered’; or higher level such as

‘building’, ‘vegetation’, ‘tree trunk’, ‘grass’ and ‘man-made pathway’.

Let x=x1, …,xM be the observed feature vectors of some M points of the point cloud data,

randomly selected within every super-voxel.

The mCRF with parameters };,{ rl=θ };,{ ijil λλ= },{ ikir λλ= where },...,{ 1 Ciii λλλ = ,

},...,{ 1 Cijijij λλλ = , },...,{ 1 C

ikikik λλλ = define the conditional probability for a state sequence to

give an observable sequence of:

∏∏=∈

ΨΨ=N

i

iii

ji

jijiij

l

l xcxxccZ

xcP1),(

),(),,,(1

)|(ε

∏∏=∈

ΨΨ=N

i

iii

Ski

jikiik

r

r scssccZ

scP1),(

),(),,,(1

)|(

)|()|(),|( xcPscPxscP lr ×=θ

( 5-6 )

Pl is the probability of the super-voxel being labelled as class c given the features

of the mid-point of the super-voxels, and of its neighbours within the super-voxel. Pr is the

probability of the super-voxel being labelled as class c given the features of the mid-point

of the super-voxel and the mid-point of its super-voxel neighbours. The final conditional

probability of the super-voxel being of class c is the product of the mentioned probabilities

as shown in Eq( 5-6 ) (with the assumption of independency between the regional and

local features). In Eq( 5-6 ), Zl and Zr are the normalization constants that make the

conditional probabilities sum to one. The local edge potential, region edge potential and

node potential in Eq. 5.7 are defined as follows:

89

Local edge potential:

∑

∑

=

=Ψ

C

C

j

C

iji

C

ij

C

jiji

C

ijjijiij

ccxx

xxccfxxcc

})(exp{

)},,,(exp{),,,(

λ

λ

( 5-7 )

As shown in Figure 5-1, the local edge potential is used to exploit the label interactions

between the point xi with m randomly selected neighbours within the super-voxel.

Region edge potential:

∑

∑

=

=Ψ

C

C

k

C

iki

C

ik

C

kiki

C

ikkikiik

ccss

ssccfsscc

})(exp{

)},,,(exp{),,,(

λ

λ

( 5-8 )

The regional edge potential provides a coarser constraint. l number of the closest

neighbouring super-voxels are selected as the regional edges.

Node potential:

∑

∑

=

=Ψ

C

C

ii

C

i

C

i

C

iii

cx

xcfxc

})(exp{

)},(exp{),(

λ

λ

( 5-9 )

The node potential is a discriminative logistic regression (maximum entropy classifier)

that models each label c as a linear function of x or s.

),,,( jiji xxccf , ),,,( kiki ssccf and ),( xcf i are feature functions which are often binary

valued for categorical classes (such as in text applications). In our application with ordinal

observations, the feature functions are real-valued; the feature functions are defined over

all the local data points feature (for example, the logarithm of saliency features)

observation sequence x and s, the current state ci and the neighbouring state cj and ck.

90

mCRFs learn by finding the node, local and regional edge weights vector, to

maximize the log-likelihood. With a Gaussian prior with variance 2Cσ , the log-likelihood is

penalized as follows:

∑∑ −== C C

CN

i

xscPC2

2

1 2),|(log

σ

λθθ

( 5-10 )

where the second summation provides smoothing to avoid over-fitting [95]. The scaled

conjugate gradient optimization algorithm is used for the maximization.

Given the observation sequence x, inference in mCRFs is to find a state sequence

cmax which is the most likely:

),|(maxargmax xscpc c θ=

( 5-11 )

Since exact inference can be intractable in such models, approximate inference using

belief propagation is performed for finding cmax.

As explained in Section 4.3.1, the super-voxels can overlap. In inference, the

points belonging to more than one super-voxel will be labelled as the maximum of the

product of the conditional probabilities from the overlapped super-voxels. Let V be the

super-voxels that include point p, and the label of point p is therefore:

∏∈

=Vv

vcp pcpc )|(maxarg max,max θ

( 5-12 )

The points labelled as ‘planar’ or ‘building’, ‘grass’ and ‘man-made pathway’ can then be

extracted to form a Digital Surface Model (DSM) and Digital Terrain Model (DTM). The

‘cluttered’ or ‘vegetation’ points can be reduced with data reduction techniques [14] or

replaced with generic models.

91

5.4 PLANE PATCHES FITTING

With the ‘planar’ or ‘building’, ‘grass’ and ‘man-made pathway’ points being extracted,

visualising the data, for a large scale model, would still require a great deal of memory and

processing time to view the data efficiently. As explained in the previous chapter, for large

scale data, the raw data were divided into small voxels before being processed to reduce

the processing time. Similarly, for visualization, one of the solutions is to process each

small voxel at a time and fit planes on the extracted planar data in the small voxel.

To perform robust plane fitting for visualisation, the Random Sample Consensus

(RANSAC) is a common method [96]. A general approach is to first fit the large scale

plane data with RANSAC and then refine the fit with least square fitting, as shown in

Error! Reference source not found.. By representing the individual data with fitted plane

models, great reduction in memory space can be achieved.

The RANSAC algorithm for plane fitting is explained as follows:

Determine the percentage of inliers w, inlier threshold τ, number of inliers required to be a good

fit m, and security probability p for n number of data

Compute the number of iterations i required

)1log(

)log(nw

pi

−=

( 5-13 )

Repeat i times

Hypothesis generation

Randomly obtained 3 points, fit a plane to the points

Calculate plane coefficients:

)()()( 213132321 zzyzzyzzya p −+−+−=

)()()( 213132321 xxzxxzxxzb p −+−+−=

)()()( 213132321 yyxyyxyyxc p −+−+−=

)()()( 122133113223321 zyzyxzyzyxzyzyxd p −+−+−=

( 5-14 )

92

If coefficients are zeros, re-select 3 random points

Model verification

Compute perpendicular distance from points to the plane

222ppp

pppp

cba

dzcybxad

++

+++=

( 5-15 )

Scoring function: If the distance is less than τ, the point is an inlier

Model refinement

If the number of inliers is greater than m refit the set of inliers with least square fitting and

recompute the perpendicular distance residuals

The best fitted plane is the plane with least perpendicular distance residual

Algorithm 5-1 RANSAC

To estimate the “inlier threshold” τ in the RANSAC algorithm, the same method in

the estimation of the noise constant or variance (for the over-segmentation algorithm in

Chapter 6) can be used. The “inlier threshold” determines whether a point is an inlier to

the fitted model and it is typically set at 1.5 times the estimated variance. There are a

number of variations to the RANSAC algorithm, including replacing the hard “inlier

threshold” with a weighted function. We will discuss some of the variations and

challenges of the RANSAC algorithm in Chapter 6.

5.5 RESULTS FOR DATA CLASSIFICATION

In this section, we demonstrate the advantage of performing over-segmentation

over individual data labelling for data classification using synthetic data. To do so, we

evaluate two classifiers, the Conditional Random Field (CRF) and the logistic regression

(which only considers the node potentials), using synthetic data to show the improvement

achieved by further taking the independencies among the neighbouring data into account.

The proposed method, the multi-scale Conditional Random Field (mCRF), is next

validated using two sets of complicated real-world data acquired from the terrestrial laser

scanner (shown in Chapter 2). We demonstrate the advantage of including the constraints

93

between super-voxels in the multi-scale method compared to the CRF. We also compare

our proposed method with triangulations and the direct plane fitting approach.

Lastly, we validate our algorithm on a large-scale real-world data set, registered

from seven terrestrial laser scans. The additional pre-processing step (required to handle

the relatively large data set) is explained, and the result is fitted with plane patches for

visualization.

5.5.1 Synthetic data sets

The proposed algorithm is compared with CRF without data reduction on a set of

synthetically generated data as shown in Figure 4-6, where the connecting neighbours are

selected from a fixed support region. The algorithm is also compared with a discriminative

ACCURACY

(0-1)

RUNTIME

(TRAINING)

RUNTIME

(TESTING)

TRAINING

ITERATIONS

REDUCTION TO

(TRAIN DATA

200)

REDUCTION TO

(TEST DATA

200)

CRF WITH ADAPTIVE

POINT REDUCTION 1 12.46S 0.109S 23 42 52

CRF WITHOUT ADAPTIVE

DATA REDUCTION 0.939 77.96S 0.516S 37 N/A N/A

LOGISTIC REGRESSION 0.835 11.344 0.156S 18 N/A N/A

Table 5-1 Result for Synthetic Data Example

Figure 5-2. Classification result of synthetic data learned and inferred with (a) CRF with adaptive data

reduction (b) CRF without adaptive data reduction (c) Logistic regression

94

classifier that does not take edges potential into account – logistic regression. A

comparison of discriminative and generative classifiers for urban data can be found in [97].

CRF with super-voxels

Six points are randomly selected from the adaptive support region for the edges in CRF

for every point p. The saliency features of the six points and point p are used as the feature

vector. From Table 5-1, we can see that the training data is reduced to 42 (21% of original

data) and required 12.46s for parameter estimation with an Intel Core 2 Duo 2.13GHz

CPU and 2GB of RAM.

For inference, the testing data is reduced to 52 (26%). Most of the data reduction

occurs within the plane regions where the curvature, which is one of the factors in the

computation of the radius of the super-voxel, is lower. Therefore the support region is

larger (this is because the support region is inversely proportional to the curvature). The

amount of reduction in data therefore depends on the ratio of the data type to the total

amount of data; i.e. for data sets with more planar data, the resulting reduction in the

amount of data will be more significant.

In the synthetic data experiment, the CRF manage to correctly classify all data

points as shown in Figure 5-2: blue is used to represent “planar” data and red is used to

represent “cluttered” data.

CRF without super-voxels

For the selection of edge points, for every point p, three points were randomly picked from

a fixed radius and another three points were randomly picked from a fixed cylinder, as

described in the proposed method in [37]. As shown in Table 5-1, the time taken to train

the CRF with all data is much longer. The reason for a few misclassifications compared to

the “CRF with super-voxel” approach is due to a different edge points selection method:

in the “CRF with super-voxel” approach, the selected neighbouring edge points are from

an adaptive radius and thus are more likely to be from the same class.

95

Logistic regression

Similar to CRF, the logistic regression is also a discriminative classifier, but it takes only

node potentials into account. The time taken for training and inference of the learning

model is similar the time for CRF with adaptive data reduction. However, without spatial

information, the algorithm is prone to misclassification of a “flatter” cluttered region as

“planar” data. This is due to that in the computation of saliency features of a point from

the cluttered data, it is possible that the k nearest points selected to form the covariance

matrix are approximately coplanar. Without neighbouring information, it can be difficult

for the classifier to correctly classify such points. As a result, the logistic regression has

the worst classification accuracy compared to the other two approaches.

5.5.2 Urban data sets

For our proposed mCRF with super-voxel approach, we next validated the approach with

three sets of real outdoor scanned data. The mCRFs was trained with hand-labelled

outdoor laser scanned data. The original segmented 57,734 training data were reduced to

5,850 super-voxels automatically with our proposed over-segmentation algorithm. As a

result,with over-segmentation, the total training time for the model is reduced to 10% of

the original time. Note that the total reduction is less than the reduction achieved in the

testing dataset (as most data reduction occurred at the planar region, and we carefully

selected a balanced amount of data from different datasets that covers most of the data

variation).

The training data were hand-labelled and the data were chosen from three scans

with different densities, scene configurations and lighting conditions. The total training

time was around 5 hours on an Intel Core 2 Duo 2.13GHz CPU with 2GB of RAM. We

then computed the super-voxels of these hand-labelled data. Four neighbours were

randomly selected for local edge features, and four nearest super-voxels were selected for

the regional edge features. This means that for every support region, we needed to

compute feature descriptors for only five points instead of every point within the support

96

region. In the following experiments, the learning model was trained to classify the data

into 5 classes: Vegetation, Trunk, Man-made objects (building, signboard), Pathway,

Terrain (Grass)].

Data set 1

With the 3D over-segmentation, 10,660 points as shown in Figure 5-3 are reduced into

538 super-voxels. Therefore, only 5% of the original data has to be labelled, providing a

large reduction in total inference time. Figure 5-3a shows the over-segmentation result.

Note the bigger super-voxels in the geometrically flat data (such as building and terrain

data). The computation time required for feature extraction is around 53.7s, and 0.1s for

inference of the super-voxels with CRF and 0.2s for mCRF.

97

The labelled data are shown in Figure 5-3d and our classification accuracy is

around 86% for mCRFs (Figure 5-3c), compared to 78% for CRFs (Figure 5-3b). The

Figure 5-3. Data set 1

(a) 3D points over-segmented into super-voxels

(b) Labelled super- voxels’ mid-point with CRFs (Yellow – Man-made objects, Red – Vegetation, Light blue

– Trunks, Green – Terrain, Dark blue – Pathways)

(c) Labelled super- voxels’ mid-point with mCRFs

(d) Labelled original data with mCRFs

98

feature descriptors used in both experiments are the same as explained in Chapter 5. With

negligible processing time difference, we show that a multi-scale approach improves the

super-voxels labelling.

Data set 2

A more complicated dataset with 158,922 points in laser intensity as shown in Figure 5-4

was over-segmented into 8,330 (5.2% of original data) super-voxels as shown in Figure

5-5. Similarly, we saved around 95% of processing time on the inference. The

computation time taken for feature extraction from the super-voxels was around 15

minutes, 87s for inference in mCRFs, and 39s for inference in CRFs. The time difference

of 48s is almost negligible compared to the time taken for feature extraction.

Figure 5-4 Outdoor LIDAR data set II

99

Figure 5-5 Segmented super-voxels

(a) (b)

Figure 5-6 Yellow – Man-made objects, Red – Vegetation, Blue – Trunks, Green – Terrain, Blue –

Pathways (a) Labelled super-voxels’ mid-point with CRFs (b) Labelled super- voxels’ mid point with

mCRFs

100

(a) (b)

(c) (d)

Figure 5-7 (a) Triangulated urban model (b) Enlargement of triangulated building surface

(c) RANSAC fitted plane patches (d) Extracted building/terrain surface plane patches

The labelled super-voxels with CRFs are shown in Figure 5-6a with classification

accuracy around 72%. With the longer-range of interaction provided by the regional edge

features in mCRFs, label accuracy was improved to 79%, as shown in Figure 5-6b.

Most misclassification occurs between ‘pathways’ and ‘terrain’ – likely to be due

to the very similar features (flat surface with similar upwards pointing normal vectors) and

colour variation caused by buildings or vegetation shadows. For some applications, for

example robotic navigation that has different requirement such as whether the terrain is

navigable, the mis-classification of terrain points in the experiment will not be an issue.

This is because the mis-labelled ‘pathways’ data will always be close to the real pathway

location.

As direct triangulations is a popular and straight-forward method for building

reconstruction, we compare our proposed method with the triangulation approach. The

101

urban data is triangulated using RiScan Pro18

with the following settings: edge clearing

threshold = 0.05m; depth factor = 8; depth threshold = 0.05m. The data were reduced to

82,518 elements (triangles), which is about half of the original data, as shown in Figure

5-7a. Figure 5-7b shows the enlargement of the building surface where the drawbacks

mentioned in the introduction can be observed, including rough edges of the vegetation,

occlusions on the building surfaces and spikes between building surface and vegetation.

Different levels of triangulation can be generated with different settings, where the

number of points for triangulations can be further reduced, or increased for better

precision. In short, this could be a solution for straight-forward object reconstruction,

where the memory required, processing time and visualization (artifacts) are not the

constraints.

To compare our proposed method with a direct RANSAC-based approach [98], we

fitted plane patches on our classified “planar” data. In the direct RANSAC-based approach

proposed by Hansen et al., the original point cloud was divided into 3D voxels of fixed

size. For every voxel, a plane was fitted with RANSAC-based plane fitting as shown in

Figure 5-7c. In Hansen et al.’s approach, vegetation was also fitted with plane patches

then filtered out (plane patches with data density lower than a threshold will be omitted).

For the remaining plane patches, neighboring planes with similar co-normalities and co-

planarities are grouped. The major grouped plane clusters that mainly contain

building/terrain data can then be extracted as shown in Figure 5-7d. The resulting number

of planes was 1522; thus, the approach requires only 6088 vertex points to represent the

planar surfaces in the building surface model.

However, the method proposed by Hansen et al. exhibits some disadvantages.

Many plane patches that are fitted on building data are filtered out as non-planar patches

in the grouping process. For example, dense vegetation close to a building or the terrain

surface could result in a deviate-fitted building plane. Consequently, during the grouping

process that groups according to the similarity of co-normality and co-linearity, the fitted

plane patches of the building or terrain surface that are deviated by close-by vegetation

may be filtered out. Also, in the approach, the plane patches groups that contain less than a

predefined number of plane patches are filtered out to avoid plane patches that are fitted

18 RiScan Pro is the companion software for the RIEGL terrestrial scanner.

102

on outliers or vegetation. This could result in useful patches that are fitted on relatively

small structures, which have fewer co-planar neighbouring plane patches, being filtered

out in the grouping process. Next, as it is common for a city model to contain cylindrical

buildings, plane patches fitted on the cylindrical building could be filtered out during the

mentioned plane patches grouping process. In addition, the method proposed by Hansen et

al. involves several thresholds, including the inliers threshold in the RANSAC process and

the co-planarity threshold. These thresholds can be difficult to estimate empirically and

have to be re-estimated for different data types.

Data set 3

Next, the proposed mCRFs model was also tested for large-scale data labelling (data set

that contains more than 10 million points). For the large scale data, the raw data were

divided into small voxels before being processed to reduce the processing time. With

mCRFs and super-voxel, the total time required to label a single scan (1,009,942 points)

was reduced from around 17 hours (without data division) to 5.8 hours for 100x100 =

10,000 divisions, to 5.1 hours for 200 x 200 = 40,000 divisions, and to 1.6 hours with

400x400 = 160,000 divisions. The accuracy considerably dropped from 0.853 for the case

of “no division” to 0.795 for the case of 400x400 divisions. To remedy this, post-

processing steps that refine the object modelling can be applied.

For a complete scan of the area, a total of seven scans were stitched together

(7,086,588 points). The classification accuracy with 400x400 = 160,000 divisions

remained acceptable, as observed in Error! Reference source not found.. A total of 12.8

hrs was required for the computation of the scans.

103

Plane patches were fitted onto the labelled building, terrain and floor data using the

RANSAC algorithm (as explained in Section 5.4) as a post-processing step to

geometrically model the scene into Digital Terrain Model (DTM) and Digital Surface

Model (DSM). The plane fitting process improved the result of the classification: single

misclassified data or outlier were “filtered”, while many of the small ‘holes’ caused by

occlusion or misclassification (e.g. ‘Building surface’ points labelled as ‘vegetation’) were

recovered.

5.5.3 Summary of the Experiment Results

In short, the experiments conducted have shown the performance of our approach

in classifying terrestrial outdoor LIDAR data. We have shown improvements by including

multi-scale modelling, and we have compared our approach to the commonly-applied

methods for urban modelling. Our approach has succeeded in modelling most of the man-

made surfaces and ground planes accurately, compared to triangulations or the direct plane

fitting approach [98]. For large-scale data, we have shown that by combining model fitting

and classification approaches, efficient and accurate urban modelling can be achieved.

Figure 5-8. Plane fitting on labelled building and terrain data

104

5.6 CONCLUSION

We have presented an efficient and accurate method for 3D terrestrial range data

classification. We reduced the amount of data by over-segmenting the raw point clouds

into super-voxels, reducing (in most cases) to 5% of the original data. We implemented

the multi-scale Conditional Random Field to provide connectivity at local, edge and

regional levels. The improvement of labelling precision (with global classification (CRF))

over local classification (logistic regression) has been demonstrated. We have also shown

that the regional feature of the super-voxels in the mCRFs improves the classification

accuracy of CRFs by 5% to 10%, while requiring only a negligible increase in the

computation time. We have also provided a strategy to handle relatively large-scale data,

and validated our proposed algorithm with an acquired real-world data-set.

105

6.0 ROBUST SEGMENTATION

6.1 INTRODUCTION

As discussed in the introduction, laser scanning technology has recently become capable

of producing dense point clouds. It is now possible to visualise highly-detailed urban

environments represented by 3D points, as shown in Figure 6-1.

3D points can be seen as the simplest form of geometric building blocks which can

be an effective display primitive [99]. As an alternative to 3D triangulation for

visualisation, point sets require little, if any, pre-processing for visualising urban

environments. However, this is memory-intensive, particularly for large-scale models. As

stated in Chapter 1, one of the solutions to urban modelling is to extract planar data (from

man-made structures) and then geometrically fit locally-delimited planes to the data.

Geometric modelling can be seen as representing potentially thousands of raw data points

with a single shape(and thus with few parameters). This results in a large reduction in

storage space and provides the ability to undertake geometric reasoning.

Chapter 6

Robust Segmentation

106

Figure 6-1 3D Terrestrial Outdoor Point Clouds

In order to fit a plane to the data, we need to robustly segment the planar data into

regions of locally delimited planes. Similar to the over-segmentation mentioned in

Chapter 3, robust segmentation groups similar regions together. The difference is that

over-segmentation is more local and multiple over-segmented segments can belong to the

same object, whereas robust geometric fitting removes more meaningful shapes in terms

of the way we think of the world. Over-segmentation is often used as a preliminary

processing step when existing techniques (robust segmentation) are insufficient to

segment or model the data accurately. To date there has not been a successful

demonstration of segmentation of complicated outdoor data using only robust

segmentation. The enormous amount of data, the large varieties and shapes of the

structures, and the sheer number of structures make robust multi-structure segmentation

extremely challenging. The existing literature on multi-structure segmentation using

robust segmentation typically involves only a relatively small number of simple structures

[100].

In our proposed approach for the automatic generation of 3D outdoor urban models,

the raw data is initially over-segmented into super-voxels (Chapter 6) to efficiently

classify them into different data types (Chapter 7). To model the extracted building data

automatically, robust segmentation is required, because data from complex environments

can be difficult to segment in 3D using a simple (non-robust) statistical fitting. Moreover,

107

it takes much longer to fit models with large search space. The current challenges in robust

multi-structure segmentation are as follows:

i) Gross outliers

Traditionally, classical statistical fitting methods, such as Least Square Fitting, are

sufficient for model fitting when only one structure is present in the data and there are no

outliers. Least square fitting is the simplest and most commonly used technique to find the

best fit for a set of data points. The best fit is the instance of the fitted model when the sum

of squared residuals is a minimum (a residual being the difference between an observed

value and the value given by the fitted model). Least squares methods were independently

developed by mathematicians Karl Friedrich Gauss in 1794, Adrien Marie Legendre in

1805 and Robert Adrain in 1808 [101]. However, this approach is highly sensitive to

outliers (for example, data belonging to one structure may be fitted to another close

structure). An extreme outlier with relatively large residual is capable of drastically off-

setting the fitted model. Therefore, the least square approach has a zero break-down point.

A more robust algorithm is therefore required.

In 1984, Rousseeuw proposed replacing the sum in the Least Square Model

method with a median. The Least Median Square (LMedS) algorithm can tolerate up to 50%

outliers [102]. A class of estimators, the M-estimators (the Maximum-likelihood

estimators), replaces the sum of square of residuals in least squares with a slower

increasing loss function (of the data value and parameter estimate). However, these

methods are not robust. The principal measure of the robustness of an estimator is its

breakdown value – the fraction of outlying data points that can corrupt the estimator and

the influence function (which shows the effect on an estimator of changing one point of

the sample) [103]. The M-estimators have a low break-down point at 50%. Popular robust

estimators that can handle more than 50% of outliers include the Hough Transform and

Random Sample Consensus (RANSAC) [104-106].

Non-robust statistical algorithms are typically more accurate. For example, in the

least square fitting, although the sample mean is easily upset by contaminated data, it

provides the most accurate estimation on the location of the normal distributions of

residuals produced by the model. Computer vision-based algorithms such as RANSAC are

108

robust but less accurate. Therefore, the robust segmentation methods are generally

performed first, followed by the least squares method to detect and eliminate outliers and

refines the model parameters.

ii) Pseudo-Outliers

Although robust estimators are capable of handling most gross outliers, the

building surface data sets are multi-structured. Therefore, the robust segmentation

algorithm must tolerate both gross outliers and pseudo-outliers. Pseudo-outliers are

defined in [107] as “outliers to the structure of interest but inliers to a different structure”.

The previously-mentioned robust estimators, such as RANSAC, are essentially designed

to handle only single structure segmentation contaminated with gross outliers. In order to

handle multi-structure segmentation, RANSAC is sometimes applied sequentially to

detect and remove the inliers of the best-fitted planes from the data set (the fit-remove

strategy).

For instance, Hesami et al [108] propose a hierarchical approach to segment coarse

to fine plane segments from complex buildings. The method starts by specifying the

number of hierarchical levels for segmentation and a user-defined input is computed for

the robust estimator in each level. The user-defined threshold works as a finer tuning

parameter and indicates the ratio of population of the smallest region that can be regarded

as a separate region and the size of the entire population. In every level of hierarchy,

robust estimation is applied to the range data to sequentially group data into segments

until the calculated scale of noise19

is larger than the scale of noise of the measurement

equipment. The sequential robust estimator approach is not optimal, as will be explained

later in detail in Section 6.2.2.2.

Another alternative is to employ unsupervised clustering techniques for

simultaneous plane fitting (see Section 6.2.3).

19 A surface is fitted to the data of each segment using a least-square fitting and the scale of noise can be

calculated for the next hierarchy level of segmentation.

109

iii) Unknown number of structures

To determine the number of structures or planes, the RANSAC fit-remove approach is

applied until the leftover points are fewer than a user-defined threshold, where the

threshold is usually difficult to estimate. In another segmentation approach, unsupervised

clustering, information criteria can be used to determine the optimal number of structures

or planes. The criterion scores for a number of clusters from two to some pre-set

maximum are computed and compared. The best score informally locates the optimal

number of structures or planes. However, determining the maximum number of clusters

can be difficult. Furthermore, due to occlusions, the decision as to whether a plane

segment belongs to another plane segment is ambiguous. This is because a single plane

may be broken into two by occlusion. For example, in Figure 6-2, the tree trunk occludes

part of the building surface. Ambiguity arises when trying to determine the actual number

of structures through segmentation of the disconnected building surface.

Figure 6-2 Ambiguity for number of structures.

We recognise the problems as mentioned above (more details are provided in

Section 6.2) and propose to apply the Infinite Gaussian Mixture Model (IGMM) to

Occluded building surface

Tree causing occlusion

110

overcome the problem of robust multi-structure segmentation. The parameters of IGMM

are estimated by Gibbs sampling and the number of clusters are allowed to grow with the

data (after the sampling has converged, one has a distribution over the number of clusters

given the data). IGMM has been applied in bioengineering for MRI classification [109,

110] and for gene expression clustering [111], document modelling for event detection

[112] and recently in computer vision for motion segmentation [113, 114]. In motion

segmentation, Jian and Chen [114] replaced the residual function for IGMM with the

Sampson distance in unsupervised clustering. With the clustered data, each cluster is fitted

with a plane with RANSAC to remove the outliers, much like “pre-segment with

RANSAC and refine with least square”. For plane data clustering, we modified the

residual function to include the prior knowledge (that data are planar) in Section 6.3 and

verified the algorithm on synthetic and real-world outdoor planar data.

This chapter is organised as follows: relevant background information on robust

multi-structure segmentation and the major problems in previous work, are discussed in

Section 6.2. Our proposed approach for plane data segmentation/clustering is explained in

Section 6.3. The modification to the IGMM algorithm is also described. We then test the

proposed method and compare it with existing methods using two sets of terrestrial laser

scanned 3D urban data (plane data and plane patches) in Section 6.4.

6.2 BACKGROUND OF ROBUST SEGMENTATION

Generally, segmentation methods can be divided into three approaches: region growing,

model fitting and clustering.

6.2.1 Region Growing

Region growing is the most common bottom-up segmentation method, often used as a

post-processing step to refine the initial over-segmented (for example, by model fitting)

regions. There are several types of region growing methods. The simplest method starts

111

with a single seed. The seed pixel/point grows by merging with the neighbouring

pixels/points with similar properties. When the region stops growing, i.e. no new

pixel/point can be added, another seed, which does not yet belong to an existing region,

will be randomly chosen. This process is repeated until all pixels/points belong to some

region. However, by growing seed points sequentially, the current region dominates the

growth process, causing ambiguities around the edges of adjacent regions, causing

different choices of seeds to give different segmentation results. This is a common

problem for the sequential segmentation approach which biases the segmentation in favour

of the regions segmented first. One solution is to start with multiple randomly-sampled

seed points (or over-segmented regions). A number of regions will grow simultaneously

and similar regions will gradually merge. Region growing approaches have the advantage

of exploiting neighbouring pixels that are likely to be similar, thus ensuring the

smoothness of the segmentation.

One of the main problems with the region growing approach lies in determining

the thresholds of the similarity criteria – i.e., how similar is deemed to be sufficiently

similar. The similarity criteria can include colour, variance, intensity, motion, size,

normals and other popular features. Other sources of problems are noise present in the 2D

image or occlusion in 3D data and these typically result in over-segmentation.

Furthermore, the computation and resource costs of these are potentially high.

6.2.2 Model Fitting

There are a number of popular fitting approaches, including the following:

6.2.2.1 Hough Transform

To fit a plane using the Hough transform, a set of points (a points-set – eg. three points for

planes) are selected and the plane coefficients given by the plane equation z = ax + by +

d are computed. One of the possible choices of parameter space corresponds to the plane

coefficients (a,b,d). However, similar to the well-known problem in line fitting of vertical

112

lines in 2D, there is the problem of vertical planes (resulting in large a and b values) which

give rise to unbounded values of the plane coefficients. Therefore, the choice of

parameterization is important, as unlike airborne laser scanning, terrestrial laser scanning

involves both vertical and horizontal planes. A better choice is to choose the parameter

space that consists of the plane’s normal vector and its distance from the origin.

The points will vote on a sinusoidal surface in the Hough parameter/accumulator

space and the intersection of the sinusoidal surfaces indicates the presence of a plane.

Thus, in the Hough transform approach, each point indicates its contribution to a globally-

consistent solution i.e. the physical plane which gave rise to that point.

In 1990, the Randomized Hough Transform (RHT) [115] was introduced to

address the slow performance and high memory consumption problems of the Hough

Transform and RHT has been applied to multi-structured segmentation [106]. By

randomly selecting the points set required to compute the parameters for the accumulator

space, RHT generates a smaller sub-set of all parameter combinations. However, RHT still

suffers from the limitation of the Hough Transform caused by the quantisation in the

accumulator space. That is, unlike other robust fitting methods that provide infinite

precision, the Hough Transform and RHT have limited precision. When the bin size is set

to be larger than the optimum size, there is a loss in precision; if set too small, most bins

will have only a single count as there will be a one-to-one mapping between pixels and

bins. Estimation of the bin size can be a difficult challenge. In some cases, for example in

fundamental matrix estimation, without the inherent knowledge of the estimate variance, it

is rather difficult to estimate the bin size in the Hough transform.

6.2.2.2 RANSAC

The RANSAC algorithm, proposed by Fishler and Bolles in 1981 [96], is one of the most

widely used approaches amongst the computer vision community due to its robustness to

outliers. Instead of finding the location of the narrowest band containing half the data, as

in LMedS, RANSAC finds the location of the densest band of a preset width. Unlike most

statistical estimation problems, the data used in computer vision generally include a high

ratio of outliers. RANSAC randomly samples a sub-set number of points and computes

113

the model parameters of each sub-set. The best fitted model is decided on the estimated

parameters that contain the largest number of inliers. By counting the number of inliers

with residuals less than a preset width/threshold, the inliers are assumed to be uniformly

distributed, i.e. carrying equal weights. Since residuals are more likely to follow a

Gaussian-like distribution, Torr et al. replaced the uniform kernel in the RANSAC scoring

function with a Gaussian kernel that shapes the residuals into a Gaussian kernel in

MLESAC [116]. Similarly, Wang and Suter proposed ASKC [117] and tested it with both

a Gaussian and an Epanechnikov kernel. The authors showed that both kernels performed

better than a uniform kernel.

The main challenges in RANSAC-based multi-structure segmentation methods

(including the extended algorithm with modified scoring function) are as follows:

1) Determining the “inliers threshold” (scale estimation)

The “inliers threshold” found in the RANSAC scoring function (Algorithm 5-1) can be

seen as the estimated variance of the underlying model. In experiments where the

estimated variance is known, the “inliers threshold” is typically set at 1.5 times the

estimated variance. Determination of the “inliers threshold” is a chicken-and-egg problem.

The inliers have to be known before we can compute the variance, yet we need to know

the variance in advance to realise which data points are the inliers. Considerable effort has

been spent addressing this issue; a simple estimator would be the median or the more

commonly used MAD estimator which is also used in MLESAC [116]. Since the “inliers

threshold” for different candidate structures can be different, Konouchine et al. [118]

extended MLESAC by proposing an approach that is capable of adaptively estimating the

variance and the outlier ratio for every iteration. However, these estimators have a 50%

breakdown point.

Wang [100] conducted a review of the robust scale estimators including ALKS,

RESC and MSSE. He found that these methods are limited in handling extreme outliers.

Wang and Suter then proposed integrating a robust scale estimator, TSSE [119], which

finds the densest band and the valley to estimate the variance in the ASKC framework.

Alternatively, Fan and Pylvanainen [120] derived the scale of the inliers from the statistics

of repeated inlier data points (Ensemble Inlier Sets) that are accumulated from all the

114

proposed models. However, these methods determine the inliers threshold using data

solely from a single mode. Estimating the variances of different models simultaneously

should be a more accurate approach.

2) Remaining points from removed structures

As previously mentioned, RANSAC is not optimal for multi-modal segmentation [104,

121]. In order to extract more than a single structure, RANSAC is repeated sequentially

(fit-remove) – the subsequent data for further segmentation are often either contaminated

with leftover data from extracted structures (when the “inlier threshold” is smaller than the

real variance), or extracting data from other structures (when the “inlier threshold” is

larger than the real variance). Zuliani et al. [104] performed fit-remove a number of times

and selected the best fit-remove with the most inliers with MultiRANSAC. The authors

show better results compared to RANSAC, but it is computationally more expensive than

the sequential RANSAC approach. In addition, the number of structures has to be

predefined.

3) Determining the outlier ratio which decides the number of sampling required to

ensure outlier-free

In large-scale data, the percentage of outliers becomes relatively large due to the sheer

number of structures. This makes segmentation more inefficient, as the number of

sampling/iterations required for the RANSAC algorithm depends on the outlier ratio.

Since data that are far away are unlikely to be part of a single structure, one solution is to

select the neighbouring points of the first randomly sampled point as the sub-set of points

with higher probability. This is known as the minimal sample sets (MSS) [104, 121],

where after randomly picking a point xi, xj has the following probability of being drawn

with a Gaussian kernel:

2

2||||exp

1)|(

σij

ij

xx

ZxxP

−−=

( 6-1 )

115

where Z is a normalization constant and σ is the estimated variance of the residuals

(“inliers threshold”).

With MSS in the selection of the sample sets, the RANSAC algorithm requires

fewer (random sampling) iterations to ensure at least one random sample is free from

outliers with some probability p. However, the outlier ratio is often not known a priori, or

is difficult to estimate. In sequential RANSAC, the outlier ratio changes for different (fit-

remove) iterations. Konouchine et. al. [118] re-estimated the outlier ratio for every

iteration using the Expectation Maximization algorithm. In practice, without knowledge of

the outlier ratio, most authors determine the number of iterations by setting a relatively

large number [104, 122].

4) Determining the stopping criteria for fit-remove sequential RANSAC

In most problems, we are often unsure of the number of structures in the data. In the fit-

remove strategy, model fitting is repeated until the leftover points are below a threshold.

However, it is often difficult to estimate a good threshold. In contrast, the mixture

modelling approach typically stops when the maximum number of iterations is reached or

upon saturation, i.e., when the objective function improvement between two consecutive

iterations is less than the minimum amount of improvement specified. In [118],

Konouchine et. al. stopped the iteration when the probability of the selected outlier-free

sample reached 95%-97%. There is no proof that this solution will always work unless the

data set is relatively clean, i.e. contains only data that can be fitted with the assumed

model.

5) Determining the number of models for sequential multi-modal RANSAC methods

With the appropriate stopping criteria, it is possible to realise the number of models.

Another approach is to not rely on the stopping criteria and discover the number of models

during multi-structure segmentation. For instance, Toldo and Fusiello [121] proposed

using agglomerative clustering (see next section for more details of clustering) with the

Jaccard distance to cluster random sampled hypothesis. The method starts with random

sampling and then proceeds by linking points with Jaccard distance smaller than 1 and

116

stop when there is no such point left. However, clustering in the random sampled

hypothesis space has the same shortcoming as general clustering methods – the number of

clusters can often be difficult to estimate.

6.2.3 Clustering

Clustering provides an alternative approach to segmenting the data simultaneously, similar

to region growing approaches with multiple seeds. Clustering has the advantage of fitting

multiple models into a global fitting model that consists of several instances of the same

model, each corresponding to a different set of parameters, unlike model fitting where

inliers of other structures are treated as outliers (pseudo outliers).

When clustering data, the actual number of cluster is unknown. Determining the

number of clusters, which is the fundamental problem of cluster validity, is one of the

main problems in clustering methods. Several researchers have tried to address this

classical open problem, including Moreau[123], Dubes[124], Fraley[125] and Still[126].

Several solutions have been proposed, and most rely on information criteria or more

recently use the Dirichlet clustering process [127]).

The following outlines the general methods for clustering and the approach to

assess cluster validity:

a. Hierarchical methods

One of the clustering methods is hierarchical agglomerative clustering: starting with as

many cluster centres as there are data, the closest centres are merged as the level proceeds

upwards to construct the tree. Alternatively, hierarchical divisive methods start by

grouping all data together in a single cluster and then splitting the cluster as it proceeds to

a higher level. In order to handle data with outliers, Frigui and Krishnapuram proposed the

Robust Competitive Agglomeration (RCA) algorithm [128]. The algorithm reduces the

effect of outliers by clustering the data into a large number of small clusters with fuzzy

clustering. The small clusters are then merged with competitive agglomerative process. To

117

determine the number of clusters, the hierarchical tree is cut at a visually appealing level.

However, the assessment of “visually appealing” is intrinsically difficult.

In previous work for outdoor airborne laser-scanned data, Haala and Brenner [25]

utilised ISODATA, which is a combination of the agglomerative and divisive clustering

methods to cluster data into planar, vegetation and terrain data. The optimal number of

spectral clusters is determined based on the minimum distance criterion. The clusters are

split and merged in the feature space (normalised height and multi-spectral information

from colour-infrared aerial images) throughout the iterations. The drawback of this

method is that a number of parameters or thresholds have to be initialised by the operator,

which can be difficult to estimate.

b. Object function approaches:

K-means

In contrast to the hierarchical methods, k-means clustering first decides on an objective

function that determines how well the clusters fit the data. The objective function J which

is a squared error function (total intra-cluster variance - eq ( 6-2 )) is minimised to

determine the centroid c of the clusters where x is a 1xN feature matrix [129]:

( 6-2 )

The number of clusters can be determined by starting from a small k and increasing

the value of k until the convergence error is acceptable. However, it is possible to have a

configuration of the clusters that have converged but does not have the minimum

distortion or has over-fitted. One solution to this is to introduce a criterion that includes a

distortion (which is the objective function) and a penalty function that increases with the

number of parameters, such as the Schwartz Bayesian information criterion (BIC) [130] in

the following equation:

)}()({1 1

ij

T

i

k

i

N

j

j cxcxJ −−=∑ ∑= =

118

( 6-3 )

where m is the number of dimensions; k is the number of centres; N is the number of

feature data; λ = 0.5 or can be obtained empirically. Other criteria include the Akaike

information criterion (AIC) [129] (which tends to overfit the model as it does not penalise

the number of parameters as strongly as BIC), Minimum Message Length (MML), and

Minimum Description Length (MDL), hypothesis-testing and cross-validation.

Fuzzy Clustering

K-means clustering divides the data into crisp clusters where each data point belongs only

to exactly one cluster. In fuzzy clustering, each data point can belong to more than a single

cluster. A membership function is estimated for every point to indicate the degree of the

points belonging to different clusters. Introduced in 1981, The Fuzzy C-Means (FCM)

algorithm [131] is probably the most widely used fuzzy clustering algorithm.

Biosca and Lerma [132] recently implemented the Possibilistic C-Means (PCM),

which is a modification of the FCM algorithm that casts the clustering problem into the

possibility theory framework. The Possibilistic C-Means (PCM)20

uses a more

complicated objective function (that includes a fuzzy partition matrix U = [uij] of

dimension kxn). It has been successfully applied to terrestrial scanned data and is therefore

chosen to verify our proposed solution (described in Section 6.3). The optimal number of

clusters is determined by minimizing the cost function, J. The process starts from k=2 and

stops when convergence is reached (i.e. |cost func(k-1) - cost func(k)| < ε):

∑ ∑∑∑= == =

−+−−=C

i

N

j

m

ijiij

T

ij

k

i

N

j

m

ij ucxcxuJ1 11 1

)1()()( η

( 6-4 )

where the membership function uij and prototype cluster ci are defined as follows:

20 http://mehr.sharif.edu/~amiri/download/Y_FCMC/Y_FCMC_Ver.1.0.zip

NmkcxcxBIC ij

T

i

k

i

N

j

j log)}()({1 1

λ+−−=∑ ∑= =

119

1

1

))()(

(1

1

−−−

+

=

m

i

ij

T

ij

ijcxcx

u

η

( 6-5 )

∑

∑

=

==

N

j

m

ij

N

j

j

m

ij

i

u

xu

c

1

1

( 6-6 )

The initial values of the prototype clusters and membership function are computed

by applying Fuzzy C-Means algorithm to the data.

c. Parametric likelihood-based or model-based approaches:

The parametric likelihood-based methods resolve the clustering problem by estimating the

number of sources from the measurements. Among these model-based methods, an

example is the probabilistic mixture model which attempts to explain the data by

estimating the distribution and the density of the clusters. The Gaussian Mixture Model is

perhaps the most common method. In addition to computing the mean of the clusters in k-

means, it also computes the covariance and the prior probability of the clusters. The

parameters can be estimated using the expectation-maximization algorithm and the

number of clusters (or mixture models) can be chosen by maximising BIC [129] for k = 2

to kmax:

Np

DLBIC log2

);( +Θ−=

( 6-7 )

∏∑= =

−

−Σ−−Σ

=ΘN

j

k

i

ji

T

ij

i

di xxxL1 1

}

1

2/12/)()(

2

1exp

)det()2(

1);( µµ

πα

( 6-8 )

N is the number of data and p is the number of free parameters. Θ = (α1, …, αk, θ1, …, θk);

),( iii Σ= µθ where c is the mean of the cluster, Σ is the covariance and α is the prior

120

probability. The aforementioned criteria, such as AIC, can also be applied instead of the

BIC criteria.

The above mentioned clustering methods often cluster in feature space that

frequently lacks spatial coherence. This can cause discontinuity (for example, in a surface-

normals feature space, two data of similar surface normals that are spatially far apart may

be grouped together as a single cluster) in clustered data in the neighbourhood of the data.

This is illustrated in the results of Set A in Section 6.4. To avoid this, we require a model-

fit clustering method which clusters solely in the geometry space (xyz).

In practice, the constraint of only grouping locally-close data can be enforced by

partitioning the large scale data into smaller voxels, similar to the work of Hansen et al.

[133]. However, the fitted plane patches are still discontinuous and require merging or

grouping into locally-delimited planes. Grouping plane patches is necessary for

visualisation purposes and to reduce the amount of data required to represent the same

object.

d. Non-parametric likelihood-based approach:

Infinite Gaussian Mixture Modelling (IGMM) is a non-parametric Bayesian modelling

approach for data clustering (i.e. the number of clusters does not need to be known in

advance), proposed by Rasmussen [134]. With the actual number of underlying models

unknown, instead of relying on a criterion, IGMM (which is a Bayesian approach to

mixture modelling) clusters the data in a statistically-principled matter.

The parameters of IGMM are estimated by Gibbs sampling and the number of

clusters are allowed to grow with the data. After the sampling has converged, one has a

distribution over the number of clusters given the data. The IGMM clustering process

avoids labels by constructing an “exchangeable cluster process” that consists of an infinite

sequence of points in Rd, with a random partition of the integers into k blocks, i.e. k cluster

[134]. Therefore, the value of k is estimated in the clustering process. More details are

provided in the following section.

121

6.3 INFINITE GAUSSIAN MIXTURE MODEL

As described above, instead of specifying a prior number of mixtures or a maximum

number of clusters, the IGMM determines the mixtures number by Gibbs random

sampling. A number of samples are generated from the Gaussian distribution for the

estimation of parameters for different mixture components. Starting with one component

or mixture, the parameters and hyper-parameters are defined as in the description below

and are updated during the Gibbs sweeps.

The likelihood function for the finite Gaussian Mixture Model in Eq. ( 6-7 ) be

written

∑=

Σ=Θk

i

iiiGxP1

),()|( µα

( 6-9 )

The parameters in the likelihood function for the finite Gaussian Mixture Model in

Eq. ( 6-7 ) can be estimated using the expectation-maximization (EM) algorithm. The EM

algorithm is an iterative algorithm approach that estimates the parameters using two steps,

an expectation step and a maximization step [135]. Alternatively, the mixture model

parameters can be estimated using posterior sampling as indicated by Bayes' theorem as

shown in eq. ( 6-10 ), where the posterior distribution is proportional to the product of the

prior probability measure and the likelihood function.

)|()()(

)|()()|( ΘΘ∝

ΘΘ=Θ xPP

xP

xPPxP

( 6-10 )

where P(x) is the normalisation constant

Unfortunately, the inference of the joint posterior probability is analytically

intractable, i.e. computations to the model cannot be done using exact mathematical

formulae. The computations have to be performed by means of computer simulations: a

two-step iterative procedure known as Gibbs sampling can be used. Gibbs sampling is

applicable when the joint distribution is not known explicitly, but the conditional

122

distribution of each variable is known. Gibbs sampling is used to update each variable, i.e.

{ Σ,, µα }, in turn from its conditional distribution given all other variables in the model.

),,|(),,()|,,( ΣΣ∝Σ µαµαµα xPPxP ( 6-11 )

For a finite number of mixtures, the likelihood of xn associated with mixture i in Eq ( 6-11 )

is then:

)2

)(exp(),(),,|(

2

2/11 iii

iiiiinn

xGicxP

µµµ

−Σ−Σ∝Σ=Σ= −

( 6-12 )

To decide whether to assign the sample to the existing mixture or a new mixture,

the conditional posteriors for the stochastic indicator variables, },...,{ 1 Nccc = , one for

each observation, are introduced to encode which class has generated the observation. The

indicators carry the value of the class to which the data point belongs, i.e. 1 to k and are

updated in the two-step iterations. The approach is based on the formulation described in

Rasmussen [134].

The conditional posterior distributions for the priors on component parameters and

the hyper-parameters are derived in Appendix III. The conditional prior for a single

indicator given all the others is:

+−

+−==

−

−

tedunrepresenisiN

drepresenteisiN

N

cicP

ij

jj

α

ααα

1

1),|(

,

( 6-13 )

where –j indicates all indexes except j and ijN ,− indicates the number of observations,

excluding xn, that belongs to mixture i.

Finally, Gibbs sampling can then be used to calculate the conditional posterior for

each cn:

123

ΣΣΣ+−

−Σ−Σ

+−∝

Σ=

∫

−

−

tedunrepresenisiddwPPxPN

drepresenteisix

N

N

cicP

iiiiiin

iii

i

jn

iinn

µβγλµµα

α

µ

α

αµ

),|(),|(),|(1

)2

)(exp(

1

),,,|(

2

2/1,

( 6-14 )

For the represented mixtures, the conditional posterior is given by the product of the prior

and the likelihood: the conditional posterior is based ),,,,|(~ γλµµ ijii xcP Σ and

),,,,|(~ wxcP ijii βµµΣ as derived in Eqs. ( 10-2 ) and ( 10-8 ).

Due to the absence of the training data, the conditional posterior of the unrepresented

mixture has to be determined by the priors: the conditional posterior is based on

),|(~ γλµµ ii P and ),|(~ wP ii βµΣ as derived in Eqs. ( 10-1 ) and ( 10-7 ).

Modified IGMM

For clustering of the classified planar data or planar patches, we modified the

Euclidean distance function 2)( iix µ− in the Gaussian mixture as follows: The spatial

observations (the coordinates of the range data) that belong to cluster j are fitted with a

plane via the orthogonal distance regression method using the SVD method, as explained

in Section 3.3.2. The plane coefficient for the best fitted plane is the eigenvector that

corresponds to the smallest eigenvalue. With the plane coefficients },,,{ dcba for j

mixtures, we can then replace the distance measure between the data and the cluster centre

with the distance measure between the data and the best fitted plane. The distance function

for },,{ zyx xxxx = is then the shortest point to plane distance distance xj , given as

follows:

222

)(

cba

xcxbxadt

zyx

++

×+×+×+−=

( 6-15 )

distance ];;[ tctbtaxj ×××= ( 6-16 )

With the Gibbs sampler that produces analytic conditional distributions for

sampling and the conditional posteriors defined, the two-step iterative algorithm

124

(Algorithm 6-1) can be employed to generate a sequence of samples in order to

approximate the conditional joint distribution. Geman and Geman [136] showed that,

under mild conditions, and after a large number of iterations, the sampler converges to a

set of samples from the joint posterior distribution.

Initialize απγβµ ,,,,,,,, wkc Σ

For :2=j no of iterations

Two-step iterative algorithm:

Step 1: Sample Normal distribution means and covariances given a current

assignment of data to classes

For repki :1=

Update Nj, the number of data points belonging to mixture i

Update mixing weights: α

π+

=N

N i

i

End

For jNi :1=

Sample ),|(~ γλµµ ii P for unrepresented mixture

Sample ),|(~ wP ii βµΣ for unrepresented mixture

End

For repki :1=

Sample ),,,,|(~ γλµµ ijii xcP Σ for represented mixture

Sample ),,,,|(~ wxcP ijii βµµΣ for represented mixture

End

Update hyper-parameters

Sample ),|(~ γµλλ P

Sample ),|(~ λµγγ P

Sample ),|(~ βΣwPw

Sample ),|(~ wP Σλβ

Sample ),|(~ NkP repαα

125

Step 2: Sample the assignment of data to classes given current values for the

means and covariances (CRP)

For jNn :1=

Sample indicators njc are generated for iteration j according to eq ( 6-14 )

End

Update repk , the number of represented mixtures

End

Algorithm 6-1 IGMM algorithm

6.4 RESULTS

We evaluated our proposed modified IGMM algorithm for clustering of planar data or

patches on two sets of acquired outdoor data obtained from the terrestrial laser scanner.

On Dataset A, we evaluated our modified IGMM algorithm and compared the

performance with IGMM and PCM algorithm on 8,416 labelled planar data, as shown in

Fig 6.3. The feature used is the spatial coordinates of the data.

PCM can be used to segment planar data by clustering in the features space

(estimated normals and distance from the origin of the data). The estimated normals (n1, n2,

n3) of the plane data can be obtained by least square fitting k neighbouring points, where

the value k is computed adaptively depending on the estimated curvature, density and

noise of the surrounding data [72]. The distance from origin di of the data can then be

computed with the following equation [132]:

iiii znynxnd 321 −−−=

( 6-17 )

126

Dataset A

As shown in Figure 6-3, both IGMM and modified IGMM successfully grouped the major

planes. The modified IGMM segmented the major vertical plane into a single group. This

demonstrates that IGMM is able to overcome missing data caused by occlusion and the

transparent glass door. In comparison, IGMM clustered some of the data that belonged to

different planes (but were relatively close to each other) despite the data having a very

different plane normal. The representation of the original data can be reduced up to 99%

(8,416 points versus 19 planes).

127

Set a: Plane data

a b

c d

c d

Figure 6-3. a, b, c show the “bird’s- eye” view and d, e, f show the isometric view of the clustered extracted plane data;

processed with a,d: Modified IGMM, b,e: IGMM and c,f: PCM.

128

As mentioned in Section 6.2.3 for PCM, the features to be fitted include the

normalised estimated plane normals and the distance d from the plane to the origins. The η

scale parameter in Eq. 6.5 was determined empirically and was set to 2 in the experiment.

The result shows the drawback of clustering in feature space – discontinuities in the

clusters neighbourhood are unavoidable. This can only be remedied by applying region

growing as a post-processing step.

Dataset B

In Dataset B, 1,187,563 data belonging to building surfaces were extracted from

7,086,588 data (a large model consisting of 7 registered scans). The point clouds were

partitioned into (400x400) 160,000 divisions for data classification and RANSAC plane

fitting was performed, as shown in Chapter 7. Using the aforementioned pre-processing

steps, the building data were reduced to 22,618 plane patches, as shown in Figure 6-4.

Figure 6-4a,c shows the centres of the plane patches clustered using the modified

IGMM algorithm, which resulted in 59 clusters. In comparison, 30 clusters were computed

with the IGMM algorithm as shown in Figure 6-4b,d. The modified IGMM is capable of

detecting most of the planar structure in the smaller buildings and segments the data into

different planar groups. One of the main differences between the behaviour of IGMM and

modified IGMM is shown in the polygon that surrounds the cylinder. The modified

IGMM managed to segment the data into different side groups, while IGMM grouped

more than one side in a single cluster.

To compare the performance of PCM with IGMM and the modified IGMM, the

plane patches in Set b are also clustered with PCM, which results in far too many

discontinuities. This is most likely due to the large variance in the distance from origin d.

Also, as stated in [132], the chance of a discontinuity occurring inside a neighbourhood

increases with its size. Dataset B contains a complex configuration with relatively fewer

data, making it more difficult to cluster in feature space (like PCM).

129

6.5 CONCLUSION

In this chapter, the Infinite Gaussian Mixture Modelling (IGMM) method is shown to be

robust for the clustering of plane data and plane patches. Unlike traditional clustering-

based methods, the number of clusters (or the maximum number of cluster) does not have

to be pre-defined. Compared to RANSAC-based methods, the Gibbs sampling employed

for the IGMM fits multiple clusters simultaneously instead of sequentially. The IGMM

does not suffer from the “fit-remove strategy” issues explained in Section 6.2.2.2.

Furthermore, the modified Gaussian residual function employed in IGMM improves the

Set b: Plane patches

a b

c d

Figure 6-4. a, b show the “bird’s-eye” view and c, d show the isometric view of the clustered extracted plane data;

processed with a, c: Modified IGMM and b,d: IGMM.

130

clustering accuracy for planar data and is capable of handling missing data caused by

occlusion or transparent objects. Both IGMM and modified IGMM are capable of

segmenting locally-delimited planes, even when working on data separated by different

groups. The algorithm can be extended to cluster non-planar objects, such as cylindrical or

cone objects, which are common in outdoor environments.

The drawback with clustering methods, or a simultaneous multi-model fitting that

fits every point, is that outliers or gross noise (eg. uniform random noise when a Gaussian

noise distribution is assumed) are difficult to deal with. The method is also not robust to

data with unexpected structures that cannot be fitted with a pre-defined model. Therefore,

the terrestrial urban data have to be pre-processed to remove the non-planar data (eg.

vegetation).

For further work, the clustered planes can be further modelled with Superquadric

[137], which requires only a small set of parameters to model a large variety of different

basic shapes. For visualisation purposes, texture maps can be created from the calibrated

colour camera images to provide a more realistic urban model.

131

7.0 CONCLUSIONS

7.1 CONCLUDING REMARKS

This dissertation has presented a series of algorithms for the generation of urban

models from terrestrial LIDAR data. The contribution of the work presented in this

dissertation is mainly in the emphasis of classifying over-segmented LIDAR data instead

of every single point. With the selected and improved feature descriptors for the over-

segmented LIDAR data and the multi-scale learning model, the proposed approach has

been shown to be an effective and accurate method for 3D outdoor LIDAR classification.

Specifically, this thesis has made the following contributions:

• The image occlusion algorithm proposed by Herley [13] is extended to detect

overlapped occlusions. The existing algorithm assumes that the occlusions are all

independent objects; one connected occluded region cannot consist of occlusions from

different images. However, this assumption is often violated in outdoor environments

due to the large number of pedestrian overlaps. By understanding that a large

difference in the discontinuity measure at the boundary of the occlusions is most

Chapter 7

Conclusions

132

likely to indicate an occlusion, the proposal is to analyse the discontinuity measure of

the boundary where occlusions occur, to separate any overlapped occlusions. The

extended algorithm [138] described in Chapter 2 has been evaluated on indoor and

outdoor panoramic image sets and shows promising results.

• A novel method is proposed that robustly and efficiently classifies outdoor terrestrial

LIDAR data into different classes. 3D data labelling in previous work was mostly

point-based. This point-based approach introduces redundant computation – labelling

every point results in a high computational load which can be reduced by classifying a

smaller sub-set of the data. A 3D over-segmentation algorithm, super-voxel, is

introduced in Chapter 4 for this purpose. This proposed method, based on 3D scale

theory, groups regions which are homogeneous with respect to colour and geometry

similarities. This method is shown to efficiently reduce the outdoor terrestrial LIDAR

data for classification.

• An overview and comparison of feature descriptors for the classification of the

LIDAR data is provided in Chapter 3. Feature descriptors are computed for the super-

voxels. One of the feature descriptors, the saliency feature, is improved to be invariant

to the adaptive size of the super-voxel that is used to compute the descriptors. An

efficient classification method can then be employed to label the super-voxels into

different data types.

• A hierarchical graphical model, the multi-scale Conditional Random Fields (mCRF),

is proposed to learn the parameters of the extracted local and regional features, as

depicted in Chapter 5. The comparison of the classification results show improvement

in accuracy with the 3D over-segmentation and mCRF. This is an extension to work

first published in [73] and more recently in [139].

• A robust estimation method is described in Chapter 6 that successfully addresses a

number of the issues associated with the segmentation of complex data with unknown

numbers of structures. The residual function of the Infinite Gaussian Mixture Model

(IGMM) is modified for clustering of labelled data belonging to planar surfaces into

locally-delimited planes. The modified algorithm is evaluated on the labelled planar

data. The result of the proposed method shows the robustness of the algorithm for the

133

clustering problem. The proposed method is also shown to be capable of handling

missing data caused by occlusion or transparent objects.

There are several limitations of the approaches presented in this thesis. To address

the existing limitations and also as part of future work, there are numerous interesting

research directions, including the following:

• Computing super-voxels for 3D data classification in aerial urban modelling and

indoor objects: The problem of redundancy in point-based modelling is not limited

to outdoor terrestrial LIDAR data classification. Further investigation of the

applicability of super-voxels in classification of 3D data of other kinds can be

conducted. In addition, to make the proposed approaches more robust and widely

applicable, it is necessary to test a wider variety of data in future work.

• The results of the object classification provided are promising, but they still offer

room for some improvement. Further investigation of the integration of this work

in 3D classification with 2D image classification can be conducted. Working in 3D

has the advantage of capturing depth and curvature features, and overcoming

limitations due to viewpoint and lighting variations. On the other hand, working

with 2D images is capable of acquiring high resolution in a short time and can

provide better edges in the segmentation. Classification or segmentation based

simultaneously on the 2D and the 3D data can increase the overall accuracy of the

classification results.

• As explained in Chapter 6, one of the challenges of the RANSAC sequential “fit-

remove” strategy for multi-structure robust segmentation is in determining the

“inliers threshold”. The “inliers threshold” for different structures might not be

optimised if fixed to the same value. One interesting approach to these threshold

estimation issues is to explore the possibility of using the estimated variances of

the mixture models, computed with the Infinite Gaussian Mixture Model, to

determine the “inliers thresholds”.

• Finally, for a complete solution to a final urban model, the clustered planes in

Chapter 6 can be further modelled with Superquadric [137], which requires only a

small set of parameters to model a large variety of different basic shapes. In

134

addition, for visualisation purposes, texture maps can be created from the

calibrated colour camera images to provide a more realistic urban model.

In short, this thesis has proposed several improvements to urban modelling

methodology. However, like all studies, the investigation has been limited by time and

resource constraints. The author hopes that this dissertation will be informative and useful

to researchers and stimulate further research in the area.

135

8.0 APPENDIX I: DATA ACQUISITION

The other data acquisition technology, InSAR, and several existing scanning

strategies using laser scanners are elaborated in this section:

Sensor Technology

Besides LIDAR, Interferometric Synthetic Aperture Radar (InSAR) is another recently

developed class of active sensor. Using satellites or aircraft, InSAR acquires the

measurements by computing the differences in the returned phase of the waves generated

by two or more synthetic aperture radars (SARs). The technology can achieve accuracy up

to centimetre-scale changes in deformation of airborne images over time spans of days to

years. However, it has less than a vertical accuracy of 1m for airborne data acquisition in

Haithcoat et al’s experiment [140], whereas LIDAR can achieve up to a few centimetres

of accuracy. According to Norheim et al. [141], the LIDAR DEM (Digital Elevation

Model) has less bias (~30 cm) and less variance (~90 cm) than the InSAR DEM (bias ~90

cm, variance ~3.5 m).

The advantage of InSAR over LiDAR is that it flies higher and faster than most

LIDAR systems and it is therefore possible to penetrate fog and rain. InSAR also has a

lower per-area cost, and is capable of more rapid survey of large areas and faster post-

survey processing. Some authors [142] suggest combining both sensors as they

Appendix I

Data Acquisition

136

complement each other. However, the use of InSAR is difficult for ground-based data

acquisition.

Scanning Strategy

For ground-based data acquisition, 2D line laser scanners and 3D laser scanners mounted

on a mobile robot, vehicle or hand-held can be used. The acquisition can be done in a

stop-and-go fashion [46, 143, 144] or in a continuous fashion [57, 144, 145]. The stop-

and-go approach measures the environment at several fixed locations, while the

continuous approach involves a moving sensor platform and a car or mobile cart.

Localisations of the scanning positions are generally done with GPS (Global Positioning

System) or by using aerial photos as a global map. The drawback of GPS as the source of

global position estimates in outdoor environments is that it tends to fail when operating in

dense urban environments, particularly in urban canyons where only a few satellites are

visible. An accurate differential GPS system that can fulfil the registration requirement is

expensive. The system also often includes an Inertia Navigation System that uses

computer and motion sensors to continuously track the position, orientation and velocity

of the scanner.

Stamos and Allen [143] acquired data on the environment in a stop-and-go fashion.

The authors used a Cyrax range scanner which has centimetre level accuracy up to 100m.

The authors scanned multiple range scans from several scan positions and integrated the

scans with images acquired from a camera. In another data acquisition example which

uses the continuous scanning method, Fruh and Zakhor [145] acquire 3D data with a

configuration of two 2D laser scanners and a camera. The approach uses aerial images to

precisely reconstruct the path of the acquisition vehicle in offline computations. One

scanner is mounted vertically to capture building facades, and the other is mounted

horizontally. Successive horizontal scans are matched with each other in order to

determine an estimate of the vehicle's motion, and relative motion estimates are

concatenated to form an initial path. Small errors in the initial path which accumulate to

become large over time can be eliminated by using Monte-Carlo-Localization with an

airborne map to correct global pose. By combining stop-and-go and continuous scanning

of the laser rangefinder, Asai et al. [146] first scanned the environment using the stop-and-

137

go approach to obtain a more complete scan of the environment. The non-measured

portions were then covered by the continuous scanning approach. The system includes a

terrestrial laser range finder, a van, a GPS and an INS sensor.

138

9.0 APPENDIX II: SINGULAR VALUE DECOMPOSITION

Singular Value Decomposition (SVD) transforms a nm × matrix A of rank r to

diagonal form using unitary matrices:

[ ]*

1

1

1 ...]...[* r

r

r vvuuVUA

=Σ=

σ

σ

O

(9-1)

where diagonal matrix }...{ 21 rσσσ >>> contains the singular value and the numeric

unitary matrices U and V are the left and right singular vectors for Σ .

Given that V is unitary21

, Eq 9-1 has the following equivalence:

rrr uAvUAVVUA σ=⇔Σ=⇔Σ= * (9-2)

This can be interpreted as the unit vectors of an orthogonal coordinate system

},...,,{ 21 rvvv are mapped under A onto a new “scaled” orthogonal coordinate system

21 Unitary matrix has the property that the conjugate transpose V* is equal to its inverse V

-1.

Appendix II

Singular Value Decomposition

139

},...,,{ 2211 rruuu σσσ , as shown in Figure 9-1, rv which corresponds to its smallest

singular value rσ gives the 1-dimension sub-space into which the data have a minimal

projection, which is the estimated normal vector.

Figure 9-1 Geometric Interpretation of the SVD

The cost function in the least square algorithm can be varied as a weighted cost

function to increase the strength of the nearby points over the distance points. Pauly et al.

[147] assign different weights for the fitting errors based on the distance of the

neighbouring points to p. The authors implemented a new cost function with Gaussian as

the weighting function, where h2 is chosen as one third the square distance between p and

its kth

nearest neighbor:

∑=

−−

−=k

i

h

pp

i

T

i

ecpncne1

||||

2 2

2

)(),(

(9-3)

Other variations include maximizing the cost function which is defined as the

angle (minimize the inner product) between the normal vector and the tangential vectors

(Vector SVD) [52].

A

11uσ

22uσ

1v 2v

140

10.0 APPENDIX III: DERIVATION OF CONDITIONAL

POSTERIOR DISTRIBUTION OF PARAMETERS FOR IGMM

Conditional Posterior Distribution of the component means, µ

Each component mean is given a Gaussian prior:

),(~),|( γλγλµ GP i

( 10-1 )

where the hyper-parameters22

λ and γ , are common to all components. The conditional

posterior distribution for the component means can be obtained by multiplying the

likelihood from Eq. ( 6-9 ) conditioned on the indicators, by the prior ),|( γλµ iP as

shown below:

)1

,(~),,,,|(γγ

λγγλµ

+Σ+Σ

+ΣΣ

iiii

iii

NN

NxGxcP

( 10-2 )

22 Hyper-parameter is defined as a parameter of a prior distribution in Bayesian analysis. The term hyper-

parameter is used to distinguish the parameters of the prior distributions (Gaussian and Gamma distributions)

from parameters of the model for the underlying system (Gaussian distribution).

Appendix III

Derivation of Conditional Posterior

Distribution of Parameters for

IGMM

141

where

λ : hyper-parameter - mean of the Gaussian priors

γ : hyper-parameter - variance of the Gaussian priors

ix : mean of the observations belonging to class i

iN : number of observations belonging to class i

The hyper-parameters can be a single value, or can be computed by taking a probability

distribution on the hyper-parameter itself, called a hyper-prior. The hierarchical structure

of the prior distributions is more robust, as the hyper-parameters can also be updated in the

iterations. The derivation of the conditional posterior distributions for the updating of the

hyper-parameters is shown below.

Conditional Posterior of the Hyper-parameters for the means

The choice of the prior density of the hyper-parameters for the component means is the

conjugate density (for mathematical convenience) of the Gaussian distribution: λ and γ

are given vague23

Gaussian and Gamma hyper-priors:

),(~)( 2

xxGP σµλ ( 10-3 )

)2

exp(),1(~)(2

2

1

2 x

x

rGaP

σγσγ

−∝

−−

( 10-4 )

where xµ and 2

xσ are the means and variance of the observation.

The conditional posteriors of the hyper-priors are the product of the hyper-priors and

∏=

repk

i

iP1

),|( γλµ :

23 The shape parameter of the Gamma hyper-prior is set to unity as proposed in Ramussen’s formulation,

corresponding to a very broad (vague) distribution.

142

)1

,(~),|(22

1

2

γσγσ

µγσµ

γµλkk

GPxx

k

i

ixx

rep

++

+

−−

=

− ∑

( 10-5 )

)1

)(

,1(~),|(

1

1

22

−

=

+

−+

+∑k

kGaP

repk

i

ix λµγσ

λµγ

( 10-6 )

where repk is the number of represented mixtures.

Conditional Posterior Distribution of the component precision, Σ

Each component variance is given a Gamma prior:

)2/exp(),(~),|( 12/1 βββ βwwGawP iii Σ−Σ∝Σ −−

( 10-7 )

where the hyper-parameters β and w, are common to all components. The conditional

posterior distribution for the component means can be obtained by multiplying the

likelihood from Eq. ( 6-9 ) conditioned on the indicators, by the prior ),|( wP i βΣ ,

simplified to the form:

))(

,(~),,,,|(

1

:

2−

=

+

−++Σ

∑i

icj ij

iiN

xwNGawxcP

j

β

µβββµ

( 10-8 )

where

β : hyper-parameter - shape of the Gamma priors

w-1

: hyper-parameter - mean of the Gamma priors

143

The derivation of the conditional posterior distributions for the updating of the hyper-

parameters is shown below.

Conditional Posterior of the Hyper-parameters for the precision

The choice of the prior density of the hyper-parameters for the component precision is the

conjugate density of the Gamma distribution: β and w are given inverse Gamma and

Gamma hyper-priors:

)2

1exp()()1,1(~)( 2

3

1

ββββ −∝⇒

−− PGaP

( 10-9 )

),1(~)( 2−xGawP σ

( 10-10 )

The conditional posterior of the hyper-priors are the product of the hyper-priors and

∏=

−Σrepk

i

i wP1

1),|( β simplified to the form as below. ),|( wP Σβ is not of the standard form

of a simple probability function. Since ),,...,|)(log( 1 wP kΣΣβ is log-concave, the

independent samples can be generated using the Adaptive Rejection Sampling (ARS)

technique [148].

)1

,1(~),|(

1

1

2

−

=

−

+

Σ+

+Σ∑

β

βσ

ββk

kGawP

repk

i

ix

( 10-11 )

])2

exp()[()2

)(2

1exp()

2(),|(

1

22

3

∏=

−− Σ

Σ−

Γ∝Σreprep

rep

k

i

i

i

k

k wwwP

ββ

β

ββ

ββ

( 10-12 )

144

Conditional Posterior Distribution of the mixing weights, π

The mixing weights are given symmetric Dirichlet priors with concentration parameter

k/α : ∏=

−

Γ

Γ=

k

i

k

ikkk

kkDirichletP1

1/

1)/(

)()/,...,/(~)|,...,( απ

α

ααααππ

Given the indicator, ci, that encodes the class to which the observation belongs and the

occupation number, Ni, which carries the number of observations associated with

component i, the inference of the mixing weights can be indirectly realised through the

inference of the indicators. With the standard Dirichlet integral, the priors for the

indicators depend only on α [134].

∫= kkkNN ddPccPccP ππππππα ,...,),...,(),...,|,...,()|,...,( 11111 ( 10-13 )

∏= Γ

+Γ

+Γ

Γ=

k

i

i

k

kN

N 1 )/(

)/(

)(

)(

α

α

α

α

( 10-14 )

)(.Γ is the standard gamma function.

The conditional prior for a single indicator, with finite number of mixture k, given all the

others is then:

α

αα

+−

+==

−

−1

/),|(

,

N

kNcicP

ij

jj

( 10-15 )

Where –j indicates all indexes except j and ijN ,− indicates the number of observations,

excluding xn, that belongs to mixture i.

Since the number of mixtures in the IGMM is infinite, the conditional prior in Eq ( 10-15 )

is given as below as k tends to infinity:

+−

+−==

−

−

tedunrepresenisiN

drepresenteisiN

N

cicP

ij

jj

α

ααα

1

1),|(

,

( 10-16 )

145

Conditional Posterior of the hyper-parameters for the mixing weight

The choice of the prior density of the hyper-parameters for the mixing weight is an inverse

Gamma prior:

)1,1(~)(1

GaP−α ( 10-17 )

The conditional posterior of the hyper-prior given repk and number of data points, N, is

log-concave and can be sampled using the ARS:

)(

)())2/(1exp(~),|(

2/3

α

αααα

+Γ

Γ−−

NNkP

repk

rep

( 10-18 )

146

REFERENCES

[1] C. G. Fuchs, E.; Förstner, W, "OEEPE Survey on 3D-city models," OEEPE

Publication, No. 35, Bundesamt fur Kartographie und Geodasie, Frankfurt, 1998.

[2] M. Batty, D. Chapman, S. Evans, M. Haklay, S. Kueppers, N. Shiode, A. Smith,

and P. M. Torrens, "Visualizing the City: Communicating Urban Design to

Plannars and Decision Makers," CASA Working Paper Series, 2000.

[3] N. Shiode, "3D urban models: recent developments in the digital modelling of

urban environments in three-dimensions," GeoJournal, vol. 52, pp. 263-9, 2000.

[4] C. Braun, T. H. Kolbe, F. Lang, W. Schickler, V. Steinhage, A. B. Cremers, W.

Foerstner, and L. Pluemer, "Models for photogrammetric building reconstruction,"

Computers & Graphics (Pergamon), vol. 19, p. 109, 1995.

[5] D. Stevens, W. M. McKay, and D. Fowler, "Combined use of photogrammetry and

CAD in the reconstruction of fire damaged buildings," Proceedings of SPIE - The

International Society for Optical Engineering, Bellingham, WA, USA, pp. 77-85,

1990.

[6] K. Schindler and J. Bauer, "A model-based method for building reconstruction,"

Proceedings First IEEE International Workshop on Higher-Level Knowledge in

3D Modeling and Motion Analysis (HLK 2003), pp. 74-82, 2003.

[7] J. L. Davidson, "STEREO PHOTOGRAMMETRY IN GEOTECHNICAL

ENGINEERING RESEARCH," Photogrammetric Engineering and Remote

Sensing, vol. 51, pp. 1589-1596, 1985.

[8] E. P. Baltsavias, "A comparison between photogrammetry and laser scanning,"

ISPRS Journal of Photogrammetry and Remote Sensing, vol. 54, pp. 83-94, 1999.

[9] F. Ackermann, "Airborne laser scanning - Present status and future expectations,"

ISPRS Journal of Photogrammetry and Remote Sensing, vol. 54, pp. 64-67, 1999.

[10] H. Yoon and K. Park, "Development of a laser range finder using the phase

difference method," Proceedings of SPIE - The International Society for Optical

Engineering, Sappora, Japan, pp. SPIE - The International Society for Optical

Engineering; Hokkaido University, Japan; Sapporo International Plaza, Japan,

2005.

[11] G. Liu, Y. Wang, and G. Liu, "Design and simulation of a mixer and phase

difference measuring circuitry for laser range finding systems," Proceedings of

SPIE - The International Society for Optical Engineering, Xinjiang, China, pp. Int.

Committee on Measurements and Instrumentation, ICMI; National Natural Science

Foundation of China, China; Chinese Society for Measurement, China, 2006.

[12] J. Skaloud and J. Vallet, "High Accuracy Handheld Mapping System for Fast

Helicopter Deployment," In Joint International Symposium on Geospatial Theory,

Processing and Applications ISPRS Comm. IV, p. 6, 2002.

[13] C. Herley, "Automatic occlusion removal from minimum number of images," 2005

International Conference on Image Processing, Genova, Italy, pp. 1046-9, 2006.

[14] C. Moenning and N. A. Dodgson, "A new point cloud simplification algorithm,"

3rd IASTED International Conference on Visualization, Imaging, and Image

Processing (VIIP 2003) 8-10 Sep pp. 1027-1033, 2003.

[15] M. Alexa, J. Behr, D. Cohen-Or, S. Fleishman, D. Levin, and T. Silva, "Point Set

Surfaces," Proc. 12th IEEE Visualization Conf., San Diego, USA, p. 6, 2001.

[16] R. T. Whitaker and E. L. Juarez-Valdes, "On the reconstruction of height functions

and terrain maps from dense range data," IEEE Transactions on Image Processing,

vol. 11, pp. 704-16, 2002.

147

[17] H. T. Tanaka and F. Kishino, "Adaptive sampling and reconstruction for

discontinuity preserving texture-mapped triangulation," IEEE CAD-Based Vision

Workshop - Proceedings, Champion, PA, USA, pp. 298-303, 1994.

[18] Y. Yemez and C. J. Wetherilt, "A volumetric fusion technique for surface

reconstruction from silhouettes and range data," Computer Vision and Image

Understanding, vol. 105, pp. 30-41, 2007.

[19] J. Hu, S. You, and U. Neumann, "Approaches to large-scale urban modeling,"

Computer Graphics and Applications, IEEE pp. 62 - 69 2003.

[20] Y. Takase, N. Sho, A. Sone, and K. Shimiya, "Automatic generation of 3d city

models and related applications," International Archives of Photogrammetry,

Remote Sensing and Spatial Information Sciences, pp. 113--120, 2003.

[21] P. Krishnamoorthy, K. L. Boyer, and P. J. Flynn, "Robust detection of buildings in

digital surface models," Proceedings 16th International Conference on Pattern

Recognition, Quebec City, Que., Canada, pp. 159-63, 2002.

[22] F. Rottensteiner, "Automatic generation of high-quality building models from lidar

data," IEEE Computer Graphics and Applications, vol. 23, pp. 42-50, 2003.

[23] L. Matikainen, J. Hyyppa, and H. Hyyppä, "Automatic detection of buildings from

laser scanner data for map updating," ISPRS Commission III. Workshop 3-d

reconstruction from airborne laserscanner and InSAR data, 2003.

[24] G. Vosselman, "Fusion of laser scanning data, maps, and aerial photographs for

building reconstruction," International Geoscience and Remote Sensing

Symposium (IGARSS), Toronto, Ont., Canada, pp. 85-88, 2002.

[25] N. Haala and C. Brenner, "Extraction of buildings and trees in urban

environments," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 54,

pp. 130-137, 1999.

[26] X. He, R. S. Zemel, and M. A. Carreira-Perpinan, "Multiscale conditional random

fields for image labeling," Proceedings of the IEEE Computer Society Conference

on Computer Vision and Pattern Recognition, Washington, DC, United States, pp.

695-702, 2004.

[27] F. Tsai and H. C. Lin, "Polygon-based texture mapping for cyber city 3D building

models," International Journal of Geographical Information Science, vol. 21, pp.

965-981, 2007.

[28] Y. Zhang, Z. Zhang, J. Zhano, and W. U. Jun, "3D building modelling with digital

map, lidar data and video image sequences," Photogrammetric Record, pp. 285-

302, 2005.

[29] C. I. Connolly, "The Determination of Next Best Views," Proceedings of the IEEE

International Conference on Robotics and Automation 1985.

[30] K. Ulm, "City models from aerial imagery – Integrating images and the

landscape," GEOInformatics, January/February, pp. 18-21, 2005.

[31] B. Hongqiang and Z. Zhaoyang, "Identification of occlusion regions based on

background rebuilding for automatic video object segmentation," Proc. SPIE - Int.

Soc. Opt. Eng. (USA), Beijing, China, pp. 883-6, 2003.

[32] T. Li, W. Chengke, L. Shigang, and Y. Yaoping, "Complete structure recovery

from long image sequence with occlusions," Proc. SPIE - Int. Soc. Opt. Eng.

(USA), Beijing, China, pp. 529-34, 2003.

[33] H. Wang and D. Suter, "A novel robust statistical method for background

initialization and visual surveillance," Lecture Notes in Computer Science

(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in

Bioinformatics), Hyderabad, India, pp. 328-337, 2006.

148

[34] Y. Wang and Q. Ji, "A dynamic conditional random field model for object

segmentation in image sequences," Proceedings of the IEEE Computer Society

Conference on Computer Vision and Pattern Recognition, San Diego, CA, United

States, pp. 264-270, 2005.

[35] M. Hoynck and J. R. Ohm, "Shape retrieval with robustness against partial

occlusion," 2003 IEEE International Conference on Acoustics, Speech, and Signal

Processing (Cat. No.03CH37404), Hong Kong, China, pp. 593-6, 2003.

[36] I. Guyon and A. Elisseeff, "An introduction to variable and feature selection,"

Journal of Machine Learning Research, vol. 3, pp. 1157-82, 2003.

[37] D. Anguelov, B. Taskar, V. Chatalbashev, D. Koller, D. Gupta, G. Heitz, and A.

Ng, "Discriminative learning of Markov random fields for segmentation of 3D

scan data," Proceedings - 2005 IEEE Computer Society Conference on Computer

Vision and Pattern Recognition, CVPR 2005, San Diego, CA, United States, pp.

169-176, 2005.

[38] A. E. Johnson and M. Hebert, "Using spin images for efficient object recognition

in cluttered 3D scenes," IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 21, pp. 433-449, 1999.

[39] S. Matzka, Y. R. Petillot, and A. M. Wallace, "Determining efficient scan-patterns

for 3-D object recognition using spin images," Lecture Notes in Computer Science

(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in

Bioinformatics), Heidelberg, D-69121, Germany, pp. 559-570, 2007.

[40] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, "Matching 3D models with

shape distributions," Proceedings International Conference on Shape Modeling and

Applications, Los Alamitos, CA, USA, pp. 154-66, 2001.

[41] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, "Shape distributions," ACM

Transactions on Graphics, vol. 21, pp. 807-832, 2002.

[42] A. S. Mian, M. Bennamoun, and R. Owens, "Matching tensors for automatic

correspondence and registration," Computer Vision - ECCV 2004. 8th European

Conference on Computer Vision. Proceedings (Lecture Notes in Comput. Sci.

Vol.3022), Berlin, Germany, pp. 495-505, 2004.

[43] W. Zhaohui, W. Yueming, and P. Gang, "3D face recognition using local shape

map," 2004 International Conference on Image Processing (ICIP) (IEEE Cat.

No.04CH37580), Piscataway, NJ, USA, pp. 2003-6, 2004.

[44] T. W. Way, H.-P. Chan, M. M. Goodsitt, B. Sahiner, L. M. Hadjiiski, C. Zhou, and

A. Chughtai, "Effect of CT scanning parameters on volumetric measurements of

pulmonary nodules by 3D active contour segmentation: A phantom study," Physics

in Medicine and Biology, vol. 53, pp. 1295-1312, 2008.

[45] D. D. Lichti, "Spectral filtering and classification of terrestrial laser scanner point

clouds," Photogrammetric Record, vol. 20, pp. 218-240, 2005.

[46] R. Triebel, K. Kersting, and W. Burgard, "Robust 3D scan point classification

using associative Markov networks," Proceedings - IEEE International Conference

on Robotics and Automation, Orlando, FL, United States, pp. 2603-2608, 2006.

[47] O. Tuzel, F. Porikli, and P. Meer, "Region covariance: A fast descriptor for

detection and classification," Lecture Notes in Computer Science (including

subseries Lecture Notes in Artificial Intelligence and Lecture Notes in

Bioinformatics), Graz, Austria, pp. 589-600, 2006.

[48] M. E. Wall, A. Rechtsteiner, and L. M. Rocha, "Singular value decomposition and

principal component analysis," A Practical Approach to Microarray Data Analysis,

Kluwer: Norwell, MA, pp. 91-109, 2003.

149

[49] R. Unnikrishnan and M. Hebert, "Robust extraction of multiple structures from

non-uniformly sampled data," IROS 2003, Las Vegas, NV, USA, pp. 1322-9, 2003.

[50] I. Stamos and P. K. Allen, "Integration of range and image sensing for photo-

realistic 3D modeling," Proceedings 2000 ICRA. Millennium Conference. , San

Francisco, CA, USA, pp. 1435-40, 2000.

[51] J. F. Lalonde, R. Unnikrishnan, N. Vandapel, and M. Hebert, "Scale selection for

classification of point-sampled 3D surfaces," Proceedings. Fifth International

Conference on 3-D Digital Imaging and Modeling, Ottawa, Ont., Canada, pp. 285-

92, 2005.

[52] K. Klasing, D. Althoff, D. Wollherr, and M. Buss, "Comparison of surface normal

estimation methods for range sensing applications," 2009 IEEE International

Conference on Robotics and Automation (ICRA), Piscataway, NJ, USA, pp. 3206-

11, 2009.

[53] J. Shuangshuang, R. R. Lewis, and D. West, "A comparison of algorithms for

vertex normal computation," Visual Computer, vol. 21, pp. 71-82, 2005.

[54] N. Amenta and M. Bern, "Surface reconstruction by Voronoi filtering,"

Proceedings of the Fourteenth Annual Symposium on Computational Geometry,

New York, NY, USA, pp. 39-48, 1998.

[55] T. K. Dey, G. Li, and J. Sun, "Normal estimation for point clouds: A comparison

study for a Voronoi based method," Point-Based Graphics, 2005 -

Eurographics/IEEE VGTC Symposium Proceedings, Stony Brook, NY, United

states, pp. 39-46, 2005.

[56] H. Hoppe, T. DeRose, T. Duchamp, J. McDonald, and W. Stuetzle, "Surface

reconstruction from unorganized points," Comput. Graph. (USA), Chicago, IL,

USA, pp. 71-8, 1992.

[57] J.-F. Lalonde, N. Vandapel, D. F. Huber, and M. Hebert, "Natural Terrain

Classification using Three-Dimensional Ladar Data for Ground Robot Mobility,"

Journal of Field Robotics, 23(10):839--861, October 2006.

[58] D. F. Wolf, G. S. Sukhatme, D. Fox, and W. Burgard, "Autonomous Terrain

Mapping and Classification Using Hidden Markov Models," in (ICRA)Proc. of the

IEEE International Conference on Robotics and Automation, pp. 2026-2031, 2005.

[59] J.-H. Xue and D. M. Titterington, "Comment on "on discriminative vs. generative

classifiers: A comparison of logistic regression and naive bayes"," Neural

Processing Letters, vol. 28, pp. 169-187, 2008.

[60] I. Ulusoy and C. M. Bishop, "Comparison of generative and discriminative

techniques for object detection and classification," in Toward Category-Level

Object Recognition Berlin, Germany: Springer, pp. 173-95, 2006.

[61] P. F. Felzenszwalb and D. P. Huttenlocher, "Efficient graph-based image

segmentation," International Journal of Computer Vision, vol. 59, pp. 167-181,

2004.

[62] C. Zhuo, F. Y. L. Chin, and R. H. Y. Chung, "Automated Hierarchical Image

Segmentation Based on Merging of Quadrilaterals," WSEAS Transactions on

Signal Processing, 2 (8),1063-1068, 2006.

[63] H. Xuming, R. S. Zemel, and D. Ray, "Learning and incorporating top-down cues

in image segmentation," Computer Vision - ECCV 2006. 9th European Conference

on Computer Vision. Proceedings, Part I (Lecture Notes in Computer Science Vol.

3951), Graz, Austria, pp. 338-51, 2006.

150

[64] R. de Luis-Garcia, R. Deriche, and C. Alberola-Lopez, "Texture and color

segmentation based on the combined use of the structure tensor and the image

components," Signal Processing, vol. 88, pp. 776-95, 2008.

[65] H. Permuter, J. Francos, and I. Jermyn, "A study of Gaussian mixture models of

color and texture features for image classification and segmentation," Pattern

Recognition, vol. 39, pp. 695-706, 2006.

[66] J. D. Boissonnat and F. Cazals, "Coarse-to-fine surface simplification with

geometric guarantees," Computer Graphics Forum, vol. 20, pp. 490-499, 2001.

[67] A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik, "Recognizing objects in

range data using regional point descriptors," Proceedings of the European

Conference on Computer Vision (ECCV), May, 2004.

[68] K. T. Abou-Moustafa and P. P. Ferrie, "The minimum volume ellipsoid metric," in

Pattern Recognition. Proceedings 29th DAGM Symposium. (Lecture Notes in

Computer Science vol. 4713)Berlin, Germany, pp. 335-44, 2007.

[69] J. Ming-Yi, L. Jing-Sin, S. Shen-Po, C. Yuh-Ren, H. Kao-Shing, and L. Wan-Chi,

"Fast and accurate collision detection based on enclosed ellipsoid," Robotica, vol.

19, pp. 381-94, 2001.

[70] E. Rimon and S. P. Boyd, "Obstacle collision detection using best ellipsoid fit,"

Journal of Intelligent and Robotic Systems: Theory and Applications, vol. 18, pp.

105-126, 1997.

[71] T. Lindeberg, "Feature detection with automatic scale selection," International

Journal of Computer Vision, vol. 30, pp. 79-116, 1998.

[72] N. J. Mitra, A. Nguyen, and L. J. Guibas, "Estimating surface normals in noisy

point cloud data," Int. J. Comput. Geometry Appl. 14, vol. (4-5), pp. 261-276, 2004.

[73] E. H. Lim and D. Suter, "Conditional Random Field for 3D Point Clouds with

Adaptive Data Reduction," in New Advances in Shape Analysis and Geometric

Modeling (NASAGEM) Workshop held in conjunction with the International

Conference on Cyberworlds, 2007, pp. 404-408, 2007.

[74] E. H. Lim and D. Suter, "Multi-scale Conditional Random Fields for over-

segmented irregular 3D point clouds classification," in Computer Vision and

Pattern Recognition Workshops, 2008. IEEE Computer Society Conference on

CVPR Workshops., pp. 1-7, 2008.

[75] S. Gumhold, X. Wang, and R. MacLeod, "Feature extraction from point clouds,"

International Meshing Roundtable, Sandia National Laboratories, October 2001.

[76] D. H. Wolpert and W. G. Macready, "No free lunch theorems for optimization,"

IEEE Transactions on Evolutionary Computation, vol. 1, pp. 67-82, 1997.

[77] G. Agre and S. Peev, "On supervised and unsupervised discretization," Cybernetics

and Information Technologies, pp. 43-57, 2002.

[78] R. Caruana and A. Niculescu-Mizil, "An empirical comparison of supervised

learning algorithms," ACM International Conference Proceeding Series, New York,

NY 10036-5701, United States, pp. 161-168, 2006.

[79] S. Ray and M. Craven, "Supervised versus multiple instance learning: An

empirical comparison," ICML 2005 - Proceedings of the 22nd International

Conference on Machine Learning, New York, NY 10036-5701, United States, pp.

697-704, 2005.

[80] P. B. Brazdil, J. Gama, and B. Henery, "Characterizing the applicability of

classification algorithms using meta-level learning," in ECCV vol 784, pp. 83-102,

1994.

151

[81] C. M. V. D. Walt, "Data Measures that characterise Classification Problems,"

Master of Engineering, thesis, in Faculty of Engineering, the Built Environment

and Information Technology University of Pretoria, 2008

[82] C. Barat, H. Loaiza, E. Colle, and S. Lelandais, "Neural and statistical classifiers.

Can such approaches be complementary ?," in Instrumentation and Measurement

Technology Conference, 2000. IMTC 2000. Proceedings of the 17th IEEE, pp.

1480-1486 2000.

[83] J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields:

Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. 18th

International Conf. on Machine Learning, pp. 282-289, 2001.

[84] T. Mitchell, Machine Learning: McGraw Hill, 1997.

[85] G. Bouchard and B. Triggs, "The trade-off between generative and discriminative

classifiers," COMPSTAT'2004 Symposium, 2004.

[86] S. Kumar, "Models for Learning Spatial Interactions in Natural Images for

Context-Based Classification," PhD thesis, in The Robotics Institute Carnegie

Mellon University, 2005

[87] M. I. Jordan, Learning in graphical models: MIT Press, 1999, 1999.

[88] V. Kolmogorov and R. Zabih, "What Energy Functions Can Be Minimized via

Graph Cuts?," IEEE Transactions on Pattern Analysis and Machine Intelligence,

vol. 26, pp. 147-159, 2004.

[89] W. Li and A. McCallum, "Rapid development of hindi named entity recognition

using conditional random fields and feature induction," ACM Transactions on

Asian Language Information Processing, vol. 2, pp. 290-294, 2003.

[90] K. Sanjiv and M. Hebert, "Discriminative random fields: a discriminative

framework for contextual interaction in classification," Proceedings Ninth IEEE

International Conference on Computer Vision, Nice, France, pp. 1150-7, 2003.

[91] D. H. Tran, T. H. Pham, K. Satou, and T. B. Ho, "Conditional random fields for

predicting and analyzing histone occupancy, acetylation and methylation areas in

DNA sequences," Lecture Notes in Computer Science (including subseries Lecture

Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Budapest,

Hungary, pp. 221-230, 2006.

[92] D. Pinto, A. McCallum, X. Wei, and W. Bruce Croft, "Table Extraction Using

Conditional Random Fields," SIGIR Forum (ACM Special Interest Group on

Information Retrieval), Toronto, Ont., Canada, pp. 235-242, 2003.

[93] Y. Qi, M. Szummer, and T. P. Minka, "Diagram structure recognition by Bayesian

conditional random fields," CVPR 2005, San Diego, CA, United States, pp. 191-

196, 2005.

[94] H. Andreasson, R. Triebel, and W. Burgard, "Improving plane extraction from 3D

data by fusing laser data and vision," 2005 IEEE/RSJ International Conference on

Intelligent Robots and Systems, Edmonton, Alta., Canada, pp. 2656-61, 2005.

[95] S. F. Chen and R. Rosenfeld, "A Gaussian prior for smoothing maximum entropy

models," Technical Report CMUCS-99-108, Carnegie Mellon University, 1999.

[96] R. C. Bolles and M. A. Fischler, "A RANSAC-Based Approach to Model Fitting

and Its Application to Finding Cylinders in Range Data," in IJCAI81, pp. 637-643.

[97] E. H. Lim and D. Suter, "Classification of 3d lidar point clouds for urban

modelling," Image and Vision Computing, New Zealand, Nov. 2006, pp. pages

149-154, 2006.

[98] W. Von Hansen, E. Michaelsen, and U. Thonnessen, "Cluster analysis and priority

sorting in huge point clouds for building reconstruction," Proceedings -

152

International Conference on Pattern Recognition, Hong Kong, China, pp. 23-26,

2006.

[99] S. Rusinkiewicz and M. Levoy, "QSplat: a multiresolution point rendering system

for large meshes," Computer Graphics Proceedings. Annual Conference Series

2000. SIGGRAPH 2000. Conference Proceedings, New York, NY, USA, pp. 343-

52, 2000.

[100] W. Hanzi, "Robust Statistics for Computer Vision: Model Fitting, Image

Segmentation and Visual Analysis," Ph.D. Thesis, Monash University, Department

of Electrical and Computer Systems Engineering 2004.

[101] S. M. Stigler, "Mathematical Statistics in the Early States " The Annals of Statistics,

Vol. 6, pp. 239-265, 1978.

[102] P. J. Rousseeuw, "Least median of squares regression," J. Amer. Statist. Assoc. 79

(388), pp. 871–880, 1984.

[103] B. Walczak, M. Daszykowski, K. Kaczmarek, and Y. Vander Heyden, "Robust

statistics in data analysis - A review," Chemometrics and Intelligent Laboratory

Systems, vol. 85, pp. 203-19, 2007.

[104] M. Zuliani, C. S. Kenney, and B. S. Manjunath, "The multiransac algorithm and its

application to detect planar homographies," Proceedings - International Conference

on Image Processing, ICIP, Piscataway, NJ 08855-1331, United States, pp. 153-

156, 2005.

[105] A. Sarti and S. Tubaro, "Detection and characterisation of planar fractures using a

3D Hough transform," Signal Processing, vol. 82, pp. 1269-82, 2002.

[106] D. Yihong, P. Xijian, H. Min, and D. Wang, "Range image segmentation based on

randomized Hough transform," Pattern Recognition Letters, vol. 26, pp. 2033-41,

2005.

[107] C. V. Stewart, "Bias in robust estimation caused by discontinuities and multiple

structures," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.

19, pp. 818-33, 1997.

[108] R. Hesami, A. BabHadiashar, and R. HosseinNezhad, "Range segmentation of

large building exteriors: A hierarchical robust approach," Computer Vision and

Image Understanding, vol. 114, pp. 475-490, 2010.

[109] A. R. Ferreira da Silva, "A Dirichlet process mixture model for brain MRI tissue

classification," Medical Image Analysis, vol. 11, pp. 169-182, 2007.

[110] B. Thirion, A. Tucholka, M. Keller, P. Pinel, A. Roche, J. F. Mangin, and J. B.

Poline, "High level group analysis of fMRI data based on Dirichlet process mixture

models," in Information Processing in Medical ImagingKerkrade, Netherlands, pp.

482-94, 2007.

[111] C. Rasmussen, B. de la Cruz, Z. Ghahramani, and D. Wild, "Modeling and

Visualizing Uncertainty in Gene Expression Clusters using Dirichlet Process

Mixtures," IEEE/ACM Transactions on Computational Biology and Bioinformatics,

2007.

[112] Z. Chengliang, Z. Shenghuo, and G. Yihong, "Trend Analysis for Large Document

Streams," in Machine Learning and Applications, 2006. ICMLA '06. 5th

International Conference on,G. Yihong, pp. 285-295, 2006.

[113] J. Yong-Dian and C. Chu-Song, "Two-view motion segmentation by mixtures of

Dirichlet process with model selection and outlier removal," in 2007 11th IEEE

International Conference on Computer VisionRio de Janeiro, Brazil, pp. 1060-7,

2007.

153

[114] Y.-D. Jian and C.-S. Chen, "Two-view motion segmentation by mixtures of

dirichlet process with model selection and outlier removal," Proceedings of the

IEEE International Conference on Computer Vision, Piscataway, NJ 08855-1331,

United States, p. 4408974, 2007.

[115] P. Kultanen, L. Xu, and E. Oja, "Randomized Hough transform (RHT),"

Proceedings - International Conference on Pattern Recognition, Piscataway, NJ,

USA, pp. 631-635, 1990.

[116] P. H. S. Torr and A. Zisserman, "MLESAC: a new robust estimator with

application to estimating image geometry," Computer Vision and Image

Understanding, vol. 78, pp. 138-56, 2000.

[117] W. Hanzi, D. Mirota, M. Ishii, and G. D. Hager, "Robust motion estimation and

structure recovery from endoscopic image sequences with an adaptive scale kernel

consensus estimator," 2008 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), Anchorage, AK, USA, p. 7 pp., 2008.

[118] A. Konouchine, V. Gaganov, and V. Vezhnevets, "AMLESAC: A New Maximum

Likelihood Robust Estimator," Graphicon-2005, Novosibirsk,Akademgorodok,

2005.

[119] W. Hanzi and D. Suter, "MDPE: a very robust estimator for model fitting and

range image segmentation," International Journal of Computer Vision, vol. 59, pp.

139-66, 2004.

[120] L. Fan and T. Pylvanainen, "Robust Scale Estimation from Ensemble Inlier Sets

for Random Sample Consensus Methods," ECCV, 2008.

[121] R. Toldo and A. Fusiello, "Robust Multiple Structures Estimation with J-Linkage,"

ECCV, 2008.

[122] H. Wang and D. Suter, "Robust adaptive-scale parametric model estimation for

computer vision," IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 26, pp. 1459-74, 2004.

[123] J. V. Moreau and A. K. Jain, "How many clusters?," Proceedings CVPR '86: IEEE

Computer Society Conference on Computer Vision and Pattern Recognition (Cat.

No.86CH2290-5), Miami Beach, FL, USA, pp. 634-6, 1986.

[124] R. C. Dubes, "How many clusters are best? - an experiment," Pattern Recognition,

vol. 20, pp. 645-663, 1987.

[125] C. Fraley and A. E. Raftery, "How many clusters? Which clustering method?

Answers via model-based cluster analysis," Computer Journal, vol. 41, pp. 578-88,

1998.

[126] S. Still and W. Bialek, "How many clusters? An information-theoretic

perspective," Neural Computation, vol. 16, pp. 2483-506, 2004.

[127] C. E. Rasmussen, "The Infinite Gaussian Mixture Model," Advances in

information processing systems 12, pp. 554-560, 2000.

[128] H. Frigui and R. Krishnapuram, "Robust competitive clustering algorithm with

applications in computer vision," IEEE Transactions on Pattern Analysis and

Machine Intelligence, vol. 21, pp. 450-465, 1999.

[129] D. A. Forsyth and J. Ponce, Computer Vision: A Modern Approach: Pearson

Education International, 2003.

[130] T. Ishioka, "Extended k-means with an efficient estimation of the number of

clusters," Intelligent Data Engineering and Automated - IDEAL 2000. Data

Mining, Financial Engineering, and Intelligent Agents. Second International

Conference. Proceedings (Lecture Notes in Computer Science Vol.1983), Hong

Kong, China, pp. 17-22, 2000.

154

[131] J. C. Bezdek, "Pattern Recognition with Fuzzy Objective Function Algorithms,"

Plenum, NY, 1981.

[132] J. M. Biosca and J. L. Lerma, "Unsupervised robust planar segmentation of

terrestrial laser scanner point clouds based on fuzzy clustering methods," ISPRS

Journal of Photogrammetry and Remote Sensing, vol. 63, pp. 84-98, 2008.

[133] W. Von Hansen, E. Michaelsen, and U. Thonnessen, "Cluster analysis and priority

sorting in huge point clouds for building reconstruction," in Proceedings -

International Conference on Pattern RecognitionHong Kong, China, pp. 23-26,

2006.

[134] C. E. Rasmussen, "The Infinite Gaussian Mixture Model," Advances in Neural

Information Processing Systems 12, pp. 554-560. (Eds.) Solla, S. A., T. K. Leen

and K. R. Müller, MIT Press, 2000.

[135] A. P. L. Dempster, N.M.; Rubin, D.B., "Maximum Likelihood from Incomplete

Data via the EM Algorithm," Journal of the Royal Statistical Society. Series B

(Methodological), pp. 1–38, 1977.

[136] S. Geman and D. Geman, "STOCHASTIC RELAXATION, GIBBS

DISTRIBUTIONS, AND THE BAYESIAN RESTORATION OF IMAGES,"

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-6, pp.

721-741, 1984.

[137] G. Biegelbauer and M. Vincze, "Efficient 3D object detection by fitting

superquadrics to range image data for robot's object manipulation," Proceedings -

IEEE International Conference on Robotics and Automation, Rome, Italy, pp.

1086-1091, 2007.

[138] E. H. Lim and D. Suter, "Occlusion removal in image for 3d urban modelling," in

Image and Vision Computing, New Zealand, pp. pp. 191-196, 2006.

[139] E. H. Lim and D. Suter, "3D terrestrial LIDAR classifications with super-voxels

and multi-scale Conditional Random Fields," Journal of Computer-Aided Design,

2009.

[140] T. Haithcoat, W. Song, and J. Hipple, "Automated Building Extraction and

Reconstruction from LIDAR Data," R&D Program for NASA/ICREST Studies

Project Report, 2004.

[141] R. A. Norheim, V. R. Queija, and R. A. Haugerud, "Comparison of LIDAR and

INSAR DEMs with dense ground control," Proceedings, Environmental Systems

Research Institute 2002 User Conference, 2003.

[142] P. Gamba, F. Dell’Acqua, and B. Houshmand, "Comparison and Fusion of LiDAR

and InSAR Digital Elevation Models Over Urban Areas, International Journal of

Remote Sensing, Vol. 24, No. 22, pp. 4289-430," 2003.

[143] I. Stamos and P. E. Allen, "3-D model construction using range and image data,"

Proceedings IEEE Conference on Computer Vision and Pattern Recognition.

CVPR 2000 (Cat. No.PR00662), Hilton Head Island, SC, USA, pp. 531-6, 2000.

[144] D. Munoz, N. Vandapel, and M. Hebert, "Directional Associative Markov

Network for 3-D Point Cloud Classification," Fourth International Symposium on

3D Data Processing, Visualization and Transmission, 2008.

[145] C. Fruh and A. Zakhor, "An Automated Method for Large-Scale, Ground-Based

City Model Acquisition," pp. pp. 5 - 24, 2004.

[146] T. Asai, M. Kanbara, and N. Yokoya, "3D Modeling of Outdoor Scenes by

Integrating Stop-And-Go and Continuous Scanning of Rangefinder," Proceedings

of the ISPRS Working Group V/4 Workshop 3D-ARCH 2005: "Virtual

155

Reconstruction and Visualization of Complex Architectures" Mestre-Venice, Italy,

2005.

[147] M. Pauly, R. Keiser, L. P. Kobbelt, and M. Gross, "Shape modeling with point-

sampled geometry," ACM Trans. Graph. (USA), USA, pp. 641-50, 2003.

[148] W. R. Gilks and P. Wild, "Adaptive rejection sampling for Gibbs sampling,"

Applied Statistics vol. 41, pp. 337-348, 1992.

3d urban modelling - university of adelaidedsuter/graduated/eehuithesisfinal.pdf · 3d urban...

Documents