3d urban modelling - university of adelaidedsuter/graduated/eehuithesisfinal.pdf · 3d urban...
TRANSCRIPT
3D Urban Modelling
by
Ee Hui Lim
B.Eng.Hons (Monash University) 2004
A thesis submitted in total fulfilment of the requirements for the degree of
Doctor of Philosophy`
in the
Department of Electrical and Computer Systems Engineering
Monash University
Clayton Victoria 3800
Australia
July 2011
3D Urban Modelling
Copyright © 2011
by
Ee Hui Lim
All Rights Reserved
i
Summary iv
Declaration vi
Preface vii
Acknowledgements viii
Dedication x
Contents ........................................................................................................................................................... vii
1.0 INTRODUCTION .................................................................................................................... 1
1.1 BACKGROUND AND MOTIVATION ........................................................................................... 2
1.1.1 Photogrammetry .......................................................................................................... 3
1.1.2 LIDAR data .................................................................................................................. 4
1.1.3 Why is 3D Urban Modelling difficult?......................................................................... 5
1.2 STRUCTURE OF THE THESIS ..................................................................................................... 9
2.0 DATA ACQUISITION AND OCCLUSION REMOVAL .................................................. 13
2.1 INTRODUCTION ..................................................................................................................... 13
2.2 TECHNOLOGY ....................................................................................................................... 13
2.3 DATA PRE-PROCESSING ........................................................................................................ 16
2.4 PREVIOUS WORK IN IMAGE OCCLUSION REMOVAL ............................................................... 19
2.5 OCCLUSION REMOVAL WITH MINIMUM NUMBER OF IMAGES ............................................... 21
2.5.1 Occlusion Detection................................................................................................... 21
2.5.2 Occlusion Removal .................................................................................................... 24
Contents
ii
2.6 EXPERIMENTAL VALIDATION ................................................................................................ 28
2.6.1 Indoor data set ........................................................................................................... 29
2.6.2 Outdoor data set ........................................................................................................ 30
2.6.3 Panoramic Data Set ................................................................................................... 33
2.7 CONCLUSION ......................................................................................................................... 36
3.0 FEATURE DESCRIPTORS FOR 3D CLASSIFICATION ............................................... 37
3.1 INTRODUCTION ..................................................................................................................... 37
3.2 REGION COVARIANCE AS 3D FEATURE DESCRIPTOR ............................................................ 39
3.2.1 Extension to Saliency Features .................................................................................. 41
3.2.2 Validation of the Extended Saliency Features ........................................................... 42
3.3 ESTIMATED NORMALS AS 3D FEATURE DESCRIPTOR ........................................................... 50
3.3.1 Delaunay/Voronoi Method ........................................................................................ 50
3.3.2 Numerical Optimization Methods .............................................................................. 52
3.3.3 Evaluation of the Surface Normal Estimation Approaches ....................................... 54
3.4 CONCLUSION ......................................................................................................................... 59
4.0 3D OVER-SEGMENTATION .............................................................................................. 60
4.1 INTRODUCTION ..................................................................................................................... 60
4.2 BACKGROUND OF OVER-SEGMENTATION ............................................................................. 61
4.2.1 Shape of the Super-voxel ........................................................................................... 63
4.2.2 Scale Selection ........................................................................................................... 63
4.3 SUPER-VOXEL – A 3D OVER-SEGMENTATION APPROACH ...................................................... 64
4.3.1 Sphere as Dividing Boundary .................................................................................... 64
4.3.2 Automatic Scale Selection .......................................................................................... 65
4.3.3 Synthetic Data ............................................................................................................ 67
4.3.4 Outdoor LIDAR Data ................................................................................................ 73
4.4 CONCLUSION ......................................................................................................................... 80
5.0 DATA CLASSIFICATION WITH MCRF .......................................................................... 81
5.1 INTRODUCTION ..................................................................................................................... 81
5.2 BACKGROUND OF SUPERVISED CLASSIFICATION .................................................................. 82
iii
5.2.1 Generative Model ...................................................................................................... 82
5.2.2 Discriminative Model ................................................................................................ 84
5.2.3 Graphical Model ........................................................................................................ 85
5.3 MULTI-SCALE CONDITIONAL RANDOM FIELD ....................................................................... 87
5.4 PLANE PATCHES FITTING ...................................................................................................... 91
5.5 RESULTS FOR DATA CLASSIFICATION .................................................................................... 92
5.5.1 Synthetic data sets...................................................................................................... 93
5.5.2 Urban data sets .......................................................................................................... 95
5.5.3 Summary of the Experiment Results ........................................................................ 103
5.6 CONCLUSION ....................................................................................................................... 104
6.0 ROBUST SEGMENTATION ............................................................................................. 105
6.1 INTRODUCTION ................................................................................................................... 105
6.2 BACKGROUND OF ROBUST SEGMENTATION ........................................................................ 110
6.2.1 Region Growing ....................................................................................................... 110
6.2.2 Model Fitting ........................................................................................................... 111
6.2.3 Clustering ................................................................................................................ 116
6.3 INFINITE GAUSSIAN MIXTURE MODEL ................................................................................ 121
6.4 RESULTS ............................................................................................................................. 125
6.5 CONCLUSION ....................................................................................................................... 129
7.0 CONCLUSIONS .................................................................................................................. 131
7.1 CONCLUDING REMARKS ...................................................................................................... 131
APPENDIX I: DATA ACQUISITION ...................................................................................................... 135
APPENDIX II: SINGULAR VALUE DECOMPOSITION ..................................................................... 138
APPENDIX III: DERIVATION OF CONDITIONAL POSTERIOR DISTRIBUTION OF
PARAMETERS FOR IGMM ..................................................................................................................... 140
REFERENCES ............................................................................................................................................ 146
iv
This dissertation concerns the development of techniques for the robust generation of 3D
terrestrial urban models. 3D urban models can be constructed by processing the data from
photogrammetry, which is the earliest remote sensing technology. Compared to traditional
photogrammetry, LIDAR (Light Detecting and Ranging) is a relatively new and fast
method to obtain 3D urban models, which samples surfaces with high density and high
point accuracy. The use of a “Multisensor” (laser scanner and camera) permits more
complete and efficient data acquisition.
There are several challenges in 3D urban modelling. The construction of 3D
terrestrial urban models requires the acquisition of a large amount of LIDAR data which
can take months and it is memory-intensive to process the data. The very large amount of
data, the unavoidable noise due to the uncontrollable environment, the large varieties and
shapes of the structures, and the sheer number of structures make robust multi-structure
data processing extremely challenging. In addition, the number of structures and the
locations of the objects in the outdoor environment are unknown, making classical
statistical methods insufficient to segment or model the data accurately.
A series of algorithms contributing to the generation of urban models from
terrestrial LIDAR data is presented in this dissertation. The approach starts in Chapter 2
with improved solutions for removing inconsistencies (due to moving occlusions) between
the images and LIDAR data. An existing algorithm which assumes that the occlusions are
all independent objects was modified; a connected occluded region cannot consist of
occlusions from different images. However, this assumption is often violated in outdoor
environments due to the large amount of pedestrian overlap. By understanding that a large
difference in the discontinuity measure at the boundary of the occlusions is most likely to
indicate an occlusion, the idea of the proposed improvement is to analyse the discontinuity
measure of the boundary where occlusions occur, to separate any “overlapped” occlusions.
Summary
v
The pre-processed data, with occlusions removed, are then labelled into different
data classes. 3D data labelling in existing studies has been mostly point-based. This
introduces a large amount of redundant computation – labelling every point will result in a
high computational load which can be reduced by classifying a smaller sub-set of the data.
A 3D over-segmentation algorithm, based on the context of super-voxel, is introduced for
this purpose. This novel method, based on 3D scale theory, groups regions which are
homogeneous with respect to colour and geometry similarities. This method is shown to
efficiently reduce the outdoor terrestrial LIDAR data for classification. Feature descriptors
are then computed for the super-voxels. One of the feature descriptors, saliency, is shown
in this dissertation to be invariant to the adaptive size of the super-voxel used to compute
the descriptors. A hierarchical graphical model, the multi-scale Conditional Random Fields
(mCRF), is proposed to learn the parameters of the extracted features to label the super-
voxels into different data types.
In addition, a robust estimation method that successfully addresses a number of
issues associated with the segmentation of complex data with unknown numbers of
structures is described. The residual function of the Infinite Gaussian Mixture Model
(IGMM) is modified for the clustering of labelled data belonging to planar surfaces into
locally-delimited planes. The modified algorithm is evaluated on the labelled planar data.
The results of the proposed method show the robustness of the algorithm for the clustering
problem. The proposed method is also shown to be capable of handling missing data caused
by occlusion or transparent objects.
This thesis proposes several improvements to urban modelling methodology. The
proposed approaches in this dissertation have been tested on LIDAR data acquired from
outdoor terrestrial environments, and are effective in solving some of the existing
challenges in 3D urban modelling.
vi
1 July 2011
I declare that:
1. This thesis contains no material that has been accepted for the awards of any other
degree or diploma in any university or institute.
2. To the best of my knowledge, this thesis contains no material that has previously
been published or written by another person except where due reference is made in
the text of the thesis.
Signed:
Ee Hui Lim
Declaration
vii
Preface
During my study at Monash University, a number of papers, which contain material used in
this thesis, have been published:
Lim, E. H.; Suter, D.; 3D Terrestrial LIDAR Classifications with Super-voxels and
Multiscale Conditional Random Fields, in Journal of Computer-Aided Design, 41(10),
pp.701-710, 2009.
Lim, E. H.; Suter, D.; Unsupervised plane data and plane patches clustering for 3d
terrestrial urban modelling based on modified Dirichlet process mixture model method, in
Visualization, Imaging and Image Processing Conference, Palma de Mallorca, Spain, pp.
97-102, 2008.
Lim, E. H.; Suter, D.; Multi-scale Conditional Random Fields for over-segmented irregular
3D point clouds classification, in IEEE Computer Society Conference on Computer Vision
and Pattern Recognition Workshops. Alaska, USA, pp. 1-7, 2008.
Lim, E. H.; Suter, D.; Conditional Random Field for 3D Point Clouds with Adaptive Data
Reduction, International Conference on Cyberworlds, CW '07, Hanover, Germany, pp.404-
408, 24-26 Oct. 2007.
Lim E. H; Suter D.; Classification of 3D LIDAR Point Clouds for Urban Modelling.
Proc.21st Image and Vision Computing New Zealand, Great Barrier Island, New Zealand,
pp. 149-154, 2006.
Lim E. H; Suter D.; Occlusion Removal in Image for 3D Urban Modelling. Proc. 21st
Image and Vision Computing New Zealand, Great Barrier Island, New Zealand, pp. 191-
196, 2006.
viii
My first and primary acknowledgement must go to my advisor and mentor, Prof. David
Suter, from whom I have learnt so much. He has made a significant contribution to this
dissertation, by introducing me to this research area, providing invaluable guidance and
giving me opportunities to visit other computer vision laboratories. My PhD research would
not have been accomplished without his enlightening ideas, constructive criticism, warm
support, and the countless hours spent on the discussion and proofreading of my research.
I would also like to thank my associate supervisor, Prof. Raymond Jarvis, for his
insightful comments during the weekly seminars. I am also very grateful to my colleagues
in the Institute for Vision Systems Engineering for all the discussions and assistance they
offered during my candidature. I thank them sincerely for all their help. I would particularly
like to thank Dr. Nghia Ho for proofreading parts of the manuscript and discussing
numerous mathematical issues. My years at Monash would not have been so enjoyable with
the help and friendship of my fellow PhD students and the staff in the ECSE department. Dr
Alex McKnight assisted by proofreading the final version of the thesis. I would like to
extend my gratitude to them all.
Thanks to Prof. Yasuhito Suenaga and David Suter, I was able to visit the Intelligent
Media Integration for Social Information Infrastructure under the 21st Century COE
(Centre of Excellence) Program for two months. I would like to thank all the staff and
students in the laboratories led by Prof Toyohide Watanabe and Prof. Masayuki Tanimoto
for sharing their experiences and insights in the Virtual Construction of Real World and the
FTV (Free Viewpoint Television) projects. I am also very grateful to have been able to visit
and discuss my work with Simon Kocbek and Prof. Peter Kokol of the System Design
laboratory at the University of Maribor, Slovenia. Thanks are owed also to Dr. Konrad
Schindler, Dr. Hanzi Wang, Dr Liang Wang and Dr Cathy Zhou. I have benefited greatly
from their valuable advice and kind support.
Acknowledgements
ix
I would like to express my gratitude to my parents who have been understanding,
caring and passionate about my education, my relatives, friends and colleagues in Silcar
who have both been encouraging and supportive. Last but not least, Qing Huang has always
been my inspiration. I truly share this accomplishment with them all.
x
Dedication
To my mother, my father and Aunt Kim.
1
1.0 INTRODUCTION
3D modelling of urban environments has become increasingly popular with the growth
and advances in computational capability. Meanwhile, the development of laser scanning
technology in the past decades has provided various high resolution and long range laser
scanners that allow rapid and efficient acquisition of 3D data. The demand for urban
modelling is apparent in both government and commercial users. An OEEPE survey [1]
conducted by the European Organization for Experimental Photogrammetric Research
showed that 95% of the participants were most interested in 3D urban modelling and 75%
were interested in 3D information about vegetation.
This rising demand has resulted in a wide range of applications for 3D urban
modelling. For instance, the generation of 3D urban models improves the practice of urban
environmental planning and design, by providing the possibility to simulate a proposed
change for the planning authorities. With these virtual urban models, applications such as
regional planning, precise navigation and disaster management can be enhanced. Many
governments and city councils are now using the models for planning and for studies of
climate, air quality, fire propagation and public safety. Moreover, new applications for
virtual reality walk-through and environmental simulation using urban models, such as
Chapter 1
Introduction
2
Google Earth1 and Microsoft Virtual Earth
2, are being developed. Apart from the above-
mentioned applications, urban models are also used in signal propagation modelling in
telecommunications. Batty et al. [2] group these applications into 12 categories of use,
including emergency services, urban planning, telecommunications, architecture, facilities
and utilities management. Shiode [3] later summarized these applications into four
categories, (1) planning and design; (2) infrastructure and facility services; (3) commercial
sector and marketing; and (4) promotion and learning of information on cities.
Various terms are used in describing these 3D models, including “Urban Model”,
“City Model”, “Virtual City”, “Cybercity” and “Digital City”. Currently, urban models
have been developed for most major cities, especially in cities with populations of over
one million, including Tokyo, New York, Mexico, London, Paris, Los Angeles, Chicago
and Delhi [2]. These 3D urban models generally contain models of man-made structures
such as buildings, and some of the models also contain natural objects such as vegetation
and terrain. Particular interest is focused on building reconstruction, where the building
models usually contain the geometric description (shapes and positions) and the
radiometric description (texture) of the building. Natural objects, including terrain and
vegetation, are also modelled to provide, for example, a more realistic representation for
robot localisation applications and visualisation purposes.
1.1 BACKGROUND AND MOTIVATION
Early urban models were constructed using wood from manual measurements where
survey crews physically placed conventional measurement devices on every key feature to
obtain the measurements. With advances in computing technology, virtual models can be
created with computer-aided design (CAD) tools, design maps of existing buildings and
(though often impractical) manual measurements. Various sensors which will be discussed
in the following sections have been developed for efficient data acquisition.
1 http://earth.google.com/
2 http://www.microsoft.com/maps/
3
1.1.1 Photogrammetry
3D urban models can be constructed by processing data from photogrammetry [4-6].
Photogrammetry is the earliest remote sensing technology, dating back to 1860, when
Albrecht Meydenbauer made his first investigations into architectural photogrammetry.
Photogrammetry refers to determining geometric properties, including the position,
orientation, shape and size of objects, from photographs and videos. To understand how
photogrammetry obtains these geometric measurements, consider the human vision
system: a stereo (3D) view can be obtained when we see an object from two different
positions (left eye and right eye). Similarly, if an object is photographed from two
different positions, a 3D picture of the object can be acquired, i.e., depth perception is
achieved.
When two or more overlapping images (with known camera characteristics:
including lens focal length, image size and resolution) from different perspectives are
available, stereo image matching, which is a standard photogrammetry technique, can be
employed to determine geometric measurements. The position of the camera where the
images are acquired is not required, as the geometry measurements are determined directly
from the images. To compute the building and terrain surfaces from stereo images, a
number of common points are identified on each image. A projection line is then
constructed from the camera location to the point on the object. The intersection of these
rays or triangulation determines the three-dimensional location of the point [7]. As pointed
out before, triangulation is one of the ways our two eyes coordinate together to estimate
distance.
Data acquisition (taking pictures) for photogrammetry is fast compared to LIDAR
scanning (see next section). However, processing the image (to retrieve the 3D
measurements for complex images with multiple buildings and vegetation) can be quite
tedious. Comparisons of the results of the accuracy and robustness of the reconstruction
process with LIDAR (Light Detecting and Ranging) data and with stereo images have
been presented [8, 9]. In short, LIDAR data acquisition is comparable with
photogrammetry to a certain extent and will replace photogrammetry in some cases. Both
technologies are fairly complementary and their integration can provide more complete
and accurate data acquisition.
4
1.1.2 LIDAR data
LIDAR (Light Detecting and Ranging) is a relatively new technology. It works by
emitting a laser pulse and the distance can be obtained via the time-of-flight method by
precisely measuring the pulse return time to the source, calculating the distance using the
value of the speed of light. The range (or depth), together with the direction of the pulse,
are recorded by the scanner to generate the 3D coordinates of the point on the target from
which the pulse is reflected. Another method for measuring distance is via phase
differencing, which uses a continuous wave with a continuous phase change. The method
calculates the distance based on the phase of a returning pulse [10, 11]. The coordinates
are expressed in a coordinate system fixed to the laser scanner, not an absolute coordinate
system fixed to the earth. These 3D points are commonly known as “point clouds”.
LIDAR is a fast method for sampling surfaces with high density and high accuracy.
The range image generally not only stores range or depth, but the return laser intensity
(which is particularly useful for its light invariant property) can also be obtained. With
calibration, the use of a “Multisensor” system (laser scanner and camera) permits more
complete and efficient data acquisition. This “Multisensor” system provides a high
resolution and complete coverage of the environment for urban modelling.
There are two types of urban model in general: airborne and terrestrial. The
airborne urban model can be built with airborne laser scanner data and satellite or aerial
images. The airborne laser scanner is generally handheld or mounted on a helicopter or an
Unmanned Aerial Vehicle (UAV) [12]. Data acquisition for a terrestrial urban model, on
the other hand, is via a close-range terrestrial laser scanner mounted on a mobile robot or
vehicle.
Airborne data consists only of information on the floor-tops (building roof) and the
terrain shapes. Information on building façades is unable to be represented in detail. The
airborne model is suitable for applications where a birds-eye view of the urban model is
desirable. On the other hand, ground-based or terrestrial modelling is capable of
reproducing detailed building façade scans, while the building roof-top details are
generally unavailable. The terrestrial urban model is preferable for applications such as a
walkthrough in virtual reality, environmental simulation or for robot navigation. In the
5
case where both building roof-top and façade details are required, the fusion of the
airborne and ground-based models permits a complete urban model to be developed.
1.1.3 Why is 3D Urban Modelling difficult?
The construction of 3D terrestrial urban models requires the acquisition of a large amount
of LIDAR data which can take months, and it is memory-intensive to store and process the
data. The very large amount of data, the unavoidable noise due to the uncontrollable
environment, the large varieties and shapes of the structures, and the sheer number of
structures make robust multi-structure data processing extremely challenging. In addition,
the number of structures and the location of objects in the outdoor environment are
unknown, making classical statistical methods insufficient to segment or model the data
accurately.
The following section outlines several challenges of 3D urban modelling.
Challenge I: Inconsistencies between the images and the LIDAR data
The acquisition of both image and LIDAR data are prone to anomalous occlusion due to
moving pedestrians and other objects in the uncontrolled outdoor environment. This is
because there is a distinct time difference between the data acquisition from the laser
scanner and the camera, i.e., the data acquisition processes for the laser scanner and the
camera do not occur simultaneously. As a result, anomalous occlusion or inconsistency
between the images and the LIDAR data (when a moving object does not coincide at the
same location on both the laser scan and the image, explained in detail in Section 2.3) may
occur. Removing the inconsistencies in the images can be challenging. In a busy
environment, a simple median filtering may not be sufficient to remove most of the
inconsistencies, as any given portion of the image scenes may be occluded more than 50%
of the time. Also, the median filter does not take any spatial continuity properties into
account. A more complicated image processing solution is required for image occlusion
removal. Based on the concept that when an occlusion happens, there is generally a
discontinuity around the boundary of the occlusion, Herley [13] proposed a method which
6
removes occlusion using a minimum number of images. However, the method is based on
the assumption that the occlusions are independent objects, which is often violated in
outdoor environments especially along a pathway during busy hours.
Challenge II: Large amount of complex outdoor data
With fast data acquisition, the amount of acquired data is massive. It is possible to collect
50,000 points per second to generate a point cloud that contains millions of points. A
complete city model usually requires registration of several scans from different locations
(with each scan being large in size). Storing or visualising the raw data is memory
demanding, and processing the data requires a long time. Manually reconstructing a 3D
urban model is costly and time consuming. An automated surface reconstruction algorithm
is therefore highly desirable.
There exist several point cloud reduction techniques, such as the coarse-to-fine
point cloud simplification by Moenning and Dodgson [14], using farthest point sampling.
Farthest point sampling is based on the idea of minimizing any reconstruction error by
repeatedly adding one sample point at a time and placing the sample point in the middle of
the least-known sampling domain. Similarly, Alexa et al. [15] reduce the point cloud by
removing the points with least contribution to the moving least squares (MLS)
representation of the underlying surface. These point cloud simplification methods can be
a useful step to speed up the subsequent surface reconstruction process.
Another method to reduce the data is via Delaunay Triangulation [16-18], which is
a common method for indoor object reconstructions from range data. The Delaunay
Triangulation is a set of triangles that connect each data point to its neighbours, with the
property that for each triangle, the unique circle circumscribed about the triangle contains
no data point. Nevertheless, the outdoor terrestrial laser scanned point clouds have very
different properties compared to the smaller indoor objects, which make triangulation and
other direct reconstruction/model fitting methods difficult. The outdoor range data often
suffer from occlusions due to moving (e.g. pedestrians) or still (e.g. vegetation)
obstructions during data acquisition. The data also have the property of varying density
due to different distances of scanned objects from the laser scanner. In addition, a large
outdoor range data set contains multiple multi-structure objects, including cluttered
7
vegetation. Due to these properties, direct reconstruction is very difficult and challenging.
Triangulating vegetation data often causes unwanted spikes (as shown in Figure 1-1), or
even connects the near-by vegetation and building data into the same surface. In addition,
it is difficult to represent the edges and corners properly with simple triangulations. Also,
extra knowledge is required to recover occlusions as building surfaces are often obstructed
by vegetation, as shown in Figure 1-1.
In a complex environment that contains various data types, especially an outdoor
environment with a relatively large amount of data belonging to vegetation, one method to
model the data is to geometrically fit the dense point clouds with primitive models (in
other words, surface reconstructions) [19, 20]. For example, hundreds or thousands of data
acquired from a plane can be defined using three points, or four points which represent the
four vertices of the planar rectangle in urban environments. Vegetation such as trees and
Figure 1-1 Triangulations of outdoor range data with RiScan Pro software. (Edge clearing
threshold = 0.05m; depth factor = 8; depth threshold = 0.05m)
8
bushes can be replaced with generic models or simplified with primitive models,
depending on the required level of detail. In order to fit the data with the appropriate
geometric models, the data can be classified into different classes, such as man-made
buildings, terrain, vegetation; or at a lower level, the raw data can be classified into
different geometrical classes, i.e. planar, linear or cluttered.
3D outdoor data classifications are not new. Several previous attempts at urban
modelling use aerial LIDAR data, where the acquired data consist of 2D (or 2.5D) birds-
eye view point clouds. Methods to classify the data that have been applied to aerial urban
modelling include the utilization of linear classifiers, or clustering methods with features
extracted from the height difference [21-24], variations of surface normal vectors [22] and
colours [25]. These methods are often inappropriate for data classification in terrestrial
urban modelling. For example, vegetation removal is commonly performed by filtering via
changes in the height differences. However, rooftop information is generally unavailable
in terrestrial or ground-based data acquisition.
In previous work on terrestrial urban modelling, the classification of raw point
clouds has been point-based. With the amount of data acquired, as mentioned before,
extracting features and labelling every single point are computational intensive. It has
been shown in work on 2D pixel classification [26] that classifying every pixel is
redundant and this is also an issue in 3D data. However, over-segmentation techniques
used in 2D images are not applicable to 3D data: the multi-structure and varying density of
the complex data makes it challenging to group the 3D data into regions that are
homogeneous with respect to certain chosen properties. A method capable of reducing the
complex 3D data into a smaller sub-set will greatly reduce the computation required in
processing the data.
Challenge IV: Unknown number of structures and unknown location
With the data belonging to man-made surfaces extracted, classical statistical methods may
be applied to segment or model the data. However, the number of structures and locations
of the objects in the outdoor environment are often unknown. This information is essential
for the classical statistical methods or estimators to robustly segment the data. Robust
estimators such as RANSAC are essentially designed to only handle single-structure
9
segmentation contaminated with gross outliers. In order to handle multi-structure
segmentation, RANSAC can be applied sequentially to detect and remove the inliers of the
best fitted planes from the data set. This process is not optimal and determining when to
stop the fit-remove process is not straight-forward.
Alternatively, clustering methods provide an approach to segmenting the data
simultaneously. Determining the number of clusters, which is the fundamental problem of
cluster validity, is one of the main problems in clustering methods. Most solutions to
solving the unknown number of clusters require the estimation of the maximum number of
clusters, which is however often difficult to estimate in a large number of points.
1.2 STRUCTURE OF THE THESIS
With the fore-mentioned challenges, this thesis presents a contribution to 3D Urban
Modelling with terrestrial LIDAR data and calibrated camera image information. The
outdoor LIDAR and image data are acquired with a system that consists of a Riegl LMS-
Z420i Terrestrial Laser Scanner and a calibrated high-resolution Nikon D100 digital
camera. Scans from different positions are registered in a local coordinate system with the
provided software RiScan Pro3. This dissertation is organised as follows and as as
illustrated in Figure 1-2:
Chapter 2 proceeds with description of the equipment used in data acquisition and
collection. This chapter concerns the aforementioned challenge regarding the
inconsistencies between the images and LIDAR data due to moving occlusions.
� The image occlusions are removed by extending the image occlusion removal
approach (with minimum number of images) proposed by Herley [13] to remove
occlusions that are not independent. The extended solution is explained in detail
and evaluated with indoor and outdoor data in the chapter.
3 RiScan Pro is the companion software for the RIEGL terrestrial scanner.
10
Chapter 3 – 5 describe robust and efficient 3D data classification methods to label the
outdoor LIDAR data (with complex properties as described above) into different data
classes. Chapter 3 starts with an overview and comparison of feature descriptors for the
classification of the LIDAR data, including the saliency features (as a function of the
eigenvalues of the region covariance) and the estimated surface normal vectors. The size
of the region neighbourhood used to compute the features cannot be fixed due to the
nature of the LIDAR data, as explained in Section 1.1.3.
� An issue is identified with the adaptive size of the region covariance used to
compute the saliency features. We modify the saliency features to be invariant to
the size of the adaptive region covariance by normalising the 1st and 2
nd largest
eigenvalues with the size of the region covariance. The extended saliency features
and the estimated surface normal vector are evaluated on synthetic and outdoor
terrestrial LIDAR data sets.
Chapter 4 addresses redundant computation in the existing point-based point cloud
classification approach.
� To avoid the redundant computation, we introduce a 3D over-segmentation
algorithm, namely super-voxel, and apply the algorithm to the outdoor terrestrial
LIDAR data. The algorithm, which is based on 3D scale theory, reduces the
original data to a smaller sub-set of the data by grouping the data into the super-
voxel subcomponents, depending on the level of a similarity criterion. This method
is shown to efficiently reduce the outdoor terrestrial LIDAR data for classification.
The features explained in Chapter 3 are then extracted from every super-voxel and Chapter
5 describes the learning model for the super-voxel classification.
� We propose a hierarchical graphical model, the multi-scale Conditional Random
Fields (mCRF), to learn the parameters of the extracted local and regional features.
A comparison of the classification results shows improvement in accuracy over
existing methods using 3D over-segmentation and mCRF.
11
Building and terrain models can then be derived from the classified data belonging to the
planar surface. In Chapter 6, the Infinite Gaussian Mixture Model (IGMM) is introduced
to robustly segment extracted data belonging to planar surfaces.
� A robust estimation method is described that successfully addresses a number of
the challenges associated with the segmentation of complex data with unknown
numbers of structures. The residual function of the Infinite Gaussian Mixture
Model (IGMM) is modified for the clustering of labelled data belonging to planar
surfaces into locally delimited planes. The proposed method is evaluated on the
labelled planar data and the robustness of the algorithm for the clustering problem
is demonstrated.
Finally, Chapter 7 concludes this dissertation by summarizing the contributions and
discussing directions for future research.
12
Figure 1-2 Overview of approach to 3D Urban Modelling
3D Urban Modelling
Chapter 2:
Data Acquisition
Registration
Occlusion
Removal
Data Classification on super voxel
(Chapter 4)
Chapter 3:
Feature Extraction
Digital Surface Model
Chapter 5:
Plane Patch Fitting
Chapter 6:
Robust Segmentation
Further work:
Chapter 7:
Vegetation Data
Chapter 5:
Learning Model
13
2.0 DATA ACQUISITION AND OCCLUSION REMOVAL
2.1 INTRODUCTION
The use of a Multisensor system – the laser scanner and camera - in the LIDAR (Light
Detecting and Ranging) method permits a much faster, more complete and more efficient
data acquisition process compared to photogrammetry. The laser scanner provides
geometry data, whereas the image taken provides colour information for realistic texture
mapping, which is valuable information for further point cloud analysis [27, 28].
In this chapter, the technologies involved in the creation of an urban model are
described, including the pre-processing required to prepare the data for data classification
and geometry fitting.
2.2 TECHNOLOGY
The equipment used in our research is a Multisensor system that includes a high-
performance long-range Riegl LMS-Z420i Terrestrial Laser Scanner with a wide field-of-
Chapter 2
Data Acquisition and
Occlusion Removal
14
view and a calibrated high-resolution Nikon D100 6 megapixel digital camera with a
14mm lens firmly mounted on top of the scanner (Figure 2-1); the camera rotates with the
laser scanner when in operation. The operating range of the scanner is greater than 800m
from the scanned object with 80% reflectivity with accuracy of 10mm. This single shot
accuracy can be improved up to 5mm by acquiring several scans in a sequence. The field
of view is 80º x 360º and the scanner can be tilted up to 180º. The measuring rate is 8000
points/sec with scan resolution up to 0.004 deg. A 80º x 360º scan with scan resolution of
0.06 deg (the resolution selected for our data acquisition) requires around 20 minutes for
data acquisition.
a b
Figure 2-1 (a) Riegl LMS-Z420i with mounted Nikon D100 (b) Riegl LMS-Z420i on cart
The camera provides the acquisition of calibrated photographs, which in turn are
automatically registered with the laser scanned data. A software package called RiSCAN
PRO, that manages the acquisition, registration and archiving tasks, is used to export the
3D point clouds and the colour information to an ASCII file for further processing. Figure
2-2 and Figure 2-3 show an example of two registered laser scans mapped with colour
information from the images using the RiSCAN PRO software. Figure 2-2 shows the front
view of the registered scans. The green and red dots indicate the two different scan
positions (locations 1 and 2). Figure 2-3i shows the birds-eye view of the registered scans
in real colour, whereas Figure 2-3ii shows the same view in simple colour (Green: location
1; Red: location 2). The RiScan Pro software carries out the registration automatically
with special targets manually placed in the environment.
15
The laser scanner is mounted on a cart and data collection is achieved in a stop-
and-go manner. In order to reduce the amount of time spent in data acquisition, the
method explained in [29] that plans the Next Best View can be implemented. With this
method, the overlap of data points in the mentioned technique would be at minimum.
Currently, the data collection is based on manually selecting the locations that best cover
the region of interest. The system can be upgraded for continuous scanning in the case of a
more complex environment.
Figure 2-2 Front view of registered LIDAR scans
i ii Figure 2-3 Birds-eye view of registered LIDAR scans
16
2.3 DATA PRE-PROCESSING
The result of the combination of image and data capture laser is however, prone to
anomalous occlusion due to moving pedestrians and other objects. This is because there is
a distinct time difference between the data acquisition from the laser scanner and camera:
the data acquisition process for laser scanner and camera does not occur simultaneously4.
This results in inconsistency or anomalous occlusions, where a moving object does not
coincide at the same location on both the laser scan and the image.
To illustrate this problem, consider the following scenario: imagine a pedestrian
happened to be at point A in Figure 2-4 when the laser scanning was taking place. As a
result, part of the laser scan of the terrain at point A would be occluded by the pedestrian.
On completion of the laser scanning, i.e. after time t, the pedestrian moved to point B as
the camera was taking images. The image of the tree at point B would be occluded by the
pedestrian. The aforementioned anomalous occlusion can be seen when both LIDAR data
and colour images are combined, as demonstrated in Figure 2-5.
a. at time = 0 b. at time = t
Figure 2-4 Example of situation where scan artefacts could occur
4 In our experiment, the Riegl terrestrial laser scanner scans the environment for some time (around 20
minutes for a resolution of 0.06 deg, followed by the camera taking seven images (with an overlap of 15%).
For the purpose of occlusions removal, another set of laser scans and images are taken after the first one.
17
a
b
Figure 2-5 Scanned results of occlusion with (a) Human object (3D points) mapped with grass image
texture (b) Human image mapped onto tree trunk
As described in the above scenario, the LIDAR data of the pedestrian captured in
position A would be colour mapped with grass (Figure 2-5a), which is the image of
position A after some time t. On the other hand, the human image taken at time t would be
colour mapped onto the tree trunk and onto the ground on position B (Figure 2-5b). This
can be an issue for visualisation requiring accurate texturing. Apart from being
“unrealistic”, such false data collection can be a problem for further analysis. For example,
if point cloud classification is based on the colour property, the green human object may
be recognised as vegetation instead.
18
a b
c
Figure 2-6 Example of LIDAR data showing raw infrared intensity readings (a), (b)Input LIDAR data
I and II with moving occlusions.; (c)Occlusion-free LIDAR data
In general, two types of occlusions have to be addressed: occlusions in the LIDAR
data and occlusions in the image data. The occlusion in the LIDAR data can usually be
solved by taking more than two scans at the same scan location and taking the greatest
depth of each point, assuming each data point is not occluded more than once in the set of
LIDAR scans from the same scan location. Figure 2-6 depicts an example of moving
occlusion removal from two LIDAR data scans using the method described. Vertical
spikes in the input images are due to pedestrians moving at a speed that is higher than the
laser scanning speed.
19
Removing occlusions in images is more difficult. This chapter presents a method
for image occlusion removal. This method works with the minimum required amount of
input pictures and is robust in detecting occlusion boundaries in the input images.
2.4 PREVIOUS WORK IN IMAGE OCCLUSION REMOVAL
Image occlusion removal can be achieved by manual intervention. For example,
Ulm [30] removed obstacles like cars or trees on terrestrial images by manual retouching
of the artefacts or occlusions. However, this is very tedious for a large collection of
images. The inconsistency caused by occlusion in the images can be otherwise removed
using automatic computer vision techniques. Removing occlusion in images for urban
modelling is somewhat analogous to background modelling [31-33]. In background
modelling, a long stream of video is taken from the same standpoint to obtain the
background model using robust statistical methods. The estimated background model can
then be employed to extract foreground objects for various purposes, including traffic
monitoring, human action recognition and object tracking.
Since the background is more likely to appear in a scene, one simple statistical
method is median filtering. Assuming two or more images contain views of the same
scene at a different time, an occlusion-free image can be formed with median filtering: the
final occlusion-free image is assigned to the median of the N input images from the same
viewpoint. However, the above assumption does not always hold. In outdoor
environments, any given portion of the image scenes may be occluded more than 50% of
the time. In addition, the median filter does not take any spatial continuity properties into
account. Therefore, some occlusion might only be partially removed. There are various
algorithms in background modelling that attempt to overcome the problems with the
limitation of the median or mean. Wang [34] proposed a more complicated solution that
locates all “stable sub-sequences” of pixel values in the video stream. The most “reliable”
sub-sequence is then chosen using the RANSAC algorithm. The initial background model
carries the mean value of the intensities over that sub-sequence.
20
The difference between background modelling and the image inconsistency
problem in this thesis is that, in the application studied here, there is not a large stream of
images for every viewpoint. For our problem, an efficient method is adopted that is based
on Herley's image occlusion removal algorithm [13]. Herley shows that multiple images
(>2 images) are not always necessary in solving image occlusion. An occlusion-free image
can be formed automatically, provided that each location of the image is not occluded at
least once. This is based on the concept that, when an occlusion occurs, there is generally
a discontinuity around the boundary of the occlusion.
To obtain the occlusion-free image, Herley first constructed a “consensus image”
that acquires the value of any two images that agree at that location with “occlusions
holes”. The occlusion holes are connected closed sets of pixels which have to be filled in.
Herley’s algorithm assumes that the occlusions are all independent objects; one occlusion
hole cannot consist of occlusions from different images. Based on this assumption, each
connected set can be filled with data from a single connected set, and hence the problem is
simplified to determining which image is the best. This works by comparing the similarity
of the occlusion’s outer boundary in the “consensus image” with the occlusion’s inner
boundary in all the input images (details are illustrated in the next section).
However, this assumption is often violated in an outdoor environment, as for
example, during a busy period along a narrow pathway with a large number of pedestrian
overlaps. To remedy this, Herley’s approach is extended to enable the removal of
occlusion when a single occlusion boundary requires information from more than a single
image. In addition, our proposed algorithm includes the ability to detect unremoved
occlusions. In the case where complete occlusion removal is not possible (no single
unobstructed view of the background is seen in any input images), the algorithm is capable
of detecting such cases. This additional step allows shape retrieval or manual retouching to
recover the particular image. This is important, as the number of images in the acquired
urban image database is large, and it can be very time-consuming to examine all processed
images to select out images that need to be further processed.
21
2.5 OCCLUSION REMOVAL WITH MINIMUM NUMBER OF
IMAGES
Let the images I0(i,j), I1(i,j)… IM-1(i,j) be a set of input images obtained from the
calibrated camera taken at different times with the same view point. Therefore, Im(i,j) =
In(i,j) m,n unless either Im or In is occluded at that location or affected by illumination
changes. The steps of the occlusion removal algorithm are detailed below:
2.5.1 Occlusion Detection
Constructing consensus image
As explained previously, the occlusion removal algorithm [13] starts by constructing a
“consensus image” U, which contains pixels that are similar in Im. The “consensus image”
can be constructed from two or more images by acquiring pixel values from any two
images that have a difference less than threshold α.
Otherwise
jiIjiISimilarityanyIfjiIjiU
nmm α<−→=
),(),(
0
),({),(
( 2-1 )
≠
=OtherwiseI
jiUforjiUjiI
m
m
0),(),(),('
( 2-2 )
Each pixel in any two images Im and In is compared with a threshold α, where α is a small
value to allow some matching error. If the similarity is low, the consensus image U is
assigned a pixel value of zero. Otherwise, the consensus image carries the average pixel
value in image Im and In. Each of the Im then forms a new image '
mI that matches the
“consensus image” except where the “consensus image” is zero.
The visual similarity can be measured using various features such as intensity or
colour, gradient, contour, texture, or spatial layout. A popular choice for similarity is
colour due to its simplicity and robustness against scaling, rotation, partial occlusion, and
22
non-rigid deformation. However, the RGB colour space is sensitive to the change of
illumination. Outdoor environments cannot be controlled and the illumination condition
may vary for the set of images from the same viewpoint. The normalized RGB space was
employed:
BGR
RI R
++=
BGR
GI G
++=
BGR
BI B
++=
( 2-3 )
The consensus image is then constructed from any two images with colour similarity less
than α, which was set to 5 in our experiments:
Colour Similarity 222 )()()( BnBmGnGmRnRm IIIIIIc −+−+−=
( 2-4 )
Figure 2-7a and Figure 2-7b shows an example of the input images from the same
viewpoint in RGB space. Both images were converted into the normalised RGB space, as
shown in Figure 2-7c and Figure 2-7d. The resulting “consensus image” based on the
visual similarity of the normalised RGB of the input images is shown in Figure 2-7e. The
black regions in the consensus image are the estimated locations of the occlusion.
23
a b
c d
e f
Figure 2-7 Construction of consensus image (a),(b) Image sequence with occlusions; (c),(d) Input
images in normalised RGB space; (e) Consensus image; (f) Filtered consensus image
Discarding consensus image noise
The consensus image at this stage may contain a large number of occlusion ‘holes’, i.e.,
pixels with zero RGB value, that are caused by image noise. In an outdoor environment,
moving trees and bushes (due to wind5) can cause mismatch in the input images, thereby
generating a large number of small holes in the consensus image. Eliminating these
relatively small occlusion ‘holes’ at this stage will reserve more computation time for
more complex processing.
5 Occlusion can be divided into static and moving occlusions in general. Vegetation may be classified as a
type of static occlusion in some building modelling applications. In our application, vegetation is identified
as part of the terrestrial urban model and would not be removed from the image.
24
A morphological filter is employed to fill in the holes that are relatively small (less
than 0.01% of the total pixels). It is important to select an image filter that does not change
the position of the occlusion boundary (for instance, erode or dilate the filter), as the
accuracy of the true occlusion boundary has a great effect on the occlusion removal.
Figure 2-7f shows an example of the filtering result of the original “consensus image” in
Figure 2-7e.
2.5.2 Occlusion Removal
Forming closed connected set in consensus image
Each occlusion which appears as a connected hole in the consensus image is grouped
together as Sp, p=1,2,…,P, where P is the number of occlusion regions in the consensus
image. For instance, in Figure 2-7f, the consensus image has three ‘holes’ – P=3 and
Figure 2-9 shows an example of grouped connected zero pixels. Only holes greater than 5%
of the largest hole is considered. In order to fill in the holes, the internal and external
boundaries of the hole in the input images are compared, to identify which of the input
images has the non-occluded data.
Figure 2-8 Definition of external boundary
Figure 2-9 Grouping of occlusion holes in the consensus image
25
The internal and external boundaries of the holes are defined as follows: For each
set of Sp, the internal boundary of each occlusion, Bmp, m=1,2,…M, where M is the number
of input images, is defined as the set of pixels in the zero-connected region that has at least
one neighbouring pixel with the background image. Therefore, for each set of occluded
images, there will be M×p internal boundaries. The green outlines in Figure 2-10a-f are
examples of the internal boundaries of the holes in the “consensus image”. For each set of
Sp, the external boundary Ep is defined as the set of pixels which is not in Sp and has at
least one neighbour in Sp. The values of the external boundary are taken from the
consensus image.
a b
c d
e f
Figure 2-10 The occlusion boundaries (a) B11 (b) B21 (c) B12 (d) B22 (e) B13 (f) B23
The discontinuity across the boundaries for Sp can then be computed. Note that the
external boundary in general is larger than the internal boundary. For example, in Figure
2-8, pixels in black represent occluded data: the external boundary pixels are labelled
“u,t,w,x,y,z” and the internal boundary “a,b,c” for the purpose of illustration. The number
of pixels in the external boundary is matched to the internal boundary as a requirement for
26
the discontinuity measure: for every pixel in the internal boundary (for example, pixel “a”
in Figure 2-8), the median of the 8-connected pixels6 external boundary pixels (for
example, “u,t,w”) that are in the external boundary is computed as the “matched” external
boundary pixel. For instance, in Figure 2-8, the corresponding new external boundary for
the internal boundary },,{ cba is )},(),,,,(),,,({ zymedianzyxwmedianwtumedian .
a b
c d
e f
Figure 2-11 (a) Plot of the D3 function; (b) “Boundary separation” of S3; New boundaries (c) B14 (d)
B24 (e) B15 (f) B25
Testing each set of Sp for occlusion overlap: “Boundary Separation”
As explained in Section 2.5, a large difference in the discontinuity measure at the
boundary is most likely to indicate an occlusion. The discontinuity measure, dmp, is the
absolute difference of internal and external boundary of Sp in image Im:
6 8-connected pixels are neighbours to every pixel that touches one of their edges or corners
27
pmpmp EBd −=
( 2-5 )
∑=
=k
i
mpmp idL1
)(
( 2-6 )
where k is the number of pixels in the boundary.
Herley [13] proposed the sum of the discontinuity measure, Lmp, as an indicator that
reveals the level of discontinuity over the boundary of Sp in image Im. He proposed for
every hole Sp, data from Im with the smallest dmp is used to fill the hole. A small Lmp,
indicates that data from Im is sufficient to be used to cover the hole Sp. In the case where a
hole must be patched partly from Im and partly from In, i.e., the occlusions are not
independent objects, the discontinuity indicator Lmp for all Im would be large.
We extend Herley’s algorithm by providing a solution to allow each hole in the
consensus image to be filled with data from one or more of the Im. For a set Sp in U, if Im is
occluded over part of Sp (subset I), and not occluded over the rest (subset II); In is
occluded over part of Sp (subset II7), and not occluded over the rest (subset I); the
proposed algorithm will be able to fill subset I from In and subset II from Im. For example,
in Figure 2-10, S1 would be filled from I1 and S2 can be filled either from I1 or I2. The set
S3 however, would be filled partly from I1 and I2. If subset I and II intersects, i.e., part of
Sp is occluded in both Im and In8, manual image restoration would be required to remove
the occluded data. The discontinuity measure can be compared with a threshold to identify
Sp that is not completely resolved.
To break a set Sp into n subsets, called “boundary separation”, so that the set Sp
subsets can be filled from both Im and In, an occlusion-overlap measure Dp9 is first defined:
mpnpp ddD −=
( 2-7 )
7 Assuming subset I and II do not intersect
8 The common occluded area of subset I and subset II should not be identical in both Im and In, as if this is
the case, the area would not be in the set Sp. 9 Although the occlusion-overlap measure Dp can be defined for more than two images, the simplest way is
to process two images at a time. The resulting image is then compared with the third image, and so forth.
28
Consider that, if a set Sp can be solely filled from Im, the occlusion-overlap
measure Dp would always be positive, as the discontinuity measure dnp would always be
larger than dmp. For example, in Figure 2-10, d11 will be constantly relatively small and d21
will be constantly relatively large. If Sp has to be patched partly from Im and partly from In,
Dp would be positive at the portion of the boundary that should be patched from Im; and
negative at the portion of the boundary that should be patched from In. Hence the problem
of deciding if a set Sp can be solely filled from Im is simplified to identifying if there exists
a zero-crossing in the occlusion-overlap measure Dp. The location of the zero-crossing in
Dp indicates the location to break the set Sp into sub-sets. For instance, in Figure 2-11a, D3
= d23-d13 is plotted. The negative regions indicate filling from I2 and positive regions from
I1. The new boundary (shown in Figure 2-11b) between the two sub-sets is formed by
connecting the boundary locations of the zero-crossing in D3 (labelled as A and B). Set S3
is therefore divided to form two new Sp sub-sets: S4 sub-sets S5.
Filling up consensus image zero regions
The image with the minimum Lmp as defined in Eq.2.6 is then selected to cover pS .
Consider the example in Figure 2-10a,b: L11 = 16 and L21 = 49. Therefore, S1 will be filled
from I1. The final result for occlusion removal in the example is shown in the next section.
Static occlusion, for example vegetation or parked cars in an outdoor environment,
would not be removed. These objects are considered consistent occlusions, i.e., they also
exist in the LIDAR point cloud. Therefore, removal of these objects in the image is
unnecessary. Static occlusions can be removed via classification of the LIDAR data,
which will be detailed in the following chapters.
2.6 EXPERIMENTAL VALIDATION
A selection of our results is shown in Figure 2-13 (indoors), Figure 2-14 (outdoors)
and Figure 2-15 (outdoors). The proposed method is also evaluated on two sets of seven
images, taken by a calibrated Nikon D100 6 Mega Pixel digital camera with 14mm lens,
29
which covers a field of view up to 360°. The calibrated images with occlusions removed
are then stitched together (shown in Figure 2-16 and Figure 2-17) before being mapped
onto the LIDAR point cloud.
2.6.1 Indoor data set
Our proposed method was evaluated on a set of indoor data as shown in Figure 2-12a,b.
The results are shown in Figure 2-13a-c. Without the “boundary separation”, S3 would
have to be filled from a single image with the lowest Lmp, which happened to be I1 (as
shown in Figure 2-13a). With “boundary separation”, part of the result image is still
occluded (as shown in Figure 2-13b). As a result, the occluded portion is never seen in the
input images. The algorithm is capable of detecting this issue and prompts the user for
manual retouch or shape retrieval [35].
a b
Figure 2-12 (a), (b) Indoor image sequence with occlusion
30
a
b c
Figure 2-13 Results of artefact removal for input images in Figure 2-12 (a) Without “boundary
separation” (b) With “boundary separation” (c) After manual retouch
2.6.2 Outdoor data set
Outdoor data set I
The proposed occlusion removal method with “boundary separation” is compared against
Herley’s method in the outdoor data set shown in Figure 2-14a. Part of the occlusion in
both images overlapped, as shown in the filtered consensus image in Figure 2-14b. The
result of occlusion removal without “boundary separation” still includes occlusion (a full
figure of a person is not removed), despite the occluded region being not occluded in one
of the input images. With “boundary separation”, all occlusions are removed automatically,
as the part where the occlusions overlap is relatively small.
31
Figure 2-14 Input images with occlusions and resulting image (a) Input image sequence; (b)
Consensus Image (c) Without “boundary separation” (d) With “boundary separation”
Outdoor data set II
The proposed occlusion-removal method with an outdoor data set in a busy environment
was evaluated as shown in Figure 2-15a. The consensus image is shown in Figure 2-15b
and the final “occlusion-free” image is shown in Figure 2-15c. Except for parts of the
image that are occluded in both inputs, most of the large occlusions in the input sequence
are removed. Some of the occlusions (far from the camera) that are not removed are
relatively small and insignificant.
(a) Input images
(b) Consensus image
(d) Result with boundary separation
(c) Result without boundary separation
32
Figure 2-15: Input images with occlusions and result of the implementation II (a) Input image
sequence (b) Consensus Image (c) Result image
33
2.6.3 Panoramic Data Set
Panoramic Data Set I
The images in Figure 2-16a,b are original input image sets that are stitched into panoramic
images. Figure 2-16c is the occlusion-free panoramic image. The original images were
individually processed for occlusion removal before being stitched into the panoramic
image. A simple linear image blending technique was used, where the adjoining areas of
the component images were matched for colour, brightness and contrast, in order to
correct the different exposures in different images.
Note that blending is not performed on the pixels belonging to the sky. To explain
this, we note first that the range of intensity levels that can be recorded by a sensor without
clipping or saturation is often referred to as the dynamic range. The JPEG images stored
by a digital camera provide only 8-bits of intensity information, which is usually
insufficient to capture the entire dynamic range for real outdoor scenes containing bright
and dark areas. When images of different exposures are stitched together to form a
panoramic image, a higher dynamic range, i.e., greater than 8-bits, is required for image
blending. However, blending of the sky is unnecessary for the proposed application in this
thesis. This is because the sky is not picked up by the LIDAR. Figure 2-16c depicts the
result of the combination of the occlusion-free image and the LIDAR data.
Panoramic Data Set II
Similar to the “Panoramic Data Set I”, Figure 2-17a and b depict the panoramic
view of the images acquired from the calibrated Nikon D100 digital camera. Occlusions in
the input images were removed. The stitched and blended panoramic view is shown in
Figure 2-17c. As mentioned in Section 2.5, analogous to vegetation (as a type of static
occlusion in building modelling), the parked cars are not removed. The result of the
combination of the occlusion-free image and the LIDAR data is shown in Figure 2-17d.
34
a
b
c
d
Figure 2-16: (a),(b) Input images; (c) Resulting image after occlusion removal and colour blending; (d)
Combination of the occlusion-free image and the LIDAR data
35
a
b
c
d
Figure 2-17: (a),(b) Input images; (c) Resulting image after occlusion removal and colour blending; (d)
Combination of the occlusion-free image and the LIDAR data
36
2.7 CONCLUSION
In this chapter, the problem with occlusion inconsistency between the acquired LIDAR
data and the image data was identified. The importance of pre-processing of the data to
remove the moving occlusion (that causes the inconsistency) in both range and image data
was shown. The proposed occlusion removal approaches for both data types were
explained and evaluated on the acquired urban data set.
Occlusion removal in LIDAR data can generally be solved by taking more than
two scans in the same scan location and taking the greatest depth of each point. For image
data, the algorithm for image occlusion removal using minimum images detailed in [13]
was extended to include the capability of detecting and removing occlusions in input
images that overlap. The extended algorithm was tested on outdoor images that included
moving occlusions such as pedestrians. The algorithm is capable of removing most of the
occlusions that cause the inconsistency. The occlusions that are not removed are in general
due to the lack of an unobstructed view of the background and are relatively small and
insignificant. Static occlusions, for example vegetation and cars parked in one spot over
the entire duration of image capture and laser scan, can be removed by processing the
LIDAR data, i.e. classifying the LIDAR into different classes and removing the particular
objects. This will be explained in the rest of the thesis.
37
3.0 FEATURE DESCRIPTORS FOR 3D CLASSIFICATION
3.1 INTRODUCTION
This chapter describes the extraction of features for the classification of outdoor-scanned
LIDAR data. In order to reconstruct 3D city models from the acquired LIDAR data, first
the raw data have to be divided into different classes. A low-level classification would be
to divide the data into “planar” (for example, façades of man-made buildings, pathways
and terrain), “linear” (for example, cylindrical poles, tree trunks) and “cluttered” (for
example, vegetation, trees) data types. As discussed in Chapter 2, one of the most
important tasks in the classification process is to extract the right features, i.e. features that
will capture the relevant relationships among observations.
The term ‘features’ means variables extracted from the raw data that are
appropriate and distinctive for correct classification with low probability of mismatch. A
feature descriptor can also be seen as a special form of data and dimensionality reduction.
Generally, feature extraction is divided into feature construction and feature selection.
Feature construction can be seen as the process of obtaining new variables by a linear or
non-linear transformation of the original raw data. Feature selection is used to determine if
Chapter 3
Feature Descriptors for 3D
Classification
38
the extracted features are sufficient to identify the different classes and to eliminate
redundant features [36].
We focus on feature construction in this chapter because feature selection is
usually only necessary when there is a relatively large number of features and relatively
few (training) data. In such a case, many features are often redundant or have such a high
dimensionality that it is difficult to determine which features are appropriate for which
classes. In contrast, we have a relatively large amount of training data and fewer features.
Thus, for our classification problem, we concentrate on the construction of distinctive
feature descriptors for outdoor LIDAR data.
Existing features used for 3D data classification include intensity [23], height [21-
23, 37], surface curvature [21], spin image [37-39], shape distribution [40, 41], local
tensors [42], shape maps [43], 3D active contour [44], normals [22] and colour [45]. These
features are often combined together, or treated independently as feature descriptors, for
urban classification. For example, in order to identify the building points, Anguelov et al.
[37] first use height to filter out most of the terrain points. The authors then compute the
minimum eigenvalue of the region covariance (of the spatial coordinates of 100 sampled
points in a cube of radius 0.5 meters) to identify the principal plane location. The cube is
then partitioned into 3x3x3 bins around the point, oriented with respect to the principal
plane. The percentage of points lying in various sub-cubes provides the information on the
local distribution. The location of the planes in the data can then be identified. In order to
identify linear objects such as trees, Anguelov et al. [37] compute a cylinder of radius 0.25
meters which extends vertically to include all the points in a “column”. The percentage of
the points that lie in various segments of the column (e.g., between 2m and 2.5m) is then
computed as one of the feature descriptors. Another feature, an indicator of whether a
point lies within 2m of the ground, is used to identify the bushes. Most of these features
rely on fixed thresholds and therefore require readjustment for different data types or
different scanning resolutions.
In another example, Triebel et al. [46] first divide the data into walls using a plane
extraction algorithm. Several methods exist to robustly extract planes from outdoor data
which are explained in detail in Chapter 8. The authors then use three types of feature
descriptor to distinguish ‘window’, ‘wall’ and ‘gutter’: 1) the cosine of the angles between
39
the local normal vectors near each point and the plane normal vector; 2) the distribution of
neighboring points in front of and behind the extracted planes and 3) the normalised
height of each point.
In general, two of the most common geometric features are region covariance and
estimated normals. The covariance matrix extracted from a region is usually sufficient as a
region descriptor to match the region in different pose or rotation and scale [47]. The
region covariance does not have any information regarding the ordering and the number of
points, which means that it has a certain scale and rotation invariance over the regions. For
the purpose of discriminating data belonging to the flat terrain from data belonging to the
building surface, information on the direction of the estimated surface normal vector is
useful. In the next sections, we focus on the region covariance and estimated normals. The
features are computed and evaluated on both synthetic and outdoor LIDAR data.
3.2 REGION COVARIANCE AS 3D FEATURE DESCRIPTOR
One of the popular features for 3D data classification is derived from the estimated region
covariance matrix [47]. A region covariance matrix is a matrix of covariances between
elements of a number of neighbouring data. For example, we generated 3D data of a plane
with Gaussian noise of standard deviation of 0.05, as shown in Figure 3-1. This has the
following covariance matrix:
−−
−
−
2429.08041.28765.13
8041.2495.83325.0
8765.1325.0505.833
40
Figure 3-1 Plane data with Gaussian noise of standard deviation = 0.05
By performing principal eigenvalue decomposition of a region covariance matrix,
also known as Principal Component Analysis (PCA), the data is transformed to a new
coordinate system. If a 3-dimensional data set is given, the greatest variance by any
projection of the data comes to lie on the first coordinate, the second greatest variance on
the second coordinate and the third greatest variance on the third coordinate. The
variances describe the shape and the coordinates describe the orientation of the data. We
can use Singular Value Decomposition10
(SVD) to perform PCA [48].
For example, the data in Figure 3-1 have the eigenvalues of 0.0025; 833.3; 833.9.
The smallest eigenvalue, 0.0025, which is the estimated covariance along the least
dominant direction of the data, matches the variance of generated plane data11
. This
smallest eigenvalue of the estimate region covariance is often computed to determine the
planarity of the data [49, 50]. For instance, Stamos and Allen [50] classify each point into
planar and non-planar data by thresholding the smallest eigenvalue corresponding to the
covariance matrix of k neighbouring data. In practice, the smallest eigenvalue for a planar
data set may range from a small value (for a relatively smooth building surface) to a larger
value (for rougher building surface). Classification using solely the smallest eigenvalue
can be difficult, as a small cluttered data set may be confused as a rough plane.
10 Details of the SVD algorithm can be found in Appendix II
11 variance = (standard deviation)
2 = (0.05)
2 = 0.0025
41
To minimise the dependency of the features on the smallest eigenvalues, Lalonde
et al. [51] derived a saliency feature descriptor using the relationship between the
eigenvalues of the region covariance, instead of solely the smallest eigenvalue. Let
λ1>λ2>λ3 be the eigenvalues of the covariance matrix of the k nearest neighbours. In case
of clutter, λ1≈λ2≈λ3 and there is no dominant direction. For points on surfaces (where
λ1,λ2>>λ3) and for linear structures where (λ1>>λ2,λ3), the saliency features were
evaluated using Eq. 3.1:
−
−=
−
−
−
23
12
1
λλ
λλ
λ
nesscurve
nesssurface
nessclutter
( 3-1 )
3.2.1 Extension to Saliency Features
The described saliency features have a disadvantage. To explain the disadvantage, we
need to first understand that a region covariance of a point p, as described in the previous
section, may be defined as i) the covariance of a k neighbouring point or ii) all points
within a defined distance (which may be determined adaptively for different data points)
from point p. In both cases, the size of the region covariance, defined by the distance of
the neighbouring point furthest away from p, may vary considerably. As a result of the
size variation, the measure of the ‘surfaceness’ and ‘curveness’ are inconsistent and can
spread over a large range of data. To eliminate the effect of size changes, we normalised
the 1st and 2nd largest eigenvalues with the size of the region covariance, r. The new
saliency features are as follows:
−
−=
−
−
−
)//log(
)/log(
)log(
23
12
1
rr
r
nesscurve
nesssurface
nessclutter
λλ
λλ
λ
( 3-2 )
After normalisation, the eigenvalues become invariant to the size of the region
covariance, especially for planar-like data and linear-like data. The features are more
distinguishable as depicted in the experiment with synthetically-generated data (Figure
42
3-7). For point-like data, the region covariance is relatively small and the eigenvalues are
similar in size; therefore normalization of the 3rd largest eigenvalue is unnecessary.
3.2.2 Validation of the Extended Saliency Features
In this section, the extended features defined in Eq 3.1 are evaluated using synthetically
generated data and outdoor LIDAR data acquired from a Riegl terrestrial laser scanner.
3.2.2.1 Synthetic Data
The extended features are evaluated using a set of synthetic data with 600 planar
data and 600 cluttered data, as shown in Figure 3-2. The planar data are corrupted by a
Gaussian noise of variance 0.05 and the cluttered data by a Gaussian noise with variance
of 3. The size of the region covariance is adaptively determined12
. The eigenvalues of the
region covariance, the saliency features and the normalised saliency features are computed
for every point, as depicted in Figure 3-3 to Figure 3-8:
12 The size of the region covariance is determined using an extended 3D scale theory approach, explained in
detail in Chapter 4 - 3D Over-segmentation. The optimal size of the region covariance is iteratively
estimated with equations that depend on the estimated curvature, density, noise and the colours of the points.
43
Figure 3-2 Plane data with Gaussian noise of standard deviation = 0.05 with cluttered data
Figure 3-3 Eigenvalues for Planar Data
44
Figure 3-4 Eigenvalues for Cluttered Data
As shown in Figure 3-3, the two largest eigenvalues for planar data are relatively
large compared to the smallest eigenvalue. The three eigenvalues are similar for cluttered
data, as shown in Figure 3-4.
Figure 3-5 Saliency Features for Planar Data
45
Figure 3-6 Saliency Features for Cluttered Data
Figure 3-7 Normalised Saliency Features for Planar Data
Figure 3-8 Normalised Saliency Features for Cluttered Data
As shown in Figure 3-5, most of the saliency values of the planes features are
greater than the saliency values of the lines and points features for planar data. Also, most
46
of the saliency values of the points features are greater than the saliency values of the lines
and planes features for cluttered data (Figure 3-6). The normalisation of the saliency
values (Figure 3-7 and Figure 3-8) successfully discriminates the points data – all saliency
values of the points features are greater than the saliency values of the planes and lines
features. The distinctive saliency features are then exploited to geometrically classify the
data with a learning model.
3.2.2.2 Outdoor LIDAR Data
We also validated the extended method with outdoor LIDAR data of 3648 points
shown in Figure 3-9 in the following experiment. The outdoor LIDAR data consist of a
vertical plane (building wall, shown in blue); a horizontal plane (pathway and grass
terrain, shown in green) and clutter (tree, shown in red).
47
a
b
Figure 3-9 Outdoor LIDAR data
48
Figure 3-10 Eigenvalues for Planar Data
Figure 3-11 Eigenvalues for Cluttered Data
As shown in Figure 3-3, the two largest eigenvalues for planar data are relatively
large compared to the smallest eigenvalue. The three eigenvalues are similar for cluttered
data, as shown in Figure 3-4.
49
Figure 3-12 Saliency Features for Planar Data
Figure 3-13 Saliency Features for Cluttered Data
Figure 3-14 Normalised Saliency Features for Planar Data
50
Figure 3-15 Normalised Saliency Features for Cluttered Data
3.3 ESTIMATED NORMALS AS 3D FEATURE DESCRIPTOR
In addition to region covariance, as explained in Section 3.1, the estimated surface normal
is useful to discriminate data belonging to the flat terrain from data belonging to the
building surface. This section explains the general methods to estimate surface normal
vectors from 3D data (in Sections 3.3.1 and 3.3.2). The explained methods are then
evaluated and compared for the synthetic and outdoor LIDAR data.
A normal to a surface at a point is the normal to the estimated tangent plane to that
surface at that point. The surface normal is useful in distinguishing some data classes and
can also be a good representation of texture. There are several ways to estimate the tangent
plane. One method is via Delaunay Triangulation (DT) or its dual Voronoi Diagram, as
shown in Figure 3-16 below.
3.3.1 Delaunay/Voronoi Method
The Delaunay Triangulation is a set of triangles that connect each data point to its
neighbours, with the property that for each triangle, the unique circle circumscribed about
the triangle contains no data point. The normal vector for the data point can be calculated
51
as the weighted average of the normal vectors of the triangles formed by each data point
and pairs of its neighbours. There are numerous variations of the weighted average,
including angle-weighted, area-weighted, centroid-weighted and gravitational-weighted
methods. Comparisons of these methods can be found in [52] and [53].
Figure 3-16 Delaunay Triangulation and its dual Voronoi Diagram
Related to the Delaunay triangulation, the Voronoi diagram is a closest-point
plotting technique which consists of a set of Voronoi polygons. A Voronoi polygon for
point p encloses all the intermediate points that are closer to point p than to any other point
in the set of coplanar points, as shown in Figure 3-16. The circumscribed spheres of the
Delaunay Triangles, known as the Delaunay Balls, are the vertices of the Voronoi
Diagram. In noise-free data, the normal vectors for the data point can be approximated as
the pole [54]. The pole is defined as the line through each data point and its furthest
Voronoi vertex. The pole is also the centre of the largest Delaunay balls incident to the
point p on both sides of the sampled surface. These largest Delaunay balls are also known
as the polar balls.
For noisy data, Dey et al. [55] extended the Delaunay Balls algorithm by
introducing Big Delaunay balls (BDB), i.e. only Delaunay balls incident on point p larger
52
than a threshold are used to estimate the normals. The algorithm starts by computing the
Delaunay Triangulation for the data points P. For each point p, the average distance to the
k nearest neighbor, λp, is computed. Next, the Delaunay ball incident to the point p with
radius greater than cλp is marked as BDB, where c is a user-defined parameter. The
normal vector for point p can then be estimated as the line through p and its pole. Note
that in the case for points where none of the Delaunay ball is marked as BDB, no normal
can be estimated for these points. One of the solutions to estimate normal for these points
is to interpolate the normal for these points with the neighbouring normals. The sensitivity
of the algorithm to noise relies on the value of k and c. Dey et al. fixed the value of k as
the average distance of point p to its five nearest neighbours. The value of c is then
determined empirically by finding c that minimizes the error between a set of referential
normals (that are computed with clean data) and the estimated normal on the data with
noise added.
3.3.2 Numerical Optimization Methods
Another approach for surface normal estimation is via numerical optimization methods.
One of the most commonly applied numerical optimization methods is via the least square
plane fitting which was proposed by Hoppe et al [56] in 1992. The method estimates
tangent planes of the point cloud data by fitting local planes (with minimum fitting error to
the data). There are two kinds of fitting error: The x and y values for the least squares or
the regression plane are fixed and the fitting error is only in z (vertical) direction. A
variation from the traditional least squares, the total least squares (TLS) or the orthogonal
distance regression plane minimises the perpendicular distances to the plane (as shown in
Figure 3-17a), i.e. there is fitting error in all three coordinates. The traditional least square
fitting that minimises the vertical deviations [yi – f (xi,a1,a2,a3,a4)] (as shown in Figure
3-17b), where (a1,a2,a3,a4) are the plane coefficients, does not minimise the actual
deviation.
53
a b
Figure 3-17 Fitting Error of (a) Total least squares (TLS) or the orthogonal distance regression (b)
Least squares
In total least squares, for every point p, the algorithm finds the best-fit local plane
nTx = c that minimises the cost function e(n,c) under the constraint n
Tn = 1. The classical
cost function is the sum of squares of the fitting error for point p and its k nearest points as
depicted in Eq. 3.3.The estimated surface normal n is then the unit vector perpendicular to
all tangent planes. Since the cost function e can be stated as a linear problem in matrix-
vector notation, the minimiser can be expressed directly as the result of a SVD (details of
the SVD algorithm are given in Appendix II).
∑=
−=k
i
i
T cpncne1
2)(),(
( 3-3 )
Note that in general, the estimated normal to a surface is not oriented. A surface normal can be
can be ‘positively’ (right-handed) or ‘negatively’ (left-handed) oriented, as shown in
Figure 3-18. A straight-forward approach to orient the normal vectors is by multiplying
the normal vectors with their z-components, resulting in all normal vectors having positive
z-component. In the case where consistent normal orientation (for example, having normal
vectors of the points belonging to the overhang of a roof pointing downward instead of in
the positive z-direction) is highly desirable, the method described in [56] can be adopted.
The method works by first adjusting the sign of the estimated normal vector with the
largest z coordinate to ensure having a positive z-component. The orientation is then
“propogated” to the neighbouring points in a Riemannian Graph, which is a graph
constructed to be a connected graph that encodes geometric proximity of the point data.
54
Figure 3-18 Ambiguity of the Orientations of Surface Normals
3.3.3 Evaluation of the Surface Normal Estimation Approaches
We compared the performance of the BDB and the TLS methods for surface
normal estimation on the synthetic (Figure 3-2) and outdoor LIDAR (Figure 3-9) data.
The normal vectors (shown in red arrows) are estimated for every data point and plotted in
Figure 3-19 and Figure 3-20 respectively. In our experiment, the number of nearest
neighbours in TLS is determined adaptively13
. In the BDB experiments, five nearest
neighbours of p are selected for each p. Another user-defined parameter, c, is fixed to 2.5
in the experiment. Unlike TLS, the BDB algorithm is not as sensitive to the choice of k as
the algorithm only requires the average distance of the point to its local neighbours.
Figure 3-19 Estimated normal vectors using the BDB method
13 Similar to the computation of region covariance, an adaptive k is essential for computation of TLS in
complicated outdoor data. This is explained in detail in Chapter 4 - 3D Over-segmentation.
55
3.3.3.1 Synthetic Data
Figure 3-20 Estimated normal vectors using the TLS method
For the synthetic data in our experiment, the reference normal is given by the plane
coefficients used to generate the data that belong to the plane. The data belonging to the
plane are then corrupted by Gaussian noise of variance 0.05 and cluttered data is a
Gaussian noise with variance of 3.
As mentioned in Section 3.3.1, the Delaunay/Voronoi method does not guarantee
normal estimation for all data points. As depicted in Figure 3-19, no normal is estimated
for some of the cluttered data. Understanding that the surface normal feature is generally
used for identifying supporting surfaces14
, such as ground with horizontal support or
building walls with vertical support, not having a normal estimated at the cluttered data
point is not an issue.
With the TLS method, the estimation of surface normal for data belonging to a
plane outperforms the BDB method, as shown in Table 3-1 and the close-ups of the results
14 In our 3D data classification, the extended saliency features explained in Section 3.2.1 are used to filter
out the cluttered data. The estimated normal feature is then used to classify the planar data into different
classes.
56
in Figure 3-21 and Figure 3-22. The estimation of normal vectors using the BDB approach
is less robust to noisy data in the experiment. Also, BDB performs poorly in estimating the
normal vectors at the edge of the planar data. This can be explained by understanding the
nature of the BDB algorithm, in that the algorithm is designed for data belonging to an
irregular surface. In BDB, only intermediate points in the Voronoi Diagram are used in the
estimation for each data point.
In contrast, the estimation of normal vectors using the TLS method does not only
rely on local information. This is more desirable as, ideally, the estimation of normal
vectors of the data belonging to a plane should exploit information from all points on the
plane. The adaptive k estimation in the TLS approach can be designed to select a larger
number of k neighbouring points which belong to the same plane. The close-up of the
normal vectors estimation in Figure 3-22 shows the robustness of the algorithm in the
noisy data.
The mean square error of the estimated normal vectors and the ground truth for the
data belonging to the plane is shown in Table 3-1:
MSE No of points TLS BDB
Plane 900 3.43 x 10-5
0.0714
Table 3-1 Mean Square Error of Estimated Normals Estimation of synthetic data using TLS and BDB
57
Figure 3-21 Estimated Normals using the BDB Approach (Close-up of data belonging to the plane in
Figure 3-19)
Figure 3-22 Estimated Normals using the TLS Approach (Close-up of data belonging to the plane in
Figure 3-20)
3.3.3.2 Outdoor LIDAR Data
We compared the approaches with the set of outdoor LIDAR data shown in Figure
3-9. The reference normal used for comparison is obtained by first manually segmenting
the vertical and horizontal planes. Least-square planes are then computed for both plane
data. The resulting normal vectors for both best-fit planes are used as the reference normal.
58
Similar to the synthetic data set, TLS works better than BDB in the normal vector
estimation for data belonging to plane, as depicted in Table 3-2, Figure 3-23 and Figure
3-24. The results show a MSE of 0.0148 for the vertical plane using the TLS method,
while the BDB method has a MSE of 0.028. As the size of the local neighbour for the TLS
method is adaptive, the estimation of normal vectors of data belonging particularly to the
horizontal grass terrain is more robust using the TLS.
MSE No of points TLS BDB
Vertical plane 697 0.0148 0.028
Horizontal plane 1484 0.0815 1.2427
Table 3-2 Mean Square Error of Normal Estimation of LIDAR data using TLS and BDB
Figure 3-23 Estimated normal vectors using the BDB method
59
Figure 3-24 Estimated normal vectors using the TLS method
3.4 CONCLUSION
In this chapter, we provided background studies on feature descriptors for 3D outdoor
LIDAR data classification and selected two types of feature descriptors for our
experiments. We extended the saliency features and validated the method on both
synthetically-generated data and outdoor LIDAR data acquired from a terrestrial laser
scanner. The extended features are shown to be distinctive for the three data classes of
interest (linear, planar and cluttered). The features are also validated to be invariant to the
size of the region neighbourhood.
In order to classify data belonging to planes into different classes, we next selected
normal vectors as a feature descriptor. We compared two most commonly used methods,
i.e. total least squares (TLS) and big Delaunay balls (BDB), in the estimation of surface
normal vectors. We showed that with adaptive estimation of nearest neighbouring points,
the TLS method is more robust for normal estimation of noisy multi-structure planar data.
60
4.0 3D OVER-SEGMENTATION
4.1 INTRODUCTION
Previous work in 3D data labelling has mostly been point-based [37, 46, 57, 58],
which introduces redundant computations for two reasons. Firstly, classifying every single
point is unnecessary. This is because most neighbouring points have similar features. For
example, points on the same plane will most likely have similar colour and similar
(estimated) normals. The amount of redundancy increases with the resolution, especially
for data with large planar surfaces. Therefore, labeling every point will result in a high
computational load which can be reduced by classifying a smaller sub-set of the data.
Next, depending on the type of learning model for data labelling, i.e.
discriminative or generative, using the complete data set can be unnecessary. To
understand this, complete data are essential for the estimation of the prior parameters
(which depend on the ratio of data of different classes to the total amount of data) of the
learning model. Unlike the generative models, the discriminative models do not require
estimation of the priors. Additional training data with similar features does not affect the
estimated parameters of the discriminative learning model [59].
Chapter 4
3D Over-segmentation
61
Previous work [60] has shown that the discriminative models generally perform
better with larger training sets compare to the generative classifier, as the discriminative
classifier reaches a higher asymptotic error at a slower rate. In the urban modeling
application, the amount of training data is in general relatively large, making the
discriminative model suitable for the application. The discriminative model needs to “see”
all possibilities during training of the learning model15
. This requirement is achievable
with over-segmentation if the reduction of training data does not affect the degree of
variation of the data. We propose an over-segmentation algorithm that groups training data
with homogeneous features, therefore maintaining the degree of variation of the data to a
reasonable extent. An overview of the existing over-segmentation algorithm for 2D and
3D data will be discussed in the next section, followed by an elaboration of our proposed
algorithm and validation of the approach on synthetic and outdoor terrestrial 3D laser data.
4.2 BACKGROUND OF OVER-SEGMENTATION
One solution to the redundancy issue in computation is to group similar points and
then label the groups instead of individual points. This approach is well-established in 2D
image analysis. The raw image is often over-segmented into a higher level representation
(compared with the individual pixels) to avoid point-based classification. Over-
segmentation is the process by which the objects being segmented from the background
are themselves segmented into sub-components, depending on the level of a visual
similarity criterion. Complete segmentation generally requires cooperation with higher
processing levels that use specific knowledge of the problem domain (that match the real-
world object); whereas over-segmentation groups regions that are homogeneous with
respect to a chosen property such as brightness, colour or texture. Although over-
segmentation is often only a means to an end in segmentation problems, the process
15 To explain this statement, we first look at the how the generative and discriminative model learns. The
generative model indirectly learns P(Y|X) on the basis of Bayes rule. In Bayes rule, the class-prior
probabilities and class-conditional densities are computed separately. For example, the generative Gaussian
Mixture Models (GMM) can be used to model the class-conditional densities, where a Gaussian is fitted to
data in each class. In contrast, the discriminative model directly learns P(Y|X) from the training data, for
example, making point estimates of the parameters using maximum likelihood. As a result, the
discriminative model needs to observe all possibilities during training.
62
increases the chances that boundaries of importance have been extracted for data
classification. Most of the previous segmentation approaches [61, 62] have shown that
extracting all objects of interest from the background, or each other, is difficult without
over-segmenting the data.
In 2D images, the graph cuts method is commonly employed for over-
segmentation (to group similar regions) [63]. A graph (image) G = (V, E) where the node
V of the graph are the pixels in feature space and an edge E is formed between every pair
of nodes. The graph can be partitioned into two disjoint sets – A and B, where VBA =U
and 0=BAI , by removing the edges connecting the two parts. The degree of difference
between the two parts can be computed as the total weight of the edges that have been
removed, also known as the cut value. Finding the minimum cut value will give the
optimal bi-partitioning of the graph or over-segmentation of the image.
However, there is a vast difference between 2D image pixel and 3D point cloud
processing – the sampling pattern of 3D point clouds is irregular, thus lacking an
organised lattice-like structure compared to the 2D image with regular lattice. In addition,
in terms of feature descriptors, the search for a coherent region in an image, compared
with 3D data segmentation, imposes very different requirements. Texture and colour are
generally used as similarity metrics for segmentation purposes in 2D images [64, 65],
whereas curveness is the main criterion for segmentation in 3D point clouds.
In reviewing the literature related to processing 3D point clouds, the author has not
located an algorithm similar to our proposed over-segmentation theory. To remove the
redundant data points, Triebel et al [46] performed kd-tree pruning (in the 3D data
labelling process) which prunes according to the position of the point and its label.
Therefore, their method only reduces the 3D data used for model training and cannot be
applied to the inference of the model. The other methods of point cloud reduction [14, 15,
55, 66] are not suitable for direct application for the present purpose, as mentioned in the
introduction. We need an algorithm that not only reduces the point cloud set, but also
retains information from the removed point cloud. That is, the information from the
removed points should still contribute to the classification of the remaining point cloud,
similar to the over-segmentation approach for 2D image processing.
63
In this chapter, we explain our approach to 3D over-segmentation in which we
have designed an adaptive support region, namely super-voxel, for the purpose. Super-
voxels are the result of an over-segmentation of the 3D point cloud. The super-voxel
reduces the complexity of the raw data and provides a longer range of interaction to the
data. It is also a perceptually-consistent unit that is uniform in the underlying data
structure and colour. We also identify two important factors for the design of the 3D over-
segmentation algorithm that must be considered:
4.2.1 Shape of the Super-voxel
A major factor of concern in the design of the super-voxel is the shape. A closely-related
area to the design of a super-voxel is the design of a region descriptor. A region descriptor
is a compact intermediate representation of the input data, often computed from connected
sets of relatively homogeneous data. Previous studies in determining the shape of a region
descriptor can be associated to super-voxel. The most common shape for region
descriptors includes sphere (3D shape contexts and harmonic shape contexts) [67],
cylinder (SPIN images) [49] and ellipsoid (Minimum Volume Ellipsoid) [68]. Previous
work [69, 70] has shown that the ellipsoid shape is preferable to the bounding box or
sphere for the approximation of the underlying object [69, 70].
4.2.2 Scale Selection
The scale of the super-voxel is another crucial parameter. Lindeberg [71] shows
that the notion of scale selection is of utmost importance for automatic processing of
unknown data. However, previous work in the computation of region point descriptors
often assumes a fixed scale. For example, the regional point descriptor used in [67] is
computed as follows: each point cloud is divided into a fixed scale of 0.2-meter voxels
and one point is selected at random from each occupied voxel. In another study, Stamos
and Allen [50] compute the region descriptor from fixed k neighbouring data for every
point. Even though the authors claim the k value is optimal, the consistency of the grouped
points cannot be guaranteed. The variation in the structures of the data and sampling
64
density as explained in Chapter 1 requires different scale levels. We have developed a
method that is capable of computing the scale of super-voxel adaptively with 3D scale
theory. With super-voxels, the classification of the 3D data is then based on a reduced data
set of the original point clouds. The concept of reducing the data set is such that
geometrically similar features data are omitted from training and inference of the learning
model. With this method, the total processing time required for training and testing the
learning model can be reduced.
4.3 SUPER-VOXEL – A 3D OVER-SEGMENTATION APPROACH
We propose to over-segment the 3D data into super-voxels before classifying the
data, using algorithms adopted from 3D scale theory [51, 72]. The individual 3D points
are clustered together to form a higher level representation, as shown in Figure 4-1. For
the p data points in Figure 4-1, the number of n super-voxels, where n<<p, are computed
based on the normalized colour similarity and the geometry of the data structure.
In this section, we will explain the scale and shape selection of the super-voxel.
Super-voxels are then computed on the selected data sets.
4.3.1 Sphere as Dividing Boundary
We have chosen sphere over ellipsoid as the shape of the super-voxel. The reason for
choosing a sphere (centred on basis point p), over ellipsoid or some irregular shapes, is to
avoid the effect of shape variation which can affect the features extracted. For instance, if
the shape of the segmented super-voxel region is ellipsoid, the super-voxel can be “longer”
in one direction, as shown in Figure 4-2a. The reason for the shape may be noise or the
surface scanned not being exactly flat. The resulting saliency features explained in
Chapter 3 that depend on the eigen-analysis of the scatter matrix will be linear-like, even
when the data is almost planar. This effect can be minimised using a sphere. As shown in
Figure 4-2b, both computed super-voxels (shown as two blue circles) are planar-like with
65
larger “surface-ness” saliency features, as explained in Chapter 3. By choosing a regular
shape, the over-segmented super-voxels will overlap each other as depicted in the same
figure. Despite having some points belonging to more than one super-voxel region, the
additional amount of computation is relatively small. Also, allowing overlap effectively
offers the advantage of providing maximum coverage of data with similar properties.
4.3.2 Automatic Scale Selection
The radius of the super-voxel, r is iteratively estimated with the following
algorithm [73, 74] (that depends on the estimated curvature, density, noise and the colours
of the points):
Randomly pick a point p
Check if the p belongs to any previous super-voxel
Start with k =15 (adjust manually according to noise constant)
Figure 4-2 Over-segmentation of 3D point clouds into super-voxels
(a) (b)
Figure 4-1. Comparison of different shapes of super-voxel (ellipsoid: yellow; sphere: blue)
66
Iterate and refine (maximum of 10 steps) to estimate the optimal size of the super-voxel
Compute radius of super-voxel r as the distance from p to its kth nearest neighbours locally.
If k < 2, assign k to 2 in order to compute least square solution for estimation of parameter
d.
Fit least square plane to p and its kt h
nearest neigbours and compute d, where d is the
shortest distance16
between p and the least square plane.
Compute µ, the average distance from p to all the points within the super-voxel.
Compute the estimated density ρ, estimated curvature κ [75] locally
2r
k
πρ ←
( 4-1 )
2
2
µκ
d←
( 4-2 )
Use the known parameters to compute rnew
))]}max(var(1[)](1
{[
)1(
3
1
2
21 colourdddamp
rdampr
n
n
oldnew
−×+×+
×−=
σερ
σ
κ
( 4-3 )
Compute
[ ]]}2
newnew rk πρ←
( 4-4 )
Stop if knew>threshold or knew saturates
Until all points belong to a super-voxel
Algorithm 4-1 3D Over-segmentation
According to the Chebysev’s Inequality, for every :0>ε
εε
σµ =>− )|(| pP
( 4-5 )
ε which is found in Eq.4.3 is set to be 0.1 in our experiments. The noise constant σ can be
estimated experimentally by computing the average distance of every point (acquired from
a single plane) to the least square fitted plane. In the case where different scanning
16 This can be computed using singular value decomposition (SVD) as explained in Section 3.3.2. The
shortest distance from p to the least square fitted plane is equivalent to the smallest singular value.
67
resolution is used, the noise constant has to be re-estimated for the different resolution
using the same approach. The accuracy of the estimation process relies on the level of
“flatness” and the amount of plane data used. To estimate the value of d1 and d2 in the
same equation, ground truth normals17
are required to solve the following linear
minimization problem (refer to [51] for more details):
2
0
32
21 ))(1
(min∑=
−+N
i
in
i
n
i
rdd σερ
σ
κ
( 4-6 )
Another variable in Eq. 4.3, the damping factor damp, is set to 0.2 to avoid the iterations
diverging and going into a marginally stable state.
To provide colour constraints within the super-voxel, we include colour properties
of the data in the estimation of the radius of the super-voxel (Eq. 4.3). First, the variances
of the normalized RGB values are computed for the super-voxels. The maximum of [varR
varG varB] is then used as a factor to reduce the radius if the colour within the super-voxel
is inconsistent. The proposed over-segmentation approach is validated on two sets of
synthetic data and two sets of outdoor LIDAR data. Super-voxels are computed and
plotted on the datasets. The noise constant in the algorithm is adjusted in both
synthetically generated data and outdoor LIDAR data sets.
4.3.3 Synthetic Data
Synthetic Data Set I
17 Similar to the estimation of the noise factor, the ground truth normal can be estimated by least square
fitting a plane, where the plane coefficient is the best estimated normal.
68
a b
c d
e f
Figure 4-3 Super-voxels for synthetic data. a) at 10th
iterations; b) at 20th
iterations; c) at 30th
iterations d) at 40th
iterations e) at 50th
iterations f) at 100th
iterations
We validated our algorithm on the synthetically generated data shown in the previous
chapter (Figure 3-2). With over-segmentation, the original data were reduced from 1500
data points to 129 super-voxels (which is around 8.6% of total amount of data). The
69
number of super-voxels on the plane is 19 (2.1% of all data belonging to the plane) and
the number of super-voxels on the clutter is 92 (15.3% of all data belonging to the plane).
The average radius of the super-voxels for the data belonging to the plane is 5.32, while
the average radius for the data belonging to the clutter is smaller, at around 2.45. The
computation time is 74s on an Intel Core 2 Duo 2.13GHz CPU and 2GB of RAM.
Figure 4-4 Super-voxels for synthetic data at 129th
(final) iteration
Figure 4-3 shows the results of the over-segmentation at 10th
, 20th
, 30th
, 40th
, 50th
and 100th
iterations. The image in Figure 4-4 shows the final over-segmentation result and
Figure 4-5 shows the top view of the result. As depicted in these figures, the sizes of the
computed super-voxels for the data belonging to the clutter remain small compared to the
sizes of the super-voxels for the data belonging to the plane. Notice that there are some
super-voxels on the clutter data that appear much larger than the average size of the super-
voxels (on the clutter data). To explain this, the over-segmentation algorithm requires at
least three points in the super-voxel for computation of the least square fitted plane. The
data from the clutter which are at a distance from the rest of the data would require a
larger super-voxel to connect with the other minimum requirement of two points.
70
Figure 4-5 Bird’s-eye view of super-voxels for synthetic data at 129th
(final) iteration
Figure 4-6 Synthetic data set 2
Synthetic Data Set II
To further evaluate the performance of the algorithm on data, we synthetically
generated two clusters of data (as shown in Figure 4-6) which also consisted of a plane
71
and a clutter. The difference of the data from the Synthetic Data Set II is that a portion of
the plane is closer to the clutter. The Synthetic Data Set II consists of two sets (for training
and inference purpose) of a hundred points that form a plane with Gaussian noise of
standard deviation at 0.01, and hundred point clutter data that represent vegetation.
The original data have been reduced from 200 data points to 52 super-voxels.
Similar to the results in the first data set, the computed super-voxels for the data belonging
to the clutter remain small compared to the super-voxels for the data belonging to the
plane. Because some portion of the clutter data is closer to the plane, as observed from the
over-segmentation result in Figure 4-7, the super-voxels fitted to the data belonging to the
plane that are located closer to the clutter are smaller, in order to exclude data belonging
to the clutter.
The number of super-voxels on the plane is 25 (25% of total data belonging to the
plane) and the number of super-voxels on the clutter is 32 (32% of total data belonging to
the clutter). The average radius of the super-voxels for the data belonging to the plane is
2.6, while the average radius for the data belonging to the clutter is 1.5. The computation
time is 12.1s. Details of the results from different viewpoints are shown in Figure 4-8 and
Figure 4-9.
72
a b
c d
e f
Figure 4-7 Super-voxels for Synthetic Data Set I. a) at 5th
iterations; b) at 10th
iterations; c) at 15th
iterations d) at 20th
iterations e) at 30th
iterations f) at 52th
(final) iterations
73
Figure 4-8 Super-voxels for Synthetic Data Set I at 52th
(final) iteration
Figure 4-9 Bird’s-eye view of super-voxels for Synthetic Data Set I at 52th
(final) iteration
4.3.4 Outdoor LIDAR Data
Outdoor LIDAR Data Set I
The over-segmentation algorithm is also applied to the outdoor LIDAR data shown in
Figure 3-9. The original data set, which consists of 3648 points, has been reduced to 180
74
super-voxels (4.9% of the total data). The result during the iterations is shown in Figure
4-10, while close-ups from different viewpoints can be found in Figure 4-11 and Figure
4-12.
In order to evaluate the performance of the algorithm, we manually labelled this
data set into different planes and clutter. With over-segmentation, the result is similar to
the results for over-segmenting the synthetically-generated data sets, where the number of
super-voxels on the planes is relatively small compared to the number of super-voxels on
the clutter. The number of super-voxels on the vertical plane is 16 (2.3%), on the
horizontal plane is 27 (1.8%) and on the clutter is 111 (7.7%). The average radius of the
super-voxels for the data belonging to the vertical plane is 0.97, and for the data belonging
to the horizontal plane is 1.2; while the average radius for the data belonging to the clutter
is 0.29. The computation time is 237s on an Intel Core 2 Duo 2.13GHz CPU with 2GB of
RAM.
75
a b
c d
e f
Figure 4-10 Super-voxels for outdoor LIDAR data. a) at 10th
iterations; b) at 20th
iterations; c) at 50th
iterations d) at 100th
iterations e) at 150th
iterations f) at 180th
(final) iterations
76
Figure 4-11 Super-voxels for outdoor LIDAR data at 180th
(final) iteration
Figure 4-12 Bird’s-eye view of super-voxels for outdoor LIDAR data at 180th
(final) iteration
77
Outdoor LIDAR Data Set II
We tested the algorithm in a more complicated outdoor LIDAR data in Data set 2, as
shown in Figure 4-13:
Figure 4-13 Outdoor LIDAR Data set II
The results during the iterations are shown in Figure 4-14 and Figure 4-15. The
final over-segmentation result is shown in Figure 4-16 and Figure 4-17 (top-view). The
original data have been reduced from 1500 data points to 279 super-voxels. The number of
super-voxels on the horizontal plane is 38 (2.1%), on the vertical plane is 54 (2.9%) and
the number of super-voxels on the clutter is 187 (6.7%). The average radius of the super-
voxels for the data belonging to the horizontal plane is 1.2, while for the data belonging to
the vertical plane it is 0.81, and the average radius for the data belonging to the clutter is
0.36. The computation time is around 7 minutes on an Intel Core 2 Duo 2.13GHz CPU
with 2GB of RAM. Notice that the super-voxels at the sparse horizontal plane are
relatively large compared to the denser vertical plane. This is because the estimated radius
of the super-voxels is inversely proportional to the estimated density (as defined in Eq. 4.2.
78
a b
c d
e f
Figure 4-14 Super-voxels for outdoor LIDAR data. a) at 10th
iterations; b) at 30th
iterations; c) at 100th
iterations
79
a b
Figure 4-15 Super-voxels for outdoor LIDAR data. a) at 200th
iterations;
Figure 4-16 Super-voxels for outdoor LIDAR data. a) at 276th
iterations;
80
Figure 4-17 Super-voxels for outdoor LIDAR data. a) at 276th
iterations;
4.4 CONCLUSION
In this chapter, we identified the problem with point-based classification. We
showed that the amount of original outdoor LIDAR data can be greatly reduced, avoiding
redundant computation for the classification of the data. We then proposed an effective
method to over-segment 3D outdoor LIDAR data. The algorithm is adaptive and capable
of accurately grouping data of similar properties, reducing original data to a relatively
small proportion. This greatly reduces the computation time required to classify the
original data, as only around 10% of the original data (depending on the resolution and the
complexity of the data) require labelling. Even though this additional pre-processing step
for classification may slightly increase the total processing time, over-segmentation
enhances the classification results, which will be described and shown in the following
chapter.
81
5.0 DATA CLASSIFICATION WITH MCRF
5.1 INTRODUCTION
In this chapter, we consider the supervised learning approach, which is a machine learning
technique to classify point clouds into different data types with extracted features. With
both the input and desired outputs (training data), supervised learning tries to find the
connection between the two sets of observations. The connection is a global model or a set
of local models that is capable of mapping the inputs to the desired outputs. With the
learned model, it is then possible to predict the output (as a continuous value in regression
problems or as a label for classification problems) for any valid input data. This is similar
to concept learning in human psychology, where the learner simplifies what has been
observed from the examples and uses the simplified concept to apply to future examples.
Existing learning approaches include logistic regression, neural networks (Multi-
layer Perception), Support Vector Machines (SVM), k-nearest neighbours, Gaussian
mixture model (GMM), naïve Bayes, decision tree and Radial Basis Function (RBF)
classifiers. These learning models each have strengths and weaknesses – referred to as the
‘no free lunch (NFL) theorem’ of Wolpert [76]: that no single classifier will outperform all
Chapter 5
Data Classification with
mCRF
82
other classifiers on all learning tasks. In order to choose the optimal classifier, it is crucial
to understand the characteristics of the data set. Several empirical comparisons have been
carried out [77-80] to define the classifier selection criteria for different data
characteristics. However, none of the approaches has successfully predicted and explained
the classifier performance for different data sets [81]. Some research has proposed
combining classifiers of different natures to complement each other. For example, Barat et
al. [82] evaluated a set of neural and statistical classifiers and provided the appropriate
fusion rule. The drawback is an increase in classifier complexity and inefficiency.
The background to supervised classification and the models that have been used
for effective 3D data classification is first described in this chapter. We demonstrate the
advantage of using a discriminative model over a generative model for our dataset. We
then show the need for a multi-scale graphical learning model for the classification
problem, and propose a multi-scale Conditional Random Field (mCRF) solution. The
proposed model is evaluated and the result is compared with some existing models. The
results confirm improvements over classification using logistic regression and Conditional
Random Fields.
5.2 BACKGROUND OF SUPERVISED CLASSIFICATION
Supervised classifiers can generally be divided into generative (model-based) and
discriminative classifiers. The following sections explain the details and differences of
both types of classifiers, and introduce the concept of graphical model which combines
both graph theory and probabilistic theory.
5.2.1 Generative Model
Popular generative models include: Bayes classifier, Hidden Markov Models and
Maximum Entropy Markov Models. These models define a joint probability distribution
of the observation and labelling sequences P(X,Y).
83
Consider the supervised learning problem:
We would like to approximate the unknown mapping function f: X→Y or P(Y | X), where
Y is the predicted output label and X is the input data.
The problem can be approached with the Naïve Bayes method which is a
generative supervised learning model based on Bayes’ theorem.
With Bayes rule:
)(
)()|()|(
XP
YPYXPXYP =
( 5-1 )
In order to predict the output label for any data, we need to estimate P(X|Y) and
P(Y) using the training data pairs. P(X) is the expected data likelihood which is the
expectation over the prior distribution. It can also be seen for normalisation purposes to
ensure the summation of probability goes to 1.
∑= )()|()( YPYXPXP
( 5-2 )
The naïve Bayes is called a generative classifier as P(X|Y) can be seen as a
distribution that describes how to generate random instances X conditioned on the
predicted target attribute Y. The term naïve is used because of the strong (naïve)
independent assumption, i.e., the presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of any other feature. This greatly simplifies the
problem, as only the variances for each class have to be computed instead of the whole
covariance matrix. Even though the independence assumption is fairly strong and
unrealistic, the naïve Bayes has surprisingly often performed much better than expected in
real-world learning problems.
For naïve Bayes with continuous input, P ( X | Y ) has to be modeled by fitting
with Gaussian Mixture Models (GMM). Although the Gaussian Mixture Model is a
parametric unsupervised learning method, it can be used as part of a supervised
classification scheme by defining prototypes. To train the GMM for non-parametric
supervised learning, we need to first discover the number of clusters that exist and the
84
clusters that correspond to different classes. For more details on determining the optimal
number of clusters, refer to Chapter 8. The Bayesian classification with GMM has been
applied in classifying terrestrial point clouds into different data types. For instance, instead
of using a manually-fixed threshold [49, 50], Lalonde et al [51] learned the distribution of
the saliency feature on hand-labelled data with GMM and estimated the parameters of the
GMM with the Expectation Maximization algorithm.
The proposed approach by Lalonde et al. only clusters in the feature space.
Classifying only with local features (not taking neighbouring points into account) can be
very difficult due to ambiguity at point level. Local classification can also lead to isolated
false positives and missing false negatives. The reason that the authors use only local
features is because the application is real-time and the classifier is labelling every new
point as soon as it is acquired.
5.2.2 Discriminative Model
Another popular class of approach is the discriminative model, including logistic
regression, Conditional Random Fields (CRFs)[83] and Markov Random Fields (MRFs),
which specify the probability of a label given an observation sequence p(Y|X). By
modelling the conditional probability distribution instead of the joint probability
distribution, the discriminative models do not need to enumerate all possible observation
sequences, which may not be feasible [83].
For example, instead of estimating the joint probabilities, the logistic regression
computes P(Y | X) directly with the following parameterisation:
∑=
++
==n
i
ii Xww
XYP
1
0 )exp(1
1)|1(
( 5-3 )
85
∑
∑
=
=
++
+
==n
i
ii
n
i
ii
Xww
Xww
XYP
1
0
1
0
)exp(1
)exp(
)|0(
( 5-4 )
The training data are used to estimate W=<w0, w1,…, wn> which satisfies
→W argmaxw ∏l
llWXYP ),|(
( 5-5 )
Unlike naïve Bayes, logistic regression is a function approximating algorithm that
discriminatively predicts the label Y given any instance X. As mentioned before, no
classifier is optimal for all classification problems. Several authors have studied the
comparison of the generative and the discriminative classiers [84, 85]. It is generally
agreed that generative classifiers are more suitable when the acquired data are limited.
This is because the generative model has assumed the underlying distribution. Therefore,
the discriminative logistic regression converges at a slower rate during parameter
estimation compared to the generative naïve Bayes. With a sufficient amount of
comprehensive training data, discriminative classifiers have the advantage of being more
accurate. Even though logistic regression also assumes conditional independencies
between the input features, it is not rigidly tied to the assumption like naïve Bayes. When
the assumption is violated, the parameters of the logistic regression will be adjusted by the
conditional likelihood maximization algorithm to fit to the data. Real life data typically
cannot be modelled precisely with an exact distribution model. Thus, the discriminative
classifiers are more suitable for real life data in general.
5.2.3 Graphical Model
The aforementioned naïve Bayes and logistic regression methods are probabilistic
models. These models take only the node potentials into account. However, in 2D images
or 3D point clouds, a single pixel/point is very ambiguous by itself. Kumar [86] identified
this problem as the curse of ambiguity: since pixels/points from different classes can
86
appear similar, it is extremely difficult to identify the class label of each pixel/point
independently. By combining graph theory and probability theory, graphical models are
capable of modelling the spatial interaction between pixels/points. Jordan described
graphical models as “a natural tool for dealing with two problems that occur throughout
applied mathematics and engineering -- uncertainty and complexity -- and in particular
they are playing an increasingly important role in the design and analysis of machine
learning algorithms.” [87]. The previously mentioned supervised models such as hidden
Markov models (HMM), Markov random fields (MRF) and conditional random fields
(CRF) are examples of graphical models.
Using both generative and discriminative graphical models, Wolf et al. [58]
classified 3D points into navigable and non-navigable regions with a HMM locally
followed by global segmentation with a MRF. The HMM is implemented to learn the
difference between different classes, whereas MRF is implemented to enforce the spatial
constraint between the neighbouring points (smoothing). Instead of labelling and
smoothing the data labels in an ad-hoc manner, as proposed by Wolf et al., Anguelov et al.
[37] approached the 3D data classification problem with discriminative graphical models –
the associative Markov network (AMN). AMN allows effective inference using graph-cuts
[88]. The authors segmented 3D scan data into four features - ground, tree, building and
shrubbery. The experimental evaluation showed that AMN predicted 93% correctly,
whereas as SVM (non-graphical model) predicted 68% correctly. Also using AMNs,
Triebel et al. [46] classified point clouds into windows, walls and gutters. The features
employed include: the cosine of angles between the local normal vectors, distribution of
neighbours and the normalised height of the points. The results showed that the AMN
outperforms a generative model, the Bayes classifier, with sufficient training examples.
Similar to Wolf et al. and Anguelov et al, we only start to process data after all
data acquisition in one area has been completed. Therefore, we have the advantage of
“knowing” complete neighbouring data for each point. As a result, we can include the
neighbouring information into the learning model using a graphical model, as spatial
relationships exist among the input data. Also, with sufficient training data, a
discriminative graphical model, the CRF, is a suitable choice of learning model. In order
to apply the learning model to super-voxels, as elaborated in Chapter 7, we propose a
multi-scale approach which will be explained in detail in the next section.
87
5.3 MULTI-SCALE CONDITIONAL RANDOM FIELD
Conditional Random Fields are undirected graphical models which have shown
promising results in text processing [83, 89], image segmentation [26, 90], DNA sequence
prediction [91], table or diagram structure extraction from documents [92, 93] and more
recently, in 3D range data classification at point level [46, 94].
As stated previously, graphical models take neighboring data into account.
Therefore, in addition to the local node potentials, the pair-wise edge potentials are
included in the model. However, the edge potentials in a graphical model are limited in
providing long-range correlation, especially for high resolution data. In addition,
classifying every point using only the features of the point and of its neighboring points is
sensitive to the difference in resolution among different scans and scanner technologies
(the density of the point clouds also varies with respect to different distances from the
laser scanner). In 2D image labeling, multi-scaled [26] approaches have been introduced
to the CRFs for super-pixel labelling. Given an over-segmentation of the data (as
explained in Chapter 6), we construct the multi-scale Conditional Random Fields
(mCRFs), as shown in Figure 5-1, for super-voxel labelling with the following context:
Figure 5-1. Multi-scale Conditional Random Fields with local edges (green) and regional edges
(black).
Let s=s1, …,sN be the observed feature vectors of some N super-voxels. Each feature
vector consists of a combination of feature descriptors such as heights, colours, SPIN
images and estimated normals.
88
Let c=c1,…,cN be the labels in C given the observable super-voxel. In urban modelling,
labels are selected from low level such as ‘planar’ and ‘cluttered’; or higher level such as
‘building’, ‘vegetation’, ‘tree trunk’, ‘grass’ and ‘man-made pathway’.
Let x=x1, …,xM be the observed feature vectors of some M points of the point cloud data,
randomly selected within every super-voxel.
The mCRF with parameters };,{ rl=θ };,{ ijil λλ= },{ ikir λλ= where },...,{ 1 Ciii λλλ = ,
},...,{ 1 Cijijij λλλ = , },...,{ 1 C
ikikik λλλ = define the conditional probability for a state sequence to
give an observable sequence of:
∏∏=∈
ΨΨ=N
i
iii
ji
jijiij
l
l xcxxccZ
xcP1),(
),(),,,(1
)|(ε
∏∏=∈
ΨΨ=N
i
iii
Ski
jikiik
r
r scssccZ
scP1),(
),(),,,(1
)|(
)|()|(),|( xcPscPxscP lr ×=θ
( 5-6 )
Pl is the probability of the super-voxel being labelled as class c given the features
of the mid-point of the super-voxels, and of its neighbours within the super-voxel. Pr is the
probability of the super-voxel being labelled as class c given the features of the mid-point
of the super-voxel and the mid-point of its super-voxel neighbours. The final conditional
probability of the super-voxel being of class c is the product of the mentioned probabilities
as shown in Eq( 5-6 ) (with the assumption of independency between the regional and
local features). In Eq( 5-6 ), Zl and Zr are the normalization constants that make the
conditional probabilities sum to one. The local edge potential, region edge potential and
node potential in Eq. 5.7 are defined as follows:
89
Local edge potential:
∑
∑
=
=Ψ
C
C
j
C
iji
C
ij
C
jiji
C
ijjijiij
ccxx
xxccfxxcc
})(exp{
)},,,(exp{),,,(
λ
λ
( 5-7 )
As shown in Figure 5-1, the local edge potential is used to exploit the label interactions
between the point xi with m randomly selected neighbours within the super-voxel.
Region edge potential:
∑
∑
=
=Ψ
C
C
k
C
iki
C
ik
C
kiki
C
ikkikiik
ccss
ssccfsscc
})(exp{
)},,,(exp{),,,(
λ
λ
( 5-8 )
The regional edge potential provides a coarser constraint. l number of the closest
neighbouring super-voxels are selected as the regional edges.
Node potential:
∑
∑
=
=Ψ
C
C
ii
C
i
C
i
C
iii
cx
xcfxc
})(exp{
)},(exp{),(
λ
λ
( 5-9 )
The node potential is a discriminative logistic regression (maximum entropy classifier)
that models each label c as a linear function of x or s.
),,,( jiji xxccf , ),,,( kiki ssccf and ),( xcf i are feature functions which are often binary
valued for categorical classes (such as in text applications). In our application with ordinal
observations, the feature functions are real-valued; the feature functions are defined over
all the local data points feature (for example, the logarithm of saliency features)
observation sequence x and s, the current state ci and the neighbouring state cj and ck.
90
mCRFs learn by finding the node, local and regional edge weights vector, to
maximize the log-likelihood. With a Gaussian prior with variance 2Cσ , the log-likelihood is
penalized as follows:
∑∑ −== C C
CN
i
xscPC2
2
1 2),|(log
σ
λθθ
( 5-10 )
where the second summation provides smoothing to avoid over-fitting [95]. The scaled
conjugate gradient optimization algorithm is used for the maximization.
Given the observation sequence x, inference in mCRFs is to find a state sequence
cmax which is the most likely:
),|(maxargmax xscpc c θ=
( 5-11 )
Since exact inference can be intractable in such models, approximate inference using
belief propagation is performed for finding cmax.
As explained in Section 4.3.1, the super-voxels can overlap. In inference, the
points belonging to more than one super-voxel will be labelled as the maximum of the
product of the conditional probabilities from the overlapped super-voxels. Let V be the
super-voxels that include point p, and the label of point p is therefore:
∏∈
=Vv
vcp pcpc )|(maxarg max,max θ
( 5-12 )
The points labelled as ‘planar’ or ‘building’, ‘grass’ and ‘man-made pathway’ can then be
extracted to form a Digital Surface Model (DSM) and Digital Terrain Model (DTM). The
‘cluttered’ or ‘vegetation’ points can be reduced with data reduction techniques [14] or
replaced with generic models.
91
5.4 PLANE PATCHES FITTING
With the ‘planar’ or ‘building’, ‘grass’ and ‘man-made pathway’ points being extracted,
visualising the data, for a large scale model, would still require a great deal of memory and
processing time to view the data efficiently. As explained in the previous chapter, for large
scale data, the raw data were divided into small voxels before being processed to reduce
the processing time. Similarly, for visualization, one of the solutions is to process each
small voxel at a time and fit planes on the extracted planar data in the small voxel.
To perform robust plane fitting for visualisation, the Random Sample Consensus
(RANSAC) is a common method [96]. A general approach is to first fit the large scale
plane data with RANSAC and then refine the fit with least square fitting, as shown in
Error! Reference source not found.. By representing the individual data with fitted plane
models, great reduction in memory space can be achieved.
The RANSAC algorithm for plane fitting is explained as follows:
Determine the percentage of inliers w, inlier threshold τ, number of inliers required to be a good
fit m, and security probability p for n number of data
Compute the number of iterations i required
)1log(
)log(nw
pi
−=
( 5-13 )
Repeat i times
Hypothesis generation
Randomly obtained 3 points, fit a plane to the points
Calculate plane coefficients:
)()()( 213132321 zzyzzyzzya p −+−+−=
)()()( 213132321 xxzxxzxxzb p −+−+−=
)()()( 213132321 yyxyyxyyxc p −+−+−=
)()()( 122133113223321 zyzyxzyzyxzyzyxd p −+−+−=
( 5-14 )
92
If coefficients are zeros, re-select 3 random points
Model verification
Compute perpendicular distance from points to the plane
222ppp
pppp
cba
dzcybxad
++
+++=
( 5-15 )
Scoring function: If the distance is less than τ, the point is an inlier
Model refinement
If the number of inliers is greater than m refit the set of inliers with least square fitting and
recompute the perpendicular distance residuals
The best fitted plane is the plane with least perpendicular distance residual
Algorithm 5-1 RANSAC
To estimate the “inlier threshold” τ in the RANSAC algorithm, the same method in
the estimation of the noise constant or variance (for the over-segmentation algorithm in
Chapter 6) can be used. The “inlier threshold” determines whether a point is an inlier to
the fitted model and it is typically set at 1.5 times the estimated variance. There are a
number of variations to the RANSAC algorithm, including replacing the hard “inlier
threshold” with a weighted function. We will discuss some of the variations and
challenges of the RANSAC algorithm in Chapter 6.
5.5 RESULTS FOR DATA CLASSIFICATION
In this section, we demonstrate the advantage of performing over-segmentation
over individual data labelling for data classification using synthetic data. To do so, we
evaluate two classifiers, the Conditional Random Field (CRF) and the logistic regression
(which only considers the node potentials), using synthetic data to show the improvement
achieved by further taking the independencies among the neighbouring data into account.
The proposed method, the multi-scale Conditional Random Field (mCRF), is next
validated using two sets of complicated real-world data acquired from the terrestrial laser
scanner (shown in Chapter 2). We demonstrate the advantage of including the constraints
93
between super-voxels in the multi-scale method compared to the CRF. We also compare
our proposed method with triangulations and the direct plane fitting approach.
Lastly, we validate our algorithm on a large-scale real-world data set, registered
from seven terrestrial laser scans. The additional pre-processing step (required to handle
the relatively large data set) is explained, and the result is fitted with plane patches for
visualization.
5.5.1 Synthetic data sets
The proposed algorithm is compared with CRF without data reduction on a set of
synthetically generated data as shown in Figure 4-6, where the connecting neighbours are
selected from a fixed support region. The algorithm is also compared with a discriminative
ACCURACY
(0-1)
RUNTIME
(TRAINING)
RUNTIME
(TESTING)
TRAINING
ITERATIONS
REDUCTION TO
(TRAIN DATA
200)
REDUCTION TO
(TEST DATA
200)
CRF WITH ADAPTIVE
POINT REDUCTION 1 12.46S 0.109S 23 42 52
CRF WITHOUT ADAPTIVE
DATA REDUCTION 0.939 77.96S 0.516S 37 N/A N/A
LOGISTIC REGRESSION 0.835 11.344 0.156S 18 N/A N/A
Table 5-1 Result for Synthetic Data Example
Figure 5-2. Classification result of synthetic data learned and inferred with (a) CRF with adaptive data
reduction (b) CRF without adaptive data reduction (c) Logistic regression
94
classifier that does not take edges potential into account – logistic regression. A
comparison of discriminative and generative classifiers for urban data can be found in [97].
CRF with super-voxels
Six points are randomly selected from the adaptive support region for the edges in CRF
for every point p. The saliency features of the six points and point p are used as the feature
vector. From Table 5-1, we can see that the training data is reduced to 42 (21% of original
data) and required 12.46s for parameter estimation with an Intel Core 2 Duo 2.13GHz
CPU and 2GB of RAM.
For inference, the testing data is reduced to 52 (26%). Most of the data reduction
occurs within the plane regions where the curvature, which is one of the factors in the
computation of the radius of the super-voxel, is lower. Therefore the support region is
larger (this is because the support region is inversely proportional to the curvature). The
amount of reduction in data therefore depends on the ratio of the data type to the total
amount of data; i.e. for data sets with more planar data, the resulting reduction in the
amount of data will be more significant.
In the synthetic data experiment, the CRF manage to correctly classify all data
points as shown in Figure 5-2: blue is used to represent “planar” data and red is used to
represent “cluttered” data.
CRF without super-voxels
For the selection of edge points, for every point p, three points were randomly picked from
a fixed radius and another three points were randomly picked from a fixed cylinder, as
described in the proposed method in [37]. As shown in Table 5-1, the time taken to train
the CRF with all data is much longer. The reason for a few misclassifications compared to
the “CRF with super-voxel” approach is due to a different edge points selection method:
in the “CRF with super-voxel” approach, the selected neighbouring edge points are from
an adaptive radius and thus are more likely to be from the same class.
95
Logistic regression
Similar to CRF, the logistic regression is also a discriminative classifier, but it takes only
node potentials into account. The time taken for training and inference of the learning
model is similar the time for CRF with adaptive data reduction. However, without spatial
information, the algorithm is prone to misclassification of a “flatter” cluttered region as
“planar” data. This is due to that in the computation of saliency features of a point from
the cluttered data, it is possible that the k nearest points selected to form the covariance
matrix are approximately coplanar. Without neighbouring information, it can be difficult
for the classifier to correctly classify such points. As a result, the logistic regression has
the worst classification accuracy compared to the other two approaches.
5.5.2 Urban data sets
For our proposed mCRF with super-voxel approach, we next validated the approach with
three sets of real outdoor scanned data. The mCRFs was trained with hand-labelled
outdoor laser scanned data. The original segmented 57,734 training data were reduced to
5,850 super-voxels automatically with our proposed over-segmentation algorithm. As a
result,with over-segmentation, the total training time for the model is reduced to 10% of
the original time. Note that the total reduction is less than the reduction achieved in the
testing dataset (as most data reduction occurred at the planar region, and we carefully
selected a balanced amount of data from different datasets that covers most of the data
variation).
The training data were hand-labelled and the data were chosen from three scans
with different densities, scene configurations and lighting conditions. The total training
time was around 5 hours on an Intel Core 2 Duo 2.13GHz CPU with 2GB of RAM. We
then computed the super-voxels of these hand-labelled data. Four neighbours were
randomly selected for local edge features, and four nearest super-voxels were selected for
the regional edge features. This means that for every support region, we needed to
compute feature descriptors for only five points instead of every point within the support
96
region. In the following experiments, the learning model was trained to classify the data
into 5 classes: Vegetation, Trunk, Man-made objects (building, signboard), Pathway,
Terrain (Grass)].
Data set 1
With the 3D over-segmentation, 10,660 points as shown in Figure 5-3 are reduced into
538 super-voxels. Therefore, only 5% of the original data has to be labelled, providing a
large reduction in total inference time. Figure 5-3a shows the over-segmentation result.
Note the bigger super-voxels in the geometrically flat data (such as building and terrain
data). The computation time required for feature extraction is around 53.7s, and 0.1s for
inference of the super-voxels with CRF and 0.2s for mCRF.
97
The labelled data are shown in Figure 5-3d and our classification accuracy is
around 86% for mCRFs (Figure 5-3c), compared to 78% for CRFs (Figure 5-3b). The
Figure 5-3. Data set 1
(a) 3D points over-segmented into super-voxels
(b) Labelled super- voxels’ mid-point with CRFs (Yellow – Man-made objects, Red – Vegetation, Light blue
– Trunks, Green – Terrain, Dark blue – Pathways)
(c) Labelled super- voxels’ mid-point with mCRFs
(d) Labelled original data with mCRFs
98
feature descriptors used in both experiments are the same as explained in Chapter 5. With
negligible processing time difference, we show that a multi-scale approach improves the
super-voxels labelling.
Data set 2
A more complicated dataset with 158,922 points in laser intensity as shown in Figure 5-4
was over-segmented into 8,330 (5.2% of original data) super-voxels as shown in Figure
5-5. Similarly, we saved around 95% of processing time on the inference. The
computation time taken for feature extraction from the super-voxels was around 15
minutes, 87s for inference in mCRFs, and 39s for inference in CRFs. The time difference
of 48s is almost negligible compared to the time taken for feature extraction.
Figure 5-4 Outdoor LIDAR data set II
99
Figure 5-5 Segmented super-voxels
(a) (b)
Figure 5-6 Yellow – Man-made objects, Red – Vegetation, Blue – Trunks, Green – Terrain, Blue –
Pathways (a) Labelled super-voxels’ mid-point with CRFs (b) Labelled super- voxels’ mid point with
mCRFs
100
(a) (b)
(c) (d)
Figure 5-7 (a) Triangulated urban model (b) Enlargement of triangulated building surface
(c) RANSAC fitted plane patches (d) Extracted building/terrain surface plane patches
The labelled super-voxels with CRFs are shown in Figure 5-6a with classification
accuracy around 72%. With the longer-range of interaction provided by the regional edge
features in mCRFs, label accuracy was improved to 79%, as shown in Figure 5-6b.
Most misclassification occurs between ‘pathways’ and ‘terrain’ – likely to be due
to the very similar features (flat surface with similar upwards pointing normal vectors) and
colour variation caused by buildings or vegetation shadows. For some applications, for
example robotic navigation that has different requirement such as whether the terrain is
navigable, the mis-classification of terrain points in the experiment will not be an issue.
This is because the mis-labelled ‘pathways’ data will always be close to the real pathway
location.
As direct triangulations is a popular and straight-forward method for building
reconstruction, we compare our proposed method with the triangulation approach. The
101
urban data is triangulated using RiScan Pro18
with the following settings: edge clearing
threshold = 0.05m; depth factor = 8; depth threshold = 0.05m. The data were reduced to
82,518 elements (triangles), which is about half of the original data, as shown in Figure
5-7a. Figure 5-7b shows the enlargement of the building surface where the drawbacks
mentioned in the introduction can be observed, including rough edges of the vegetation,
occlusions on the building surfaces and spikes between building surface and vegetation.
Different levels of triangulation can be generated with different settings, where the
number of points for triangulations can be further reduced, or increased for better
precision. In short, this could be a solution for straight-forward object reconstruction,
where the memory required, processing time and visualization (artifacts) are not the
constraints.
To compare our proposed method with a direct RANSAC-based approach [98], we
fitted plane patches on our classified “planar” data. In the direct RANSAC-based approach
proposed by Hansen et al., the original point cloud was divided into 3D voxels of fixed
size. For every voxel, a plane was fitted with RANSAC-based plane fitting as shown in
Figure 5-7c. In Hansen et al.’s approach, vegetation was also fitted with plane patches
then filtered out (plane patches with data density lower than a threshold will be omitted).
For the remaining plane patches, neighboring planes with similar co-normalities and co-
planarities are grouped. The major grouped plane clusters that mainly contain
building/terrain data can then be extracted as shown in Figure 5-7d. The resulting number
of planes was 1522; thus, the approach requires only 6088 vertex points to represent the
planar surfaces in the building surface model.
However, the method proposed by Hansen et al. exhibits some disadvantages.
Many plane patches that are fitted on building data are filtered out as non-planar patches
in the grouping process. For example, dense vegetation close to a building or the terrain
surface could result in a deviate-fitted building plane. Consequently, during the grouping
process that groups according to the similarity of co-normality and co-linearity, the fitted
plane patches of the building or terrain surface that are deviated by close-by vegetation
may be filtered out. Also, in the approach, the plane patches groups that contain less than a
predefined number of plane patches are filtered out to avoid plane patches that are fitted
18 RiScan Pro is the companion software for the RIEGL terrestrial scanner.
102
on outliers or vegetation. This could result in useful patches that are fitted on relatively
small structures, which have fewer co-planar neighbouring plane patches, being filtered
out in the grouping process. Next, as it is common for a city model to contain cylindrical
buildings, plane patches fitted on the cylindrical building could be filtered out during the
mentioned plane patches grouping process. In addition, the method proposed by Hansen et
al. involves several thresholds, including the inliers threshold in the RANSAC process and
the co-planarity threshold. These thresholds can be difficult to estimate empirically and
have to be re-estimated for different data types.
Data set 3
Next, the proposed mCRFs model was also tested for large-scale data labelling (data set
that contains more than 10 million points). For the large scale data, the raw data were
divided into small voxels before being processed to reduce the processing time. With
mCRFs and super-voxel, the total time required to label a single scan (1,009,942 points)
was reduced from around 17 hours (without data division) to 5.8 hours for 100x100 =
10,000 divisions, to 5.1 hours for 200 x 200 = 40,000 divisions, and to 1.6 hours with
400x400 = 160,000 divisions. The accuracy considerably dropped from 0.853 for the case
of “no division” to 0.795 for the case of 400x400 divisions. To remedy this, post-
processing steps that refine the object modelling can be applied.
For a complete scan of the area, a total of seven scans were stitched together
(7,086,588 points). The classification accuracy with 400x400 = 160,000 divisions
remained acceptable, as observed in Error! Reference source not found.. A total of 12.8
hrs was required for the computation of the scans.
103
Plane patches were fitted onto the labelled building, terrain and floor data using the
RANSAC algorithm (as explained in Section 5.4) as a post-processing step to
geometrically model the scene into Digital Terrain Model (DTM) and Digital Surface
Model (DSM). The plane fitting process improved the result of the classification: single
misclassified data or outlier were “filtered”, while many of the small ‘holes’ caused by
occlusion or misclassification (e.g. ‘Building surface’ points labelled as ‘vegetation’) were
recovered.
5.5.3 Summary of the Experiment Results
In short, the experiments conducted have shown the performance of our approach
in classifying terrestrial outdoor LIDAR data. We have shown improvements by including
multi-scale modelling, and we have compared our approach to the commonly-applied
methods for urban modelling. Our approach has succeeded in modelling most of the man-
made surfaces and ground planes accurately, compared to triangulations or the direct plane
fitting approach [98]. For large-scale data, we have shown that by combining model fitting
and classification approaches, efficient and accurate urban modelling can be achieved.
Figure 5-8. Plane fitting on labelled building and terrain data
104
5.6 CONCLUSION
We have presented an efficient and accurate method for 3D terrestrial range data
classification. We reduced the amount of data by over-segmenting the raw point clouds
into super-voxels, reducing (in most cases) to 5% of the original data. We implemented
the multi-scale Conditional Random Field to provide connectivity at local, edge and
regional levels. The improvement of labelling precision (with global classification (CRF))
over local classification (logistic regression) has been demonstrated. We have also shown
that the regional feature of the super-voxels in the mCRFs improves the classification
accuracy of CRFs by 5% to 10%, while requiring only a negligible increase in the
computation time. We have also provided a strategy to handle relatively large-scale data,
and validated our proposed algorithm with an acquired real-world data-set.
105
6.0 ROBUST SEGMENTATION
6.1 INTRODUCTION
As discussed in the introduction, laser scanning technology has recently become capable
of producing dense point clouds. It is now possible to visualise highly-detailed urban
environments represented by 3D points, as shown in Figure 6-1.
3D points can be seen as the simplest form of geometric building blocks which can
be an effective display primitive [99]. As an alternative to 3D triangulation for
visualisation, point sets require little, if any, pre-processing for visualising urban
environments. However, this is memory-intensive, particularly for large-scale models. As
stated in Chapter 1, one of the solutions to urban modelling is to extract planar data (from
man-made structures) and then geometrically fit locally-delimited planes to the data.
Geometric modelling can be seen as representing potentially thousands of raw data points
with a single shape(and thus with few parameters). This results in a large reduction in
storage space and provides the ability to undertake geometric reasoning.
Chapter 6
Robust Segmentation
106
Figure 6-1 3D Terrestrial Outdoor Point Clouds
In order to fit a plane to the data, we need to robustly segment the planar data into
regions of locally delimited planes. Similar to the over-segmentation mentioned in
Chapter 3, robust segmentation groups similar regions together. The difference is that
over-segmentation is more local and multiple over-segmented segments can belong to the
same object, whereas robust geometric fitting removes more meaningful shapes in terms
of the way we think of the world. Over-segmentation is often used as a preliminary
processing step when existing techniques (robust segmentation) are insufficient to
segment or model the data accurately. To date there has not been a successful
demonstration of segmentation of complicated outdoor data using only robust
segmentation. The enormous amount of data, the large varieties and shapes of the
structures, and the sheer number of structures make robust multi-structure segmentation
extremely challenging. The existing literature on multi-structure segmentation using
robust segmentation typically involves only a relatively small number of simple structures
[100].
In our proposed approach for the automatic generation of 3D outdoor urban models,
the raw data is initially over-segmented into super-voxels (Chapter 6) to efficiently
classify them into different data types (Chapter 7). To model the extracted building data
automatically, robust segmentation is required, because data from complex environments
can be difficult to segment in 3D using a simple (non-robust) statistical fitting. Moreover,
107
it takes much longer to fit models with large search space. The current challenges in robust
multi-structure segmentation are as follows:
i) Gross outliers
Traditionally, classical statistical fitting methods, such as Least Square Fitting, are
sufficient for model fitting when only one structure is present in the data and there are no
outliers. Least square fitting is the simplest and most commonly used technique to find the
best fit for a set of data points. The best fit is the instance of the fitted model when the sum
of squared residuals is a minimum (a residual being the difference between an observed
value and the value given by the fitted model). Least squares methods were independently
developed by mathematicians Karl Friedrich Gauss in 1794, Adrien Marie Legendre in
1805 and Robert Adrain in 1808 [101]. However, this approach is highly sensitive to
outliers (for example, data belonging to one structure may be fitted to another close
structure). An extreme outlier with relatively large residual is capable of drastically off-
setting the fitted model. Therefore, the least square approach has a zero break-down point.
A more robust algorithm is therefore required.
In 1984, Rousseeuw proposed replacing the sum in the Least Square Model
method with a median. The Least Median Square (LMedS) algorithm can tolerate up to 50%
outliers [102]. A class of estimators, the M-estimators (the Maximum-likelihood
estimators), replaces the sum of square of residuals in least squares with a slower
increasing loss function (of the data value and parameter estimate). However, these
methods are not robust. The principal measure of the robustness of an estimator is its
breakdown value – the fraction of outlying data points that can corrupt the estimator and
the influence function (which shows the effect on an estimator of changing one point of
the sample) [103]. The M-estimators have a low break-down point at 50%. Popular robust
estimators that can handle more than 50% of outliers include the Hough Transform and
Random Sample Consensus (RANSAC) [104-106].
Non-robust statistical algorithms are typically more accurate. For example, in the
least square fitting, although the sample mean is easily upset by contaminated data, it
provides the most accurate estimation on the location of the normal distributions of
residuals produced by the model. Computer vision-based algorithms such as RANSAC are
108
robust but less accurate. Therefore, the robust segmentation methods are generally
performed first, followed by the least squares method to detect and eliminate outliers and
refines the model parameters.
ii) Pseudo-Outliers
Although robust estimators are capable of handling most gross outliers, the
building surface data sets are multi-structured. Therefore, the robust segmentation
algorithm must tolerate both gross outliers and pseudo-outliers. Pseudo-outliers are
defined in [107] as “outliers to the structure of interest but inliers to a different structure”.
The previously-mentioned robust estimators, such as RANSAC, are essentially designed
to handle only single structure segmentation contaminated with gross outliers. In order to
handle multi-structure segmentation, RANSAC is sometimes applied sequentially to
detect and remove the inliers of the best-fitted planes from the data set (the fit-remove
strategy).
For instance, Hesami et al [108] propose a hierarchical approach to segment coarse
to fine plane segments from complex buildings. The method starts by specifying the
number of hierarchical levels for segmentation and a user-defined input is computed for
the robust estimator in each level. The user-defined threshold works as a finer tuning
parameter and indicates the ratio of population of the smallest region that can be regarded
as a separate region and the size of the entire population. In every level of hierarchy,
robust estimation is applied to the range data to sequentially group data into segments
until the calculated scale of noise19
is larger than the scale of noise of the measurement
equipment. The sequential robust estimator approach is not optimal, as will be explained
later in detail in Section 6.2.2.2.
Another alternative is to employ unsupervised clustering techniques for
simultaneous plane fitting (see Section 6.2.3).
19 A surface is fitted to the data of each segment using a least-square fitting and the scale of noise can be
calculated for the next hierarchy level of segmentation.
109
iii) Unknown number of structures
To determine the number of structures or planes, the RANSAC fit-remove approach is
applied until the leftover points are fewer than a user-defined threshold, where the
threshold is usually difficult to estimate. In another segmentation approach, unsupervised
clustering, information criteria can be used to determine the optimal number of structures
or planes. The criterion scores for a number of clusters from two to some pre-set
maximum are computed and compared. The best score informally locates the optimal
number of structures or planes. However, determining the maximum number of clusters
can be difficult. Furthermore, due to occlusions, the decision as to whether a plane
segment belongs to another plane segment is ambiguous. This is because a single plane
may be broken into two by occlusion. For example, in Figure 6-2, the tree trunk occludes
part of the building surface. Ambiguity arises when trying to determine the actual number
of structures through segmentation of the disconnected building surface.
Figure 6-2 Ambiguity for number of structures.
We recognise the problems as mentioned above (more details are provided in
Section 6.2) and propose to apply the Infinite Gaussian Mixture Model (IGMM) to
Occluded building surface
Tree causing occlusion
110
overcome the problem of robust multi-structure segmentation. The parameters of IGMM
are estimated by Gibbs sampling and the number of clusters are allowed to grow with the
data (after the sampling has converged, one has a distribution over the number of clusters
given the data). IGMM has been applied in bioengineering for MRI classification [109,
110] and for gene expression clustering [111], document modelling for event detection
[112] and recently in computer vision for motion segmentation [113, 114]. In motion
segmentation, Jian and Chen [114] replaced the residual function for IGMM with the
Sampson distance in unsupervised clustering. With the clustered data, each cluster is fitted
with a plane with RANSAC to remove the outliers, much like “pre-segment with
RANSAC and refine with least square”. For plane data clustering, we modified the
residual function to include the prior knowledge (that data are planar) in Section 6.3 and
verified the algorithm on synthetic and real-world outdoor planar data.
This chapter is organised as follows: relevant background information on robust
multi-structure segmentation and the major problems in previous work, are discussed in
Section 6.2. Our proposed approach for plane data segmentation/clustering is explained in
Section 6.3. The modification to the IGMM algorithm is also described. We then test the
proposed method and compare it with existing methods using two sets of terrestrial laser
scanned 3D urban data (plane data and plane patches) in Section 6.4.
6.2 BACKGROUND OF ROBUST SEGMENTATION
Generally, segmentation methods can be divided into three approaches: region growing,
model fitting and clustering.
6.2.1 Region Growing
Region growing is the most common bottom-up segmentation method, often used as a
post-processing step to refine the initial over-segmented (for example, by model fitting)
regions. There are several types of region growing methods. The simplest method starts
111
with a single seed. The seed pixel/point grows by merging with the neighbouring
pixels/points with similar properties. When the region stops growing, i.e. no new
pixel/point can be added, another seed, which does not yet belong to an existing region,
will be randomly chosen. This process is repeated until all pixels/points belong to some
region. However, by growing seed points sequentially, the current region dominates the
growth process, causing ambiguities around the edges of adjacent regions, causing
different choices of seeds to give different segmentation results. This is a common
problem for the sequential segmentation approach which biases the segmentation in favour
of the regions segmented first. One solution is to start with multiple randomly-sampled
seed points (or over-segmented regions). A number of regions will grow simultaneously
and similar regions will gradually merge. Region growing approaches have the advantage
of exploiting neighbouring pixels that are likely to be similar, thus ensuring the
smoothness of the segmentation.
One of the main problems with the region growing approach lies in determining
the thresholds of the similarity criteria – i.e., how similar is deemed to be sufficiently
similar. The similarity criteria can include colour, variance, intensity, motion, size,
normals and other popular features. Other sources of problems are noise present in the 2D
image or occlusion in 3D data and these typically result in over-segmentation.
Furthermore, the computation and resource costs of these are potentially high.
6.2.2 Model Fitting
There are a number of popular fitting approaches, including the following:
6.2.2.1 Hough Transform
To fit a plane using the Hough transform, a set of points (a points-set – eg. three points for
planes) are selected and the plane coefficients given by the plane equation z = ax + by +
d are computed. One of the possible choices of parameter space corresponds to the plane
coefficients (a,b,d). However, similar to the well-known problem in line fitting of vertical
112
lines in 2D, there is the problem of vertical planes (resulting in large a and b values) which
give rise to unbounded values of the plane coefficients. Therefore, the choice of
parameterization is important, as unlike airborne laser scanning, terrestrial laser scanning
involves both vertical and horizontal planes. A better choice is to choose the parameter
space that consists of the plane’s normal vector and its distance from the origin.
The points will vote on a sinusoidal surface in the Hough parameter/accumulator
space and the intersection of the sinusoidal surfaces indicates the presence of a plane.
Thus, in the Hough transform approach, each point indicates its contribution to a globally-
consistent solution i.e. the physical plane which gave rise to that point.
In 1990, the Randomized Hough Transform (RHT) [115] was introduced to
address the slow performance and high memory consumption problems of the Hough
Transform and RHT has been applied to multi-structured segmentation [106]. By
randomly selecting the points set required to compute the parameters for the accumulator
space, RHT generates a smaller sub-set of all parameter combinations. However, RHT still
suffers from the limitation of the Hough Transform caused by the quantisation in the
accumulator space. That is, unlike other robust fitting methods that provide infinite
precision, the Hough Transform and RHT have limited precision. When the bin size is set
to be larger than the optimum size, there is a loss in precision; if set too small, most bins
will have only a single count as there will be a one-to-one mapping between pixels and
bins. Estimation of the bin size can be a difficult challenge. In some cases, for example in
fundamental matrix estimation, without the inherent knowledge of the estimate variance, it
is rather difficult to estimate the bin size in the Hough transform.
6.2.2.2 RANSAC
The RANSAC algorithm, proposed by Fishler and Bolles in 1981 [96], is one of the most
widely used approaches amongst the computer vision community due to its robustness to
outliers. Instead of finding the location of the narrowest band containing half the data, as
in LMedS, RANSAC finds the location of the densest band of a preset width. Unlike most
statistical estimation problems, the data used in computer vision generally include a high
ratio of outliers. RANSAC randomly samples a sub-set number of points and computes
113
the model parameters of each sub-set. The best fitted model is decided on the estimated
parameters that contain the largest number of inliers. By counting the number of inliers
with residuals less than a preset width/threshold, the inliers are assumed to be uniformly
distributed, i.e. carrying equal weights. Since residuals are more likely to follow a
Gaussian-like distribution, Torr et al. replaced the uniform kernel in the RANSAC scoring
function with a Gaussian kernel that shapes the residuals into a Gaussian kernel in
MLESAC [116]. Similarly, Wang and Suter proposed ASKC [117] and tested it with both
a Gaussian and an Epanechnikov kernel. The authors showed that both kernels performed
better than a uniform kernel.
The main challenges in RANSAC-based multi-structure segmentation methods
(including the extended algorithm with modified scoring function) are as follows:
1) Determining the “inliers threshold” (scale estimation)
The “inliers threshold” found in the RANSAC scoring function (Algorithm 5-1) can be
seen as the estimated variance of the underlying model. In experiments where the
estimated variance is known, the “inliers threshold” is typically set at 1.5 times the
estimated variance. Determination of the “inliers threshold” is a chicken-and-egg problem.
The inliers have to be known before we can compute the variance, yet we need to know
the variance in advance to realise which data points are the inliers. Considerable effort has
been spent addressing this issue; a simple estimator would be the median or the more
commonly used MAD estimator which is also used in MLESAC [116]. Since the “inliers
threshold” for different candidate structures can be different, Konouchine et al. [118]
extended MLESAC by proposing an approach that is capable of adaptively estimating the
variance and the outlier ratio for every iteration. However, these estimators have a 50%
breakdown point.
Wang [100] conducted a review of the robust scale estimators including ALKS,
RESC and MSSE. He found that these methods are limited in handling extreme outliers.
Wang and Suter then proposed integrating a robust scale estimator, TSSE [119], which
finds the densest band and the valley to estimate the variance in the ASKC framework.
Alternatively, Fan and Pylvanainen [120] derived the scale of the inliers from the statistics
of repeated inlier data points (Ensemble Inlier Sets) that are accumulated from all the
114
proposed models. However, these methods determine the inliers threshold using data
solely from a single mode. Estimating the variances of different models simultaneously
should be a more accurate approach.
2) Remaining points from removed structures
As previously mentioned, RANSAC is not optimal for multi-modal segmentation [104,
121]. In order to extract more than a single structure, RANSAC is repeated sequentially
(fit-remove) – the subsequent data for further segmentation are often either contaminated
with leftover data from extracted structures (when the “inlier threshold” is smaller than the
real variance), or extracting data from other structures (when the “inlier threshold” is
larger than the real variance). Zuliani et al. [104] performed fit-remove a number of times
and selected the best fit-remove with the most inliers with MultiRANSAC. The authors
show better results compared to RANSAC, but it is computationally more expensive than
the sequential RANSAC approach. In addition, the number of structures has to be
predefined.
3) Determining the outlier ratio which decides the number of sampling required to
ensure outlier-free
In large-scale data, the percentage of outliers becomes relatively large due to the sheer
number of structures. This makes segmentation more inefficient, as the number of
sampling/iterations required for the RANSAC algorithm depends on the outlier ratio.
Since data that are far away are unlikely to be part of a single structure, one solution is to
select the neighbouring points of the first randomly sampled point as the sub-set of points
with higher probability. This is known as the minimal sample sets (MSS) [104, 121],
where after randomly picking a point xi, xj has the following probability of being drawn
with a Gaussian kernel:
2
2||||exp
1)|(
σij
ij
xx
ZxxP
−−=
( 6-1 )
115
where Z is a normalization constant and σ is the estimated variance of the residuals
(“inliers threshold”).
With MSS in the selection of the sample sets, the RANSAC algorithm requires
fewer (random sampling) iterations to ensure at least one random sample is free from
outliers with some probability p. However, the outlier ratio is often not known a priori, or
is difficult to estimate. In sequential RANSAC, the outlier ratio changes for different (fit-
remove) iterations. Konouchine et. al. [118] re-estimated the outlier ratio for every
iteration using the Expectation Maximization algorithm. In practice, without knowledge of
the outlier ratio, most authors determine the number of iterations by setting a relatively
large number [104, 122].
4) Determining the stopping criteria for fit-remove sequential RANSAC
In most problems, we are often unsure of the number of structures in the data. In the fit-
remove strategy, model fitting is repeated until the leftover points are below a threshold.
However, it is often difficult to estimate a good threshold. In contrast, the mixture
modelling approach typically stops when the maximum number of iterations is reached or
upon saturation, i.e., when the objective function improvement between two consecutive
iterations is less than the minimum amount of improvement specified. In [118],
Konouchine et. al. stopped the iteration when the probability of the selected outlier-free
sample reached 95%-97%. There is no proof that this solution will always work unless the
data set is relatively clean, i.e. contains only data that can be fitted with the assumed
model.
5) Determining the number of models for sequential multi-modal RANSAC methods
With the appropriate stopping criteria, it is possible to realise the number of models.
Another approach is to not rely on the stopping criteria and discover the number of models
during multi-structure segmentation. For instance, Toldo and Fusiello [121] proposed
using agglomerative clustering (see next section for more details of clustering) with the
Jaccard distance to cluster random sampled hypothesis. The method starts with random
sampling and then proceeds by linking points with Jaccard distance smaller than 1 and
116
stop when there is no such point left. However, clustering in the random sampled
hypothesis space has the same shortcoming as general clustering methods – the number of
clusters can often be difficult to estimate.
6.2.3 Clustering
Clustering provides an alternative approach to segmenting the data simultaneously, similar
to region growing approaches with multiple seeds. Clustering has the advantage of fitting
multiple models into a global fitting model that consists of several instances of the same
model, each corresponding to a different set of parameters, unlike model fitting where
inliers of other structures are treated as outliers (pseudo outliers).
When clustering data, the actual number of cluster is unknown. Determining the
number of clusters, which is the fundamental problem of cluster validity, is one of the
main problems in clustering methods. Several researchers have tried to address this
classical open problem, including Moreau[123], Dubes[124], Fraley[125] and Still[126].
Several solutions have been proposed, and most rely on information criteria or more
recently use the Dirichlet clustering process [127]).
The following outlines the general methods for clustering and the approach to
assess cluster validity:
a. Hierarchical methods
One of the clustering methods is hierarchical agglomerative clustering: starting with as
many cluster centres as there are data, the closest centres are merged as the level proceeds
upwards to construct the tree. Alternatively, hierarchical divisive methods start by
grouping all data together in a single cluster and then splitting the cluster as it proceeds to
a higher level. In order to handle data with outliers, Frigui and Krishnapuram proposed the
Robust Competitive Agglomeration (RCA) algorithm [128]. The algorithm reduces the
effect of outliers by clustering the data into a large number of small clusters with fuzzy
clustering. The small clusters are then merged with competitive agglomerative process. To
117
determine the number of clusters, the hierarchical tree is cut at a visually appealing level.
However, the assessment of “visually appealing” is intrinsically difficult.
In previous work for outdoor airborne laser-scanned data, Haala and Brenner [25]
utilised ISODATA, which is a combination of the agglomerative and divisive clustering
methods to cluster data into planar, vegetation and terrain data. The optimal number of
spectral clusters is determined based on the minimum distance criterion. The clusters are
split and merged in the feature space (normalised height and multi-spectral information
from colour-infrared aerial images) throughout the iterations. The drawback of this
method is that a number of parameters or thresholds have to be initialised by the operator,
which can be difficult to estimate.
b. Object function approaches:
K-means
In contrast to the hierarchical methods, k-means clustering first decides on an objective
function that determines how well the clusters fit the data. The objective function J which
is a squared error function (total intra-cluster variance - eq ( 6-2 )) is minimised to
determine the centroid c of the clusters where x is a 1xN feature matrix [129]:
( 6-2 )
The number of clusters can be determined by starting from a small k and increasing
the value of k until the convergence error is acceptable. However, it is possible to have a
configuration of the clusters that have converged but does not have the minimum
distortion or has over-fitted. One solution to this is to introduce a criterion that includes a
distortion (which is the objective function) and a penalty function that increases with the
number of parameters, such as the Schwartz Bayesian information criterion (BIC) [130] in
the following equation:
)}()({1 1
ij
T
i
k
i
N
j
j cxcxJ −−=∑ ∑= =
118
( 6-3 )
where m is the number of dimensions; k is the number of centres; N is the number of
feature data; λ = 0.5 or can be obtained empirically. Other criteria include the Akaike
information criterion (AIC) [129] (which tends to overfit the model as it does not penalise
the number of parameters as strongly as BIC), Minimum Message Length (MML), and
Minimum Description Length (MDL), hypothesis-testing and cross-validation.
Fuzzy Clustering
K-means clustering divides the data into crisp clusters where each data point belongs only
to exactly one cluster. In fuzzy clustering, each data point can belong to more than a single
cluster. A membership function is estimated for every point to indicate the degree of the
points belonging to different clusters. Introduced in 1981, The Fuzzy C-Means (FCM)
algorithm [131] is probably the most widely used fuzzy clustering algorithm.
Biosca and Lerma [132] recently implemented the Possibilistic C-Means (PCM),
which is a modification of the FCM algorithm that casts the clustering problem into the
possibility theory framework. The Possibilistic C-Means (PCM)20
uses a more
complicated objective function (that includes a fuzzy partition matrix U = [uij] of
dimension kxn). It has been successfully applied to terrestrial scanned data and is therefore
chosen to verify our proposed solution (described in Section 6.3). The optimal number of
clusters is determined by minimizing the cost function, J. The process starts from k=2 and
stops when convergence is reached (i.e. |cost func(k-1) - cost func(k)| < ε):
∑ ∑∑∑= == =
−+−−=C
i
N
j
m
ijiij
T
ij
k
i
N
j
m
ij ucxcxuJ1 11 1
)1()()( η
( 6-4 )
where the membership function uij and prototype cluster ci are defined as follows:
20 http://mehr.sharif.edu/~amiri/download/Y_FCMC/Y_FCMC_Ver.1.0.zip
NmkcxcxBIC ij
T
i
k
i
N
j
j log)}()({1 1
λ+−−=∑ ∑= =
119
1
1
))()(
(1
1
−−−
+
=
m
i
ij
T
ij
ijcxcx
u
η
( 6-5 )
∑
∑
=
==
N
j
m
ij
N
j
j
m
ij
i
u
xu
c
1
1
( 6-6 )
The initial values of the prototype clusters and membership function are computed
by applying Fuzzy C-Means algorithm to the data.
c. Parametric likelihood-based or model-based approaches:
The parametric likelihood-based methods resolve the clustering problem by estimating the
number of sources from the measurements. Among these model-based methods, an
example is the probabilistic mixture model which attempts to explain the data by
estimating the distribution and the density of the clusters. The Gaussian Mixture Model is
perhaps the most common method. In addition to computing the mean of the clusters in k-
means, it also computes the covariance and the prior probability of the clusters. The
parameters can be estimated using the expectation-maximization algorithm and the
number of clusters (or mixture models) can be chosen by maximising BIC [129] for k = 2
to kmax:
Np
DLBIC log2
);( +Θ−=
( 6-7 )
∏∑= =
−
−Σ−−Σ
=ΘN
j
k
i
ji
T
ij
i
di xxxL1 1
}
1
2/12/)()(
2
1exp
)det()2(
1);( µµ
πα
( 6-8 )
N is the number of data and p is the number of free parameters. Θ = (α1, …, αk, θ1, …, θk);
),( iii Σ= µθ where c is the mean of the cluster, Σ is the covariance and α is the prior
120
probability. The aforementioned criteria, such as AIC, can also be applied instead of the
BIC criteria.
The above mentioned clustering methods often cluster in feature space that
frequently lacks spatial coherence. This can cause discontinuity (for example, in a surface-
normals feature space, two data of similar surface normals that are spatially far apart may
be grouped together as a single cluster) in clustered data in the neighbourhood of the data.
This is illustrated in the results of Set A in Section 6.4. To avoid this, we require a model-
fit clustering method which clusters solely in the geometry space (xyz).
In practice, the constraint of only grouping locally-close data can be enforced by
partitioning the large scale data into smaller voxels, similar to the work of Hansen et al.
[133]. However, the fitted plane patches are still discontinuous and require merging or
grouping into locally-delimited planes. Grouping plane patches is necessary for
visualisation purposes and to reduce the amount of data required to represent the same
object.
d. Non-parametric likelihood-based approach:
Infinite Gaussian Mixture Modelling (IGMM) is a non-parametric Bayesian modelling
approach for data clustering (i.e. the number of clusters does not need to be known in
advance), proposed by Rasmussen [134]. With the actual number of underlying models
unknown, instead of relying on a criterion, IGMM (which is a Bayesian approach to
mixture modelling) clusters the data in a statistically-principled matter.
The parameters of IGMM are estimated by Gibbs sampling and the number of
clusters are allowed to grow with the data. After the sampling has converged, one has a
distribution over the number of clusters given the data. The IGMM clustering process
avoids labels by constructing an “exchangeable cluster process” that consists of an infinite
sequence of points in Rd, with a random partition of the integers into k blocks, i.e. k cluster
[134]. Therefore, the value of k is estimated in the clustering process. More details are
provided in the following section.
121
6.3 INFINITE GAUSSIAN MIXTURE MODEL
As described above, instead of specifying a prior number of mixtures or a maximum
number of clusters, the IGMM determines the mixtures number by Gibbs random
sampling. A number of samples are generated from the Gaussian distribution for the
estimation of parameters for different mixture components. Starting with one component
or mixture, the parameters and hyper-parameters are defined as in the description below
and are updated during the Gibbs sweeps.
The likelihood function for the finite Gaussian Mixture Model in Eq. ( 6-7 ) be
written
∑=
Σ=Θk
i
iiiGxP1
),()|( µα
( 6-9 )
The parameters in the likelihood function for the finite Gaussian Mixture Model in
Eq. ( 6-7 ) can be estimated using the expectation-maximization (EM) algorithm. The EM
algorithm is an iterative algorithm approach that estimates the parameters using two steps,
an expectation step and a maximization step [135]. Alternatively, the mixture model
parameters can be estimated using posterior sampling as indicated by Bayes' theorem as
shown in eq. ( 6-10 ), where the posterior distribution is proportional to the product of the
prior probability measure and the likelihood function.
)|()()(
)|()()|( ΘΘ∝
ΘΘ=Θ xPP
xP
xPPxP
( 6-10 )
where P(x) is the normalisation constant
Unfortunately, the inference of the joint posterior probability is analytically
intractable, i.e. computations to the model cannot be done using exact mathematical
formulae. The computations have to be performed by means of computer simulations: a
two-step iterative procedure known as Gibbs sampling can be used. Gibbs sampling is
applicable when the joint distribution is not known explicitly, but the conditional
122
distribution of each variable is known. Gibbs sampling is used to update each variable, i.e.
{ Σ,, µα }, in turn from its conditional distribution given all other variables in the model.
),,|(),,()|,,( ΣΣ∝Σ µαµαµα xPPxP ( 6-11 )
For a finite number of mixtures, the likelihood of xn associated with mixture i in Eq ( 6-11 )
is then:
)2
)(exp(),(),,|(
2
2/11 iii
iiiiinn
xGicxP
µµµ
−Σ−Σ∝Σ=Σ= −
( 6-12 )
To decide whether to assign the sample to the existing mixture or a new mixture,
the conditional posteriors for the stochastic indicator variables, },...,{ 1 Nccc = , one for
each observation, are introduced to encode which class has generated the observation. The
indicators carry the value of the class to which the data point belongs, i.e. 1 to k and are
updated in the two-step iterations. The approach is based on the formulation described in
Rasmussen [134].
The conditional posterior distributions for the priors on component parameters and
the hyper-parameters are derived in Appendix III. The conditional prior for a single
indicator given all the others is:
+−
+−==
−
−
tedunrepresenisiN
drepresenteisiN
N
cicP
ij
jj
α
ααα
1
1),|(
,
( 6-13 )
where –j indicates all indexes except j and ijN ,− indicates the number of observations,
excluding xn, that belongs to mixture i.
Finally, Gibbs sampling can then be used to calculate the conditional posterior for
each cn:
123
ΣΣΣ+−
−Σ−Σ
+−∝
Σ=
∫
−
−
tedunrepresenisiddwPPxPN
drepresenteisix
N
N
cicP
iiiiiin
iii
i
jn
iinn
µβγλµµα
α
µ
α
αµ
),|(),|(),|(1
)2
)(exp(
1
),,,|(
2
2/1,
( 6-14 )
For the represented mixtures, the conditional posterior is given by the product of the prior
and the likelihood: the conditional posterior is based ),,,,|(~ γλµµ ijii xcP Σ and
),,,,|(~ wxcP ijii βµµΣ as derived in Eqs. ( 10-2 ) and ( 10-8 ).
Due to the absence of the training data, the conditional posterior of the unrepresented
mixture has to be determined by the priors: the conditional posterior is based on
),|(~ γλµµ ii P and ),|(~ wP ii βµΣ as derived in Eqs. ( 10-1 ) and ( 10-7 ).
Modified IGMM
For clustering of the classified planar data or planar patches, we modified the
Euclidean distance function 2)( iix µ− in the Gaussian mixture as follows: The spatial
observations (the coordinates of the range data) that belong to cluster j are fitted with a
plane via the orthogonal distance regression method using the SVD method, as explained
in Section 3.3.2. The plane coefficient for the best fitted plane is the eigenvector that
corresponds to the smallest eigenvalue. With the plane coefficients },,,{ dcba for j
mixtures, we can then replace the distance measure between the data and the cluster centre
with the distance measure between the data and the best fitted plane. The distance function
for },,{ zyx xxxx = is then the shortest point to plane distance distance xj , given as
follows:
222
)(
cba
xcxbxadt
zyx
++
×+×+×+−=
( 6-15 )
distance ];;[ tctbtaxj ×××= ( 6-16 )
With the Gibbs sampler that produces analytic conditional distributions for
sampling and the conditional posteriors defined, the two-step iterative algorithm
124
(Algorithm 6-1) can be employed to generate a sequence of samples in order to
approximate the conditional joint distribution. Geman and Geman [136] showed that,
under mild conditions, and after a large number of iterations, the sampler converges to a
set of samples from the joint posterior distribution.
Initialize απγβµ ,,,,,,,, wkc Σ
For :2=j no of iterations
Two-step iterative algorithm:
Step 1: Sample Normal distribution means and covariances given a current
assignment of data to classes
For repki :1=
Update Nj, the number of data points belonging to mixture i
Update mixing weights: α
π+
=N
N i
i
End
For jNi :1=
Sample ),|(~ γλµµ ii P for unrepresented mixture
Sample ),|(~ wP ii βµΣ for unrepresented mixture
End
For repki :1=
Sample ),,,,|(~ γλµµ ijii xcP Σ for represented mixture
Sample ),,,,|(~ wxcP ijii βµµΣ for represented mixture
End
Update hyper-parameters
Sample ),|(~ γµλλ P
Sample ),|(~ λµγγ P
Sample ),|(~ βΣwPw
Sample ),|(~ wP Σλβ
Sample ),|(~ NkP repαα
125
Step 2: Sample the assignment of data to classes given current values for the
means and covariances (CRP)
For jNn :1=
Sample indicators njc are generated for iteration j according to eq ( 6-14 )
End
Update repk , the number of represented mixtures
End
Algorithm 6-1 IGMM algorithm
6.4 RESULTS
We evaluated our proposed modified IGMM algorithm for clustering of planar data or
patches on two sets of acquired outdoor data obtained from the terrestrial laser scanner.
On Dataset A, we evaluated our modified IGMM algorithm and compared the
performance with IGMM and PCM algorithm on 8,416 labelled planar data, as shown in
Fig 6.3. The feature used is the spatial coordinates of the data.
PCM can be used to segment planar data by clustering in the features space
(estimated normals and distance from the origin of the data). The estimated normals (n1, n2,
n3) of the plane data can be obtained by least square fitting k neighbouring points, where
the value k is computed adaptively depending on the estimated curvature, density and
noise of the surrounding data [72]. The distance from origin di of the data can then be
computed with the following equation [132]:
iiii znynxnd 321 −−−=
( 6-17 )
126
Dataset A
As shown in Figure 6-3, both IGMM and modified IGMM successfully grouped the major
planes. The modified IGMM segmented the major vertical plane into a single group. This
demonstrates that IGMM is able to overcome missing data caused by occlusion and the
transparent glass door. In comparison, IGMM clustered some of the data that belonged to
different planes (but were relatively close to each other) despite the data having a very
different plane normal. The representation of the original data can be reduced up to 99%
(8,416 points versus 19 planes).
127
Set a: Plane data
a b
c d
c d
Figure 6-3. a, b, c show the “bird’s- eye” view and d, e, f show the isometric view of the clustered extracted plane data;
processed with a,d: Modified IGMM, b,e: IGMM and c,f: PCM.
128
As mentioned in Section 6.2.3 for PCM, the features to be fitted include the
normalised estimated plane normals and the distance d from the plane to the origins. The η
scale parameter in Eq. 6.5 was determined empirically and was set to 2 in the experiment.
The result shows the drawback of clustering in feature space – discontinuities in the
clusters neighbourhood are unavoidable. This can only be remedied by applying region
growing as a post-processing step.
Dataset B
In Dataset B, 1,187,563 data belonging to building surfaces were extracted from
7,086,588 data (a large model consisting of 7 registered scans). The point clouds were
partitioned into (400x400) 160,000 divisions for data classification and RANSAC plane
fitting was performed, as shown in Chapter 7. Using the aforementioned pre-processing
steps, the building data were reduced to 22,618 plane patches, as shown in Figure 6-4.
Figure 6-4a,c shows the centres of the plane patches clustered using the modified
IGMM algorithm, which resulted in 59 clusters. In comparison, 30 clusters were computed
with the IGMM algorithm as shown in Figure 6-4b,d. The modified IGMM is capable of
detecting most of the planar structure in the smaller buildings and segments the data into
different planar groups. One of the main differences between the behaviour of IGMM and
modified IGMM is shown in the polygon that surrounds the cylinder. The modified
IGMM managed to segment the data into different side groups, while IGMM grouped
more than one side in a single cluster.
To compare the performance of PCM with IGMM and the modified IGMM, the
plane patches in Set b are also clustered with PCM, which results in far too many
discontinuities. This is most likely due to the large variance in the distance from origin d.
Also, as stated in [132], the chance of a discontinuity occurring inside a neighbourhood
increases with its size. Dataset B contains a complex configuration with relatively fewer
data, making it more difficult to cluster in feature space (like PCM).
129
6.5 CONCLUSION
In this chapter, the Infinite Gaussian Mixture Modelling (IGMM) method is shown to be
robust for the clustering of plane data and plane patches. Unlike traditional clustering-
based methods, the number of clusters (or the maximum number of cluster) does not have
to be pre-defined. Compared to RANSAC-based methods, the Gibbs sampling employed
for the IGMM fits multiple clusters simultaneously instead of sequentially. The IGMM
does not suffer from the “fit-remove strategy” issues explained in Section 6.2.2.2.
Furthermore, the modified Gaussian residual function employed in IGMM improves the
Set b: Plane patches
a b
c d
Figure 6-4. a, b show the “bird’s-eye” view and c, d show the isometric view of the clustered extracted plane data;
processed with a, c: Modified IGMM and b,d: IGMM.
130
clustering accuracy for planar data and is capable of handling missing data caused by
occlusion or transparent objects. Both IGMM and modified IGMM are capable of
segmenting locally-delimited planes, even when working on data separated by different
groups. The algorithm can be extended to cluster non-planar objects, such as cylindrical or
cone objects, which are common in outdoor environments.
The drawback with clustering methods, or a simultaneous multi-model fitting that
fits every point, is that outliers or gross noise (eg. uniform random noise when a Gaussian
noise distribution is assumed) are difficult to deal with. The method is also not robust to
data with unexpected structures that cannot be fitted with a pre-defined model. Therefore,
the terrestrial urban data have to be pre-processed to remove the non-planar data (eg.
vegetation).
For further work, the clustered planes can be further modelled with Superquadric
[137], which requires only a small set of parameters to model a large variety of different
basic shapes. For visualisation purposes, texture maps can be created from the calibrated
colour camera images to provide a more realistic urban model.
131
7.0 CONCLUSIONS
7.1 CONCLUDING REMARKS
This dissertation has presented a series of algorithms for the generation of urban
models from terrestrial LIDAR data. The contribution of the work presented in this
dissertation is mainly in the emphasis of classifying over-segmented LIDAR data instead
of every single point. With the selected and improved feature descriptors for the over-
segmented LIDAR data and the multi-scale learning model, the proposed approach has
been shown to be an effective and accurate method for 3D outdoor LIDAR classification.
Specifically, this thesis has made the following contributions:
• The image occlusion algorithm proposed by Herley [13] is extended to detect
overlapped occlusions. The existing algorithm assumes that the occlusions are all
independent objects; one connected occluded region cannot consist of occlusions from
different images. However, this assumption is often violated in outdoor environments
due to the large number of pedestrian overlaps. By understanding that a large
difference in the discontinuity measure at the boundary of the occlusions is most
Chapter 7
Conclusions
132
likely to indicate an occlusion, the proposal is to analyse the discontinuity measure of
the boundary where occlusions occur, to separate any overlapped occlusions. The
extended algorithm [138] described in Chapter 2 has been evaluated on indoor and
outdoor panoramic image sets and shows promising results.
• A novel method is proposed that robustly and efficiently classifies outdoor terrestrial
LIDAR data into different classes. 3D data labelling in previous work was mostly
point-based. This point-based approach introduces redundant computation – labelling
every point results in a high computational load which can be reduced by classifying a
smaller sub-set of the data. A 3D over-segmentation algorithm, super-voxel, is
introduced in Chapter 4 for this purpose. This proposed method, based on 3D scale
theory, groups regions which are homogeneous with respect to colour and geometry
similarities. This method is shown to efficiently reduce the outdoor terrestrial LIDAR
data for classification.
• An overview and comparison of feature descriptors for the classification of the
LIDAR data is provided in Chapter 3. Feature descriptors are computed for the super-
voxels. One of the feature descriptors, the saliency feature, is improved to be invariant
to the adaptive size of the super-voxel that is used to compute the descriptors. An
efficient classification method can then be employed to label the super-voxels into
different data types.
• A hierarchical graphical model, the multi-scale Conditional Random Fields (mCRF),
is proposed to learn the parameters of the extracted local and regional features, as
depicted in Chapter 5. The comparison of the classification results show improvement
in accuracy with the 3D over-segmentation and mCRF. This is an extension to work
first published in [73] and more recently in [139].
• A robust estimation method is described in Chapter 6 that successfully addresses a
number of the issues associated with the segmentation of complex data with unknown
numbers of structures. The residual function of the Infinite Gaussian Mixture Model
(IGMM) is modified for clustering of labelled data belonging to planar surfaces into
locally-delimited planes. The modified algorithm is evaluated on the labelled planar
data. The result of the proposed method shows the robustness of the algorithm for the
133
clustering problem. The proposed method is also shown to be capable of handling
missing data caused by occlusion or transparent objects.
There are several limitations of the approaches presented in this thesis. To address
the existing limitations and also as part of future work, there are numerous interesting
research directions, including the following:
• Computing super-voxels for 3D data classification in aerial urban modelling and
indoor objects: The problem of redundancy in point-based modelling is not limited
to outdoor terrestrial LIDAR data classification. Further investigation of the
applicability of super-voxels in classification of 3D data of other kinds can be
conducted. In addition, to make the proposed approaches more robust and widely
applicable, it is necessary to test a wider variety of data in future work.
• The results of the object classification provided are promising, but they still offer
room for some improvement. Further investigation of the integration of this work
in 3D classification with 2D image classification can be conducted. Working in 3D
has the advantage of capturing depth and curvature features, and overcoming
limitations due to viewpoint and lighting variations. On the other hand, working
with 2D images is capable of acquiring high resolution in a short time and can
provide better edges in the segmentation. Classification or segmentation based
simultaneously on the 2D and the 3D data can increase the overall accuracy of the
classification results.
• As explained in Chapter 6, one of the challenges of the RANSAC sequential “fit-
remove” strategy for multi-structure robust segmentation is in determining the
“inliers threshold”. The “inliers threshold” for different structures might not be
optimised if fixed to the same value. One interesting approach to these threshold
estimation issues is to explore the possibility of using the estimated variances of
the mixture models, computed with the Infinite Gaussian Mixture Model, to
determine the “inliers thresholds”.
• Finally, for a complete solution to a final urban model, the clustered planes in
Chapter 6 can be further modelled with Superquadric [137], which requires only a
small set of parameters to model a large variety of different basic shapes. In
134
addition, for visualisation purposes, texture maps can be created from the
calibrated colour camera images to provide a more realistic urban model.
In short, this thesis has proposed several improvements to urban modelling
methodology. However, like all studies, the investigation has been limited by time and
resource constraints. The author hopes that this dissertation will be informative and useful
to researchers and stimulate further research in the area.
135
8.0 APPENDIX I: DATA ACQUISITION
The other data acquisition technology, InSAR, and several existing scanning
strategies using laser scanners are elaborated in this section:
Sensor Technology
Besides LIDAR, Interferometric Synthetic Aperture Radar (InSAR) is another recently
developed class of active sensor. Using satellites or aircraft, InSAR acquires the
measurements by computing the differences in the returned phase of the waves generated
by two or more synthetic aperture radars (SARs). The technology can achieve accuracy up
to centimetre-scale changes in deformation of airborne images over time spans of days to
years. However, it has less than a vertical accuracy of 1m for airborne data acquisition in
Haithcoat et al’s experiment [140], whereas LIDAR can achieve up to a few centimetres
of accuracy. According to Norheim et al. [141], the LIDAR DEM (Digital Elevation
Model) has less bias (~30 cm) and less variance (~90 cm) than the InSAR DEM (bias ~90
cm, variance ~3.5 m).
The advantage of InSAR over LiDAR is that it flies higher and faster than most
LIDAR systems and it is therefore possible to penetrate fog and rain. InSAR also has a
lower per-area cost, and is capable of more rapid survey of large areas and faster post-
survey processing. Some authors [142] suggest combining both sensors as they
Appendix I
Data Acquisition
136
complement each other. However, the use of InSAR is difficult for ground-based data
acquisition.
Scanning Strategy
For ground-based data acquisition, 2D line laser scanners and 3D laser scanners mounted
on a mobile robot, vehicle or hand-held can be used. The acquisition can be done in a
stop-and-go fashion [46, 143, 144] or in a continuous fashion [57, 144, 145]. The stop-
and-go approach measures the environment at several fixed locations, while the
continuous approach involves a moving sensor platform and a car or mobile cart.
Localisations of the scanning positions are generally done with GPS (Global Positioning
System) or by using aerial photos as a global map. The drawback of GPS as the source of
global position estimates in outdoor environments is that it tends to fail when operating in
dense urban environments, particularly in urban canyons where only a few satellites are
visible. An accurate differential GPS system that can fulfil the registration requirement is
expensive. The system also often includes an Inertia Navigation System that uses
computer and motion sensors to continuously track the position, orientation and velocity
of the scanner.
Stamos and Allen [143] acquired data on the environment in a stop-and-go fashion.
The authors used a Cyrax range scanner which has centimetre level accuracy up to 100m.
The authors scanned multiple range scans from several scan positions and integrated the
scans with images acquired from a camera. In another data acquisition example which
uses the continuous scanning method, Fruh and Zakhor [145] acquire 3D data with a
configuration of two 2D laser scanners and a camera. The approach uses aerial images to
precisely reconstruct the path of the acquisition vehicle in offline computations. One
scanner is mounted vertically to capture building facades, and the other is mounted
horizontally. Successive horizontal scans are matched with each other in order to
determine an estimate of the vehicle's motion, and relative motion estimates are
concatenated to form an initial path. Small errors in the initial path which accumulate to
become large over time can be eliminated by using Monte-Carlo-Localization with an
airborne map to correct global pose. By combining stop-and-go and continuous scanning
of the laser rangefinder, Asai et al. [146] first scanned the environment using the stop-and-
137
go approach to obtain a more complete scan of the environment. The non-measured
portions were then covered by the continuous scanning approach. The system includes a
terrestrial laser range finder, a van, a GPS and an INS sensor.
138
9.0 APPENDIX II: SINGULAR VALUE DECOMPOSITION
Singular Value Decomposition (SVD) transforms a nm × matrix A of rank r to
diagonal form using unitary matrices:
[ ]*
1
1
1 ...]...[* r
r
r vvuuVUA
=Σ=
σ
σ
O
(9-1)
where diagonal matrix }...{ 21 rσσσ >>> contains the singular value and the numeric
unitary matrices U and V are the left and right singular vectors for Σ .
Given that V is unitary21
, Eq 9-1 has the following equivalence:
rrr uAvUAVVUA σ=⇔Σ=⇔Σ= * (9-2)
This can be interpreted as the unit vectors of an orthogonal coordinate system
},...,,{ 21 rvvv are mapped under A onto a new “scaled” orthogonal coordinate system
21 Unitary matrix has the property that the conjugate transpose V* is equal to its inverse V
-1.
Appendix II
Singular Value Decomposition
139
},...,,{ 2211 rruuu σσσ , as shown in Figure 9-1, rv which corresponds to its smallest
singular value rσ gives the 1-dimension sub-space into which the data have a minimal
projection, which is the estimated normal vector.
Figure 9-1 Geometric Interpretation of the SVD
The cost function in the least square algorithm can be varied as a weighted cost
function to increase the strength of the nearby points over the distance points. Pauly et al.
[147] assign different weights for the fitting errors based on the distance of the
neighbouring points to p. The authors implemented a new cost function with Gaussian as
the weighting function, where h2 is chosen as one third the square distance between p and
its kth
nearest neighbor:
∑=
−−
−=k
i
h
pp
i
T
i
ecpncne1
||||
2 2
2
)(),(
(9-3)
Other variations include maximizing the cost function which is defined as the
angle (minimize the inner product) between the normal vector and the tangential vectors
(Vector SVD) [52].
A
11uσ
22uσ
1v 2v
140
10.0 APPENDIX III: DERIVATION OF CONDITIONAL
POSTERIOR DISTRIBUTION OF PARAMETERS FOR IGMM
Conditional Posterior Distribution of the component means, µ
Each component mean is given a Gaussian prior:
),(~),|( γλγλµ GP i
( 10-1 )
where the hyper-parameters22
λ and γ , are common to all components. The conditional
posterior distribution for the component means can be obtained by multiplying the
likelihood from Eq. ( 6-9 ) conditioned on the indicators, by the prior ),|( γλµ iP as
shown below:
)1
,(~),,,,|(γγ
λγγλµ
+Σ+Σ
+ΣΣ
iiii
iii
NN
NxGxcP
( 10-2 )
22 Hyper-parameter is defined as a parameter of a prior distribution in Bayesian analysis. The term hyper-
parameter is used to distinguish the parameters of the prior distributions (Gaussian and Gamma distributions)
from parameters of the model for the underlying system (Gaussian distribution).
Appendix III
Derivation of Conditional Posterior
Distribution of Parameters for
IGMM
141
where
λ : hyper-parameter - mean of the Gaussian priors
γ : hyper-parameter - variance of the Gaussian priors
ix : mean of the observations belonging to class i
iN : number of observations belonging to class i
The hyper-parameters can be a single value, or can be computed by taking a probability
distribution on the hyper-parameter itself, called a hyper-prior. The hierarchical structure
of the prior distributions is more robust, as the hyper-parameters can also be updated in the
iterations. The derivation of the conditional posterior distributions for the updating of the
hyper-parameters is shown below.
Conditional Posterior of the Hyper-parameters for the means
The choice of the prior density of the hyper-parameters for the component means is the
conjugate density (for mathematical convenience) of the Gaussian distribution: λ and γ
are given vague23
Gaussian and Gamma hyper-priors:
),(~)( 2
xxGP σµλ ( 10-3 )
)2
exp(),1(~)(2
2
1
2 x
x
rGaP
σγσγ
−∝
−−
( 10-4 )
where xµ and 2
xσ are the means and variance of the observation.
The conditional posteriors of the hyper-priors are the product of the hyper-priors and
∏=
repk
i
iP1
),|( γλµ :
23 The shape parameter of the Gamma hyper-prior is set to unity as proposed in Ramussen’s formulation,
corresponding to a very broad (vague) distribution.
142
)1
,(~),|(22
1
2
γσγσ
µγσµ
γµλkk
GPxx
k
i
ixx
rep
++
+
−−
=
− ∑
( 10-5 )
)1
)(
,1(~),|(
1
1
22
−
=
+
−+
+∑k
kGaP
repk
i
ix λµγσ
λµγ
( 10-6 )
where repk is the number of represented mixtures.
Conditional Posterior Distribution of the component precision, Σ
Each component variance is given a Gamma prior:
)2/exp(),(~),|( 12/1 βββ βwwGawP iii Σ−Σ∝Σ −−
( 10-7 )
where the hyper-parameters β and w, are common to all components. The conditional
posterior distribution for the component means can be obtained by multiplying the
likelihood from Eq. ( 6-9 ) conditioned on the indicators, by the prior ),|( wP i βΣ ,
simplified to the form:
))(
,(~),,,,|(
1
:
2−
=
+
−++Σ
∑i
icj ij
iiN
xwNGawxcP
j
β
µβββµ
( 10-8 )
where
β : hyper-parameter - shape of the Gamma priors
w-1
: hyper-parameter - mean of the Gamma priors
143
The derivation of the conditional posterior distributions for the updating of the hyper-
parameters is shown below.
Conditional Posterior of the Hyper-parameters for the precision
The choice of the prior density of the hyper-parameters for the component precision is the
conjugate density of the Gamma distribution: β and w are given inverse Gamma and
Gamma hyper-priors:
)2
1exp()()1,1(~)( 2
3
1
ββββ −∝⇒
−− PGaP
( 10-9 )
),1(~)( 2−xGawP σ
( 10-10 )
The conditional posterior of the hyper-priors are the product of the hyper-priors and
∏=
−Σrepk
i
i wP1
1),|( β simplified to the form as below. ),|( wP Σβ is not of the standard form
of a simple probability function. Since ),,...,|)(log( 1 wP kΣΣβ is log-concave, the
independent samples can be generated using the Adaptive Rejection Sampling (ARS)
technique [148].
)1
,1(~),|(
1
1
2
−
=
−
+
Σ+
+Σ∑
β
βσ
ββk
kGawP
repk
i
ix
( 10-11 )
])2
exp()[()2
)(2
1exp()
2(),|(
1
22
3
∏=
−− Σ
Σ−
Γ∝Σreprep
rep
k
i
i
i
k
k wwwP
ββ
β
ββ
ββ
( 10-12 )
144
Conditional Posterior Distribution of the mixing weights, π
The mixing weights are given symmetric Dirichlet priors with concentration parameter
k/α : ∏=
−
Γ
Γ=
k
i
k
ikkk
kkDirichletP1
1/
1)/(
)()/,...,/(~)|,...,( απ
α
ααααππ
Given the indicator, ci, that encodes the class to which the observation belongs and the
occupation number, Ni, which carries the number of observations associated with
component i, the inference of the mixing weights can be indirectly realised through the
inference of the indicators. With the standard Dirichlet integral, the priors for the
indicators depend only on α [134].
∫= kkkNN ddPccPccP ππππππα ,...,),...,(),...,|,...,()|,...,( 11111 ( 10-13 )
∏= Γ
+Γ
+Γ
Γ=
k
i
i
k
kN
N 1 )/(
)/(
)(
)(
α
α
α
α
( 10-14 )
)(.Γ is the standard gamma function.
The conditional prior for a single indicator, with finite number of mixture k, given all the
others is then:
α
αα
+−
+==
−
−1
/),|(
,
N
kNcicP
ij
jj
( 10-15 )
Where –j indicates all indexes except j and ijN ,− indicates the number of observations,
excluding xn, that belongs to mixture i.
Since the number of mixtures in the IGMM is infinite, the conditional prior in Eq ( 10-15 )
is given as below as k tends to infinity:
+−
+−==
−
−
tedunrepresenisiN
drepresenteisiN
N
cicP
ij
jj
α
ααα
1
1),|(
,
( 10-16 )
145
Conditional Posterior of the hyper-parameters for the mixing weight
The choice of the prior density of the hyper-parameters for the mixing weight is an inverse
Gamma prior:
)1,1(~)(1
GaP−α ( 10-17 )
The conditional posterior of the hyper-prior given repk and number of data points, N, is
log-concave and can be sampled using the ARS:
)(
)())2/(1exp(~),|(
2/3
α
αααα
+Γ
Γ−−
NNkP
repk
rep
( 10-18 )
146
REFERENCES
[1] C. G. Fuchs, E.; Förstner, W, "OEEPE Survey on 3D-city models," OEEPE
Publication, No. 35, Bundesamt fur Kartographie und Geodasie, Frankfurt, 1998.
[2] M. Batty, D. Chapman, S. Evans, M. Haklay, S. Kueppers, N. Shiode, A. Smith,
and P. M. Torrens, "Visualizing the City: Communicating Urban Design to
Plannars and Decision Makers," CASA Working Paper Series, 2000.
[3] N. Shiode, "3D urban models: recent developments in the digital modelling of
urban environments in three-dimensions," GeoJournal, vol. 52, pp. 263-9, 2000.
[4] C. Braun, T. H. Kolbe, F. Lang, W. Schickler, V. Steinhage, A. B. Cremers, W.
Foerstner, and L. Pluemer, "Models for photogrammetric building reconstruction,"
Computers & Graphics (Pergamon), vol. 19, p. 109, 1995.
[5] D. Stevens, W. M. McKay, and D. Fowler, "Combined use of photogrammetry and
CAD in the reconstruction of fire damaged buildings," Proceedings of SPIE - The
International Society for Optical Engineering, Bellingham, WA, USA, pp. 77-85,
1990.
[6] K. Schindler and J. Bauer, "A model-based method for building reconstruction,"
Proceedings First IEEE International Workshop on Higher-Level Knowledge in
3D Modeling and Motion Analysis (HLK 2003), pp. 74-82, 2003.
[7] J. L. Davidson, "STEREO PHOTOGRAMMETRY IN GEOTECHNICAL
ENGINEERING RESEARCH," Photogrammetric Engineering and Remote
Sensing, vol. 51, pp. 1589-1596, 1985.
[8] E. P. Baltsavias, "A comparison between photogrammetry and laser scanning,"
ISPRS Journal of Photogrammetry and Remote Sensing, vol. 54, pp. 83-94, 1999.
[9] F. Ackermann, "Airborne laser scanning - Present status and future expectations,"
ISPRS Journal of Photogrammetry and Remote Sensing, vol. 54, pp. 64-67, 1999.
[10] H. Yoon and K. Park, "Development of a laser range finder using the phase
difference method," Proceedings of SPIE - The International Society for Optical
Engineering, Sappora, Japan, pp. SPIE - The International Society for Optical
Engineering; Hokkaido University, Japan; Sapporo International Plaza, Japan,
2005.
[11] G. Liu, Y. Wang, and G. Liu, "Design and simulation of a mixer and phase
difference measuring circuitry for laser range finding systems," Proceedings of
SPIE - The International Society for Optical Engineering, Xinjiang, China, pp. Int.
Committee on Measurements and Instrumentation, ICMI; National Natural Science
Foundation of China, China; Chinese Society for Measurement, China, 2006.
[12] J. Skaloud and J. Vallet, "High Accuracy Handheld Mapping System for Fast
Helicopter Deployment," In Joint International Symposium on Geospatial Theory,
Processing and Applications ISPRS Comm. IV, p. 6, 2002.
[13] C. Herley, "Automatic occlusion removal from minimum number of images," 2005
International Conference on Image Processing, Genova, Italy, pp. 1046-9, 2006.
[14] C. Moenning and N. A. Dodgson, "A new point cloud simplification algorithm,"
3rd IASTED International Conference on Visualization, Imaging, and Image
Processing (VIIP 2003) 8-10 Sep pp. 1027-1033, 2003.
[15] M. Alexa, J. Behr, D. Cohen-Or, S. Fleishman, D. Levin, and T. Silva, "Point Set
Surfaces," Proc. 12th IEEE Visualization Conf., San Diego, USA, p. 6, 2001.
[16] R. T. Whitaker and E. L. Juarez-Valdes, "On the reconstruction of height functions
and terrain maps from dense range data," IEEE Transactions on Image Processing,
vol. 11, pp. 704-16, 2002.
147
[17] H. T. Tanaka and F. Kishino, "Adaptive sampling and reconstruction for
discontinuity preserving texture-mapped triangulation," IEEE CAD-Based Vision
Workshop - Proceedings, Champion, PA, USA, pp. 298-303, 1994.
[18] Y. Yemez and C. J. Wetherilt, "A volumetric fusion technique for surface
reconstruction from silhouettes and range data," Computer Vision and Image
Understanding, vol. 105, pp. 30-41, 2007.
[19] J. Hu, S. You, and U. Neumann, "Approaches to large-scale urban modeling,"
Computer Graphics and Applications, IEEE pp. 62 - 69 2003.
[20] Y. Takase, N. Sho, A. Sone, and K. Shimiya, "Automatic generation of 3d city
models and related applications," International Archives of Photogrammetry,
Remote Sensing and Spatial Information Sciences, pp. 113--120, 2003.
[21] P. Krishnamoorthy, K. L. Boyer, and P. J. Flynn, "Robust detection of buildings in
digital surface models," Proceedings 16th International Conference on Pattern
Recognition, Quebec City, Que., Canada, pp. 159-63, 2002.
[22] F. Rottensteiner, "Automatic generation of high-quality building models from lidar
data," IEEE Computer Graphics and Applications, vol. 23, pp. 42-50, 2003.
[23] L. Matikainen, J. Hyyppa, and H. Hyyppä, "Automatic detection of buildings from
laser scanner data for map updating," ISPRS Commission III. Workshop 3-d
reconstruction from airborne laserscanner and InSAR data, 2003.
[24] G. Vosselman, "Fusion of laser scanning data, maps, and aerial photographs for
building reconstruction," International Geoscience and Remote Sensing
Symposium (IGARSS), Toronto, Ont., Canada, pp. 85-88, 2002.
[25] N. Haala and C. Brenner, "Extraction of buildings and trees in urban
environments," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 54,
pp. 130-137, 1999.
[26] X. He, R. S. Zemel, and M. A. Carreira-Perpinan, "Multiscale conditional random
fields for image labeling," Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, Washington, DC, United States, pp.
695-702, 2004.
[27] F. Tsai and H. C. Lin, "Polygon-based texture mapping for cyber city 3D building
models," International Journal of Geographical Information Science, vol. 21, pp.
965-981, 2007.
[28] Y. Zhang, Z. Zhang, J. Zhano, and W. U. Jun, "3D building modelling with digital
map, lidar data and video image sequences," Photogrammetric Record, pp. 285-
302, 2005.
[29] C. I. Connolly, "The Determination of Next Best Views," Proceedings of the IEEE
International Conference on Robotics and Automation 1985.
[30] K. Ulm, "City models from aerial imagery – Integrating images and the
landscape," GEOInformatics, January/February, pp. 18-21, 2005.
[31] B. Hongqiang and Z. Zhaoyang, "Identification of occlusion regions based on
background rebuilding for automatic video object segmentation," Proc. SPIE - Int.
Soc. Opt. Eng. (USA), Beijing, China, pp. 883-6, 2003.
[32] T. Li, W. Chengke, L. Shigang, and Y. Yaoping, "Complete structure recovery
from long image sequence with occlusions," Proc. SPIE - Int. Soc. Opt. Eng.
(USA), Beijing, China, pp. 529-34, 2003.
[33] H. Wang and D. Suter, "A novel robust statistical method for background
initialization and visual surveillance," Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), Hyderabad, India, pp. 328-337, 2006.
148
[34] Y. Wang and Q. Ji, "A dynamic conditional random field model for object
segmentation in image sequences," Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, San Diego, CA, United
States, pp. 264-270, 2005.
[35] M. Hoynck and J. R. Ohm, "Shape retrieval with robustness against partial
occlusion," 2003 IEEE International Conference on Acoustics, Speech, and Signal
Processing (Cat. No.03CH37404), Hong Kong, China, pp. 593-6, 2003.
[36] I. Guyon and A. Elisseeff, "An introduction to variable and feature selection,"
Journal of Machine Learning Research, vol. 3, pp. 1157-82, 2003.
[37] D. Anguelov, B. Taskar, V. Chatalbashev, D. Koller, D. Gupta, G. Heitz, and A.
Ng, "Discriminative learning of Markov random fields for segmentation of 3D
scan data," Proceedings - 2005 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, CVPR 2005, San Diego, CA, United States, pp.
169-176, 2005.
[38] A. E. Johnson and M. Hebert, "Using spin images for efficient object recognition
in cluttered 3D scenes," IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 21, pp. 433-449, 1999.
[39] S. Matzka, Y. R. Petillot, and A. M. Wallace, "Determining efficient scan-patterns
for 3-D object recognition using spin images," Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), Heidelberg, D-69121, Germany, pp. 559-570, 2007.
[40] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, "Matching 3D models with
shape distributions," Proceedings International Conference on Shape Modeling and
Applications, Los Alamitos, CA, USA, pp. 154-66, 2001.
[41] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin, "Shape distributions," ACM
Transactions on Graphics, vol. 21, pp. 807-832, 2002.
[42] A. S. Mian, M. Bennamoun, and R. Owens, "Matching tensors for automatic
correspondence and registration," Computer Vision - ECCV 2004. 8th European
Conference on Computer Vision. Proceedings (Lecture Notes in Comput. Sci.
Vol.3022), Berlin, Germany, pp. 495-505, 2004.
[43] W. Zhaohui, W. Yueming, and P. Gang, "3D face recognition using local shape
map," 2004 International Conference on Image Processing (ICIP) (IEEE Cat.
No.04CH37580), Piscataway, NJ, USA, pp. 2003-6, 2004.
[44] T. W. Way, H.-P. Chan, M. M. Goodsitt, B. Sahiner, L. M. Hadjiiski, C. Zhou, and
A. Chughtai, "Effect of CT scanning parameters on volumetric measurements of
pulmonary nodules by 3D active contour segmentation: A phantom study," Physics
in Medicine and Biology, vol. 53, pp. 1295-1312, 2008.
[45] D. D. Lichti, "Spectral filtering and classification of terrestrial laser scanner point
clouds," Photogrammetric Record, vol. 20, pp. 218-240, 2005.
[46] R. Triebel, K. Kersting, and W. Burgard, "Robust 3D scan point classification
using associative Markov networks," Proceedings - IEEE International Conference
on Robotics and Automation, Orlando, FL, United States, pp. 2603-2608, 2006.
[47] O. Tuzel, F. Porikli, and P. Meer, "Region covariance: A fast descriptor for
detection and classification," Lecture Notes in Computer Science (including
subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
Bioinformatics), Graz, Austria, pp. 589-600, 2006.
[48] M. E. Wall, A. Rechtsteiner, and L. M. Rocha, "Singular value decomposition and
principal component analysis," A Practical Approach to Microarray Data Analysis,
Kluwer: Norwell, MA, pp. 91-109, 2003.
149
[49] R. Unnikrishnan and M. Hebert, "Robust extraction of multiple structures from
non-uniformly sampled data," IROS 2003, Las Vegas, NV, USA, pp. 1322-9, 2003.
[50] I. Stamos and P. K. Allen, "Integration of range and image sensing for photo-
realistic 3D modeling," Proceedings 2000 ICRA. Millennium Conference. , San
Francisco, CA, USA, pp. 1435-40, 2000.
[51] J. F. Lalonde, R. Unnikrishnan, N. Vandapel, and M. Hebert, "Scale selection for
classification of point-sampled 3D surfaces," Proceedings. Fifth International
Conference on 3-D Digital Imaging and Modeling, Ottawa, Ont., Canada, pp. 285-
92, 2005.
[52] K. Klasing, D. Althoff, D. Wollherr, and M. Buss, "Comparison of surface normal
estimation methods for range sensing applications," 2009 IEEE International
Conference on Robotics and Automation (ICRA), Piscataway, NJ, USA, pp. 3206-
11, 2009.
[53] J. Shuangshuang, R. R. Lewis, and D. West, "A comparison of algorithms for
vertex normal computation," Visual Computer, vol. 21, pp. 71-82, 2005.
[54] N. Amenta and M. Bern, "Surface reconstruction by Voronoi filtering,"
Proceedings of the Fourteenth Annual Symposium on Computational Geometry,
New York, NY, USA, pp. 39-48, 1998.
[55] T. K. Dey, G. Li, and J. Sun, "Normal estimation for point clouds: A comparison
study for a Voronoi based method," Point-Based Graphics, 2005 -
Eurographics/IEEE VGTC Symposium Proceedings, Stony Brook, NY, United
states, pp. 39-46, 2005.
[56] H. Hoppe, T. DeRose, T. Duchamp, J. McDonald, and W. Stuetzle, "Surface
reconstruction from unorganized points," Comput. Graph. (USA), Chicago, IL,
USA, pp. 71-8, 1992.
[57] J.-F. Lalonde, N. Vandapel, D. F. Huber, and M. Hebert, "Natural Terrain
Classification using Three-Dimensional Ladar Data for Ground Robot Mobility,"
Journal of Field Robotics, 23(10):839--861, October 2006.
[58] D. F. Wolf, G. S. Sukhatme, D. Fox, and W. Burgard, "Autonomous Terrain
Mapping and Classification Using Hidden Markov Models," in (ICRA)Proc. of the
IEEE International Conference on Robotics and Automation, pp. 2026-2031, 2005.
[59] J.-H. Xue and D. M. Titterington, "Comment on "on discriminative vs. generative
classifiers: A comparison of logistic regression and naive bayes"," Neural
Processing Letters, vol. 28, pp. 169-187, 2008.
[60] I. Ulusoy and C. M. Bishop, "Comparison of generative and discriminative
techniques for object detection and classification," in Toward Category-Level
Object Recognition Berlin, Germany: Springer, pp. 173-95, 2006.
[61] P. F. Felzenszwalb and D. P. Huttenlocher, "Efficient graph-based image
segmentation," International Journal of Computer Vision, vol. 59, pp. 167-181,
2004.
[62] C. Zhuo, F. Y. L. Chin, and R. H. Y. Chung, "Automated Hierarchical Image
Segmentation Based on Merging of Quadrilaterals," WSEAS Transactions on
Signal Processing, 2 (8),1063-1068, 2006.
[63] H. Xuming, R. S. Zemel, and D. Ray, "Learning and incorporating top-down cues
in image segmentation," Computer Vision - ECCV 2006. 9th European Conference
on Computer Vision. Proceedings, Part I (Lecture Notes in Computer Science Vol.
3951), Graz, Austria, pp. 338-51, 2006.
150
[64] R. de Luis-Garcia, R. Deriche, and C. Alberola-Lopez, "Texture and color
segmentation based on the combined use of the structure tensor and the image
components," Signal Processing, vol. 88, pp. 776-95, 2008.
[65] H. Permuter, J. Francos, and I. Jermyn, "A study of Gaussian mixture models of
color and texture features for image classification and segmentation," Pattern
Recognition, vol. 39, pp. 695-706, 2006.
[66] J. D. Boissonnat and F. Cazals, "Coarse-to-fine surface simplification with
geometric guarantees," Computer Graphics Forum, vol. 20, pp. 490-499, 2001.
[67] A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik, "Recognizing objects in
range data using regional point descriptors," Proceedings of the European
Conference on Computer Vision (ECCV), May, 2004.
[68] K. T. Abou-Moustafa and P. P. Ferrie, "The minimum volume ellipsoid metric," in
Pattern Recognition. Proceedings 29th DAGM Symposium. (Lecture Notes in
Computer Science vol. 4713)Berlin, Germany, pp. 335-44, 2007.
[69] J. Ming-Yi, L. Jing-Sin, S. Shen-Po, C. Yuh-Ren, H. Kao-Shing, and L. Wan-Chi,
"Fast and accurate collision detection based on enclosed ellipsoid," Robotica, vol.
19, pp. 381-94, 2001.
[70] E. Rimon and S. P. Boyd, "Obstacle collision detection using best ellipsoid fit,"
Journal of Intelligent and Robotic Systems: Theory and Applications, vol. 18, pp.
105-126, 1997.
[71] T. Lindeberg, "Feature detection with automatic scale selection," International
Journal of Computer Vision, vol. 30, pp. 79-116, 1998.
[72] N. J. Mitra, A. Nguyen, and L. J. Guibas, "Estimating surface normals in noisy
point cloud data," Int. J. Comput. Geometry Appl. 14, vol. (4-5), pp. 261-276, 2004.
[73] E. H. Lim and D. Suter, "Conditional Random Field for 3D Point Clouds with
Adaptive Data Reduction," in New Advances in Shape Analysis and Geometric
Modeling (NASAGEM) Workshop held in conjunction with the International
Conference on Cyberworlds, 2007, pp. 404-408, 2007.
[74] E. H. Lim and D. Suter, "Multi-scale Conditional Random Fields for over-
segmented irregular 3D point clouds classification," in Computer Vision and
Pattern Recognition Workshops, 2008. IEEE Computer Society Conference on
CVPR Workshops., pp. 1-7, 2008.
[75] S. Gumhold, X. Wang, and R. MacLeod, "Feature extraction from point clouds,"
International Meshing Roundtable, Sandia National Laboratories, October 2001.
[76] D. H. Wolpert and W. G. Macready, "No free lunch theorems for optimization,"
IEEE Transactions on Evolutionary Computation, vol. 1, pp. 67-82, 1997.
[77] G. Agre and S. Peev, "On supervised and unsupervised discretization," Cybernetics
and Information Technologies, pp. 43-57, 2002.
[78] R. Caruana and A. Niculescu-Mizil, "An empirical comparison of supervised
learning algorithms," ACM International Conference Proceeding Series, New York,
NY 10036-5701, United States, pp. 161-168, 2006.
[79] S. Ray and M. Craven, "Supervised versus multiple instance learning: An
empirical comparison," ICML 2005 - Proceedings of the 22nd International
Conference on Machine Learning, New York, NY 10036-5701, United States, pp.
697-704, 2005.
[80] P. B. Brazdil, J. Gama, and B. Henery, "Characterizing the applicability of
classification algorithms using meta-level learning," in ECCV vol 784, pp. 83-102,
1994.
151
[81] C. M. V. D. Walt, "Data Measures that characterise Classification Problems,"
Master of Engineering, thesis, in Faculty of Engineering, the Built Environment
and Information Technology University of Pretoria, 2008
[82] C. Barat, H. Loaiza, E. Colle, and S. Lelandais, "Neural and statistical classifiers.
Can such approaches be complementary ?," in Instrumentation and Measurement
Technology Conference, 2000. IMTC 2000. Proceedings of the 17th IEEE, pp.
1480-1486 2000.
[83] J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. 18th
International Conf. on Machine Learning, pp. 282-289, 2001.
[84] T. Mitchell, Machine Learning: McGraw Hill, 1997.
[85] G. Bouchard and B. Triggs, "The trade-off between generative and discriminative
classifiers," COMPSTAT'2004 Symposium, 2004.
[86] S. Kumar, "Models for Learning Spatial Interactions in Natural Images for
Context-Based Classification," PhD thesis, in The Robotics Institute Carnegie
Mellon University, 2005
[87] M. I. Jordan, Learning in graphical models: MIT Press, 1999, 1999.
[88] V. Kolmogorov and R. Zabih, "What Energy Functions Can Be Minimized via
Graph Cuts?," IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 26, pp. 147-159, 2004.
[89] W. Li and A. McCallum, "Rapid development of hindi named entity recognition
using conditional random fields and feature induction," ACM Transactions on
Asian Language Information Processing, vol. 2, pp. 290-294, 2003.
[90] K. Sanjiv and M. Hebert, "Discriminative random fields: a discriminative
framework for contextual interaction in classification," Proceedings Ninth IEEE
International Conference on Computer Vision, Nice, France, pp. 1150-7, 2003.
[91] D. H. Tran, T. H. Pham, K. Satou, and T. B. Ho, "Conditional random fields for
predicting and analyzing histone occupancy, acetylation and methylation areas in
DNA sequences," Lecture Notes in Computer Science (including subseries Lecture
Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Budapest,
Hungary, pp. 221-230, 2006.
[92] D. Pinto, A. McCallum, X. Wei, and W. Bruce Croft, "Table Extraction Using
Conditional Random Fields," SIGIR Forum (ACM Special Interest Group on
Information Retrieval), Toronto, Ont., Canada, pp. 235-242, 2003.
[93] Y. Qi, M. Szummer, and T. P. Minka, "Diagram structure recognition by Bayesian
conditional random fields," CVPR 2005, San Diego, CA, United States, pp. 191-
196, 2005.
[94] H. Andreasson, R. Triebel, and W. Burgard, "Improving plane extraction from 3D
data by fusing laser data and vision," 2005 IEEE/RSJ International Conference on
Intelligent Robots and Systems, Edmonton, Alta., Canada, pp. 2656-61, 2005.
[95] S. F. Chen and R. Rosenfeld, "A Gaussian prior for smoothing maximum entropy
models," Technical Report CMUCS-99-108, Carnegie Mellon University, 1999.
[96] R. C. Bolles and M. A. Fischler, "A RANSAC-Based Approach to Model Fitting
and Its Application to Finding Cylinders in Range Data," in IJCAI81, pp. 637-643.
[97] E. H. Lim and D. Suter, "Classification of 3d lidar point clouds for urban
modelling," Image and Vision Computing, New Zealand, Nov. 2006, pp. pages
149-154, 2006.
[98] W. Von Hansen, E. Michaelsen, and U. Thonnessen, "Cluster analysis and priority
sorting in huge point clouds for building reconstruction," Proceedings -
152
International Conference on Pattern Recognition, Hong Kong, China, pp. 23-26,
2006.
[99] S. Rusinkiewicz and M. Levoy, "QSplat: a multiresolution point rendering system
for large meshes," Computer Graphics Proceedings. Annual Conference Series
2000. SIGGRAPH 2000. Conference Proceedings, New York, NY, USA, pp. 343-
52, 2000.
[100] W. Hanzi, "Robust Statistics for Computer Vision: Model Fitting, Image
Segmentation and Visual Analysis," Ph.D. Thesis, Monash University, Department
of Electrical and Computer Systems Engineering 2004.
[101] S. M. Stigler, "Mathematical Statistics in the Early States " The Annals of Statistics,
Vol. 6, pp. 239-265, 1978.
[102] P. J. Rousseeuw, "Least median of squares regression," J. Amer. Statist. Assoc. 79
(388), pp. 871–880, 1984.
[103] B. Walczak, M. Daszykowski, K. Kaczmarek, and Y. Vander Heyden, "Robust
statistics in data analysis - A review," Chemometrics and Intelligent Laboratory
Systems, vol. 85, pp. 203-19, 2007.
[104] M. Zuliani, C. S. Kenney, and B. S. Manjunath, "The multiransac algorithm and its
application to detect planar homographies," Proceedings - International Conference
on Image Processing, ICIP, Piscataway, NJ 08855-1331, United States, pp. 153-
156, 2005.
[105] A. Sarti and S. Tubaro, "Detection and characterisation of planar fractures using a
3D Hough transform," Signal Processing, vol. 82, pp. 1269-82, 2002.
[106] D. Yihong, P. Xijian, H. Min, and D. Wang, "Range image segmentation based on
randomized Hough transform," Pattern Recognition Letters, vol. 26, pp. 2033-41,
2005.
[107] C. V. Stewart, "Bias in robust estimation caused by discontinuities and multiple
structures," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
19, pp. 818-33, 1997.
[108] R. Hesami, A. BabHadiashar, and R. HosseinNezhad, "Range segmentation of
large building exteriors: A hierarchical robust approach," Computer Vision and
Image Understanding, vol. 114, pp. 475-490, 2010.
[109] A. R. Ferreira da Silva, "A Dirichlet process mixture model for brain MRI tissue
classification," Medical Image Analysis, vol. 11, pp. 169-182, 2007.
[110] B. Thirion, A. Tucholka, M. Keller, P. Pinel, A. Roche, J. F. Mangin, and J. B.
Poline, "High level group analysis of fMRI data based on Dirichlet process mixture
models," in Information Processing in Medical ImagingKerkrade, Netherlands, pp.
482-94, 2007.
[111] C. Rasmussen, B. de la Cruz, Z. Ghahramani, and D. Wild, "Modeling and
Visualizing Uncertainty in Gene Expression Clusters using Dirichlet Process
Mixtures," IEEE/ACM Transactions on Computational Biology and Bioinformatics,
2007.
[112] Z. Chengliang, Z. Shenghuo, and G. Yihong, "Trend Analysis for Large Document
Streams," in Machine Learning and Applications, 2006. ICMLA '06. 5th
International Conference on,G. Yihong, pp. 285-295, 2006.
[113] J. Yong-Dian and C. Chu-Song, "Two-view motion segmentation by mixtures of
Dirichlet process with model selection and outlier removal," in 2007 11th IEEE
International Conference on Computer VisionRio de Janeiro, Brazil, pp. 1060-7,
2007.
153
[114] Y.-D. Jian and C.-S. Chen, "Two-view motion segmentation by mixtures of
dirichlet process with model selection and outlier removal," Proceedings of the
IEEE International Conference on Computer Vision, Piscataway, NJ 08855-1331,
United States, p. 4408974, 2007.
[115] P. Kultanen, L. Xu, and E. Oja, "Randomized Hough transform (RHT),"
Proceedings - International Conference on Pattern Recognition, Piscataway, NJ,
USA, pp. 631-635, 1990.
[116] P. H. S. Torr and A. Zisserman, "MLESAC: a new robust estimator with
application to estimating image geometry," Computer Vision and Image
Understanding, vol. 78, pp. 138-56, 2000.
[117] W. Hanzi, D. Mirota, M. Ishii, and G. D. Hager, "Robust motion estimation and
structure recovery from endoscopic image sequences with an adaptive scale kernel
consensus estimator," 2008 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Anchorage, AK, USA, p. 7 pp., 2008.
[118] A. Konouchine, V. Gaganov, and V. Vezhnevets, "AMLESAC: A New Maximum
Likelihood Robust Estimator," Graphicon-2005, Novosibirsk,Akademgorodok,
2005.
[119] W. Hanzi and D. Suter, "MDPE: a very robust estimator for model fitting and
range image segmentation," International Journal of Computer Vision, vol. 59, pp.
139-66, 2004.
[120] L. Fan and T. Pylvanainen, "Robust Scale Estimation from Ensemble Inlier Sets
for Random Sample Consensus Methods," ECCV, 2008.
[121] R. Toldo and A. Fusiello, "Robust Multiple Structures Estimation with J-Linkage,"
ECCV, 2008.
[122] H. Wang and D. Suter, "Robust adaptive-scale parametric model estimation for
computer vision," IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 26, pp. 1459-74, 2004.
[123] J. V. Moreau and A. K. Jain, "How many clusters?," Proceedings CVPR '86: IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (Cat.
No.86CH2290-5), Miami Beach, FL, USA, pp. 634-6, 1986.
[124] R. C. Dubes, "How many clusters are best? - an experiment," Pattern Recognition,
vol. 20, pp. 645-663, 1987.
[125] C. Fraley and A. E. Raftery, "How many clusters? Which clustering method?
Answers via model-based cluster analysis," Computer Journal, vol. 41, pp. 578-88,
1998.
[126] S. Still and W. Bialek, "How many clusters? An information-theoretic
perspective," Neural Computation, vol. 16, pp. 2483-506, 2004.
[127] C. E. Rasmussen, "The Infinite Gaussian Mixture Model," Advances in
information processing systems 12, pp. 554-560, 2000.
[128] H. Frigui and R. Krishnapuram, "Robust competitive clustering algorithm with
applications in computer vision," IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 21, pp. 450-465, 1999.
[129] D. A. Forsyth and J. Ponce, Computer Vision: A Modern Approach: Pearson
Education International, 2003.
[130] T. Ishioka, "Extended k-means with an efficient estimation of the number of
clusters," Intelligent Data Engineering and Automated - IDEAL 2000. Data
Mining, Financial Engineering, and Intelligent Agents. Second International
Conference. Proceedings (Lecture Notes in Computer Science Vol.1983), Hong
Kong, China, pp. 17-22, 2000.
154
[131] J. C. Bezdek, "Pattern Recognition with Fuzzy Objective Function Algorithms,"
Plenum, NY, 1981.
[132] J. M. Biosca and J. L. Lerma, "Unsupervised robust planar segmentation of
terrestrial laser scanner point clouds based on fuzzy clustering methods," ISPRS
Journal of Photogrammetry and Remote Sensing, vol. 63, pp. 84-98, 2008.
[133] W. Von Hansen, E. Michaelsen, and U. Thonnessen, "Cluster analysis and priority
sorting in huge point clouds for building reconstruction," in Proceedings -
International Conference on Pattern RecognitionHong Kong, China, pp. 23-26,
2006.
[134] C. E. Rasmussen, "The Infinite Gaussian Mixture Model," Advances in Neural
Information Processing Systems 12, pp. 554-560. (Eds.) Solla, S. A., T. K. Leen
and K. R. Müller, MIT Press, 2000.
[135] A. P. L. Dempster, N.M.; Rubin, D.B., "Maximum Likelihood from Incomplete
Data via the EM Algorithm," Journal of the Royal Statistical Society. Series B
(Methodological), pp. 1–38, 1977.
[136] S. Geman and D. Geman, "STOCHASTIC RELAXATION, GIBBS
DISTRIBUTIONS, AND THE BAYESIAN RESTORATION OF IMAGES,"
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-6, pp.
721-741, 1984.
[137] G. Biegelbauer and M. Vincze, "Efficient 3D object detection by fitting
superquadrics to range image data for robot's object manipulation," Proceedings -
IEEE International Conference on Robotics and Automation, Rome, Italy, pp.
1086-1091, 2007.
[138] E. H. Lim and D. Suter, "Occlusion removal in image for 3d urban modelling," in
Image and Vision Computing, New Zealand, pp. pp. 191-196, 2006.
[139] E. H. Lim and D. Suter, "3D terrestrial LIDAR classifications with super-voxels
and multi-scale Conditional Random Fields," Journal of Computer-Aided Design,
2009.
[140] T. Haithcoat, W. Song, and J. Hipple, "Automated Building Extraction and
Reconstruction from LIDAR Data," R&D Program for NASA/ICREST Studies
Project Report, 2004.
[141] R. A. Norheim, V. R. Queija, and R. A. Haugerud, "Comparison of LIDAR and
INSAR DEMs with dense ground control," Proceedings, Environmental Systems
Research Institute 2002 User Conference, 2003.
[142] P. Gamba, F. Dell’Acqua, and B. Houshmand, "Comparison and Fusion of LiDAR
and InSAR Digital Elevation Models Over Urban Areas, International Journal of
Remote Sensing, Vol. 24, No. 22, pp. 4289-430," 2003.
[143] I. Stamos and P. E. Allen, "3-D model construction using range and image data,"
Proceedings IEEE Conference on Computer Vision and Pattern Recognition.
CVPR 2000 (Cat. No.PR00662), Hilton Head Island, SC, USA, pp. 531-6, 2000.
[144] D. Munoz, N. Vandapel, and M. Hebert, "Directional Associative Markov
Network for 3-D Point Cloud Classification," Fourth International Symposium on
3D Data Processing, Visualization and Transmission, 2008.
[145] C. Fruh and A. Zakhor, "An Automated Method for Large-Scale, Ground-Based
City Model Acquisition," pp. pp. 5 - 24, 2004.
[146] T. Asai, M. Kanbara, and N. Yokoya, "3D Modeling of Outdoor Scenes by
Integrating Stop-And-Go and Continuous Scanning of Rangefinder," Proceedings
of the ISPRS Working Group V/4 Workshop 3D-ARCH 2005: "Virtual
155
Reconstruction and Visualization of Complex Architectures" Mestre-Venice, Italy,
2005.
[147] M. Pauly, R. Keiser, L. P. Kobbelt, and M. Gross, "Shape modeling with point-
sampled geometry," ACM Trans. Graph. (USA), USA, pp. 641-50, 2003.
[148] W. R. Gilks and P. Wild, "Adaptive rejection sampling for Gibbs sampling,"
Applied Statistics vol. 41, pp. 337-348, 1992.