mapping of indoor environments by robots using low...

Mapping of Indoor Environments

by Robots using Low-cost Vision Sensors

Trevor Taylor B.E. (Hons Chem), M.S.Ch.E., M.S.E.E.

Submitted in fulfilment of the requirements for the degree of

Doctor of Philosophy

Faculty of Science and Technology

Queensland University of Technology

2009

Dedicated to my sister, Lyndal, who lost her battle with cancer before I could finish

this thesis. Thanks for all the encouragement. I’ll miss you sis.

Mapping of Indoor Environments by Robots using Low-Cost Vision Sensors Trevor Taylor

KEYWORDS

Computer Vision; Simultaneous Localization and Mapping (SLAM); Concurrent

Mapping and Localization (CML); Mobile Robots

ABSTRACT

For robots to operate in human environments they must be able to make their own

maps because it is unrealistic to expect a user to enter a map into the robot’s

memory; existing floorplans are often incorrect; and human environments tend to

change. Traditionally robots have used sonar, infra-red or laser range finders to

perform the mapping task. Digital cameras have become very cheap in recent years

and they have opened up new possibilities as a sensor for robot perception. Any

robot that must interact with humans can reasonably be expected to have a camera

for tasks such as face recognition, so it makes sense to also use the camera for

navigation. Cameras have advantages over other sensors such as colour information

(not available with any other sensor), better immunity to noise (compared to sonar),

and not being restricted to operating in a plane (like laser range finders). However,

there are disadvantages too, with the principal one being the effect of perspective.

This research investigated ways to use a single colour camera as a range sensor to

guide an autonomous robot and allow it to build a map of its environment, a process

referred to as Simultaneous Localization and Mapping (SLAM). An experimental

system was built using a robot controlled via a wireless network connection. Using

the on-board camera as the only sensor, the robot successfully explored and mapped

indoor office environments. The quality of the resulting maps is comparable to those

that have been reported in the literature for sonar or infra-red sensors. Although the

maps are not as accurate as ones created with a laser range finder, the solution using

a camera is significantly cheaper and is more appropriate for toys and early domestic

robots.

iv Faculty of Science and Technology QUT

Trevor Taylor Mapping of Indoor Environments by Robots using Low-cost Vision Sensors

TABLE OF CONTENTS

Introduction 1

1.1 Problem Statement ......................................................................................... 1

1.2 Aim ................................................................................................................ 2

1.3 Purpose........................................................................................................... 2

1.4 Scope.............................................................................................................. 3

1.5 Original Contributions ................................................................................... 4

1.6 Overview of the Thesis .................................................................................. 9

Background and Literature Review 13

2.1 Background .................................................................................................. 13

2.1.1 Brief history of Vision-based Robots ............................................ 14

2.1.2 Robot Motion................................................................................. 15

2.2 Computer Vision.......................................................................................... 18

2.2.1 Stereo versus Monocular Vision.................................................... 18

2.2.2 Digital Cameras ............................................................................. 18

2.2.3 Pinhole Camera Approximation .................................................... 20

2.2.4 Focal Length .................................................................................. 21

2.2.5 Perspective..................................................................................... 22

2.2.6 Range Estimation........................................................................... 22

2.2.7 Effects of Camera Resolution........................................................ 22

2.2.8 Field of View (FOV) Limitations .................................................. 23

2.2.9 Focus.............................................................................................. 24

2.2.10 Camera Distortion and Calibration................................................ 24

2.2.11 Uneven Illumination ...................................................................... 25

2.2.12 Image Preprocessing...................................................................... 26

2.3 Navigation.................................................................................................... 26

2.3.1 Obstacle Detection and Avoidance................................................ 27

2.3.2 Reactive Navigation....................................................................... 27

2.3.3 Visual Navigation .......................................................................... 28

2.3.4 Early Work in Visual Navigation .................................................. 29

2.3.5 The Polly Family of Vision-based Robots..................................... 30

QUT Faculty of Science and Technology v


2.3.6 The ‘Horswill Constraints’ .............................................................33

2.3.7 Floor Detection...............................................................................34

2.3.8 Image Segmentation .......................................................................35

2.3.9 Using Sensor Fusion.......................................................................36

2.4 Mapping........................................................................................................36

2.4.1 Types of Maps ................................................................................37

2.4.2 Topological Maps ...........................................................................37

2.4.3 Metric Maps....................................................................................39

2.4.4 Map Building..................................................................................42

2.4.5 Instantaneous Maps ........................................................................43

2.4.6 Inverse Perspective Mapping (IPM)...............................................43

2.4.7 Occupancy Grids ............................................................................45

2.4.8 Ideal Sensor Model.........................................................................48

2.5 Probabilistic Filters.......................................................................................50

2.5.1 Kalman Filters ................................................................................50

2.5.2 The Kalman Filter Algorithm.........................................................50

2.5.3 Particle Filters.................................................................................53

2.5.4 Particle Weights..............................................................................55

2.5.5 Resampling .....................................................................................55

2.5.6 Particle Diversity ............................................................................57

2.5.7 Number of Particles Required ........................................................57

2.5.8 Particle Deprivation........................................................................58

2.6 Localization ..................................................................................................59

2.6.1 Levels of Localization ....................................................................60

2.6.2 Localization Approaches ................................................................62

2.6.3 Behaviour-Based Localization .......................................................63

2.6.4 Landmark-Based Localization........................................................63

2.6.5 Dense Sensor Matching for Localization .......................................64

2.6.6 Scan Matching ................................................................................64

2.6.7 Improving the Proposal Distribution ..............................................65

2.7 Simultaneous Localization and Mapping (SLAM).......................................65

2.7.1 SLAM Approaches.........................................................................67

2.7.2 Map Features for Kalman Filtering ................................................68

2.7.3 Extensions to the Kalman Filter .....................................................69

vi Faculty of Science and Technology QUT


2.7.4 Examples of Kalman Filter-based SLAM ..................................... 70

2.7.5 Particle Filter SLAM ..................................................................... 71

2.7.6 A Particle Filter SLAM Algorithm................................................ 72

2.7.7 Sample Motion Model ................................................................... 75

2.7.8 Measurement Model Map.............................................................. 76

2.7.9 Update Occupancy Grid ................................................................ 77

2.7.10 Stability of SLAM Algorithms ...................................................... 77

2.8 Visual SLAM............................................................................................... 79

2.8.1 Visual Localization........................................................................ 79

2.8.2 Appearance-based Localization..................................................... 79

2.8.3 Feature Tracking............................................................................ 81

2.8.4 Visual Odometry............................................................................ 85

2.8.5 Other Approaches .......................................................................... 85

2.9 Exploration................................................................................................... 86

2.9.1 Overview of Available Methods.................................................... 87

2.9.2 Configuration Space ...................................................................... 90

2.9.3 Complete Coverage ....................................................................... 91

2.9.4 The Distance Transform (DT) ....................................................... 91

2.9.5 DT Algorithm Example ................................................................. 92

2.10 Summary ...................................................................................................... 95

Design and Experimentation 97

3.1 Methodology used in this Research ............................................................. 97

3.2 Research Areas Investigated ........................................................................ 97

3.3 The Robots ................................................................................................... 98

3.3.1 Hemisson Robot............................................................................. 99

3.3.2 Yujin Robot ................................................................................. 100

3.3.3 Simulation.................................................................................... 101

3.3.4 X80 Robot.................................................................................... 103

3.4 Robot Motion ............................................................................................. 104

3.4.1 Piecewise Linear Motion ............................................................. 104

3.4.2 Odometry Errors .......................................................................... 106

3.4.3 Effect of Motion on Images......................................................... 106

3.5 Computer Vision........................................................................................ 107

3.5.1 Vision as the Only Sensor............................................................ 107

QUT Faculty of Science and Technology vii


3.5.2 Experimental Environments .........................................................108

3.5.3 Monocular Vision.........................................................................110

3.5.4 Cheap Colour Cameras.................................................................110

3.5.5 Narrow Field of View...................................................................111

3.5.6 Removing Camera Distortion.......................................................113

3.5.7 Uneven Illumination.....................................................................116

3.5.8 Compensating for Vignetting .......................................................118

3.5.9 Camera Adaptation.......................................................................118

3.6 Navigation...................................................................................................119

3.6.1 Reactive Navigation .....................................................................120

3.6.2 Floor Segmentation Approaches ..................................................122

3.6.3 Colour Spaces ...............................................................................123

3.6.4 k-means Clustering.......................................................................123

3.6.5 EDISON .......................................................................................126

3.6.6 Other Image Processing Operations .............................................126

3.6.7 Implementing Floor Detection .....................................................129

3.7 Mapping......................................................................................................131

3.7.1 Occupancy Grids ..........................................................................131

3.7.2 Inverse Perspective Mapping (IPM).............................................133

3.7.3 Simplistic Sensor Model...............................................................139

3.7.4 Effect of Range Errors..................................................................141

3.7.5 Vision Sensor Noise Characteristics ............................................142

3.7.6 More Realistic Sensor Model .......................................................145

3.7.7 The Radial Obstacle Profile (ROP) ..............................................148

3.7.8 Angular Resolution of the ROP....................................................150

3.7.9 Calculating the ROP .....................................................................150

3.7.10 Effects of Camera Tilt ..................................................................151

3.7.11 Mapping the Local Environment..................................................153

3.7.12 Creating a Global Map .................................................................154

3.8 Localization ................................................................................................155

3.8.1 Odometry ......................................................................................155

3.8.2 Tracking Image Features ..............................................................156

3.8.3 Scan Matching of ROPs ...............................................................158

3.8.4 Using Hough Lines in the Map.....................................................161

viii Faculty of Science and Technology QUT


3.8.5 Orthogonality Constraint ............................................................. 162

3.9 Simultaneous Localization and Mapping (SLAM).................................... 163

3.9.1 Implementation ............................................................................ 163

3.9.2 Incremental Localization ............................................................. 164

3.9.3 Motion Model .............................................................................. 165

3.9.4 Calculating Weights..................................................................... 167

3.9.5 Resampling .................................................................................. 169

3.9.6 Updating the Global Map ............................................................ 171

3.9.7 Experimental Verification of Localization .................................. 171

3.9.8 Quality of Generated Maps.......................................................... 174

3.10 Other SLAM Implementations .................................................................. 178

3.10.1 DP-SLAM.................................................................................... 179

3.10.2 GMapping .................................................................................... 180

3.10.3 MonoSLAM................................................................................. 181

3.11 Exploration Process ................................................................................... 183

3.12 Summary .................................................................................................... 184

Results and Discussion 185

4.1 Hardware.................................................................................................... 185

4.1.1 The Robots................................................................................... 185

4.1.2 Computational Requirements ...................................................... 186

4.2 Navigation.................................................................................................. 187

4.2.1 Final Floor Detection Algorithm ................................................. 187

4.3 Mapping ..................................................................................................... 191

4.3.1 Inverse Perspective Mapping (IPM)............................................ 191

4.3.2 The Radial Obstacle Profile (ROP) ............................................. 192

4.3.3 Pirouettes ..................................................................................... 193

4.3.4 Discrete Cosine Transform (DCT) of the ROP ........................... 193

4.3.5 Floor Boundary Detection for Creating Local Maps................... 196

4.3.6 Ignoring Vertical Edges............................................................... 198

4.3.7 Handling Detection Errors ........................................................... 198

4.3.8 Warping Camera Images to show Rotations ............................... 200

4.3.9 Creating a Global Map................................................................. 202

4.4 Incremental Localization ........................................................................... 203

4.4.1 Odometry Errors .......................................................................... 203

QUT Faculty of Science and Technology ix


4.4.2 Using High-Level Features...........................................................204

4.4.3 Vertical Edges ..............................................................................204

4.4.4 Horizontal Edges ..........................................................................209

4.5 Difficulties with GridSLAM.......................................................................213

4.5.1 Failures in Simulation...................................................................214

4.5.2 Sensor Model................................................................................215

4.5.3 Particle Filter Issues .....................................................................217

4.5.4 Maintaining Particle Diversity .....................................................218

4.5.5 Determining the Number of Particles Required ...........................219

4.5.6 Improving the Proposal Distribution ............................................219

4.5.7 Avoiding Particle Depletion .........................................................220

4.5.8 Defining the Motion Model..........................................................221

4.5.9 Calculating Importance Weights using Map Correlation.............222

4.5.10 Resampling Issues ........................................................................224

4.5.11 Retaining Importance Weights to Improve Stability....................225

4.5.12 Adjusting Resampling Frequency ................................................225

4.6 Comparison with Other SLAM Algorithms ...............................................226

4.6.1 DP-SLAM.....................................................................................226

4.6.2 GMapping.....................................................................................228

4.7 Information Content Requirements for Stable SLAM................................230

4.8 Exploration .................................................................................................233

4.9 Experimental Results ..................................................................................236

4.10 Areas for Further Research.........................................................................243

4.10.1 Using Colour in the ROP..............................................................243

4.10.2 Floor Segmentation ......................................................................243

4.10.3 Map Correlation versus Scan Matching .......................................245

4.10.4 Information Content Required for Localization ...........................245

4.10.5 Closing Loops and Revisiting Explored Areas.............................246

4.10.6 Microsoft Robotics Developer Studio ..........................................246

4.11 Summary.....................................................................................................246

Conclusions 247

5.1 Motivation for the Research .......................................................................247

5.2 Mapping Indoor Environments Using Only Vision....................................248

5.3 Using Vision as a Range Sensor.................................................................248

x Faculty of Science and Technology QUT


5.4 Mapping and Exploration for Indoor Environments.................................. 249

5.5 Simultaneous Localization And Mapping (SLAM) for use with Vision... 250

5.6 General Recommendations ........................................................................ 251

5.6.1 Adapting the Environment for Robots......................................... 251

5.6.2 Additional Sensors....................................................................... 252

5.7 Contributions.............................................................................................. 253

5.7.1 Theoretical Contributions ............................................................ 253

5.7.2 Practical Contributions ................................................................ 254

5.8 Future Research ......................................................................................... 255

5.9 Summary .................................................................................................... 256

Glossary 257

References 269

QUT Faculty of Science and Technology xi


LIST OF TABLES

Table 1 – Sample Data from a Distance Transform 93

Table 2 – Intrinsic Parameters for X80 robot 114

Table 3 – Effect of perspective on distance measurements 160

Table 4 – Comparison of Map Quality 178

Table 5 – Comparison of calculated rotation angles 209

Table 6 – Technical Specifications for SICK LMS 200 231

xii Faculty of Science and Technology QUT


LIST OF FIGURES

Figure 1 – Sample occupancy grid map (Level 7) 5

Figure 2 – Sample occupancy grid map (Level 8) 6

Figure 3 – Screenshot of the Experimental Computer Vision program (ExCV) 9

Figure 4 – Two-Wheeled Differential Drive 16

Figure 5 – Camera Orientation and Vertical Field of View 19

Figure 6 – Pinhole Camera 20

Figure 7 – Example of a topological map 38

Figure 8 – Example of a metric map 39

Figure 9 – Metric map based on a square grid 41

Figure 10 – Occupancy grid created by a simulated robot 46

Figure 11 – Ideal sensor model 49

Figure 12 – Hypothetical example showing sonar rays 59

Figure 13 – EKF SLAM example in Matlab 70

Figure 14 – UKF SLAM example in Matlab 71

Figure 15 – Example of particle trajectories 72

Figure 16 – Configuration Space: (a) Actual map; and (b) C-Space map 91

Figure 17 – 3D view of a Distance Transform 94

Figure 18 – Hemisson robot with Swann Microcam 99

Figure 19 – Playing field showing Hemisson robot and obstacles 100

Figure 20 – Yujin robot with Swann Microcam 100

Figure 21 – Playing field using Lego Duplo obstacles 101

Figure 22 – Simulator view: (a) From robot, and (b) Top-down view 102

Figure 23 – Simulated environment using Microsoft Robotics Developer Studio: (a)

Robot view, and (b) Top-down view 103

Figure 24 – Tobor the X80 robot 103

Figure 25 – Environment showing floor, wall, door and ‘kick strip’ 109

Figure 26 – Sequence of images from a robot rotating on the spot (turning left by 30°

per image from top-left to bottom-right) 112

Figure 27 – Sample calibration images for X80 robot 114

Figure 28 – Complete distortion model for X80 camera 115

QUT Faculty of Science and Technology xiii


Figure 29 – Correcting for Radial Distortion: (a) Original image; and (b) Undistorted

image 116

Figure 30 – Examples of uneven illumination in corridors 116

Figure 31 – Correcting for Vignetting: (a) Original image; and (b) Corrected image

118

Figure 32 – Using vision to detect the floor: (a) Camera image; and (b) Extracted

floor region and steering direction 120

Figure 33 – Segmentation using k-means Clustering 124

Figure 34 – k-means Clustering using pixel location and colour 125

Figure 35 – Segmentation using EDISON: (a) Original image from X80 robot; and

(b) EDISON output 126

Figure 36 – Basic image processing: (a) Source image; (b) Morphological closing;

and (c) Pyramid segmentation 127

Figure 37 – Texture removal: (a) Original edges; (b) Textured area based on edgels;

and (c) Edges removed 127

Figure 38 – Effects of pre-filtering: (a) Image filtered with a 9x9 Gaussian; (b) Edges

obtained after filtering; and (c) Pyramid segmentation after filtering 128

Figure 39 – Flood Filling: (a) Fill with one seed point; (b) Fill from a different seed

point; and (c) Fill after pre-filtering 128

Figure 40 – Camera Field of View (FOV) showing coordinate system and parameters

for Inverse Perspective Mapping (IPM) 134

Figure 41 – Extrinsic parameters (Camera view) 137

Figure 42 – Extrinsic parameters (World view) 137

Figure 43 – Test image for Inverse Perspective Mapping 138

Figure 44 – Verification of Inverse Perspective Mapping 139

Figure 45 – Radial range as a function of pixel coordinates 141

Figure 46 – Corrected Floor Image: (a) Original; and (b) Hand-edited 142

Figure 47 – Frequency of edges in sample data set by scanline 143

Figure 48 – Mean of vision errors by scanline 144

Figure 49 – Standard Deviation of vision errors by scanline 144

Figure 50 – Range errors by scanline 145

Figure 51 – Complex sensor model for errors in range measurement 146

Figure 52 – Complex sensor model (with errors) for shorter range 147

Figure 53 – Typical Radial Obstacle Profile (ROP) Sweep 149

xiv Faculty of Science and Technology QUT


Figure 54 – ROP overlaid on a playing field 150

Figure 55 – Naïve local map showing extraneous walls 152

Figure 56 – Detecting vertical edges: (a) Simulated camera image; (b) Detected

floor; and (c) Detected vertical edges 152

Figure 57 – Corrected local map without walls due to vertical edges 153

Figure 58 – Matching SIFT features between images: (a) Small rotation; and (b)

Large rotation 157

Figure 59 – Overlap in field of view between rotations 159

Figure 60 – Hough Transform of a grid map: (a) Standard; and (b) Probabilistic 161

Figure 61 – Map Correlation for Particle 31: (a) Mask; (b) Global map segment; and

(c) Local map 168

Figure 62 – Map Correlation for Particle 48: (a) Mask; (b) Global map segment; and

(c) Local map 169

Figure 63 – Particle progression during localization and final path on map (from top-

left to bottom-right) 172

Figure 64 – Map Correlations versus map displacement 175

Figure 65 – Normalized SSD versus map displacement 176

Figure 66 – Ground truth for Level 8 176

Figure 67 – Sample SLAM maps for Level 8 177

Figure 68 – Validation of Windows version of DP-SLAM: (a) Map from the

Internet; and (b) Map produced for this Thesis 180

Figure 69 – Validation of Windows version of GMapping: (a) Map from the Internet;

and (b) Map produced for this Thesis 181

Figure 70 – Path to nearest unknown space 184

Figure 72 – Floor detected using Flood Fill and Edges 188

Figure 73 – Floor edges not detected 188

Figure 74 – Shadows detected as walls 189

Figure 75 – Enlarged image showing JPEG artefacts 190

Figure 76 – ROP sweep smoothed using DCT 193

Figure 77 – DCT Coefficients of ROP Data 195

Figure 78 – Floor detection: (a) Camera image; (b) Floor; (c) Contour; and (d) Local

map 198

Figure 79 – Examples of problems with images: (a) Interference; (b) Poor

illumination; (c) Too bright; and (d) Shadows 200

QUT Faculty of Science and Technology xv


Figure 80 – Overlap between successive images: Top row – simulated camera

images; and Bottom row – warped images 201

Figure 81 – Map generated in simulation overlaid with the actual model 202

Figure 82 – Verticals in an image with a tilted camera 205

Figure 83 – Vertical edges in successive images for rotations on the spot 208

Figure 84 – Obtaining a local map from an image: (a) Image; (b) Floor; (c) Contour;

and (d) Local map 210

Figure 85 – Map created from odometry data: (a) Without incremental localization;

and (b) With incremental localization 211

Figure 86 – Map from simulation with random motion errors (50 particles): (a)

GridSLAM output; and (b) Ground Truth 214

Figure 87 – Map from simulation with no motion errors 215

Figure 88 – Local map created with complex sensor model: (a) Camera image; and

(b) Map showing uncertain range values 216

Figure 89 – Maps created with: (a) Simple sensor model; and (b) Complex sensor

model 217

Figure 90 – Particle trajectories: (a) Before; and (b) After closing a loop 218

Figure 91 – Divergent particle 224

Figure 92 – Ground truth map for simulation 226

Figure 93 – Test 1 with DP-SLAM: (a) Initial test run; and (b) After learning the

motion model 227

Figure 94 – Test 2 with DP-SLAM: (a) 50 Particles; and (b) 100 Particles 228

Figure 95 – Test 1 – GridSLAM compared to GMapping: (a) GridSLAM; and (b)

GMapping 229

Figure 96 – Test 2 – GridSLAM compared to GMapping: (a) GridSLAM; and (b)

GMapping 229

Figure 97 – GMapping with 50 Particles and 5cm grid size 230

Figure 98 – Comparison of vision and laser range finder fields of view 232

Figure 99 – Map produced by GMapping with truncated data 233

Figure 100 – Path generated by Distance Transform: (a) Original map; and (b)

Configuration space showing path 235

Figure 101 – Maps generated by: (a) Simulation; and (b) Yujin Robot 236

Figure 102 – Map of Level 7 produced using raw data 238

Figure 103 – Map of Level 7 using SLAM with 4m vision range 239

xvi Faculty of Science and Technology QUT


Figure 104 – Map of Level 7 using SLAM with 2m vision range 240

Figure 105 – Second map of Level 7 using SLAM 241

Figure 106 – Map of Level 8 using SLAM 242

QUT Faculty of Science and Technology xvii


PUBLICATIONS ARISING FROM THIS RESEARCH

The following publications resulted from work on this PhD. Some of this

material has been incorporated into this thesis. The list is in chronological order.

Taylor, T., Geva, S., & Boles, W. W. (2004, Jul). Monocular Vision as a Range

Sensor. Paper presented at the International Conference on Computational

Intelligence for Modelling, Control & Automation (CIMCA), Gold Coast, Australia.

Taylor, T., Geva, S., & Boles, W. W. (2004, Dec). Vision-based Pirouettes using the

Radial Obstacle Profile. Paper presented at the IEEE Conference on Robotics,

Automation and Mechatronics (RAM), Singapore.

Taylor, T., Geva, S., & Boles, W. W. (2005, Sep). Early Results in Vision-based

Map Building. Paper presented at the 3rd International Symposium on Autonomous

Minirobots for Research and Edutainment (AMiRE), Awara-Spa, Fukui, Japan.

Taylor, T., Geva, S., & Boles, W. W. (2005, Dec). Directed Exploration using a

Modified Distance Transform. Paper presented at the Digital Image Computing

Techniques and Applications (DICTA), Cairns, Australia.

Taylor, T., Geva, S., & Boles, W. W. (2006, Dec). Using Camera Tilt to Assist with

Localisation. Paper presented at the 3rd International Conference on Autonomous

Robots and Agents, Palmerston North, New Zealand.

Taylor, T. (2007). Applying High-Level Understanding to Visual Localisation for

Mapping. In S. C. Mukhopadhyay & G. Sen Gupta (Eds.), Autonomous Robots and

Agents (Vol. 76, pp. 35-42): Springer-Verlag.

Taylor, T., Boles, W. W., & Geva, S. (2007, Dec). Map Building using Cheap

Digital Cameras. Paper presented at the Digital Image Computing Techniques and

Applications (DICTA), Adelaide, Australia.

Johns, K., & Taylor, T. (2008). Professional Microsoft Robotics Developer Studio:

Wrox (Wiley Publishing).

xviii Faculty of Science and Technology QUT


ABBREVIATIONS

CCD – Charge Coupled Device (a type of image sensor for digital cameras)

CML – Concurrent Mapping and Localization (also called. SLAM)

DCT – Discrete Cosine Transform

DT – Distance Transform

EKF – Extended Kalman Filter

FOV – Field Of View (of a camera)

HSI/L/V – Hue, Saturation and Intensity/Lightness/Value colour spaces

HVS – Human Visual System

LRF – Laser Range Finder

MCL – Monte Carlo Localization

IPM – Inverse Perspective Mapping

IR – Infra-red

PF – Particle Filter

RGB – Red, Green and Blue colour space

ROP – Radial Obstacle Profile

SIFT – Scale-Invariant Feature Transform

SLAM – Simultaneous Localization And Mapping

NOTE: There is also a Glossary at the end of this thesis.

QUT Faculty of Science and Technology xix


STATEMENT OF ORIGINAL AUTHORSHIP

The work contained in this thesis has not been previously submitted to meet

requirements for an award at this or any other higher education institution. To the

best of my knowledge and belief, the thesis contains no material previously

published or written by another person except where due reference is made.

Signature:

Date: 6th March, 2009

xx Faculty of Science and Technology QUT


ACKNOWLEDGEMENTS

After spending over seven years working part-time on a PhD, there is a

plethora of people who deserve thanks. I hope that I have not missed anyone.

I would like to thank my thesis supervisors, Dr. Shlomo Geva and Dr. Wageeh

Boles. They have both been very understanding, especially during the times when I

was making little progress due to my workload.

In the final stages, after some serious distractions including writing a book, the

death of my sister and getting a job in the Microsoft Robotics Group, Dr. Kerry

Raymond came to my aid in rewriting my thesis into a more conventional format. I

am deeply indebted to her for her positive attitude, pragmatic approach and for

pushing me to complete the work.

Several other colleagues at QUT have assisted me in various ways, including

innumerable discussions on topics related to my PhD. At the risk of leaving

somebody out, I will just say thank you to all the academic staff and graduate

students at QUT.

In 2006, I took a sabbatical at the Heinz Nixdorf Institute in Paderborn,

Germany, to work on my PhD. I would like to thank Prof. Ulrich Rückert for his

hospitality in providing me with accommodation and a stimulating research

environment.

The Faculty of IT (now Science and Technology) deserves thanks for the

financial support that enabled me to attend numerous conferences to present papers.

Thanks also to my previous Heads of School, Dr. Alan Underwood and Dr. Alan

Tickle, who supported my endeavours.

I would like to acknowledge the input received from a number of SLAM

researchers. In particular, Prof. Wolfram Burgard and Dr. Giorgio Grisetti from the

University of Freiburg were very helpful in diagnosing problems with the system

used in this thesis due to its unique characteristics.

QUT Faculty of Science and Technology xxi


Other researchers who have taken the time to discuss various SLAM problems

include: Dr. Andrew Davison (Imperial College London); Dr. Ronald Parr (Duke

University); and Dr. Tim Bailey (Australian Centre for Field Robotics).

In 2007, Kyle Johns in the Microsoft Robotics Group invited me to co-author a

book on Microsoft Robotics Developer Studio. This proved to be much more work

than I had anticipated. However, it was a key factor in my getting a job with

Microsoft in Seattle. Thank you Kyle. Thanks also to the editors at Wiley Publishing.

Most importantly, I would like to thank my wife, Denise, for putting up with

me through the trials and tribulations of working on a PhD part-time whilst teaching

full-time. Without her support and constant supply of nourishment, both physical and

emotional, I would literally not be here today.

And finally, to my parents, Geoff and Cynthia, I would like to say thank you

for providing me with a sound education – the best gift parents can give to a child.

xxii Faculty of Science and Technology QUT

Chapter One Introduction

1 Introduction

‘Map generation is a requirement for many

realistic mobile robotics applications.’

– G. Dudek and M. Jenkin,

Computational Principles of Robotics, 2005, Page 214.

Rodney Brooks, Head of the Artificial Intelligence Lab at MIT, wrote a book

about the future of robotics [Brooks 2002]. In it he stated (page 74):

Although their vision has improved drastically over the last two to

three years, today’s robots are still extraordinarily bad at vision

compared to animals and humans.

He claimed that Artificial Intelligence researchers had historically

underestimated the difficulty of computer vision because humans find it so easy. He

went on to say:

Those hard, unsolved problems include such simple visual tasks as

… being able to patch together a stable worldview when walking

through a room.

Brooks’ statement provided the principal inspiration for this research. The

objective has therefore been to build maps using only vision – a very complex task.

1.1 Problem Statement

With aging populations in many countries, there is an increasing need to care

for the elderly. This takes place in health care facilities, such as hospitals, or in

communal living areas like nursing homes. At the same time, work is underway to

QUT Faculty of Science and Technology 1

Mapping of Indoor Environments by Robots using Low-cost Vision Sensors Trevor Taylor

automate mundane tasks such as delivering the mail in office buildings and vacuum

cleaning homes.

One solution is autonomous mobile robots that can operate in indoor

environments and perform tasks without human intervention. Invariably, these robots

need maps of their environment so that they can find their way around. Robots

should be able to build their own maps so that consumers – relatively

unsophisticated end-users – can set them up with as little effort as possible. Indeed, if

robots are going to make major inroads into the home market it will not be

acceptable to require the consumer to enter a map into the robot’s memory.

Although sonar and infra-red sensors have been widely used in the past, vision

is a much richer sensory medium. Even robotic toys, such as the WowWee

Robosapien RS Media [WowWee 2006], include a camera because it permits a much

wider range of tasks to be performed than would be possible with simple range

sensors. Tasks such as locating specific objects and face recognition, for instance,

become feasible using vision.

Given the requirement to build maps for homes, hospitals, offices and so on,

and the assumption that future robots will have cameras, it makes sense to

investigate the use of vision for mapping. This is an active area of research, but a

field that is still in its infancy with numerous different approaches currently being

developed.

1.2 Aim

The aim of this research was to investigate the use of computer vision using

cheap digital cameras by autonomous mobile robots for the mapping of indoor

environments.

1.3 Purpose

The main research question of this thesis can be inferred directly from the title:

How can an autonomous robot build maps of indoor environments using vision as its

only sensor?

2 Faculty of Science and Technology QUT


Subsidiary questions arise naturally in breaking down the task into manageable

components. In particular:

• How can vision be used as a range sensor?

• What mapping and exploration techniques are appropriate for indoor

environments?

• Are existing SLAM (Simultaneous Localization and Mapping)

approaches applicable to visual information?

As this was applied research, one key purpose of this thesis is to document the

problems encountered. It is important to record both what did work and what did not.

This should help future researchers and warn them against travelling down the same

paths only to find that they are dead ends. Negative results are not always reported in

the literature, and instead the problem scope is narrowed down until a solution is

found. Failed experiments might sometimes be thrown away, but there can be

valuable information embodied in these failures. Therefore this thesis includes

details of failures as well as successes.

1.4 Scope

The robots used in this study all had a single colour camera (monocular vision)

as their sole sensory input device. This presents a severe limitation, and consequently

a significant challenge. (Strictly speaking, the camera was the only exteroceptive

sensor. The X80 robot used in the final experiments also used wheel encoders which

are an interoceptive sensor. The Hemisson and Yujin robots, however, did not use

any other sensors.)

To develop a commercial system, considerable effort would have to be put into

addressing the safety aspects of robots operating in human workspaces. One of the

best ways to ensure safety and reliability is to use multiple sensing modes. For

example, both infra-red and vision could be used together. These different sensors

complement each other and can ensure that a single sensor failure will not result in

an accident. The emphasis in this thesis, however, is only on vision.



This research was specifically limited to indoor environments with flat floors

and reasonably uniform lighting. The floor also needed to be fairly uniform in

colour, and in particular it could not have a pattern so tiled floors were excluded.

Although this might seem like a strong restriction, it has often been assumed in the

literature that the floor is a uniform colour [Horswill 1993], [Howard & Kitchen

1997], [Cheng & Zelinsky 1998], [Lenser & Veloso 2003]. It is very common in

practice for the floors of offices to be painted a single colour, or carpet to be

primarily of one colour with a small amount of texture that can easily be filtered out

of the images.

The primary objective of this thesis was to build maps autonomously. This

involved developing an effective exploration algorithm and the use of SLAM based

on visual information. Although SLAM can be studied independently of exploration

by driving a robot manually, autonomous operation is highly desirable. Therefore

exploration was included in the scope.

One of the common research topics in localization is the ‘kidnapped robot’

problem, also referred to as global localization, where the robot must determine

where it is on a map with no initial knowledge of its position. Given that the focus

was on building maps of unknown environments using SLAM, global localization

was outside the scope.

These assumptions and constraints are further elaborated in Chapter 2 –

Background and Literature Review.

1.5 Original Contributions

There are no robots on the market today (2008) that can reliably operate

autonomously in an unstructured environment solely using vision. Even in research

environments, the robots reported in the literature invariably used sonar or laser

range finders as their primary range sensors. Vision tends to be used as an additional

sensor for identifying objects, people, or locations, but not primarily for navigation

and especially not for mapping.

This project, which began in January 2002, has contributed to the body of

knowledge by demonstrating mapping of unknown indoor environments using only



vision. The principal contribution of this thesis is the confluence of range estimation

from visual floor segmentation with SLAM to produce maps from a single forward-

facing camera mounted on an autonomous robot. During the external review of this

thesis, contemporaneous work came to light [Stuckler & Benhke 2008] which

demonstrates that other researchers see value in the approach taken in this thesis.

The primary output of this project was a method of producing occupancy grid

maps that can be generated using only vision, which addresses the major research

question. The feasibility of vision-only mapping is confirmed by Figure 1 and Figure

2 which show examples of maps covering areas roughly 20 metres square for floors

7 and 8 in QUT S Block. (The ground truth is shown in red). These maps use the

standard convention for an occupancy grid in which white represents free space,

black cells are obstacles and shades of grey correspond to the (inverse) certainty that

a cell is occupied. They were produced by an X80 robot with a low-resolution colour

camera.

Figure 1 – Sample occupancy grid map (Level 7)



Figure 2 – Sample occupancy grid map (Level 8)

With regard to the subsidiary research questions:

• The issue of using monocular vision as a range sensor was resolved

early in the work. Although vision is a highly non-linear range sensor,

due to perspective, it can be used reliably over short ranges.

• An Occupancy Grid was found to be a suitable map representation that

also lent itself to using a Distance Transform for exploration.

• Several insights were obtained into the operation of SLAM algorithms

with a restricted amount of input data, such as a cheap camera which

has a limited field of view and low resolution. Areas for further

investigation were identified.

One of the significant problems encountered in this research was the

assumption that SLAM was a solved problem due to the large body of literature on

the subject. For example, it was claimed in a series of tutorial articles in the IEEE

Robotics and Automation Magazine in 2006 that the problem has been solved, at

least in a theoretical sense [Durrant-Whyte & Bailey 2006, Bailey & Durrant-Whyte

2006].

Although several different approaches to Visual SLAM were identified in the

initial research, none of these approaches proved to be appropriate to the

experimental setup in this work. Corridors in office buildings tend to be bland and



featureless. Also, the robot had only a single camera and images were captured at a

very low frame rate (due to hardware limitations) and exhibited motion blur during

moves. Significant pose changes between images made feature tracking almost

impossible.

Despite the challenges of SLAM and the limitations inherent in using vision,

the final system was able to build maps of acceptable quality for navigating around a

building. This was the most significant output of the research. Many of the practical

problems encountered in using vision are also addressed.

In detail, the following areas are original work (with relevant citations for

published papers where applicable):

• Development of an efficient system of equations for performing Inverse

Perspective Mapping (IPM) to enable monocular vision to be used as a

range sensor [Taylor et al. 2004a] (see sections 3.7.2 and 4.3.1);

• A new approach to mapping using Radial Obstacle Profiles (ROPs) to

represent surrounding obstacles. ROPs are obtained by instructing the

robot to perform a pirouette (spin on the spot) and are used for creating

local maps [Taylor et al. 2004b, Taylor et al. 2005a] (see section 3.7.7

and 4.3.2);

• Extensions to the use of the Distance Transform for use in autonomous

exploration [Taylor et al. 2005a, Taylor et al. 2005b]. See sections 3.11

and 4.8. These extensions include modifications to the Distance

Transform for collaborative and/or directed exploration [Taylor et al.

2005b] (although most of this material is omitted from this thesis

because it is not directly relevant);

• Development of a simple method for detecting vertical edges in an

image when the camera is tilted [Taylor et al. 2006, Taylor 2007] (see

sections 3.7.10 and 4.4.3); and

• Development of methods for incremental localization based on

knowledge of the structure of the environment [Taylor et al. 2006,



Taylor 2007, Taylor et al. 2007]. This constraint-based localization

improves the stability of the GridSLAM algorithm and allows

operation with fewer particles than would otherwise be required. See

sections 4.5.

In addition, this thesis contains details of previously unpublished work:

• Use of the Discrete Cosine Transform to compactly describe Radial

Obstacle Profiles for comparison purposes (see section 4.3.4).

This research produced a substantial amount of software, including:

• An OpenGL simulator for computer vision experiments, called the

SimBot_W32 program;

• A simulation environment built using the Microsoft Robotics

Developer Studio (MRDS) and sample code for producing occupancy

grid maps [Taylor 2006a, Johns & Taylor 2008];

• A library for controlling X80 robots including a version that runs on a

PDA allowing remote teleoperation of an X80 from a hand-held device

[Taylor 2006b]; and

• A complete working system based on X80 robots for map building in

indoor environments, known as ExCV.

A screenshot of the program for building maps using vision, called ExCV

(Experimental Computer Vision), is shown below in Figure 3. This diagram only

shows about half of the available windows that the program can display for

diagnostic purposes.



Figure 3 – Screenshot of the Experimental Computer Vision program (ExCV)

The ExCV computer program required expertise in software design and

development, simulation, graphics (using OpenGL), image processing, computer

vision, SLAM and robotics.

The author’s work with Microsoft Robotics Developer Studio (MRDS) was

instrumental in his being invited to co-author a textbook on MRDS. This book was

written concurrently with this thesis and published in May 2008 [Johns & Taylor

2008]. Some of the sample code published with the book was based on code

developed for this research.

1.6 Overview of the Thesis

This thesis is divided into five chapters:

1. Introduction

2. Background and Literature Review

3. Design and Experimentation



4. Results and Discussion

5. Conclusions

Chapters 2, 3 and 4 are organised according to the following logical structure.

(The terms in italics are defined in the following paragraphs). These topics cover all

the major aspects of visual map building, each of which has to be addressed when

developing a complete system.

• Robots and Robot Motion;

• Computer Vision;

• Navigation, including Floor Detection using Vision;

• Map Building from visual data using Inverse Perspective Mapping and

the Radial Obstacle Profile;

• Exploring unknown environments using the Distance Transform;

• Localization, and in particular Incremental Localization; and

• Simultaneous Localization and Mapping.

Chapter 2 (Background and Literature Review) provides background

information as well as a review of the applicable literature. The research covered a

broad range of topics, as should be evident from the bullet points above.

The design and development process is discussed in Chapter 3 (Design and

Experimentation). The nature of the research meant that the design evolved over

time based on experimental prototypes. Therefore the chapter incorporates some

results of the research which were fed back into the design.

Chapter 3 discusses using vision as a range sensor by identifying the floor

boundary and then applying Inverse Perspective Mapping to determine the real-

world coordinates of the boundary points from pixel coordinates in the image.

Converting to a real-world reference frame is necessary in order to perform basic

navigation, exploration, and map building.



Building a map does not happen automatically just by wandering around at

random – the robot must perform some sort of directed exploration. Chapter 3

therefore discusses the algorithm that was developed to explore the environment and

how it ensured that the whole area was covered.

Once basic exploration and map building were proven to work in simulation, it

was necessary to move out of the artificial environment of the simulator and on to a

real robot. This immediately introduced one of the inherent complexities of the real

world – tracking the robot’s pose (position and orientation).

Robots can quickly get lost if they are not constantly monitoring the effects of

their own movements because commands to the wheels rarely produce exactly the

desired motions. Discovering where you are with respect to a map is a process called

localization, and it is also covered in Chapter 3.

There is a basic conundrum in robotics: building an accurate map requires

accurate localization, but localization requires a map. This can be resolved through a

process called Simultaneous Localization and Mapping (SLAM), sometimes also

referred to as Concurrent Mapping and Localization (CML). Chapter 3 covers the

basic principles of SLAM.

Chapter 4 (Results and Discussion) presents the results of the study and

discusses these outcomes. SLAM is not yet an exact science and there are still many

practical issues to be resolved. Chapter 4 covers some problems encountered in

implementing SLAM using only vision.

Existing SLAM algorithms proved to be unstable with the experimental

hardware used in this project due to insufficient information content in the input data

stream. However, by applying additional constraints and performing incremental

localization (in addition to the localization embedded in the SLAM algorithm), it

was possible to develop a stable system that successfully built maps of office

corridors. The chapter finishes with a list of areas for further investigation.

Finally, Chapter 5 (Conclusions) draws conclusions from the research and

reiterates the contributions made by this work. It directly addresses the original

research questions.



Note that there is a Glossary at the end of this thesis. Terms that are in the

Glossary appear in italics the first time they are used, and also to reinforce the fact

that explicit definitions are available. This saves having to include definitions or

footnotes in the main body of the text. There is some overlap between the Glossary

and the list of abbreviations in the front of the thesis.


Chapter Two Design and Experimentation

2 Background and Literature Review

‘Vision is our most powerful sense.

Vision is also our most complicated sense.’

– B.K.P. Horn,

Robot Vision, 1986, page 1.

2.1 Background

All robots need maps to achieve useful tasks [Thrun 1998], and it is a

fundamental task for mobile robots [Grisetti et al. 2005]. The objective of this

research was therefore to achieve the task of mapping of indoor environments using

vision. In order to complete this task, the robot required the ability to:

• Measure distances to obstacles using images from a camera;

• Build maps based on what it saw;

• Explore its environment in a systematic and thorough way; and

• Keep track of its location and orientation in the presence of real-world

problems like wheel slippage.

Endowing a robot with these skills required research into a diverse range of

topics including computer vision, robot motion, inverse perspective mapping (IPM),

mapping algorithms, and exploration methods. Putting it all together required a

Simultaneous Localization and Mapping (SLAM) algorithm because mapping had to

be performed at the same time as exploration which in turn required localization to

ensure that the robot did not get lost. All of these topics are addressed in this chapter.



2.1.1 Brief history of Vision-based Robots

Robots have used vision for navigation to a greater or lesser extent for many

years. This section presents a few examples that are indicative of the types of

approaches that have been used. A more detailed review can be found in [DeSouza

& Kak 2002].

The earliest attempt at using computer vision was probably Shakey the robot at

Stanford, which operated in an artificial environment with coloured blocks. A

detailed history of Shakey is available in [Nilsson 1984]. It was called Shakey

because of the way it vibrated when it moved.

Shakey was followed by the Stanford Cart [Moravec 1977] which used two

cameras, one of which could be moved to provide different viewpoints for stereo

vision. It was very slow and test runs took hours with the available computing power

at the time. Image processing took so long that the shadows during outdoor test runs

became a problem because they would actually move noticeably between images.

Moravec moved to Carnegie-Mellon University (CMU) in 1980 where he

developed the CMU Rover [Moravec 1983]. This launched a long history of robotics

and vision at CMU, including the well-known CMUcam [CMU 2006]. Moravec’s

work gave us the Moravec Interest Operator [Moravec 1980], which was an early

attempt to select significant features from an image. Moravec is also famous for

formulating occupancy grids for mapping [Moravec & Elfes 1985], although these

were created using sonar, not vision.

In the late 1980s, one of the best known robots was Polly, which was built by

Ian Horswill [Horswill 1993]. This used a low resolution black and white camera to

detect the floor and people. To attract the robot’s attention it was literally necessary

to ‘shake a leg at it’. This robot had a built-in map of part of a building at MIT,

including the place where the carpet changed colour so it could navigate across this

apparent ‘boundary’. Polly was known to have several deficiencies, including

stopping when it saw shafts of light coming in through the windows.

Polly was the first in a long family of robots that operated by detecting the

floor, and thereby identifying the surrounding obstacles. These include [Lorigo et al.



1997, Howard & Kitchen 1997, Maja et al. 2000, Murray 2000]. This research

therefore followed in the Polly tradition and adopted what are referred to here as the

‘Horswill Constraints’ (explained later in this chapter in section 2.3.6). These are

constraints on the environment that simplify the task of locating the floor.

The ER-1 robot [Evolution Robotics 2005] (now discontinued) used a web

camera to perform what Evolution Robotics called Visual Simultaneous Localization

and Mapping (vSLAM). This was basically place recognition based on building a

large database of unique image features. These features were obtained using the

SIFT (Scale Invariant Feature Transform) algorithm [Lowe 2004]. It is common for

SIFT-based systems to use several thousand features in an image.

Andrew Davison [Davison & Murray 1998, Davison 1999] used features to

create what he refers to as MonoSLAM. This system built maps by tracking

hundreds of features from one image to the next and eventually determining their

physical location in the world by applying an Extended Kalman Filter (EKF).

Further discussion of Visual SLAM can be found in section 2.8.

2.1.2 Robot Motion

To construct accurate maps, the robot must know where it is with a high degree

of certainty. This pose consists of a 2D position and an orientation, which is a total

of three variables.

A common configuration for robots used both in research and industry is the

differential drive. This consists of two wheels that can be driven independently,

either forward or backward. There might also be a castor (or jockey) wheel for

balance, but it can be ignored as far as the robot’s motion is concerned. This drive

configuration is non-holonomic because the robot cannot instantaneously move in

any direction, although a combination of translations and rotations does allow the

robot to reach any point on a 2D plane.

The kinematics of this type of robot have been studied extensively, and the

generalised equations of motion are readily available from textbooks [Dudek &

Jenkin 2000, Choset et al. 2005].



Figure 4 – Two-Wheeled Differential Drive

Consider Figure 4 which shows a two-wheeled differential drive viewed from

above. This type of robot can drive forwards or backwards, rotate on the spot or

drive in arbitrary arcs.

If the robot motions are constrained to only translations or rotations, then the

equations of motion become trivial and are related directly to the geometry of the

robot and the resolution of the wheel encoders. Three parameters must be known

(refer to Figure 4):

• Number of ticks of the wheel encoders per wheel revolution, n

• Radius of the wheels, r

• Wheelbase or distance between the wheels, w.

For a translation (forward motion) both wheels must turn by the same amount.

Due to the way that the motors are mounted, this usually results in the wheel encoder

values changing in opposite directions, but both values should change by the same

absolute amount, ∆e. The distance travelled, d, is therefore:

nerd ∆

⋅= π2 (1)

Expressed in this form it is clear that the distance is equal to the perimeter of

the wheel times the ratio of the number of ticks moved to the number of ticks per

revolution.



When the robot rotates on the spot, the wheels turn in opposite directions

around the circumference of a circle with diameter equal to the wheelbase, w. (This

circle is shown in Figure 4). In this case the angle of rotation, α, as measured around

the centre of the robot is:

nwer ∆

⋅=πα 4 (2)

These formulae can be easily rearranged to obtain the number of ticks required

for a particular motion.

In much of the literature, robots are controlled by adjusting the (rotational)

speed of the wheels. Sometimes this is simply referred to as the speed of the robot,

but wheel (rotational) speed and the speed of the robot (axle speed) are not the same.

In this approach the robot is kept constantly moving and therefore needs to make

real-time decisions to avoid collisions and explore its environment.

In this thesis, a different approach was used whereby only pure translations or

rotations were performed. After each motion, the robot could take as long as

necessary to compute the next move. This ‘piecewise linear’ motion removes the

requirement to control the robot in real-time.

A key input to the mapping process is information relating to the actual

motions of the robot – how it transitions from one pose to the next. This might take

the form of the control input to the wheels, such as motor voltages or currents.

However, it is more likely to involve using wheel encoders to obtain odometry

information. To maintain the current pose, the robot must integrate odometry

information over time. Odometry is notoriously unreliable [Borenstein & Feng 1996,

Murphy 2000], and accumulated errors can result in the robot becoming lost

[Dellaret at al. 1999].

A great deal of research has focused on tracking camera movements, referred

to as visual odometry, or determining where the robot is based on what is sees, called

visual localization. These are discussed in the section on Visual SLAM.



2.2 Computer Vision

The primary objective of a computer vision system is to segment images to

obtain useful information about the surrounding environment. If software were

available that could perfectly segment an image and reliably identify all of the

distinct objects that it contained, a significant portion of the vision problem would

have been solved. Segmentation is still an active area of research as a topic in its

own right. For further elaboration, see the sections on floor detection (3.6.2 and

onwards) which outline various methods of segmentation that were trialled.

2.2.1 Stereo versus Monocular Vision

Stereo vision is a common configuration in the literature as well as in nature.

Using two cameras with a reasonable distance between them allows the stereo

disparity to be calculated, and hence depth or distance to objects in the field of view

established.

To extract depth information using stereo vision the correspondence between

pixels in the two images must be established unambiguously. This ‘data

correspondence’ problem is a recurring theme in computer vision. It is often

necessary to match points in two images. Optic flow, for instance, has exactly the

same problem as stereo disparity matching, but it is a temporal rather than a spatial

problem.

In addition to calibrating the individual cameras, a stereo ‘head’ also needs to

have the two cameras accurately calibrated so that their distance apart and vergence

angle is precisely known.

Stereo vision is not a pre-requisite for making range estimates. If the camera

parameters are known and the floor can be segmented in the image, then the range to

obstacles can be obtained using Inverse Perspective Mapping (see section 2.4.6)

applied to a single image, i.e. monocular vision.

2.2.2 Digital Cameras

Because digital cameras create images consisting of picture elements (pixels)

there are quantization effects when a digital image is created.



The number of pixels across the image is known as the horizontal resolution.

The vertical resolution is the number of rows or scanlines in the image. The ratio

between the horizontal and vertical resolutions is the aspect ratio of the camera.

Typical cameras have an aspect ratio of 4:3.

The cameras used in computer vision vary considerably in their specifications.

In this work cheap colour cameras (commonly referred to as web cameras) were

used. They have fairly low resolution (discussed below) which ensures that the

amount of computation involved in processing images is not excessive.

The intrinsic parameters of a camera describe its internal properties. The

extrinsic parameters define the configuration of the camera, and in particular its

orientation in terms of a real-world coordinate system.

Cameras are usually mounted on top of robots. They typically have a limited

field of view (FOV), measured as the visible angle between the left edge of the image

and the right edge (the horizontal field of view).

Figure 5 – Camera Orientation and Vertical Field of View

Figure 5 shows a side view of a robot with a camera. The vertical field of view

is shown (measured as the angle between the bottom scanline and the top scanline in

the image).



Notice that the camera in Figure 5 is tilted to make best use of the limited

FOV. Even with this tilt, there is still a blind area in front of the robot that cannot be

seen in the camera view.

2.2.3 Pinhole Camera Approximation

This work uses the pinhole camera approximation as a mathematical model of

the camera. The formulae for a pinhole camera are well known and usually covered

in the early chapters of textbooks on computer vision. Refer to [Horn 1986] page 20

for instance, which is one of the classic texts on computer vision. However, the same

equations can be found in many other textbooks with much more detailed coverage,

such as [Faugeras 1993, Hartley & Zisserman 2003].

Figure 6 – Pinhole Camera

In Figure 6 the ‘pinhole’ is at the origin of the camera coordinate system,

labelled O. A light ray from a point in the real world, P (x, y, z), passes through the

pinhole and hits the image plane at P' (x', y', z').

Note that the image is created on a plane at right angles to the camera’s

principal (optical) axis. To maintain a right-hand coordinate system, the z-Axis

points from the pinhole (the origin) towards the image plane. Therefore the distance

along the z-Axis to the image plane remains constant and is shown as f in the

diagram.

The equations relating the coordinates of P' to P can be obtained using simple

geometry and presented here without derivation:



zx

fx=

' (3)

zy

fy=

' (4)

Notice in these equations that the pixel coordinates vary inversely with the

distance of a point in the real world from the camera.

If the lens in the camera is perfect, then the same equations apply. However,

real lenses are not perfect so some amount of distortion occurs which complicates

the equations.

2.2.4 Focal Length

The distance from the origin to the image plane in Figure 6, labelled f, is the

focal length. In a conventional camera, the lens is located at the origin and it can be

moved backwards and forwards along the optical axis to change the focus. This

involves changing the focal length.

Light passing through the camera lens will only be in focus if it comes from a

single plane in the real world parallel to the image plane. However, over a certain

range, called the depth of field, points in the real world will converge to

approximately the same place on the image plane. This is easy to understand in terms

of a digital camera because the image plane is broken up into picture elements, or

pixels, which correspond to a small area on the image plane.

For digital cameras the focal length is expressed in units of pixels because the

coordinates on the image plane are measured in pixels. (The ratios in equations 3 and

4 are dimensionless).

Note that the classical treatment of pinhole cameras relies on knowing the

camera’s focal length. This information is not always available for a camera.



2.2.5 Perspective

A key feature of pinhole cameras (as opposed to fish-eye or wide angle lenses)

is that straight lines remain straight in the image. The camera, however, introduces a

perspective transformation.

The point at which ‘parallel’ lines in an image appear to converge is called a

vanishing point. In general, there can be multiple vanishing points for an image

depending on the geometry of the objects in view.

The primary problem with perspective is that depth information is lost in the

transformation from the 3D world to a 2D image. Referring back to Figure 6, the

pixel in the image could result from any point in the real world between the camera

origin and infinity along the light ray shown as (P'-O-P).

2.2.6 Range Estimation

Clearly, to build metric maps (maps that are drawn to scale) it is necessary to

measure distances. In this research vision was used as the range sensor. Recovering

depth information from images is one of the key areas of vision research. There are

many different ways to do this.

If the pixels in the image can be classified as either floor or non-floor, this

eliminates one degree of freedom in the 3D locations of the real-world points that

correspond to the floor pixels, which must be on the ground plane (floor). Simple

geometry can then be used to obtain range estimates assuming that the camera

configuration is known.

2.2.7 Effects of Camera Resolution

Consider for example one of the earliest digital cameras that only had a

resolution of 64x48 pixels. Each horizontal scanline (one of the 48 rows of pixels)

corresponds to a large distance across the floor in the real world. Now imagine a

camera with a resolution of 640x480 (VGA resolution) that captures exactly the

same scene. Obviously the higher resolution means that the distances to obstacles

can be resolved more accurately.



Side Note:

Due to perspective effects, it is not possible to say that a resolution of 640x480

is 10 times more ‘accurate’ than 64x48. Distances in an image vary non-linearly.

Furthermore, the tilt angle of the camera has an effect as well. It is only

necessary to consider two cases to illustrate this – a camera oriented parallel to the

ground and one facing directly down at the ground.

For the horizontal camera, the centre scanline corresponds to the horizon,

which is at infinity. One scanline below this is still a long way away, but

significantly less than infinity, and so forth until the bottom scanline which is right in

front of the robot.

On the other hand for a camera oriented vertically, the distance from the

camera (measured across the floor) is zero at the centre scanline and only varies by

some small amount from the top scanline to the bottom scanline of the image.

Higher resolution for the horizontal camera will obviously result in much finer

granularity for the real-world distances between scanlines in the image. However, for

the vertical camera it might only make differences of fractions of a centimetre – not

enough to be significant on a scale that is the size of a robot.

Detecting edges in images is also affected by the resolution. Edges are

important features in image analysis. With a very low resolution an edge will be

subject to very bad ‘jaggies’ – a stair-step effect caused by the quantization of an

edge that does not align with one the horizontal or vertical axes. These jaggies might

disguise the existence of an edge.

2.2.8 Field of View (FOV) Limitations

One of the significant problems of many cheap digital cameras is a limited

field of view (FOV). A typical web camera has a FOV of 40-60°. In contrast,

humans have a FOV of up to 200°. (The exact range depends on multiple factors

including age, race and sex.)

A restricted FOV means that the robot must ‘look around’ to get a good view

of the surrounding environment when building a map. Also, to make the best use of



the available FOV, the cameras on robots are often tilted downwards as in Figure 5.

Even with this tilt, there might be an area in front of the robot that cannot be seen,

labelled as the Blind Area in Figure 5.

Panoramic cameras have been widely used to overcome FOV limitations

[Yamazawa et al. 1995, Baker & Nayar 1998, Ishiguro 1998]. There are various

types of special-purpose cameras: omnidirectional cameras (360° FOV); panoramic

cameras (around 180° FOV) and cameras with wide-angle or fish-eye lenses.

The common factor in all of these is significant distortion of the image. In

these cases, standard perspective no longer applies and straight lines no longer

appear straight in the image. Nayar specifically developed a catadioptic camera (one

that uses mirrors) so that he could extract true perspective images from the

omnidirectional image [Nayar 1997].

2.2.9 Focus

Cheap cameras usually have a fixed focus, unlike the plethora of digital

cameras now on the market for consumer use which usually have an autofocus

mechanism. Autofocus can be an advantage as far as the quality of the image is

concerned, especially when attempting to extract edge information. However, if the

camera automatically changes the focus there is no way to obtain feedback from the

camera to determine the new focal length, which is one of the camera’s intrinsic

parameters.

The problem with fixed focus is that some part of the image will invariably be

out of focus. Typically cheap cameras are focussed at ‘infinity’. This leads to

blurring in the foreground and can make it harder to detect edges.

2.2.10 Camera Distortion and Calibration

A significant problem with many cameras is lens distortion. The effects are

highly non-linear. The primary effect is radial distortion. This results in straight lines

becoming increasingly curved towards the edges of the image. Clearly, when

building a map, it is important to perceive walls as being straight.



By taking pictures of a ‘calibration target’ (usually a checkerboard pattern)

from various different positions, it is possible to calculate the intrinsic parameters of

the camera, which include the focal length and various distortion parameters

[Bouguet 2006]. These parameters allow images to be undistorted by applying an

appropriate algorithm [Intel 2005]. See section 3.5.6 for details.

Given the intrinsic parameters, it is also possible to calculate the extrinsic

parameters for a given image using the same software. The extrinsic parameters

consist of a rotation matrix and a translation vector that transform camera

coordinates to real-world coordinates aligned with the calibration target. These

parameters can be used to obtain the camera tilt angle.

Another common problem with cheap cameras is vignetting which appears as a

darkening of the image towards the corners. This problem arises because the

camera’s CCD array is rectangular but the lens is round, which restricts light from

reaching the corners of the CCD. Although this problem is mentioned in textbooks

on computer vision, such as [Forsyth & Ponce 2003], no solutions are offered.

2.2.11 Uneven Illumination

If the lighting in the environment is not uniform, as in corridors where the

lighting levels vary, it might be necessary to adjust the brightness of the images in

the software to provide a more uniform range of values for floor pixels. This is a

different issue from vignetting.

Many cameras have difficulty when operating in low light conditions, resulting

in images of poor quality with many pixels approaching dark grey or black. The

HVS achieves a large dynamic range by gradually switching from cones (which

detect colour) to rods (which detect only intensity, i.e. greyscale). This is why human

‘night vision’ is black and white. However, digital cameras have not yet achieved

similar levels of performance.

The problem of poor and/or uneven illumination has plagued computer vision

for many years [Funt & Finlayson 1995, Funt et al. 1998]. The issue is referred to as

colour constancy – maintaining stable colour values despite differences in

illumination. A related problem is dealing with shadows which might be interpreted



as edges in the image where there are no real edges. These are still open research

problems.

2.2.12 Image Preprocessing

Given several of the issues discussed in this chapter, most computer vision

systems perform a preprocessing step on images prior to applying computer vision

algorithms. It is common practice to use a Gaussian blur or other smoothing filter to

reduce the effects of noise. Video signals frequently contain speckles which can be

reduced via filtering.

In addition, some compensation for illumination (brightness and contrast)

might be applied. These are fundamental image processing operations. See for

instance [Forsyth & Ponce 2003].

2.3 Navigation

[Murphy 2000] proposes four questions1 related to robot navigation:

1. Where am I going?

2. What’s the best way there?

3. Where have I been?

4. Where am I?

These questions relate to different areas of the overall problem of navigating in

an unknown environment. The first two are basically path planning – a task that

cannot be performed without a goal (or multiple goals) and some sort of map. In an

unknown environment, however, the robot does not have a map to begin with. The

third question clearly requires some form of memory, i.e. building a map and a past

history of positions, called a trajectory or a trail. The last question is probably the

most important and is referred to as localization.

When all of these questions are combined, they require both an exploration

algorithm and a Simultaneous Localization and Mapping (SLAM) algorithm

1 These questions are in the order listed by Murphy. The logical order depends on the application.



[Dissanyake et al. 2001]. However, all of this is predicated on being able to sense the

environment accurately in order to determine the location of obstacles. This is the

task of the vision system.

2.3.1 Obstacle Detection and Avoidance

The most fundamental problem facing an autonomous robot using vision is

obstacle detection. Several researchers have made this point in the opening

paragraphs of their papers [Borenstein & Koren 1989, Camus et al. 1996, Lourakis &

Orphanoudakis 1998, Ulrich & Nourbakhsh 2000b]. However, despite decades of

research, there is no agreement within the Computer Vision community on how best

to achieve this task, especially as a precise definition of obstacle detection is

surprisingly difficult [Zhang et al. 1997].

In its simplest form, obstacle detection is the process of distinguishing an

obstacle from the floor (sometimes referred to as the ‘ground plane’ in the literature).

It is not necessary to understand what is seen in order to avoid obstacles – simply

distinguishing between the floor and all other objects (even moving humans) is

sufficient.

Many different methods of obstacle detection have been proposed, each with

their own advantages and disadvantages. Some are domain-specific and cannot be

generalised, [Firby et al. 1995, Horswill 1994a].

Once reliable obstacle avoidance has been achieved, a higher-level objective

can be imposed on the robot: to perform a particular task or reach a specified goal.

2.3.2 Reactive Navigation

In the early days of robotics, the accepted practice was to use the Sense-Plan-

Act model for control of the robot. This involved sensing the environment,

constructing a model of the world that fitted the data or matching the data to an

existing model (or map), then planning what to do next according to some high-level

directive, and finally implementing the plan as an action.

Then [Brooks 1986] showed that simple reactive behaviours could be

combined to produce apparently intelligent behaviour. These came to be known as



‘emergent behaviours’ because they were not explicitly programmed, but emerged

from the combination of several simpler behaviours. Brooks challenged the

traditional wisdom of building a world model, saying that ‘the world is its own best

model’.

[Arkin 1998] developed this approach further and proposed robot architectures

based on combinations of behaviours. His book on behaviour-based robotics has

become a classic text in the field of robotics and it provides an excellent overview of

research in the field up to the late 1990s.

Navigation is independent of the sensor(s) used to obtain information about

obstacles. However, the focus in this research was on using vision.

2.3.3 Visual Navigation

Robots have been navigating using vision since the days of the Stanford Cart

[Moravec 1977]. However, this does not mean that the problem of navigation using

vision has been solved; far from it. Many robots today still use sensors such as sonar,

infra-red or laser range finders, and do not have cameras at all. Even those that do

have cameras often rely on them only to identify places or objects that they are

searching for, but use sonar or lasers for actual navigation.

Author after author has pointed out how complex a task vision is, and just as

importantly, how easily humans accomplish it: [Wandell 1995, Hoffman 1998,

Brooks 2002]. In their survey paper, [DeSouza & Kak 2002] also point out the lack

of progress in the field of visual navigation over the previous 30 years.

A wide range of image processing algorithms is available today as open source

code, such as the Intel Open Computer Vision Library [Intel 2005]. However,

processing steps such as edge detection, segmentation, line detection etc. do not

constitute vision on their own. Computer vision requires analysis of the scene and

high-level understanding. Complete scene interpretation is not always necessary to

achieve useful tasks [Dudek & Jenkin 2000]. In fact, [Horswill 1997] introduced the

idea of lightweight vision tasks.



2.3.4 Early Work in Visual Navigation

Some of the earliest work on visual obstacle avoidance was done by [Moravec

1977]. One legacy of his work was the Moravec Interest Operator. This was a way to

find points of interest in an image, where ‘interest’ was defined as an area that had

large changes in pixel values (using a simple sum of squared differences). These

tended to be corners. Features can then be used to track objects or for stereo

correlation.

Side Note:

Feature detection is still an active area of research. Various methods have been

developed over the years, such as the well-known Canny Edge Detector [Canny

1986], Harris Corner Detector [Harris & Stephens 1988], and the generalised Hough

Transform (for lines and curves) [Duda & Hart 1972].

The importance of features and establishing the correspondences between

features in multiple images was explained by Ullman [Ullman 1996] whose concepts

of high-level vision were built on underlying low-level features extracted from the

image.

In their classic paper, Shi and Tomasi [Shi & Tomasi 1994] identified a set of

‘good features to track’.

A more recent algorithm which has been widely applied is the Scale-Invariant

Feature Transform (SIFT) [Lowe 2004]. SIFT creates unique descriptors (128 bytes

long) for features of interest which are called keypoints. Sometimes thousands of

keypoints are obtained from a single image.

Extracted point features are usually used for tracking, not for identifying the

floor. Edges and segmentation (blob extraction) are more useful for finding floors.

Moravec developed the Stanford Cart, which later became the CMU Rover

[Moravec 1980, 1983]. The Cart moved at only 4 metres per hour. It ran outdoors

and it could become confused by changes in shadows between images because it

took so long to perform image processing. This was primarily a reflection of the



computer technology available at the time, which was a DEC KL-10 processor.

However, for its time, the Cart was a triumph of engineering.

2.3.5 The Polly Family of Vision-based Robots

In the late 1980s and early 1990s, Horswill [Horswill & Brooks 1988, Horswill

1993, Horswill 1994a] developed a robot, called Polly, which navigated mostly by

means of vision. It actually had ‘bump’ and sonar sensors and did not rely totally on

vision.

Polly operated in a restricted environment with a constant floor colour, except

for a particular boundary that was specifically programmed into the built-in map.

The robot had a monochrome camera, and used a resolution of only 64 x 48 pixels.

Although this might sound low, sometimes too much information is actually a

disadvantage, both in terms of processing power requirements and because

additional detail can tend to obscure the similarity of floor pixels.

There is evidence that humans operate at multiple resolutions precisely

because segmentation is quicker and more reliable on a smaller image [Wandell

1995]. In fact, one method of segmenting an image uses so-called Gaussian or

Laplacian Pyramids where the image resolution is successively reduced so that the

principal features stand out.

Horswill’s approach was as follows:

• Smooth the image.

• Look at a trapezium in the immediate foreground and determine the

average pixel value.

• Use this as ‘floor’ and then label every pixel, starting from the bottom

of the image and scanning up each column until there was a mismatch.

• The height of the columns indicated the distance to obstacles in what

Horswill referred to as a ‘Radial Depth Map’.

Polly’s floor recognition process failed on any sort of texture like tile or carpet

patterns. It could not handle shiny floors or shadows. The robot would brake for



shafts of sunlight coming through doors. It also had a problem with obstacles of a

similar colour to the floor because it could not distinguish these ‘isoluminant edges’.

In addition, Horswill had to add a special condition for the carpet boundary on the

floor of the building where he worked.

Horswill went on to develop a second-generation robot called Frankie

[Horswill 1994b]. The constraint of a uniform colour for the floor was relaxed

somewhat and turned into a constraint on the frequency components of the floor

texture. An edge detector was also used to identify obstacles. Frankie could therefore

handle some small degree of texture, but there were still problems with shiny floors,

where the robot treated reflections as obstacles.

[Cheng & Zelinsky 1996] took the concept of the Polly vision system, but used

a very large grid size so that carpet texture would be averaged out. The robot was

given a goal, and so had competing behaviours: obstacle avoidance and driving

towards the goal. They noted that it is possible for the competition between

behaviours to result in stagnation or oscillation. To resolve this problem, it was

necessary to introduce some state information – in effect the robot needed a memory.

A subsequent paper [Cheng & Zelinsky 1998] explained how they added

Lighting Adaptation in an attempt to overcome variations in illumination. The

system used a Fujitsu vision processor to handle up to 100 template correlations at a

frame rate of 30 Hz, where the base template used for matching was updated using

the average intensity.

In another similar approach to Polly, [Howard & Kitchen 1997] developed

Robot J. Edgar which used simple edge detection to detect the first non-floor pixel in

each of several vertical image columns. The robot used a wide-angle lens and could

run at up to 30 cm/sec. The process was as follows:

• For each pixel column, apply a one dimensional edge filter starting

from the bottom of the image;

• Record the image location of the first edge; and

• Convert the image location into world coordinates using a lookup table.



The resulting map of boundary points exhibited error rates of 10% because of

noise. Consequently, the measurements from three frames were combined to produce

reliable obstacle maps.

However, this approach suffered from the same problems as Polly and Frankie,

and requireed camera calibration in order to construct the lookup table for

transformations back into real world coordinates.

[Lorigo et al. 1997] extended Horswill’s work and refined the method for

segmenting the floor pixels from obstacle pixels by using a colour image and a sum

of absolute differences between histograms of a reference area and the rest of the

pixels in the image.

The robot, called Pebbles, was run both indoors and outdoors in the simulated

Mars environment at the Jet Propulsion Laboratories (JPL) in California. The system

achieved speeds of up to 30 cm/sec with 4 frames per second being processed.

The approach used three ‘vision modules’ – brightness gradients (edges), Red

and Green (from RGB) and Hue and Saturation (from HSV). An obstacle boundary

map was constructed by each of these modules, and the results fused together using

smoothing and median values. However, the image was still only 64 x 64 pixels, and

a sliding window of 20 x 10 pixels was used in computing the obstacle boundaries.

One of the limitations of this system was that it always assumed that the

reference area immediately in front of the robot was free of obstacles, and therefore

could be used as a template for the floor. Also, the robot’s only goal was to move

safely within cluttered environments – it did not do any mapping or localisation.

[Martin 2001] extended the original Polly system using Genetic Programming.

The objective was to show the use of Evolutionary Computing, rather than to

improve the performance of Polly.

Martin used a very limited set of examples – only 20 test runs. The robot had a

black and white camera and the available operations included the Sobel gradient

magnitude, Median filters and the Moravec Interest Operator. According to [Martin

2002], the best-of-run individuals had essentially the structure of Horswill’s and



Lorigo’s systems, although the window size was reduced to 8 x 8 and the process

essentially used only the Sobel operator. The resulting algorithm ran at about 10 Hz

on a 333 MHz Pentium II at a resolution of 320 x 240.

In summary, these Polly-derived systems failed if the floor was textured. In

contrast, [Lourakis & Orphanoudakis 1998] developed a system that used texture to

advantage to extract the floor from successive images. The system was based on a

ground reference, and detected obstacles that protruded from the ground using a

registration process.

2.3.6 The ‘Horswill Constraints’

In his work, Horswill [Horswill 1994a, 1994b] defined a series of constraints

on the environment in which Polly operated. These have been adopted by numerous

researchers who have continued working in the area.

In honour of Horswill’s work, the following constraints are referred to as the

‘Horswill Constraints’. Note that they have been reformulated by the author.

Ground-Plane Constraint – All obstacles rest on the ground plane and their

projections are contained within the interiors of their regions of contact with the

ground plane, i.e. no overhangs.

Background Texture Constraint – The environment is uniformly illuminated

and the ground plane has no significant texture. (Horswill expressed this in terms of

a minimum spatial frequency that would allow the texture to be removed by

smoothing the image.)

These constraints can be rephrased as follows:

1. The ground is flat so the floor is a plane, which simplifies the

geometry.

2. All obstacles sit on the floor and there is an edge where an obstacle

makes contact with the floor.

3. Obstacles differ in appearance from the ground in a way that is visually

easy to detect.



4. There are no overhanging obstacles (which are likely to decapitate the

robot).

These constraints have been applied by independent researchers over a period

of many years: [Lorigo et al. 1997, Howard & Kitchen 1997, Thau 1997, Ulrich &

Nourbakhsh 2000a, 2000b, Martin 2001].

Although these constraints represent significant limitations on the environment

the robot can operate in, they are still applied today because many indoor

environments meet these constraints. There are however some obvious exceptions:

steps and ramps; floors with coverings that have a substantial texture, e.g. multi-

coloured tiles; or carpet with significant patterns.

Many indoor environments have skirting boards, or kick boards, along the

walls. These are generally a different colour from the floor and the wall. They

provide a clear visual boundary between the floor and the wall. Even if there are no

skirting boards, black linoleum strips are often stuck along the walls to make the

edges of the corridors stand out. This is common practice in hospitals, for instance.

2.3.7 Floor Detection

The basic principle behind floor detection is to segment the image so that the

floor appears as a single segment. Segmentation depends to a large extent on colour

constancy which is affected by illumination. Both of these are active fields of

research.

One of the driving forces behind research into colour object recognition has

been robot soccer. In this scenario there is only a small range of distinct colours, but

the illumination can vary considerably between the test environment and the

competition environment. Many attempts have been made to develop colour

classification systems to handle this problem. These systems have relied on

techniques such as transformations to specialised colour spaces, training of a

classifier, and fuzzy logic [Reyes et al. 2006]. The key point to note here is that the

research is ongoing and this is far from being a solved problem, even though it is

highly constrained.



2.3.8 Image Segmentation

One approach to image segmentation is to use clustering whereby pixels that

are similar (according to some measure of similarity) are grouped together. A

popular clustering algorithm is k-means [Duda et al. 2001].

[Ulrich & Nourbakhsh 2000b] developed a system that was intended to be used

in a variety of environments. The system used a resolution of 320 x 240 in colour

and ran at up to 10 Hz, which was deemed to be good enough for real-time use.

The floor detection process was as follows:

• Filter the image using a 5 x 5 Gaussian;

• Transform into HSI (Hue Saturation Intensity) colour space;

• Construct histograms of the Reference Area (Hue and Intensity) and

smooth them using a low-pass filter; and

• Compare pixels to the histograms to classify them as obstacle pixels if

either the hue histogram bin value or the intensity histogram bin value

was below a threshold. Otherwise they were floor pixels.

The primary objective was to relax the constraint that the reference area must

be free of obstacles. The reference area was only assumed to be the ground if the

robot had just travelled through it. A ‘Reference Queue’ was kept to ‘remember’ the

past reference areas.

Some of the innovations by Ulrich and Nourbakhsh were: a trapezoidal area

was taken as the reference to account for perspective effects (instead of a rectangle);

and ‘invalid’ values of Hue and Saturation (where Intensity was low) were ignored.

An additional feature of their work was an ‘assistive’ mode where the robot

could be manually driven for a period of time so that it could learn the appearance of

the ground. (This was referred to as ‘appearance-based navigation’ in the paper, but

this is not the common usage of the term in the vision community.)



Detecting the floor has also been a common topic in the Robot Soccer

literature. [Bruce et al. 2000] presented a scheme using the YUV colour space that

allowed them to track several hundred regions of 32 distinct colours at a resolution

of 640 x 480 at a frame rate of 30 Hz using a 700 MHz PC. The obvious

disadvantage of this method was that it uses 32 predefined colours. [Das et al. 2001]

claimed that YUV works better than RGB for segmentation for robot soccer.

[Lenser & Veloso 2003] trained a robot to recognise the difference between the

floor and other obstacles. Their ‘visual sonar’ approach was therefore specific to the

particular environment where the robot was trained.

2.3.9 Using Sensor Fusion

Quite recently, the winning team in the DARPA Grand Challenge in 2005 used

a mixture of Gaussians to extract the road surface from camera images [Dahlkamp et

al. 2006]. However, in this case they had the benefit of a laser range finder which

allowed them to obtain training samples for recognizing the colour of the roadway

which was often dirt, but actually varied across the challenge course.

By identifying the clear area in front of their car as ‘road’, the Stanford team

could use this portion of the image to obtain a set of key pixel values using k-means

clustering. Then by applying a mixture of Gaussian around these key pixel values

they marked the whole of the image as either road or non-road pixels. For an

entertaining explanation, see the Google video [Thrun 2006].

In this case, the vision processing was used to extend knowledge of the

environment beyond the limitations of the laser range finder, thereby enabling the

vehicle to travel faster. The important point to note, however, is that the process still

used a laser range finder as the primary sensor. This was not, therefore, an example

of vision-only processing.

2.4 Mapping

Map building by autonomous robots has been done in the past in a variety of

different ways and with a wide range of different sensors. The combination of

algorithms described here includes some new methods specifically for mapping

using vision.



One obvious function of a map is to enable the robot to remember the terrain

that it has driven over. This provides positive proof of free space. The robot might

use this prior information (in its memory) to predict what a new view should look

like for comparison with actual sensor information. It can adjust its pose in the

previous ‘obstacle map’ according to its own motion and then compare what it sees

with what is already in the map.

Before mapping can be done using vision, it is necessary to transform the

camera images back to representations of the real world. This is done by segmenting

out the floor and then using Inverse Perspective Mapping on the floor boundary

points, as explained below in section 2.4.6.

To draw maps correctly, the robot must always know precisely where it is. In

simulation this is not a problem because the robot’s motions are always perfect.

However, in the real world the robot’s pose can only be estimated because there are

random errors in motions. Determining the robot’s pose is a localization problem. As

should already be apparent, in practice it is not possible to completely separate

localization from mapping.

2.4.1 Types of Maps

For a comprehensive overview of mapping algorithms, see [Thrun 2002]. The

topic is also covered in several robotics textbooks, but Thrun is acknowledged as one

of the foremost experts in the world in this area.

There are two fundamentally different types of maps: Topological and Metric.

A third approach is also possible by combining topological and metric maps to

create a hybrid [Thrun 1998, Ulrich & Nourbakhsh 2000b, Tomatis et al. 2001,

2002]. These hybrid maps use a high-level topological map with local metric maps.

In effect, the map consists of a series of smaller maps linked together.

2.4.2 Topological Maps

Topological maps are directed graphs, i.e. a set of nodes containing

information about significant objects or places connected by lines. Figure 7 shows a



sample topological map. The nodes in the map have been labelled in the order that

the robot visited them.

Figure 7 – Example of a topological map

Topological maps can have information attached to the ‘edges’ that join the

nodes together, such as distance estimates and the directions (headings) between

nodes. For example, [Egido et al. 2002] created a system for mapping corridors.

Corridors have various types of features, such as corners and T-junctions, which can

be used to identify nodes.

Many of the so-called ‘appearance based’ exploration and mapping strategies

have relied on recognising previously visited places using photos attached to the map

nodes.

Strictly speaking, a topological map does not have to record exact locations for

the nodes. Also, because they do not record empty space, topological maps tend to

be more compact than metric maps.

Note that the mental maps that humans build are basically topological. For

instance, when giving directions people often use phrases like ‘turn right at the post

office’. Humans do not record exact distances because they do not have the

appropriate sensors to do this. Instead, measured distances are relative, such as ‘next



door’ (near) versus ‘a few blocks away’ (far). Everyone, for example, has heard of

the so-called ‘country mile’ that is generally much longer than a mile.

2.4.3 Metric Maps

Metric maps are exact representations of the world. These are precisely what

most people think of in relation to the word ‘map’ – a street directory or floor plan.

These maps can be drawn to scale and they have a specified orientation, or ‘north

point’.

Figure 8 – Example of a metric map

Figure 8 shows a metric map in which the significant junctions have been

labelled with the letters A to J. It is quite apparent how the robot traversed the map

simply by following the labels in order. Compare this with Figure 7 which is a

topological representation of the same map. Clearly, Figure 8 is much easier for a

human to understand and interpret. The physical arrangement of the topological

nodes on the page in Figure 7 does not correspond directly to that shown in Figure 8

because there is no requirement for a topological map to preserve spatial

relationships.

Metric maps can be either feature-based or tessellated. It is not clear from

Figure 8 what type of map it is. Because it consists only of line segments, it could be

either feature-based or tessellated.



Feature-based maps record the positions of significant ‘features’, like the

corners and intersections in Figure 8. Therefore they are generally maintained in

program memory as some form of tree structure. However, it is quite easy to draw a

2D representation of these maps because exact spatial information is available. The

resulting map would have a series of marked points, like drawing pins pushed into a

sheet of paper. In the example above these are the letters A to J. However, the walls

that are visible in this map would require additional information to describe them.

Features do not have to be limited to points. For instance, in Figure 8 it would

be possible to dispense with the labels A to J and instead record each of the line

segments that make up the walls. A wall, however, is not a point feature, and it takes

some time for the robot to observe; hence it has to infer walls from a succession of

observations (sensor readings).

In deriving their SPmap approach, [Castellanos et al. 1999] presented explicit

methods for the probabilistic modelling of various types of features, including line

segments. This allowed them to incorporate features other than simple points into a

Kalman Filter-based SLAM algorithm.

One of the difficulties with feature-based maps is defining what constitutes a

feature. If the criteria are too lax, there will be an inordinate number of features

recorded and this will quickly become a problem with memory capacity. On the

other hand, if the criteria for features are too stringent, then the features will be

sparse and they might not adequately describe the environment.

In the case of ‘appearance based’ mapping using images, it is quite common

for several thousand features to be selected from each image. These are usually

corners or other areas of the image that are highly distinctive. Probably the most

well-known method for selecting these features is the SIFT (Scale Invariant Feature

Transform) algorithm [Lowe 2004].

An alternative to using features is tessellation (dividing a plane into a set of

non-overlapping figures according to specific rules). Various types of tessellation are

possible: regular lattices or grids; Voronoi diagrams which use polygons; or

Delaunay triangulation which uses only triangles. Even for a regular grid, Quadtrees

can be used to create cells of differing sizes in order to reduce the total number of



cells. However, the simplest form of tessellation, and one that is intuitive and easy to

understand, is a square grid.

When the size of the map can be predicted in advance, as it almost always can

be for indoor environments, the map can be broken up into a set of square cells of

equal size. Some researchers have proposed non-square tessellations, but from here

on in this thesis it is assumed that a square grid is used.

Overlaying a grid onto Figure 8 might result in a map like Figure 9. The solid

black lines (which are a little hard to see) are the original walls and the map grid is

shown in light grey. Because the map grid resolution is relatively coarse, the dark

grey walls in this map are much thicker than the real walls, which results in thinner

corridors. This could be a problem for a large robot that relied on the map to decide

if an area was passable or not.

Figure 9 – Metric map based on a square grid

Also, notice that the grid walls appear to be offset slightly from the real walls

because the grid does not fit the walls exactly and the cells do not line up with the

real walls. In fact, in maps generated by a robot, it is very common for the walls not

to align with the grid. Small errors in the robot’s position, or arithmetic round-off

errors, can result in wall cells falling over from one cell to the adjacent one. This

results in walls that have ‘jaggies’ even though they should be straight.



There is a large amount of wasted space in the map above. The robot cannot

reach the area outside the building walls. These cells can therefore never be

populated and their contents will always remain ‘unknown’. However, this approach

is simple to implement, easy to understand, and has been in widespread use in the

research community for many years.

2.4.4 Map Building

Building a map is a complex task, especially when the spatial information is

inaccurate. Several approaches have been taken, most of them probabilistic in nature,

such as Maximum Likelihood [Simmons et al. 2000] and Bayesian Filtering

[Dellaert et al. 1999].

Hidden Markov Models (HMMs) have been used [Filliat & Meyer 2000, 2001]

in an attempt to overcome the inherent uncertainty in the robot’s location.

Probabilistic methods using Markov processes (so-called Markov Localization) have

also been widely used and are well documented in a long succession of papers by a

key group of researchers including Thrun, Fox, and Burgard. For examples, see

[Thrun & Bucken 1996, Thrun 1998, Thrun et al. 2000] and in particular the

textbook Probabilistic Robotics [Thrun et al. 2005].

Note that a first-order Markov assumption is almost always made when

building a map. In other words, it is assumed that new information to be added to a

map does not depend on the previous motions of the robot (only the immediately

prior pose), or on the information in cells previously updated in the map. Put

differently, this means that past and future data are considered to be independent in a

statistical sense.

In general this is not true, especially in indoor environments. For instance, if

the robot identifies part of a wall and can determine which direction the wall is

heading in, then it is a fair assumption that it will continue in the same direction. The

robot can actually predict future map information based on prior information. This

prediction uses domain knowledge – knowing that humans build walls that tend to be

long (relative to the size of the robot). Of course the wall must eventually end, but

over much of the length of the wall there is a higher likelihood of finding another

piece of wall than of finding empty space as the robot moves along the wall.



[Pérez Lorenzo et al. 2004] explicitly used walls. A Hough transform of the

map was used to obtain a set of straight lines. If they were sufficiently long, these

lines were likely to correspond to walls. When adding new information to the map,

the data was checked to see if it contained extensions to existing walls, and this fact

was used to help localize the robot.

2.4.5 Instantaneous Maps

Assuming that the floor can be extracted from the camera image by some

algorithm (or combination of algorithms), the boundary of the resulting floor

segment in the image can be used to obtain an approximate top-down map of the

floor in front of the robot. The higher a pixel scanline is in the image the further it is

away from the camera. If an accurate metric map is not required, then the pixel

height might be sufficient information.

Badal [Badal et al. 1994] introduced the Instantaneous Obstacle Map (IOM)

which mapped the detected obstacles to a Polar Occupancy Grid and used it to

determine a steering direction. The IOM was not intended for building maps, but for

reactive navigation. Therefore the resolution was fairly coarse.

Horswill used a ‘Radial Depth Map’, but it was neither radial nor a true depth

map. For each pixel column in the image, Horswill used the height of the scanline

(row) where an obstacle was first encountered as a measure of distance but he did

not attempt to obtain real-world coordinates.

The relationship between a scanline in the image and the distance from the

robot of the corresponding point on the floor is highly non-linear due to perspective.

This is one of the challenges of visual mapping.

2.4.6 Inverse Perspective Mapping (IPM)

Knowledge of where obstacles are in the real world is essential in order to

navigate successfully. However, the required accuracy and resolution of this

information will depend on several factors, such as the speed of the robot and its

size.



When images are created inside a camera, the 3D information from the real

world undergoes a perspective transformation that converts it to a 2D representation.

Unfortunately, this transformation cannot be reversed using the image alone –

additional information is required.

Consider a ray of light travelling from a point in the real world to the camera.

Every point (3D coordinate) that it passes through along the way will be mapped to

the same pixel in the resulting image. Therefore, an inverse perspective

transformation from an image pixel back to a 3D world coordinate can only be

correct up to some arbitrary scale factor.

With the ground-plane constraint, however, it is possible to use one additional

piece of information for mapping out the location of obstacles: they rest on the floor.

This ignores the possibility of an overhang, such as the top of a desk which might

stick out beyond the legs. It also assumes that the intersection of the obstacle with

the floor has been correctly identified.

The floor is a plane, and this imposes a constraint on the 3D coordinates of

points that lie on the floor, which is sufficient to allow Inverse Perspective Mapping

(IPM) to produce unique coordinates for each floor pixel in the image. A simple

lookup table can be calculated for the entire image given the necessary geometric

information. This approach has been used in conjunction with optic flow [Mallot et

al. 1991], and for stereo vision [Bertozzi et al. 1998].

Side Note:

Strictly speaking this is not correct. If the camera can see above the horizon,

then the pixels in the image that are above the horizon cannot be on the ground and

therefore there is no unique solution for their real-world locations. However, the

camera on a robot is often tilted downwards. In any case, when operating in indoor

environments the robot cannot see to the horizon because the walls get in the way.

The classical method for turning an image back into a three dimensional map

involves an analysis based on homogeneous coordinates and the perspective

transformation matrix. The transformation matrix can be determined via analysis

using an idealised pinhole camera and knowledge of the focal length of the camera



lens. In general, the focal length might be unknown. It therefore has to be obtained

via camera calibration as noted in section 2.3.10. The relevant transformations are

discussed in detail in textbooks such as [Faugeras 1993] and [Hartley and Zisserman

2003].

In the case of a digital camera, the pixel grid can be used to advantage.

[Bertozzi et al. 1998] derived formulae for use with a stereo system, but their result

was fairly complex.

Tilting the camera makes sense when it has a limited Field of View (FOV).

However, this seems to be largely ignored in the literature, with most cameras being

mounted horizontally with their principal axis parallel to the ground. In this simple

case, the ‘horizon’ is the centre scanline of the image. Anything above the horizon

cannot be part of the floor, which in a sense ‘wastes’ half of the image.

[Cheng & Zelinsky 1996] presented an analysis for a monocular camera where

the camera was tilted to the extent that the entire FOV of the camera intersected the

floor.

2.4.7 Occupancy Grids

Occupancy grids were first developed for use with sonar scans [Moravec &

Elfes 1985, Elfes 1989]. The basic idea was to break the world up into a set of grid

cells of a fixed size and assign a probability to each cell as to whether or not it was

occupied. Viewing these cells as a 2D image resulted in what would intuitively be

regarded as a top-down map. By using a single byte for the probability value, a

greyscale image could be used as a map where each pixel represents a grid cell.

This approach to building a metric map using sonar information was described

well in [Lee 1996]. The method has been used for many years for ranging devices

where the measurements are noisy or where there is significant uncertainty (as in the

case of sonar which is not highly focused).

More recently, Laser Range Finders (LRFs) have been used to obtain range

information. These have an enormous range compared to sonar, with very high

accuracy. For instance, some LRF models have a range of 80 metres with an error of



no more than 2cm across the entire range. However, compared to cheap web

cameras, LRFs are very expensive devices.

Figure 10 shows a typical occupancy grid created using a simulated robot.

According to the convention for occupancy grids, white indicates free space and

black cells are obstacles. Shades of grey represent different certainties about whether

a cell is occupied or not. The map is initially filled with grey (which is still visible

around the outside), which means no prior knowledge of the cell state, or a 50/50

chance of being occupied.

Figure 10 – Occupancy grid created by a simulated robot

In a strictly numerical sense, the map is really a Free Space grid, not an

Occupancy grid. Cells in the map of Figure 10 were represented using an 8-bit value

with an obstacle having a value of zero and free space a value of 255. This is why

the obstacles appear black and the free space is white.

Note that in reality the initial probability of a cell being occupied should not be

50%. The actual probability depends on how cluttered the environment is. If a map

was available to begin with, the average ‘density’ of obstacles could be calculated. In

many environments it is far less than 50%. For example, in the map above only

about 12% of the space is occupied by obstacles. This suggests that the map should

initially be filled with some value lower than 50%. However, it was left at 50%

which is actually more conservative than the ‘sparse’ value of 12%.

In the early papers on occupancy grids, the cells in the map contained what

was called a ‘certainty factor’ which related to the probability of a cell being

occupied and ranged from -1 to +1. (The original paper did not give a name to this

type of map and referred to it simply as a sonar map). An explicit algorithm was



given for combining successive sonar readings. Later the approach was put on a

more sound footing by introducing Bayes Rule as a means of updating cell values.

Bayes Rule relates conditional probabilities. In the case of mapping, the

objective is to calculate the probability that a cell is occupied based on a sensor

reading. Assume that, by some means, an estimate is already available of the

probability that a cell, m, is occupied. This is the prior probability (sometimes

simply called the prior), denoted as p(m).

When a new sensor reading, z, arrives and this new information must be

incorporated into the map. This new posterior probability (also just called the

posterior) can be calculated as follows:

)()()(

zpmpmzp

)( zmp = (5)

The value p(m|z) is sometimes read as the ‘conditional probability of m given

z’, or in other words, the updated estimate of the map occupancy conditioned on the

new sensor reading. On the right-hand side of equation 5, p(z|m) is the probability of

getting a particular sensor reading given the actual occupancy associated with the

cell. This is usually called the inverse sensor model, and is a probability distribution

that allows translation from sensor readings to occupancy probabilities, usually in

two dimensions. (See section 2.4.8 Ideal Sensor Model below for more information.)

Notice that the denominator, p(z), does not depend on m and therefore is

effectively just a normalizing constant that will be the same for all values of m in the

posterior.

The derivation of occupancy grid mapping using Bayes Rule can be found in a

variety of papers and textbooks [Murphy 2000, Thrun 2003, Thrun et al. 2005].

Therefore the derivation is not included here. Details of the particular calculations

used in this thesis can be found in Chapter 3 – Design and Experimentation.

[Howard & Kitchen 1997] used the Log-likelihood to represent the probability

of a cell being occupied. This had a computational advantage because combining

new sensor information with the existing map required only additions, not



multiplications. Given the processing power available at the time the paper was

written, this was an important issue.

Further refinements to occupancy grids have been proposed since the original

development, [Murray & Little 2000], and they have been called by other names

such as Certainty Grids or Evidence Grids (with subtle differences), but the principle

remains the same.

[Choi et al. 1999] used occupancy grids, but then applied morphological

open/close operations to the grids in order to clean them up. They pointed out that

odometry was not a reliable method for localization because of wheel slip, a theme

that has occurred repeatedly in the literature, for example [Gaussier et al. 2000].

In an interesting twist, [Hähnel 2004] showed that using a counting approach,

rather than Bayes Rule, produced results that were qualitatively the same but much

easier to calculate. The idea of the ‘counting model’ was to increment hit and miss

counters for a cell each time that it occurred in a range scan. If there were more hits

than misses, then the cell was probably occupied.

Taking this a step further, Hähnel only recorded the end-points of laser range

scans in order to avoid ray-casting to draw the laser information into the map. In this

case, the map effectively consisted just of obstacles – the free space and unknown

space were not represented explicitly. The robot did not need information about the

unknown space for exploration because Hähnel drove his robot manually.

Furthermore, Hähnel suggested that some amount of blurring of the map was

desirable for comparing maps because it resulted in a fitness measure that was

smoother.

2.4.8 Ideal Sensor Model

A key concept from the original work on occupancy grids was the sensor

model. Because sonar senses obstacles inside a cone, or beam, extending outward

from the sensor, it is not possible to determine exactly where an obstacle is located.

Consequently, a probability of occupancy is assigned to each of the cells that fall

inside the cone.



The sensor model for vision is quite different to sonar because a vision sensor

sees along individual light rays and the ‘beam’ is very narrow. However, vision has

other problems, the most important ones being non-linearity of the measured range

due to perspective and quantization due to the pixels in the camera.

It might be expected that the model used for Laser Range Finders would be

similar to vision because lasers also operate along light beams, but this is not the

case.

The sensor model for an ideal range sensor is shown in Figure 11.

Figure 11 – Ideal sensor model

This model says that for a given range measurement there is an obstacle

located at the specified distance (with absolute certainty); all the space between the

sensor and the obstacle is empty; and occupancy of the space beyond the obstacle is

unknown. The model is easy to implement, but unfortunately, it is too simplistic for

vision as will become apparent later in this thesis. For a detailed derivation of a

sensor model for a beam type range finder, see page 153 in [Thrun et al. 2005].

On the other hand, the popularity of laser range finders in recent years has led

to the use of the ideal model because lasers are highly accurate over their entire

range. It is common practice to use ray-casting to populate a map using laser range

finder data, such as in the CARMEN package [CARMEN 2006]. In fact, some

researchers now use only the end-point of each laser ‘hit’ and do not bother to draw

free space into the map. (See for example GMapping from [OpenSLAM 2007].)



2.5 Probabilistic Filters

Most modern SLAM algorithms rely on the use of probabilistic filters. The

most commonly used forms of filters are Kalman Filters and Particle Filters. These

are discussed below.

2.5.1 Kalman Filters

Kalman Filters are discussed here for completeness. For a good explanation of

Kalman Filtering see [Choset et al. 2005]. The description in [Thrun et al. 2005]

follows a different approach with quite different notation. [Andrade-Cetto &

Sanfeliu 2006] provide extensive discussion of the stability of the Kalman Filter.

It is only in the last decade that the stability of Kalman Filter-based SLAM has

been proven [Newman 1999, Dissanayake et al. 2001]. However, it must be noted

that stability is only proven in the limit as the number of observations approaches

infinity.

A Kalman Filter is basically a recursive state estimator [Kalman 1960]. The

‘state’ consists of a set of ‘features’ observed in the environment plus the robot’s

pose. Estimating the location of the features effectively builds a map.

The reason for the early popularity the Kalman Filter is that it is analytically

tractable and the update equations can be expressed in closed form. Being a recursive

filter, it is also computationally efficient.

2.5.2 The Kalman Filter Algorithm

The following is a brief definition of the Kalman Filter. The derivation can be

found in Probabilistic Robotics starting on page 40. The notation is as shown below

which differs slightly from the textbook:

At State Transition matrix from prior state at time t

Bt State Transition matrix from control input at time t

Ct State to Measurement Transformation matrix at time t

Kt Kalman Gain at time t



Qt Measurement noise covariance at time t

Rt System noise covariance at time t

St Estimated covariance of the state at time t

ut Control input at time t

xt Current state at time t (location of features plus the robot pose)

zt Sensor measurements at time t

δt Measurement noise at time t

εt System noise at time t

µt Estimate of the state at time t

The system is assumed to be described by equation 6:

xt = At xt-1 + Bt ut + εt (6)

This equation says that the new state is a linear combination of the previous

state plus the control input with additive zero-mean Gaussian noise, εt, with a

covariance of Rt. There are three assumptions here: linear system dynamics; first-

order Markov process; and noise that is both zero-mean and Gaussian.

The system state can only be observed indirectly through measurements, zt,

which are described by the following equation:

zt = Ct xt + δt (7)

Equation 7 states that the measurements are linearly related to the system state

plus some measurement noise, δt, that is zero-mean and has a covariance of Qt.

It is further assumed that the initial state is normally distributed with mean µ0

and covariance and S0.



The Kalman Filter algorithm from page 42 of Probabilistic Robotics [Thrun et

al. 2005] can be stated as shown below. Note that the prime (') indicates temporary

variables.

Algorithm 1 – Kalman Filter

Kalman_Filter(µt-1, St-1, ut, zt)

/* Prediction Step */

µt' = At µt-1 + Bt ut

St' = At St-1 AtT + Rt

/* Update step */

Kt = St CtT (Ct St' Ct

T + Qt) –1

µt = µt' + Kt (zt – Ct µt')

St = (I – Kt Ct) St'

Return µt, St

The filter works as follows:

Firstly, it estimates the new state and covariance in the Prediction Step using

the prior state values, the state transition matrices and noise covariance.

Secondly, it calculates the Kalman Gain, Kt, during the Update Step. This

matrix controls the extent to which the new measurement information is incorporated

into the state estimate. The state is updated using the error between the actual

measurements and the expected measurements that would have been seen if the

predicted state was correct. This error is called the innovation.

Note that there are several matrices that must be known ahead of time. When a

Kalman Filter is used for SLAM, these matrices degenerate to fairly simple forms

because the landmarks (or features) in the state are assumed to be independent of

each other and also stationary. Estimates are therefore required for:



• the motion model, which is how the robot’s pose changes in response

to the control input (elements of At and Bt);

• the measurement model, which is how the sensor responds to a given

feature (Ct); and

• the system and measurement noise covariances (Rt and Qt).

Furthermore, it is usually assumed that the system and sensor dynamics do not

change over time, so the relevant matrices are fixed so the subscript, t, can be

dropped.

2.5.3 Particle Filters

A popular form of localization is called Monte Carlo Localization (MCL).

There is an enormous amount of literature on the subject, although sometimes under

other names. Much of the work has been done by the trio of Thrun, Burgard and Fox.

In particular, Probabilistic Robotics by these authors is the definitive textbook in this

area.

MCL represents the belief in the current pose of the robot as a set of particles.

In simple terms, a particle consists of a pose estimate and any necessary state

information that is required to perform updates to the pose. As the robot moves,

these particles trace out trajectories (or paths) on the map.

If the set of particles is sufficiently large and has an appropriate distribution

that adequately represents the true posterior probability distribution, then the average

of the particles should be near to the actual pose of the robot. Ideally, one of the

particles should be very close to correct. If there is no such particle, then there are

insufficient particles in the set and the filter is likely to diverge.

A particle filter (for localization) operates as follows:



Algorithm 2 – Particle Filter

1. Perform a motion (based on the control input).

2. Obtain a new range scan (the measurement).

3. For each particle in the set calculate the new pose:

a. Apply the motion model to the existing (particle) pose using

the control input; and

b. For the new pose, compare the measurement to the global map

to obtain a weight (the importance factor).

4. Resample with probability proportional to the weights.

5. Decide where to go next and calculate the new control input.

6. Repeat from Step 1.

Note that there is an interesting issue here. With hundreds of particles

providing different poses there are several ways to determine the best estimate of the

pose. Taking an average seems intuitively correct, but the distribution might not be

anything like a Gaussian and could even be multi-modal. Selecting the particle that

has the best match to the existing map (highest importance factor) is another possible

approach.

Applying the motion model is referred to as sampling from the proposal

distribution. It can be considered to be adding to the tree of all possible particle

trajectories. Calculating the importance weights and then applying resampling results

in a pruning of the tree because some particles are thrown away and others are

duplicated.

Clearly this process is a compromise. It is desirable to maintain particle

diversity so that the particle set adequately represents the true posterior distribution

over the robot’s pose, but at the same time it is important to eliminate trajectories



that are obviously incorrect because they simply distort the resulting distribution and

ultimately would cause the filter to fail. Simplistically, the objective is to keep the

‘good’ particles and throw away the ‘bad’ ones. The difficulty is determining which

particles are good or bad, which is the responsibility of the map comparison step.

2.5.4 Particle Weights

An important step in a particle filter is to rank particles based on their weights.

The weights are calculated by comparing the local map information generated by

each particle against a global map. For localization only, the correct map is available

and there is no ambiguity in calculating the weights. However, in SLAM the map is

being constructed at the same time as the localization is taking place.

In SLAM each particle constructs its own map. Some of the maps will be

closer to the ground truth than others. However, the ground truth is unknown, by

definition. If a ‘bad’ particle survives for long enough, it will construct its own map

variant which fits the data and therefore its weight compared to its own map can

remain relatively high. One possibility is to compare each particle map against the

average of all maps, thereby ensuring that an individual particle cannot construct its

own version of reality and maintain its weight.

2.5.5 Resampling

Resampling is necessary because there is a finite number of particles in the set.

The objective is to retain the best particles and eliminate particles that have poor

weights. Some of the particles are ‘cloned’ to replace the bad particles.

Note that resampling should not necessarily occur after every new input

sample is received, although this is what Algorithm 2 suggests. It is common

practice to perform resampling after a small number of consecutive sensor

measurements (and motions) have been made. This provides more information for

use in the calculation of the weights and is acceptable if the variance in the pose is

not too great over the small number of motions.

There are several different ways to perform resampling. The Low Variance

Sampler from page 110 of Probabilistic Robotics can be stated as follows:



Algorithm 3 – Low Variance Sampler

Low_Variance_Sampler(Pt)

Tt = Ø

r = random(0..1) / M

c = wtk

i = 1

For k = 1 to M

U = r + (k – 1) / M

While U > c

i = i + 1

c = c + wtk

End While

Add ptk to Tt

End For

Return Tt

Resampling is a stochastic process. The objective is to select particles with a

probability proportional to their importance weights. Particles are replaced after

selection, which means that the same particle can be selected more than once if it has

a high weight. However, because it is stochastic, it is possible to select a particle

with a very low weight – ‘good’ particles can be replaced by ‘bad’ ones.

Although resampling algorithms are designed to be unlikely to propagate bad

particles, it is not impossible. The author observed bad particles being propagated

during several test runs over multiple iterations. Consider for instance the case where



there is very little input information, i.e. few obstacles are visible such as in the

middle of a large open space. In this case the particles will all have very similar

weights and the stochastic nature of the resampling process can select bad particles.

The problem known as Particle Deprivation (discussed below) is symptomatic of the

propagation of bad particles.

2.5.6 Particle Diversity

A significant issue related to Particle Filters is the diversity of the particle set.

Particle Filters cannot propagate new information backwards in time – the

trajectories of particles cannot be updated retrospectively.

It is possible to think of particles in terms of their ancestry. Tracing back in

time, eventually all particles might have a single common ancestor. Looking at the

number of steps backward in time to reach this common ancestor gives some

indication of how well the filter can tolerate errors. In particular, for the filter to

‘close a loop’, some particle diversity must remain at the point where the loop closes.

Put simply, this means that the more particles there are, the larger the loop that can

be closed.

2.5.7 Number of Particles Required

Selection of the number of particles for a particle filter is difficult. Too few

particles can result in the filter failing, whereas too many particles will adversely

affect performance. Clearly, the lower bound is one particle which is not a particle

filter at all. The chance of one particle following the correct trajectory is quite small,

although not zero. On the other hand, it is impossible in practice to have an infinite

number of particles.

There is no formal procedure or theoretical foundation for selecting the number

of particles. Usually researchers establish the number of particles required by trial

and error. The process might simply be to start with a very large number of particles

and keep reducing the number until the filter fails.

Some algorithms reported in the literature have used thousands of particles,

such as DP-SLAM [Eliazar & Parr 2003]. Examples on the web site for DP-SLAM

(see the link from [OpenSLAM 2007]) used between 1,000 and 9,000 particles.



One of the objectives of GMapping [Grisetti et al. 2005, Grisetti et al. 2007]

was specifically to reduce the number of particles required because each particle

contained an occupancy grid map. Some examples used as few as 30 particles.

P-SLAM [Chang et al. 2007] predicted the structure of the environment based

on features already in the map. This worked well in indoor environments. However,

the authors showed a real-life example where 1,000 particles were required with a

Rao-Blackwellized Particle Filter in order to produce a good map. The minimum

number of particles quoted for P-SLAM was 200, but in this case the map was

distorted.

A recent paper [Stuckler & Benhke 2008] that applied the environmental

assumption of orthogonal walls showed that acceptable quality maps could be

produced using only 50 particles. (NOTE: This paper was published concurrently

with this thesis. The concept of incremental localization developed in this thesis is

similar, but not derived from this paper.)

2.5.8 Particle Deprivation

One of the well-known problems with Particle Filters is referred to as particle

deprivation or depletion [Thrun et al. 2005, p122] or sample impoverishment

[Montemerlo & Thrun 2007, p63]. This happens when the particle set becomes

dominated by a few particles, leading to a dramatic reduction in diversity. If the

remaining particles do not properly represent the posterior distribution over the

robot’s pose, then the filter will diverge. This occurs in practice because there can

only be a finite number of particles.

Montemerlo and Thrun suggest that this problem occurs when there is a

mismatch between the proposal and posterior distributions due to the motion model

being very noisy, but the measurements being quite accurate. A substantial portion of

the particle set will therefore be thrown away.

The problem can be aggravated by resampling too frequently or a poor

objective function for calculating the weights.



2.6 Localization

Localization is the process that a robot uses to determine where it is in the

world. Usually this is done by comparing surrounding features with a map. The

definition of localization can therefore be re-stated as the process of updating the

pose of a robot based on sensor input.

Sonar sensors were used in a lot of the early research on localization. Typically

the sensors were arranged in a ring around the perimeter of the robot. This meant that

a full 360° sweep could be obtained without moving the robot, although the number

of ‘rays’ was usually limited to between 12 and 24.

Consider Figure 12 which shows a robot exploring an indoor environment. The

robot is facing to the left, or west, in the map. This is an artificially generated map

showing 12 rays traced from the robot. It is designed to illustrate several points about

sonar and range sensors in general.

Figure 12 – Hypothetical example showing sonar rays

The ray extending to the top left has reached its maximum range without

hitting an obstacle. The opposite sensor (bottom right) has also returned a maximum

range, but this is in fact an error that resulted from the signal bouncing off the wall

and into the adjoining room so no return signal was seen. The ray therefore appears

to have passed through the wall. These maximum range errors occur fairly often.

Notice in the top right that one of the rays does not extend all the way to the

wall. Short readings such as this happen for no apparent reason, although they do not

occur with a high probability.



Now consider if the robot were to move forward. The ray towards the upper

left that narrowly misses the obstacle would then intersect with the obstacle and be

much shorter.

Alternatively, if the robot turned to its left, the ray from the front sensor of the

robot, which hits the very corner of an obstacle in the diagram, would miss the

obstacle and travel all the way to the wall at the left.

Small changes in position can therefore have a significant effect on the range

values. This clearly illustrates how matching the rays to a map can be used for

localization.

This explanation is based on using only a small number of range-bearing

measurements. (The bearings are implicit in the arrangement of the sensors around

the robot.) For a sensor that generates significantly denser range data, such as a laser

range finder or vision, other approaches are possible.

In this project, the visual component of the system generates a floor boundary,

which is analogous to a laser range finder’s range scan. This ROP, as it is called, is

used to create a local map. By overlaying the local map on the existing global map

and comparing the two it is possible to obtain a measure of fit. This might be thought

of as the logical extension of Figure 12 to a very large number of rays. Similar

problems with overshooting and short range measurements also apply to vision, but

the reasons are quite different to sonar and lasers.

The biggest challenge to Localization is erroneous odometry information. The

background to odometers is discussed in the classic textbook Sensors for Mobile

Robots [Everett 1995]. According to Murphy [Murphy 2000], ‘shaft encoders are

notoriously inaccurate’. Odometry errors are discussed at length in [Borenstein &

Feng 1996].

2.6.1 Levels of Localization

A common thread in the literature is the ‘Kidnapped Robot’ problem. This is

different from simply getting lost while exploring. A kidnapped robot ‘wakes up’ in

unfamiliar surroundings and has to determine where it is. Clearly this is the most



difficult localization problem, and is referred to as the global localization problem in

this thesis.

Since the focus is on SLAM for this thesis, when the robot starts up it is told its

current position. From then on it should always have a good idea of where it is

relative to the map that it is building. In any case, since the robot starts with no map

at all it clearly cannot localize itself after being moved while it was turned off.

Instantaneous ‘teleportation’ from one place to another does not happen during

a SLAM test run. So if the robot becomes lost, it must be somewhere near its last

known good position. In the worst case, it should be able to reverse the sequence of

motion commands that brought it to the current location and end up back where it

started from. It is not quite so simple if the wheels have slipped along the way, but

the point is that re-localizing only requires searching a small area of the map. This is

called local localization.

Each time the robot moves, it is important to track its new pose. The ‘control

input’ to the robot consists of a command to move a certain distance or to rotate by a

specified amount. In the absence of any other information, and assuming that the

robot can execute commands with reasonable accuracy, this control input can be

used as the odometry input for a tracking algorithm.

Wheel encoder information is a better measure of actual motions of the robot

than the control commands issued to the motors. However, there are several reasons

why the encoder information might not be correct:

• the wheels can slip, especially if the robot nudges an obstacle;

• the wheels can sink into the carpet, changing their effective diameter;

and

• the wheel diameters and/or the wheelbase might not be known

accurately (especially if the wheels are made from moulded plastic).

These inaccuracies result in errors when calculating the distance travelled.



Other information must therefore be used to refine estimated motions. This is

referred to as incremental localization because it incrementally adjusts the robot’s

pose between motions. The purpose is to improve the pose estimate prior to

executing the localization algorithm.

2.6.2 Localization Approaches

Localization methods fall into two broad categories:

• Kalman Filters which track features (also called landmarks or beacons)

using normal distributions to estimate errors in feature locations; and

• Markov localization methods, also called Monte Carlo Localization

(MCL) or Particle Filters, which approximate an arbitrary probability

distribution using a set of particles and operate on a grid-based map.

These techniques do not have to be mutually exclusive. For example,

probabilistic methods like MCL can use Kalman Filters in their particles instead of

occupancy grids.

It is important to note that most popular localization methods make the

assumption that the system can be represented as a first-order Markov process. In

terms of localization, this means that the robot’s pose at a particular time, t, depends

only on its pose at the previous time step, (t-1), and the control input applied to the

robot’s motors in the intervening time interval. In general, the pose at time t might

depend on a finite number of previous poses, but this is ignored.

Localization with a known map has been studied for many years. [Gutmann et

al. 1998] compared several methods, and suggested that there are three basic

approaches to localization:

1. Behaviour-based;

2. Landmark-based; and

3. Dense sensor matching.



The last two subsections discuss how to improve the quality of the pose

estimates prior to applying a localization algorithm.

2.6.3 Behaviour-Based Localization

Behaviour-based approaches usually generate topological maps. They might

involve behaviours such as wall following and recording features such as doorways.

It can therefore be difficult for these approaches to recognise places that have been

previously visited, and they cannot localize a robot accurately in a metric sense.

2.6.4 Landmark-Based Localization

There is a large body of literature that deals with using known landmarks

and/or man-made beacons for localization. Obviously, if the robot is building a map

then there cannot be any known landmarks to begin with.

Artificial landmarks were deliberately avoided in this project because

introducing them into the environment is not possible in many situations.

Furthermore, artificial landmarks introduce additional costs and maintenance

overhead.

An alternative is to choose natural landmarks, or ‘features’. It is important to

select only significant features because an excessive number of features will not

necessarily help localization. However, when operating in a corridor environment

many of the images might not show any obvious visible landmarks. This is discussed

further under Tracking Features.

There are other problems with this approach as well. In particular, landmarks

can look different from different directions or even be unidentifiable from a different

viewpoint. For instance, the front and back of a chair look quite different, but it is the

same object.

The Symmetries and Perturbations Map (SPMap) uses corners and

‘semiplanes’ as the principal features [Castellanos et al. 1999]. A ‘semiplane’ is a

line segment obtained from a laser range scan and typically represents a wall. They

developed probabilistic models for matching and tracking these distinguishing

features.



[Castellanos & Tardós 1999] addressed sensor fusion between laser range

finders and monocular vision. In the case of vision, they used vertical edges as

landmarks. However, the camera was horizontal which makes detecting vertical

edges trivial. The general case where the camera is tilted is discussed below.

2.6.5 Dense Sensor Matching for Localization

Two variants of dense sensor matching were compared empirically in

[Gutmann et al. 1998]: Markov localization (based on a grid); and scan matching

using a Kalman filter (which was feature-based). In essence, these are the two

primary methods still in use today, although Gutmann et al. suggested combining the

two techniques. In particular, it was concluded that scan matching was more accurate

than Markov localization, but was less robust as the noise level increased.

‘Catastrophic failures’ were observed by Gutmann et al. in some cases,

meaning that the localization algorithm diverged. It is interesting to note that even in

this early work there was mention of a requirement for ‘sufficient information’ for

scan matching to be successful.

[Gutmann & Konolige 1999] addressed the problem of large cyclic

environments which are known to present problems for localization algorithms. They

combined three different techniques (scan matching, consistent pose estimation and

map correlation) to create the LRGC (Local Registration / Global Correlation)

algorithm.

Some of the early papers on the subject of localization did not provide

experimental results for long test runs. Examples were given from simulations, or

quite short runs. It was only with work such as [Thrun et al. 2000] that evidence

started to appear to show that localization worked reliably over long periods of time

and long distances travelled.

2.6.6 Scan Matching

In recent years there has been a lot of work done on scan matching which

involves comparing two successive laser range scans. The original work dates back

to [Lu & Milios 1994, 1997] who applied various techniques to matching sonar

sweeps in order to determine the amount of a rotation and/or translation.



[Hähnel 2004] used scan matching of laser range scans to improve pose

estimates before performing SLAM updates. He attributes the quality of his maps to

this extra step. [Grisetti et al. 2005] continued with this approach in the GMapping

algorithm.

[Nieto et al. 2005] introduced an approach called Scan-SLAM that

incorporated scan matching into an EKF SLAM algorithm.

Scan matching is not directly applicable to vision. Information must first be

extracted from an image that is similar to a laser range scan.

2.6.7 Improving the Proposal Distribution

The objective of improving the proposal distribution is to enhance the stability

of the filter by providing better pose estimates before performing the update step.

This is sometimes referred to as incremental localization. Scan matching (discussed

in the previous section) is one of the common approaches.

An alternative to scan matching is to compare features in the input data against

known features in the map and update the robot’s pose estimate prior to applying the

filter. The selected features must be significant such as walls or corners.

2.7 Simultaneous Localization and Mapping (SLAM)

For autonomous mobile robots to operate in human environments they need

SLAM (Simultaneous Localization and Mapping), also referred to as CML

(Concurrent Mapping and Localization). As domestic robots begin to appear in our

homes, it will simply not be acceptable for users to have to program maps into their

robots. Quite apart from the fact that most people do not have a map of their home,

they will not want to be bothered with the tedious process of entering a map into the

robot’s memory.

Leading researchers in the field have suggested that SLAM is now a ‘solved’

problem [Durrant-Whyte & Bailey 2006]. However, they go on to discuss the many

practical problems that still remain.

The primary problems that SLAM has to solve are:



• Determining the robot’s pose from uncertain input data; and

• Estimating the relative position of observed features from

measurements containing noise.

Building a map is not one of the significant problems if the input data is

accurate. Also, the exploration algorithm does not have a strong effect.

However, if the robot’s pose is not correct, or the observations contain gross

errors, then features will be inserted into the map in the wrong locations. These

errors accumulate, resulting in distortions of the map. It is even possible for the robot

to become completely lost.

It is generally acknowledged that the field of SLAM began with a paper by

[Smith & Cheesman 1986] in which they addressed the issue of representing spatial

uncertainty in a systematic way. In a revised version [Smith et al. 1990], which is

also widely cited, they introduced the term ‘stochastic map’ which reflects the fact

that the input information is stochastic in nature.

This ground-breaking work discussed the use of a Kalman Filter and also

applied Monte Carlo simulation to robot pose estimation. Even at this preliminary

stage, the authors pointed out two of the most fundamental problems with SLAM:

non-Gaussian distributions; and non-linearities in the system.

Soon after the Smith & Cheeseman paper, another early researcher used a

geometric approach to develop a theoretical framework for SLAM [Durrant-Whyte

1988]. No experimental data was provided in this paper.

Various researchers under the tutelage of Durrant-Whyte have worked on

SLAM for close to two decades. Newman’s PhD thesis [Newman 1999] provided a

detailed theoretical background, including proofs that existing SLAM algorithms

converged (in the limit as time goes to infinity) and also a new algorithm that used

the relationships between map features rather than their absolute positions. The

general solution to the SLAM problem was published by [Dissanayake at al. 2001].



Most of the early work on SLAM was done using sonar sensors, such as [Lee

1996]. These are noisy and not very accurate. Therefore it made sense to use a

probabilistic approach.

More recently, Laser Range Finders (LRFs) have become popular. These

devices have a much greater range than sonar, and are extremely accurate by

comparison. However, LRFs cost significantly more than sonar sensors.

Vision is the most recent sensor to be used and there is a great deal of research

currently in progress on using vision for SLAM. One of the key advantages of

vision, which was part of the initial impetus for this research, is that cameras are now

very cheap. The disadvantage is that vision is a very complex sensor to use.

Consequently, many different approaches have been used for visual SLAM, as

explained later in this chapter in section 2.8.

2.7.1 SLAM Approaches

SLAM is basically a ‘chicken and the egg’ problem. To localize itself, the

robot needs a map. However, the robot does not have a map initially because

building a map is the objective of SLAM (and an appropriate exploration algorithm).

One of the implicit assumptions in SLAM is that the process can be

represented by a first-order Markov model. In other words, the current state depends

only on the previous state.

In broad terms, there are two basic approaches to SLAM:

• Kalman Filtering; and

• Probabilistic methods.

It is also possible to merge the two approaches. For instance, a set of Kalman

Filters can be used in the particles of a Particle Filter.

This section briefly outlines these two different approaches. The key point is

that all SLAM methods use stochastic models – they allow for uncertainty in the

estimated locations of objects in the environment and in the estimated pose of the

robot.



2.7.2 Map Features for Kalman Filtering

Kalman Filters work with features in their state vectors and therefore are not

normally applied to grid-based maps because each grid cell would become a variable

in the state vector. The size of the filter covariance matrix grows at the rate of N2,

where N is the number of features. This can rapidly become unmanageable for large

environments or if the system is over-zealous in selecting features. As the number of

features increases, it starts to detract from the computational efficiency.

Although it is tempting to consider features to be ‘points’ in space, there are

implementations using the Kalman Filter to track other types of features. For

example, [Castellanos & Tardós 1999] used what they referred to as corners,

segments and semiplanes. As long as the features can be described parametrically

and there are methods to unambiguously determine matches between features, then

the Kalman Filter can be applied.

This simplified discussion ignores the requirement to add new features to the

state vector as they are discovered, and also to remove features if they later prove to

be inappropriate or mistakes or just to prune the state vector.

Some research has been directed towards minimizing the impact of the

expanding state matrix by breaking the map into a series of sub-maps or ignoring

features that are out of sensor range during updates. This can significantly reduce the

computational burden. However, it assumes that the sub-maps are all correctly

correlated with each other.

The nature of the Kalman Filter algorithm is such that the landmarks are fully

correlated. Therefore over time the correlation matrix will become static and the

Kalman gain will approach zero. This can happen very quickly according to

[Andrade-Cetto & Sanfeliu 2006]. In other words, no further updates to existing

landmark locations will occur – only the robot’s pose will be updated.

In theory, all landmarks must be observable at every step in order to perform a

full update. Furthermore, it must be possible to match each of the landmarks to the

corresponding one in the new observations, a process called data association. Clearly

this will never be the case in practice.



One significant problem with Kalman Filters is data association, which is the

process of matching the observed features between iterations of the filter. Any

textbook on SLAM will have a reference to this problem. See for example [Thrun et

al. 2005, Choset et al. 2005].

In order to update the state, it is necessary to know which feature corresponds

to which measurement. If a feature is associated with the wrong measurement, it can

cause the filter to become unstable. Other factors can also affect the quality of data

association such as the selection of spurious features or ambiguous features. Methods

have been developed to prune features that exhibit a large variance on the

assumption that they have resulted from a mismatch.

Features also must be invariant. In other words, they must not move around. In

theory the Kalman Filter can handle non-stationary environments. However, data

association errors resulting from moving objects can have drastic effects on the

correlation matrix.

2.7.3 Extensions to the Kalman Filter

As noted above, Kalman Filters assume that the system is linear and that any

signal noise has a Gaussian distribution. Real-world systems do not fit these

assumptions. To try to resolve this problem, the Extended Kalman Filter (EKF)

linearises the system around the current point using the first term of a Taylor

expansion. Calculating the required Jacobians increases the complexity and

computational load.

The EKF has been one of the most popular methods for SLAM in recent years.

However, in an invited paper [Julier & Uhlmann 2004] state that the EKF has several

deficiencies and that ‘more than 35 years of experience in the estimation community

has shown that it is difficult to implement, difficult to tune, and only reliable for

systems that are almost linear on the time scale of the updates.’ This is a serious

concern.

To try to overcome the linear Gaussian restriction, the Unscented Kalman

Filter (UKF) was developed [Julier & Uhlmann 1997]. UKFs combine some of the

features of probabilistic estimation with the traditional EKF, so that rather than



linearizing using a Taylor expansion, the probability distribution is represented using

‘sigma points’.

2.7.4 Examples of Kalman Filter-based SLAM

The following diagrams for EKF SLAM (Figure 13) and UKF SLAM (Figure

14) were generated by the author by running sample code provided by Dr. Bailey on

the OpenSLAM web site [OpenSLAM 2007]. They are used to illustrate a couple of

key points, not to show any deficiencies in these approaches.

Figure 13 – EKF SLAM example in Matlab

In Figure 13, the actual trajectory of the robot and the estimated trajectory do

not match exactly. Furthermore, towards the end (bottom left) many of the estimated

feature locations (shown as error ellipses with a ‘+’ in the centre) are wrong.



Figure 14 – UKF SLAM example in Matlab

Figure 14 shows UKF SLAM where the robot has traversed the loop twice.

The second trajectory tracks the actual trajectory much more closely. Also, the

estimated errors in the feature locations have reduced significantly. However, the

estimated locations are still incorrect for many of the features.

The main point is that these are state-of-the-art algorithms, but they still cannot

produce good results. The importance of revisiting areas that have been seen

previously is also quite clear – as shown by trajectory improvements in Figure 14.

2.7.5 Particle Filter SLAM

[Thrun et al. 1998] were one of the first groups to apply particle filter methods

to SLAM. They formulated the problem using probability distributions and applied

Expectation Maximization (EM) to solve the problem.

The particle filter approach to SLAM is relatively recent, but it has gained

considerable attention in the literature [Doucet et al. 2001]. There are many different

names and algorithms. One basic approach, known as Monte Carlo SLAM, or

Particle Filters, can be described as follows:



Maintain a large collection of pose estimates and maps with associated

certainty estimates, and update these at each step so that the distribution of these

estimates eventually approximates the real probability distribution. The idea is to

keep enough ‘guesses’ that one of them will actually turn out to be correct.

Particle Filters can be applied to either feature or grid-based maps. For

instance, a Particle Filter might use a separate Kalman Filter for each particle. In the

case of occupancy grids, each particle must maintain a separate copy of the grid. The

result is that Particle Filters can be very memory and computationally intensive.

One way to view the output of a Particle Filter is by plotting the trajectories of

all of the particles. Figure 15 shows the trajectories of particles during an exploration

run.

Figure 15 – Example of particle trajectories

In Figure 15 the robot started at the bottom centre of the map. The locations of

the particles are shown as black dots at the ends of the grey trajectories. As it

progressed, the robot became increasingly uncertain about its pose. This is reflected

in the spreading of the particles trajectories. In this 2D representation it is not

possible to show the orientations of the particles, which also have some variation.

2.7.6 A Particle Filter SLAM Algorithm

Several SLAM algorithms are provided in pseudo-code in Probabilistic

Robotics [Thrun et al. 2005]. More recently, open source code has begun to appear

[OpenSLAM 2007].



FastSLAM is a popular form of particle filter, and FastSLAM 2.0 [Montemerlo

et al. 2003, Montemerlo & Thrun 2007] is the most recent incarnation. When applied

to an occupancy grid it is called GridSLAM.

The GridSLAM algorithm is discussed below based on the description in

[Thrun et al. 2005] which is the definitive textbook on the subject. The algorithm is

reproduced below for ease of reference, with some small notational changes. The

algorithm is expressed at a high level. The derivation is not discussed here because it

was given in [Thrun et al. 2005].

In the algorithm below, the following notation is used:

M Number of particles

mtk Map (global) associated with particle number k at time t

ptk Particle number k at time t (which contains xt

k, mtk, wt

k)

Pt Particle set at time t (which contains all particles for k=1..M)

Tt Temporary particle set

ut Control input at time t

wtk Weight of particle number k at time t

xtk Robot pose for particle number k at time t

zt Sensor measurements at time t

The GridSLAM algorithm from page 478 of Probabilistic Robotics [Thrun et

al. 2005] can be stated as shown below.



Algorithm 4 – FastSLAM Occupancy Grids

FastSLAM_Occupancy_Grids(Pt-1, ut, zt)

Tt = Ø /* Empty temporary Particle Set */

For k = 1 to M /* For each particle … */

xtk = Sample_Motion_Model(ut, xt-1

k)

wtk = Measurement_Model_Map(zt, xt

k, mt-1k)

mtk = Update_Occupancy_Grid(zt, xt

k, mt-1k)

Add particle ‹xtk, mt

k, wtk› to Tt

End For

/* Select new Particle Set based on Weights */

Pt = Resample(Tt)

Return Pt

This algorithm in turn uses several other algorithms. These are explained

below. However, some notes are appropriate at this point.

Each particle contains its own global map, mtk. Depending on the particle

trajectories, these maps can be substantially different. However, the same

measurement data, zt, is applied to all maps.

Particles also have a current pose, xtk, which is used when updating the map.

However, there is no trajectory maintained by the particle. The map implicitly

records the trajectory. This is a key point. It is a consequence of the Markov

assumption which states that the current pose depends only on the previous pose and

the control input, so there is no need to maintain a trajectory.



The control input, ut, could be either the command issued to the robot or the

odometry data from the robot’s wheel encoders. In either case, it represents the

amount by which the robot has moved. Clearly, ut is noisy, and this is one reason

why localization is required.

Sensor measurements (also called observations), zt, can take many forms.

Usually they will be an array of range measurements, also referred to as a range

scan. For sonar sensors or a laser range finder it is quite obvious how this is derived.

For the vision system a simulated range scan is obtained from the floor boundary as

is explained above in the section on Inverse Perspective Mapping.

The importance weights, wtk, are normalized so that they sum to 1.0 and they

are therefore equivalent to the probability of a match between the current sensor

measurement (observation) and the global map.

FastSLAM 2.0, which is an improved version of FastSLAM, applies the most

recent measurement to pose prediction prior to calculating the new proposal

distribution. The result is improved performance and stability, and might be viewed

as a form of incremental localization.

2.7.7 Sample Motion Model

The motion model should represent the physical process that the robot

undertakes as it moves through the environment. There are many different motion

models that can be used.

The mechanics of robot motion are covered extensively in [Choset et al. 2005],

with a similar model in [Thrun et al. 2005]. The kinematics of various different

methods of robot locomotion, including the differential drive, were derived in

[Dudek & Jenkin 2000]. Estimation of the odometric error for a differential drive

was discussed in [Siegwart & Nourbakhsh 2004].

Note that whether the motion model has a plausible physical interpretation or

not is not the issue, but rather the fact that it adequately covers the possible range of

motions. This means that the proposed distribution of poses will encompass the

actual pose. If the parameters of the motion model are too narrow, then the particle

filter will be unable to track the robot’s actual motion. Conversely, if the motion



model results in a distribution that is very broad then particles might ‘wander’ too

much.

Applying the motion model consists of calculating the new pose based on the

control input and then adding the noise.

2.7.8 Measurement Model Map

In the particle filter algorithm above, there is a step where the importance

weights of the particles must be calculated to select the best particles. With an

occupancy grid for the map representation, the obvious way to compare new

measurements against the map is using map correlation. This involves comparing the

values of corresponding cells between the local map (new observations) and the

global map.

[Schultz & Adams 1996] applied two different search algorithms (centre-of-

mass and hillclimber) to finding the best match between the long-term (global) map

and a short-term (local) map. They combined the inputs from a ring of sonar sensors

and a laser range finder with a narrow field of view. Several scans were combined to

produce a short-term map with sufficient ‘maturity’ to instigate a search.

To provide an objective function for comparing maps, Schultz & Adams used

two different methods: Binary match and Product match. The binary match simply

allocates a 1 or 0 depending on whether the cells in the two maps match or not. The

product match takes the product of the occupancy grid values for the two

corresponding cells. However, they did not find any significant difference in the

results produced using these two methods.

They did note that the search space had large regions in which many

registration poses resulted in the same match scores. In other words, there are many

ambiguous poses and the robot could not determine its pose uniquely. This is a

common problem, and it is aggravated by using small local maps. Unfortunately, the

size of the local map is primarily determined by the sensor range.

Direct comparison of the grid cells in the local and global map can be

performed using the map correlation function defined in Probabilistic Robotics, page

174 to calculate weights. This can be expressed by the following equation:



∑ ∑

∑

−−

−−=

yx yxlocalyxyx

yxlocalyxyx

mmmm

mmmm

, ,

2_

,,2

_

,

,

_

,,

_

,

)(.)(

)).((ρ (8)

where the average map value is:

)(21

,,,,

_

∑ +=yx

localyxyx mmN

m (9)

In this notation mx,y denotes a particular cell in the global map in real-world

coordinates. The corresponding value in the local map, mx,y,local depends on the

current pose of the robot. The pose has not been shown explicitly in the formula.

The value of ρ ranges from -1.0 to +1.0 and is calculated only in the area of

overlap. In particular, the values should only be compared in the region that the

sensor might actually observe. However, in some situations such as when the robot is

facing a nearby wall, this can result in a very small number of cells being included in

the calculation.

2.7.9 Update Occupancy Grid

The process for updating an occupancy grid given a local map involves the

application of Bayes Rule to combine the probabilities of occupancy from the local

map with the existing probabilities in the global map. Refer back to section 2.4.7 on

Occupancy Grids.

2.7.10 Stability of SLAM Algorithms

The various proofs that exist for the solution of the SLAM problem

[Dissanayke et al. 2001, Montemerlo et al. 2003] apply in the limit as time, or

equivalently the number of observations goes to infinity. Additionally, for a particle

filter, the number of particles must go to infinity. In other words, these algorithms

are asymptotically stable, but this says nothing about their behaviour in the short to

medium term.

The author previously encountered numerical problems with provably stable

adaptive filter algorithms which are conceptually similar to SLAM [Johnson &



Taylor 1980]. Problems such as round-off errors for floating-point numbers can

result in the failure of an algorithm if it is only marginally stable.

Only recently has information started to appear in the literature that suggests

SLAM is not as reliable as has been reported in the past. For instance, [Chang et al.

2007] showed a map that was produced using a Rao-Blackwellised Particle Filter

which was grossly distorted and would have to be classified as a SLAM failure.

These types of maps rarely appear in SLAM papers because there is a tendency to

only report positive results.

Other researchers have begun to challenge the stability of established SLAM

algorithms. In [Julier & Uhlmann 2001], it was shown that full-covariance SLAM

produced inconsistent maps after several hundred updates.

[Castellanos et al. 2004] analysed the consistency of EKF-SLAM and found

that it could fail under certain circumstances. They suggested breaking the map into

smaller sub-maps to avoid possible failures. This approach was extended further in

[Martinez-Cantin & Castellanos 2006] in what they called ‘robocentric local maps’.

Data association is a significant problem for Kalman Filter based SLAM

because incorrect associations can cause the filter to fail. This is well known and has

been mentioned in textbooks [Thrun et al. 2005, Choset et al. 2005] and was

discussed in detail in [Andrade-Cetto & Sanfeliu 2006].

The need for additional localization to improve pose estimates has become

apparent in recent years and has been reported by several researchers. For instance,

[Hähnel 2004] reported that using scan matching was crucial to producing maps that

did not contain inconsistencies.

[Bailey et al. 2006b] investigated the consistency of FastSLAM and came to

the somewhat startling conclusion that ‘FastSLAM in its current form cannot

produce consistent estimates in the long-term.’ They noted, however, that FastSLAM

might work effectively in the short term. This was at odds with [Hähnel et al. 2003],

where FastSLAM was proven to converge.



In a similar paper on EKF-SLAM [Bailey et al. 2006a], the researchers noted

that in their experiments EKF-SLAM would diverge if the true heading uncertainty

exceeded as little as 1.7 degrees. In circumstances where there is a large true heading

variance, the filter can fail suddenly and irretrievably – a catastrophic failure.

These recent papers suggest that the SLAM problem has not been solved in a

practical sense.

2.8 Visual SLAM

Visual SLAM is not a new concept. Many different methods have been applied

to using vision for SLAM.

There are several research threads running through the field of Visual SLAM,

including visual localization and visual odometry. Also, some researchers have

attempted to reconstruct 3D maps, whereas others assumed that the robot moved in

2D on a ground plane and used the homography of the ground plane as a constraint.

2.8.1 Visual Localization

Various approaches to vision-based localization have been reported in the

literature. Two fundamentally different methods are discussed below that rely on

information obtained directly from images:

• Appearance-based; and

• Feature Tracking.

Visual odometry, which is an application of feature tracking, is also discussed

briefly.

2.8.2 Appearance-based Localization

In appearance-based localization the robot captures images of various locations

and then determines where it is later by comparing images. The exact mechanism for

comparing images varies amongst researchers.



[Ulrich & Nourbakhsh 2000a] developed an appearance-based localization

system using an omni-directional camera. However, this was for topological

localization, not metric.

To complicate matters, the term ‘appearance based’ is also used in [Ulrich &

Nourbakhsh 2000b], but in this case they used a monocular colour camera and the

appearance that they refer to is the appearance of the floor. This later paper deals

solely with obstacle avoidance, and does not address localization.

In a widely-cited paper, [Thrun et al. 2000], the robot Minerva used a camera

pointed upwards at the ceiling, plus other sensors. The robot operated in a museum

for two weeks using a map of the ceiling to avoid getting lost. Images from the

camera were matched to the map using Monte Carlo Localization. Note that in this

case the robot had already traversed the environment and taken all of the necessary

pictures before it was required to localize itself.

[Wolf et al. 2005] used an image retrieval system for storing and retrieving the

closest matching images. MCL was also used in this case to select the best match.

Matching images is not a robust approach because it is not possible to match

different viewpoints. In particular, the view of a location from a substantially

different direction, such as looking up or down a corridor, can be completely

different.

Using a catadioptric or omni-directional lens [Nayar 1998] to obtain a full 360°

view can eliminate the problem of differing viewpoints. However, it introduces the

complex operation of matching images that might have been rotated.

One disadvantage of appearance based approaches is that they require a large

amount of storage for the images and then finding the best match. They can also

have problems if the environment is dynamic or even if the illumination changes

significantly such as between day and night.

Variations on this approach might not require storage of entire images, but are

still based on what a scene looks like from a particular pose. For example, [Prasser,

Wyeth & Milford 2004] used the complex cell model where input images were

processed through Gabor filters to extract salient features. Template matching was



used to identify previously seen locations. The data was fed into a RatSLAM system

[Milford, Wyeth & Prasser 2004] which produced a topological map.

2.8.3 Feature Tracking

Tracking features in images is an area that has received a considerable amount

of attention.

At the macro level, the features used for tracking are usually corners or other

‘good features to track’ [Shi & Tomasi 1994]. Several other feature detectors have

been invented, such as the Moravec Interest Operator, SUSAN [Smith & Brady

1995] and the Harris Corner Detector [Harris & Stephens 1988].

As early as 1988, [Ayache & Faugeras 1988] developed methods for

combining multiple stereo images using depth information and an Extended Kalman

Filter to determine the 3D motion of a robot in a static environment. The matching

was based on horizontal and vertical edge segments extracted from the images.

Stereo imaging is a common approach to SLAM.

Monocular vision was used in [Bouguet & Perona 1995] with a recursive

motion estimator called the Essential Filter applied to features in image sequences of

around 2000 frames. They discuss how sampling frequency and the number of

features for tracking affect the accuracy of estimated motions. It is interesting that

they assumed a measurement noise of 1 pixel.

MonoSLAM [Davison 1999, Davison 2003] used features in the images from a

monocular camera which it tracked with an Extended Kalman Filter (EKF). The

examples showed how a smoothly-moving hand-held camera could have its ego-

motion tracked and build a 3D map. Test runs were only short – 20 seconds. The Shi

and Tomasi interest operator was used to select features, and then these were tracked

using a normalized Sum of Squared Differences correlation between frames in order

to estimate their 3D locations. No ground truth information was presented. Note that

the reported examples of MonoSLAM have not been for robots exploring

autonomously.

Davison noted that the selected features can often be poor choices, such as

depth discontinuities or shadows and reflections which move as the robot changes its



vantage point. The solution was simple: leave these ‘bad’ features alone and let the

tracking algorithm discard them when suitable matches could not be found in

subsequent images.

MonoSLAM requires a high frame rate, or correspondingly small motions

between frames, in order to be able to track features. A wide-angle lens also helped

MonoSLAM by allowing features to remain in view for longer [Davison et al. 2004].

In feature-based SLAM the features are normally points. MonoSLAM has been

extended to use straight lines as well as point features [Smith et al. 2006]. Their main

contribution was a fast straight line detector.

One of the popular feature extraction algorithms in recent years has been SIFT

(Scale-Invariant Feature Transform) [Lowe 2004]. Sample code for SIFT can be

downloaded from Lowe’s web site at http://www.cs.ubc.ca/~lowe, but the

algorithm is the subject of a patent.

SIFT produces 128-byte ‘keypoints’ which should be unique. They are

typically corner-type features and therefore SIFT works best in cluttered

environments. In empty corridors, SIFT might not detect many features.

[Se et al. 2001, Se et al. 2002] applied the SIFT algorithm to the problem of

determining 3D motion by using a ‘Triclops’ stereo system with three cameras.

Localization was accomplished by a least-squares minimization of the errors

between matched landmarks. Again, a Kalman filter was used to handle the noise.

Hundreds of match points were used per image, with the database eventually

containing many thousands of match points. The example given was for a very small

area.

The V-GPS system [Burschka & Hager 2003] used a set of landmarks along a

path that were manually selected by the user during a ‘teaching step’. This database

of known image locations was used to triangulate and determine the camera’s

position. The authors pointed out that the tracking ability of the system was limited

by the frame rate and suggested that 30 Hz was desirable.


http://www.cs.ubc.ca/~lowe


In a subsequent paper [Burschka & Hager 2004], the V-GPS system was

extended to perform SLAM as well. Features were automatically selected by the

system. The authors claimed that the system could operate on as few as three tracked

features and show the convergence for pure translations and rotations. In effect, the

system implemented vision-based odometry. No results were presented for large test

runs.

[Miro, Dissanayake & Zhou 2005] developed a vision-based SLAM system

that used a stereo camera where SIFT was used for data association between the

landmarks used to determine range and bearing information.

The vSLAM system developed by Evolution Robotics was based on SIFT for

feature detection. It was discussed in a two-part paper [Karlsson et al. 2005] and

[Goncalves et al. 2005]. vSLAM was one of the first commercial applications of

Visual SLAM and combined both vision and odometry based SLAM. The robot built

a map consisting of landmarks in a typical home environment, but made no attempt

to build a metric map.

The paper contains several useful pointers for feature-based Visual SLAM. It

was noted that the environment must contain ‘sufficient amount of visual features’,

although this was not quantified. It was also pointed out that more than one camera is

advantageous because of the different viewpoints which increase the chance of

reliably locating features from any given pose. A Landmark Database Manager was

required to manage the large amount of data obtained from features and prune poor

landmarks to save space.

One interesting point was the statement that the basic particle filter performs

poorly if the motion model generates too few hypothetical paths in regions where the

probability of a particular pose based on the current sensor reading is large. In other

words, if the filter generates motions that are basically random and do not fit the

observed features, then many of the particles will be ineffective. The paper suggests

using 200 particles and splitting them into 180 based on the motion model and 20

using the measurement model.

The system was tested in an indoor environment and shown to produce

reasonably accurate results. An interesting aspect of this work was that a SICK Laser



Range Finder was also used to gather data and perform SLAM. The vision system

did not perform as well as the LRF, although the differences are not clearly

quantified.

The paper stated that ‘monocular sensors can not directly retrieve depth

information from the scene’. However, if there are some constraints on the geometry

such as a known (and correctly observable) ground-plane, or if multiple frames are

combined then depth information can be recovered for certain points in the image

such as the intersection of the floor with obstacles.

[Sim et al. 2006] used a Rao-Blackwellised Particle Filter (RBPF) on SIFT

features obtained using a stereo camera to obtain 3D maps. In one experiment, two

rooms of a laboratory were mapped. The resulting map contained 184,211 3D

landmarks and was constructed in real-time using only 100 particles on a 3.2GHz PC

with 4GB of memory. However, up to 2,000 particles could be used in real-time.

It is important to note that Sim et al. used weight normalization and adaptive

resampling to avoid starvation of the filter. They also found that visual odometry

outperformed dead reckoning.

In [Eade & Drummond 2006] the authors claim to be the first to apply a

FastSLAM-type particle filter to single-camera SLAM. They point out that, in

general, a single camera is a bearing-only sensor. However, Structure From Motion

(SFM) techniques allow the recovery of distance information as well. The selected

features were based solely on how distinctive they were, not on their significance in

the real world. They noted that a very large number of features were selected in a

short space of time. The system was tested with 50, 250 and 1,000 particles and it

was found that using more than 50 particles did not produce significant

improvements. Very limited testing was performed in the paper, and the

experimental data provided was for a small area which was feature-rich. Loop

closing over large trajectories was not evaluated.

A similar approach using SIFT and Monte Carlo Localization (MCL) was used

in [Bennewitz et al. 2006]. From each image they extracted 150-500 SIFT keypoints,

resulting in hundreds of thousands of points in the course of an experiment. The

authors discuss the use of the ‘number of effective particles’ to avoid flushing good



particles due to spurious observations. A total of 800 particles were used. The

experiments reported were for both a wheeled robot and a humanoid robot, although

the distances travelled were only short (20m and 7m respectively).

2.8.4 Visual Odometry

The concept of visual odometry occurs frequently in the literature. In this case

the motions of the robot are tracked accurately and integrated over time to obtain

pose information.

On the micro level (small patches of pixels), tracking is used to determine

optic flow [Beauchemin & Barron 1995]. The optic flow field describes the motion

of pixel patches in the image. The changes in the image can result either from the

ego-motion of the camera or from moving objects, but it is generally assumed that

the environment is static. Reliable optic flow requires either small motions between

frames or a high frame rate.

Visual odometry using a cheap monocular camera was addressed in [Campbell

et al. 2005]. Basically, the camera looked at the ground and tracked texture features.

The odometry system was tested on different robots in a very wide variety of indoor

and outdoor environments using commercial web cameras. Motions included

simultaneous translation and rotation. The key component was optic flow field

estimation which was performed from 2 to 10 times per second. One of the

advantages of this approach was that precipice detection could also be performed.

No attempt was made in this paper to perform localization or mapping.

However, it is significant because of the quality of the odometry and its robustness to

illumination.

2.8.5 Other Approaches

As early as 1999, researchers were comparing vision and lasers for localization

[Perez et al. 1999]. They noted that maps provide plenty of information for lasers to

correlate against, such as walls, but little information for vision. This is because they

only used vertical features such as door frames for visual comparisons.



In order to obtain range information from vision, Perez et al. used a trinocular

system which had a complicated calibration procedure. They also noted that laser

range information was more accurate than vision, and required virtually no overhead

to compute. However, laser range scans are limited in that they cannot provide any

additional information for matching, such as colour.

Visual Sonar [Lenser & Veloso 2003] detected obstacles by observing

occlusions of the floor on the assumption of a small set of consistent floor colours. In

particular, the system was designed for robot soccer where the playing field is well

defined. It was a purely reactive system and did not attempt to build maps.

Active vision, where the robot seeks to maximize the information gained, was

used by [Davison & Murray 2002]. The robot used a stereo head to obtain depth

information for selected features. The head could swivel to locate good features.

They noted two failure modes in their SLAM system which was based on an

Extended Kalman Filter: Data association errors and nonlinearities in the system.

The reported results were for very short test runs with a tethered robot.

The use of colour from video images was a key aspect of work on increasing

the resolution of 3D laser-generated maps [Andreasson et al. 2006]. Visual

information was used to interpolate between laser range scans because laser

information is relatively sparse compared to images.

2.9 Exploration

To build a complete map of the surrounding environment, as discussed above,

requires a better approach than simply wandering around at random. A random walk

will eventually cover the entire environment, but it is not an efficient method of

exploration.

Note that the exploration algorithm is an entirely separate issue from SLAM,

although certain exploration approaches can improve the quality of SLAM by

revisiting areas previously explored. This helps to increase the certainty of the map.

Selecting an ‘optimal’ path depends on the goal which, by definition, is

unknown during exploration. Therefore all exploration algorithms are a compromise

and can never be truly optimal.



Exploration requires two basic tasks to be performed:

• Path Planning; and

• Remembering what has been explored (mapping).

There is a multiplicity of path planning algorithms and this chapter reviews

them briefly, with more detail on the selected algorithm. Note that once exploration

is complete, path planning is also required for performing useful tasks.

Obstacle avoidance is also a key requirement for any robot motion, including

exploration. Exploration algorithms invariably include obstacles in their path

planning, so obstacle avoidance is a natural consequence – at least for static

obstacles, not necessarily dynamic obstacles.

The robot cannot plan paths inside unknown (unexplored) space, so it must

periodically re-evaluate its trajectory. The frequency of re-planning can be a

parameter of the algorithm and will be affected by the range of the sensors.

Autonomous exploration is an important capability for a robot. However, in

much of the literature the robot has been teleoperated, even for work dealing with

SLAM [Hähnel 2004]. SLAM research on its own does not require autonomous

exploration.

Note that many of the log files provided on Radish [Radish 2001], an online

repository for mapping log files, were created by manually driving the robots around.

In other words, some of the robots did not use an exploration algorithm even though

they might have been doing SLAM in real time. The key point here is that SLAM

does not imply exploration, or vice versa.

2.9.1 Overview of Available Methods

Many exploration algorithms consider the problem as one of determining the

optimal path from the robot’s current location to unknown space, where optimal is

generally considered to be the shortest distance. For a grid or lattice map

representation, this can be done using the classic Dijkstra or A* [Hart et al. 1968]

search algorithms. These treat the grid as a graph and determine the minimum

distance path by building a search tree. The search can be either depth-first or



breadth-first. The A* algorithm uses the straight-line distance to the goal as a

heuristic rule to guide the exploration, although the robot will often have to deviate

from this path to travel around obstacles. These algorithms generate paths that take

the robot close to corners.

[Kuffner 2004] offers some methods to improve these searches by reducing the

number of neighbouring nodes that are considered at each expansion of the path.

However, he notes that these methods are ‘generally impractical when dealing with

large grids’. This problem arises due to the recursive nature of the approach.

The A* algorithm is for static environments. It was extended to dynamic

environments by [Stentz 1994] and called the D* algorithm. Basically his approach

is to keep the original A* path and make local detours when cells are found to be

impassable due to dynamic obstacles.

For a good explanation of the A* and D* algorithms, including examples and

pseudo-code, see Appendix H in [Choset et al. 2005].

Frontier-based [Yamauchi et al. 1997] or wavefront methods are closely

related to these graph-based searches. The principle is easy to understand – an

advancing ‘wave’ visits each of the accessible cells in the map in much the same

way that a wave washes up on the beach and sweeps around obstacles. The ‘frontier’

is the interface between known and unknown space which advances as the robot

moves through the environment. Clearly the robot does not have to visit every cell in

order to determine whether it is occupied or not, depending on the range of its

sensors.

Jarvis was one of the first proponents of using the Distance Transform (DT) as

an exploration algorithm [Jarvis 1984]. This is the algorithm that was selected, and it

is explained in detail later in section 2.9.4.

[Zelinsky 1992] designed an algorithm for path planning in an unknown

environment using quadtrees and distance transforms. Quadtrees are based on grids,

but they can be represented in the form of directed graphs. This allows conventional

graph searching techniques to be used.



Greedy Mapping is a simple exploration algorithm that always moves the robot

to the nearest unknown cell in the occupancy grid [Koenig et al. 2001]. The authors

address the lower and upper bounds on the distance travelled. Although Greedy

Mapping is not optimal, it is far from the worst case.

A common factor here is that all these methods depend on the connectivity

between nodes, which is either 4-way or 8-way for a square grid, and assign a cost to

the transition from one node to another. The cost can simply be the Euclidean

distance. However, the cost could also include other metrics. For instance, the energy

requirements are important for a robot running on a battery; elapsed time is

important for search and rescue; and safety (proximity to obstacles) might be

important if collision avoidance is a high priority.

Field based algorithms using an ‘artificial potential field’ were pioneered by

Khatib [Khatib 1985] who did a lot of the early work on robotic arms and

manipulators. [Borenstein & Koren 1989] took this idea and created Virtual Force

Fields which were a combination of Certainty Grids and Potential Fields.

One of the major problems with field-based algorithms is that they can become

trapped in local minima. Many exploration algorithms are susceptible to ‘local

minimum traps’ such as a U-shaped obstacle field. This was recognised from the

outset in [Borenstein & Koren 1989] and they provided some heuristics to escape

from these traps.

Even concurrently with this thesis, researchers were still publishing work

related to solving the local minimum problem. For instance, it might be necessary to

introduce some randomness to get out of the traps or simply to use wall following to

find the way out [Hong et al. 2006].

Exploration can also be based on pre-programmed behaviours. Sim and Dudek

examined a variety of geometrical search patterns for exploration referred to as

Seedspreader, Concentric, Figure of Eight, Random, Triangle, and Star [Sim &

Dudek 2003].

Sim and Dudek used features in the images in order to learn what the view

from a particular location looked like. This could then be used later for localization.



This is a common approach, but basing localization on appearance makes the

assumption that the environment will always look the same whenever it is revisited.

This is clearly not true in environments inhabited by humans who like to rearrange

the furniture.

The Random pattern performed well relative to the others. This is attributed to

the fact that the random path re-traversed previously seen territory, thereby reducing

the accumulated error by allowing the robot to re-localize itself.

The results are inconclusive, but Sim and Dudek do make the point that their

visual mapping is appearance-based and as such it does not take advantage of the

metric information available from the images.

2.9.2 Configuration Space

The first step in many exploration algorithms is to convert the current global

map to Configuration Space, or C-Space, as explained below. The objective of C-

Space is to provide a buffer between obstacles and the robot.

A C-Space map is created by expanding the obstacles in the map by the radius

of the robot so that it is not possible for the robot to collide with an obstacle as it

follows a path. Figure 16 (b) is a Configuration Space map after the robot has

finished exploring. The map uses the standard convention for occupancy grids where

white represents free space, black is obstacles and grey is unknown. The actual map

is shown in Figure 16 (a) for comparison. Notice how the walls are much thicker in

C-Space.



Figure 16 – Configuration Space: (a) Actual map; and (b) C-Space map

2.9.3 Complete Coverage

In certain situations it is important to visit every free cell in the map, called

paths of complete coverage. These have been developed using distance transforms

[Zelinsky et al. 1993]. Applications such as vacuum cleaning [Oh et al. 2004] or

lawn mowing require the robots to traverse every accessible cell in the map to

complete their task.

However, complete coverage is not necessary when simply building a map.

Using vision as a range sensor allows the robot to see areas of the map without

physically visiting them, thereby avoiding the need to traverse large areas of the

map. A good range sensor can also avoid many potential traps and much wasted time

by allowing the robot to see dead ends before entering them. Furthermore, the robot

might also be able to see into cells that are not physically accessible because it is too

large to fit through a gap.

2.9.4 The Distance Transform (DT)

The Distance Transform (DT) was originally designed for use in image

processing [Rosenfeld & Pflatz 1966] where the ‘distance’ between areas of the

image needed to be calculated and an image is defined as a regular grid of pixel

values.

The basic concept of a Distance Transform is simple. Consider the map as a

pool of water with the obstacles sticking out of it. Throw a pebble into the pool and

take snapshots as the ripples spread out. The advancing wave front will wrap around

obstacles. Ignore any ‘reflections’ that occur. As time progresses the wave front

moves further away from its origin – the distance increases. All points on the wave

front at any given instant in time are equidistant from the origin in the distance



transform sense, but not in the direct Euclidean sense of a straight line between two

points.

Frontier-based exploration [Yamauchi 1997] is conceptually similar to the

distance transform approach in that it locates the advancing frontier as the boundary

between open space and uncharted territory. However, Yamauchi uses a depth-first

search of the grid to find the shortest obstacle-free path to a frontier.

There are various different Distance Transform algorithms that can be used. In

particular, the distance measure can be changed. [Shih & Wu 2004] discuss

commonly used measures such as the Manhattan or City Block distance (4-

connected), Chessboard distance (8-connected) and Euclidean distance.

The main application of the Distance Transform has been in medical imaging.

However, Jarvis [Jarvis 1984] recognized that the Distance Transform could be used

as an exploration algorithm. For this purpose, all of the unknown space cells in the

map are marked as goals for the transform. The DT is performed on the C-Space

map, and can be followed from the robot’s current location to the nearest goal,

thereby finding the shortest path to the nearest unknown space.

Since Javis’ early work, Distance Transforms have been used widely for path

planning in exploration [Murray & Jennings 1997, Wang et al. 2002].

2.9.5 DT Algorithm Example

For a square grid, there are both 4-connected and 8-connected versions of the

DT. The 8-connected version was chosen, which requires that the robot be able to

execute two types of moves – move forwards the distance of one grid cell (for north-

south and east-west moves), and move the square root of two times the cell size for

diagonal moves. Moving backwards is not necessary if the robot can rotate by 180

degrees.

Therefore, the use of the DT only requires the robot to rotate on the spot and

move forward by two different (but fixed) distances. There is no requirement to

move in real time or to make steering decisions on the fly. This is a significant

advantage of the piecewise linear approach because it is not dependent on the

computational resources available. Clearly, the fastest possible processor is desirable




so that the robot does not spend too much time ‘thinking’, but it is not essential to

the task.

To perform the DT, obstacles in the map are set to ‘infinity’, and hence

ignored as possible nodes on the path during the calculations. Free space is filled in

with a ‘big’ value, which is necessary so that the transform will correctly update the

cells. It is assumed that the total cost (distance) in a cell can never reach ‘big’, but

this is a fair assumption for practical purposes even when using a 32-bit integer for

the distance. Unknown space, which is the goal, is filled in with zeros.

Table 1 – Sample Data from a Distance Transform

Consider the data shown in Table 1 which is a section from a DT (and

corresponds to the outlined region in Figure 17). The values in the cells are the

‘distance’ measure. The hash signs (##) indicate a numeric overflow because these

cells contain very large values which are obstacles.

0

##

0

0

0

0

0

1

0

2

0

3

0

4

0

5

0

4

0

3

0

2

0

##

0

0

0

0

0

0

0

0

0

0

0 0

##

0

##

0

0

0

0

0

1

0

2

0

3

0

4

0

4

0

3

0

2

0

1

0

0

0

0

0

0

0

0

0

0

##

0

## ##

0

0

##

0

0

0

0

0

0

0

1

0

2

0

3

0

4

1

3

0

2

0

1

1

0

0

0

0

0

0

0

1

0

##

0

## ##

0

0

0

0

0

0

0

0

0

0

0

0

##

1

##

1

##

2

##

1

##

0

##

0

0

0

0

0

0

0

0

0

0

##

0

## ##

0 0 0 0 1 1 2 2 3 ## ## ## 0 0 0 0 0 ## ##

## ## ## ## 2 2 3 3 4 ## ## ## 0 0 0 0 0 0 1

## ## ## ## 3 3 4 4 5 ## ## ## 0 0 0 0 0 0 0

## ## ## ## 3 4 5 5 6 7 ## ## ## ## ## 0 0 ## 0

## ## ## ## 2 3 4 5 6 7 8 ## ## ## ## ## ## ## ##

0 ## ## ## 1 2 3 4 5 6 7 ## ## ## ## ## ## ## ##

0 0 0 0 0 1 2 3 4 5 6 ## ## ## ## ## ## ## ##

0 0 0 0 0 0 ## ## ## 6 7 8 9 10 11 12 13 14 ##

0 0 0 0 0 0 ## ## ## 7 8 9 10 11 12 13 14 15 ##

0 ## ## 0 ## ## ## ## ## 8 9 10 11 12 13 14 15 16 ##

## ## ## ## ## ## ## ## ## 9 10 11 12 13 14 15 16 17 ##

## ## ## ## ## ## ## ## ## 10 11 12 13 14 15 16 16 17 ##

## ## ## ## ## ## ## ## ## 11 12 ## 14 15 ## 16 15 16 ##

## ## ## ## 12 11 10 11 12 12 ## ## ## ## ## ## 14 15 ##

## ## ## ## 11 10 9 10 11 12 ## ## ## ## ## ## 13 14 ##

## ## ## ## 10 9 8 9 10 11 ## ## ## ## ## ## 12 13 ##

## ## ## ## 9 8 7 8 9 10 ## ## ## ## ## 10 11 12 ##

## ## ## ## ## 7 6 7 8 9 8 7 6 7 8 9 10 11 12

## ## ## ## ## ## 5 6 7 8 7 6 5 6 7 8 9 10 11

0 ## ## ## ## ## 4 5 6 7 6 5 4 ## ## ## ## ## ##

0 ## ## ## ## ## 3 4 5 6 5 4 3 ## ## ## ## ## ##

## ##

##

##

0


The DT ‘wave’ emanates from the zero cells (unknown space) and travels

‘uphill’ until it encounters an obstacle and has nowhere else to go. This is easy to see

– look down diagonally from the top left-hand corner of the table. In the top left, the

values are zero. There is a ‘channel’ between two obstacles where the values start to

increase 1, 2, 3.

To help visualise the DT, Figure 17 shows a 3D view of the cell values. The

area inside the (red) rectangle in the top left corner corresponds to the data shown in

Table 1. Valleys and ridges are quite apparent. The obstacles are the flat areas that

are highlighted in a darker colour.

Figure 17 – 3D view of a Distance Transform

Following a steepest descent path through the DT from the robot’s current

location (or in fact starting from any non-obstacle cell) will lead to the nearest goal,

which is unknown space. To reiterate, from a single DT it is possible to create a path

from anywhere (except inside an obstacle) to the nearest unknown space just by

‘rolling downhill’.



2.10 Summary

This chapter has outlined a wide range of different subject areas that must be

addressed in order to perform mapping using vision. Although there has been a lot of

work on mapping using sonar and laser range finders, there is comparatively little

work in the literature on using only monocular vision to produce metric maps of

unknown environments.


Chapter Three Design and Experimentation

3 Design and Experimentation

‘Vision is the process of discovering from images

what is present in the world, and where it is.’

– David Marr,

Vision, 1982, page 3

3.1 Methodology used in this Research

The basic research methodology used in this thesis was iterative

experimentation, a process sometimes referred to as action research. The design was

developed by a process of successive refinement based on prototypes using both real

robots and simulation. In principle this is similar to the approaches used today in

some software design methodologies.

A consequence of this methodology was that preliminary results were

incorporated into subsequent designs. Therefore this chapter includes some

intermediate results, but Chapter 4 – Results and Discussion includes the complete

list of significant results.

3.2 Research Areas Investigated

As outlined in Chapter 1, there were several areas that required investigation in

order to build a complete system for visual mapping. In brief, the main areas were

controlling the robot; extracting range information from the video stream; building

local maps; exploring the environment; and putting this all together with

simultaneous localization and mapping.

This chapter discusses each of the research areas in addition to outlining the

experimental environment. Due to the wide range of topics that must be covered, the

discussion might appear to be disjointed at first glance. However, they follow a



logical progression starting from the robots and ending at the ultimate objective

which was a complete visual SLAM system:

• The robots used, including simulation, and the environments;

• Controlling robot motion and relevance to the problem at hand;

• Aspects of computer vision relevant to the tasks of navigation and

mapping;

• Navigation based on vision;

• Mapping, including Inverse Perspective Mapping and the Radial

Obstacle Profile used for producing local maps, and combining local

maps into global maps;

• Exploration using the Distance Transform;

• Implementing localization to determine the robot’s pose; and

• Simultaneous Localization and Mapping.

3.3 The Robots

In the course of the research, several different types of robots were used:

• K-Team Hemmison;

• Yujin Soccer Robot;

• Simulated robot written by the author;

• Simulated robot in Microsoft Robotics Developer Studio; and

• Dr. Robot X80.

Early proof-of-concept work was done in a custom-built ‘playing field’using a

Hemisson robot with a wireless colour camera attached to it. Unfortunately, the

Hemisson required a serial cable, or umbilical, which proved to be a serious



drawback. Later the research moved to a Yujin robot because it had a wireless

connection.

A simulation was developed in order to speed up experiments and provide

more strict control over the environment. If an algorithm does not work in

simulation, it is unlikely to work in the real world. A simulator can therefore save a

significant amount of time.

The X80 robot was roughly 35cm in diameter and ‘human sized’ in the sense

that it could operate in the corridors of typical office buildings.

3.3.1 Hemisson Robot

Hemisson robots are manufactured by K-Team in Switzerland [K-Team 2006],

the same company that makes the popular Khepera robot. It was intended to be a

cheap robot for the education market. This robot was selected for its low price and

size, making it suitable for use in a prototype. The robot is shown in Figure 18.

Figure 18 – Hemisson robot with Swann Microcam

The Hemisson was roughly 10cm in diameter and powered by a 9V battery.

Figure 18 also shows the wireless Swann Microcam, which was smaller than the 9V

battery that it ran off. Notice the serial cable attached to the back of the robot (left-

hand side of the photo).

For testing purposes, two ‘playing fields’ were built using 120cm x 90cm

sheets of blackboard Masonite for the floor. The intention was that the floor would

be matte black in colour. The walls around the playing fields were painted white.

These playing fields could be populated with various objects to represent obstacles.

Figure 19 shows one of the playing fields with the Hemisson and some obstacles.

(The curvature of the walls in the photo was due to the camera lens).



Figure 19 – Playing field showing Hemisson robot and obstacles

The Hemisson communicated with a PC via a serial cable that is clearly visible

in Figure 18. This proved to be a major problem because the Hemisson was not

powerful enough to pull the cable and it tended to get entangled with the obstacles.

3.3.2 Yujin Robot

Switching to using the Yujin Soccer robot shown in Figure 20 eliminated the

problem with the umbilical cable because it uses a wireless connection. The antenna

is visible at the back of the robot. Another Swann Microcam was attached to the

Yujin robot to supply a video stream.

Figure 20 – Yujin robot with Swann Microcam

Yujin robots were used in the Mirasot Robot Soccer League and therefore their

size was strictly regulated – they were 7cm cubes. The Yujin robot was controlled

from a PC using a radio modem. This provided autonomy, whilst still keeping the

small form factor that was necessary to work in the artificial playing field

environments.



For the Yujin, a slightly more complex environment was created and a

simulator was also written based on this. An example of the environment is shown in

Figure 21.

Figure 21 – Playing field using Lego Duplo obstacles

There are two important aspects of this field:

1. An additional wall (just left of centre) prevented the entire field from

being visible from a single location; and

2. There was a wall in the top-right corner at a 45° angle.

The size of the obstacles was of a comparable scale to the size of the robot.

However, this environment was sparsely populated compared with many real world

environments. It was also relatively small.

3.3.3 Simulation

Based on measurements of the Yujin playing field and the robot geometry, a

simulated environment was first created in VRML (Virtual Reality Markup

Language), and then translated to OpenGL. The simulator generated the view from a

camera mounted on a robot located inside the playing field.

Figure 22(a) shows a typical simulated camera image. A top-down view,

created by moving the camera viewpoint to an overhead position, is shown in Figure

22(b). This confirmed that the simulated environment matched the real environment.



Figure 22 – Simulator view: (a) From robot, and (b) Top-down view

In June 2006, Microsoft announced the development of Microsoft Robotics

Studio (MSRS). Several beta versions (called Community Technology Previews)

followed in quick succession throughout the second half of 2006. The first official

release was in December 2006.

One of the key components of MSRS was a Simulator which included a

physics engine so that the simulation included gravity, friction, inertia, and so on.

This simulator was much more sophisticated than the one that had been developed

for this thesis and it was available free of charge. Based on prior work for the

simulation in this research, the author wrote a Maze Simulator [Taylor 2006a] for

MSRS. The Maze Simulator allowed simulated robots to roam around an

environment very similar to the one that had already been implemented.

In April 2008, Microsoft renamed MSRS to Microsoft Robotics Developer

Studio (MRDS). The Maze Simulator was used in a textbook on MRDS written by

the author and one of the members of the Microsoft Robotics Team [Johns & Taylor

2008].

An example of the MRDS simulation is shown in Figure 23. Two views are

shown – one from a camera located on top of a simulated Pioneer 3DX robot and the

other a view from above. The simulated environment was built from a map image

that could be specified in a configuration file, so it was easy to change.



Figure 23 – Simulated environment using Microsoft Robotics Developer Studio: (a) Robot view, and (b) Top-down view

The Pioneer 3DX in the simulation has a webcam and a Laser Range Finder.

Further work was also done on a simple example that explored the environment,

called ExplorerSim, and created maps using the LRF. This was released with the

textbook as well.

3.3.4 X80 Robot

In the final phase of the research, it was necessary to move from the playing

fields to a real human environment. For this purpose the author purchased an X80

robot from Dr. Robot Inc. in Canada [Dr Robot Inc. 2008b]. This robot was selected

because it had a built-in pan and tilt camera and a wireless Ethernet interface.

Figure 24 shows the original X80 robot, called Tobor. The pan and tilt camera

is visible at the front of the robot. Sonar and infrared sensors are also located across

the front of the robot. The WiFi antenna is located at the back of the robot but is not

visible in this photo.

Figure 24 – Tobor the X80 robot

The X80 was roughly 35cm in diameter with wheels 17cm in diameter and

wheel encoders with a resolution of 1200 ticks per revolution.



3.4 Robot Motion

Two different methods were used for controlling the robots depending on their

onboard hardware: timed motions; and wheel encoder information.

Timed motions involve sending commands to the robot’s motors, waiting a

pre-determined amount of time, and then stopping the motors. The calculation of the

required time delay was based on a table constructed from experimental data.

Robots with wheel encoders can be controlled with reasonable accuracy.

However, significant errors were sometimes observed during moves. The motion

model is discussed in section 3.9.3.

For safety reasons, the X80 robot was programmed to check its front sonar

sensor prior to a move. The range information from the sonar sensor was not used

by the system for drawing maps, but only as confirmation that it was safe for the

robot to drive forwards.

3.4.1 Piecewise Linear Motion

Processing of images is computationally intensive which makes real-time

steering problematic. To eliminate this problem, robots in this research were

restricted to only two types of movements: rotation on the spot, or forward/backward

moves in a straight line. This is referred to as piecewise linear motion. It also fits

well with the use of the Distance Transform [Jarvis 1984] which was selected for

exploration.

Piecewise linear motion involves only two allowable movements: rotation and

translation. Because the robot has two wheels it can rotate on the spot. If both wheels

run at the same speed2, then the robot will move forwards or backwards in a straight

line, which is referred to as a translation. By combining these two types of moves, it

is possible to reach any point in a 2D space (provided that the point is accessible to

the robot – there must be a path that allows it to squeeze past obstacles) and it

approximates holonomic operation.

2 In reality, the wheel motors are usually opposed so that one must run ‘forwards’ while the other runs ‘backwards’ to produce straight-line motion. Their ‘speeds’ are therefore opposite.



Piecewise linear motion has rarely been used in the literature. The emphasis in

the literature has instead been on building fast algorithms that can control the robot’s

motion in real-time. As CPU speeds increase every year, eventually piecewise linear

motion will be equivalent to continuous motion. In the meantime, this approach

allowed the robots to take as long as necessary to process camera images before

making their next moves – there were no real-time considerations.

Another advantage of piecewise linear motion was that the equations

governing the motion of the robot were extremely simple. This was important in

developing the SLAM algorithm because a motion model was required for the robot,

i.e. how it responded to inputs (where inputs were commands to move).

The primary disadvantage of the piecewise linear approach to robot motion

became evident early in the research – it was not possible to use optic flow.

Extracting the optic flow field from a sequence of images requires, ideally, that the

images correspond to small motions. By processing images only after significant

motions have completed, the differences between images were too great for typical

optic flow algorithms to recognise the changes. However, the camera images from

the X80 robot suffered from a significant amount of motion blur and so they would

not have been suitable for optic flow.

Some vision-based SLAM algorithms also required images with small

incremental motions in order to make reliable estimates of the positions of features

in the images. MonoSLAM [Davison 1999] was reported to work best when the

camera was waved around, thereby providing a large number of different views of

the same objects.

Because the video stream was ignored during motions, if a human walked into

the path of a robot while it was moving, the robot could collide with the human.

Further work is required to eliminate this problem. For example, the X80 robot had

sonar sensors that could have been monitored continuously as a safety measure to

avoid such collisions, rather than just checking the sonar prior to a move.

Despite these drawbacks, piecewise linear motion was a good approach.



3.4.2 Odometry Errors

Odometry is notoriously unreliable, and it was not expected to be perfect.

However, the large errors that sometimes occurred were not anticipated.

For timed motions, as used on the Hemisson and Yujin robots, it proved to be

impossible to reliably time an operation under Windows. Even so-called ‘high

precision’ timer code [Wheeler 2003, Manko 2003] would randomly fail to produce

the correct result. The problem had to do with the way that Windows implements

multi-tasking, and that occasionally it would get bogged down in some internal

housekeeping, such as flushing the disk cache. This caused the timer to overrun

significantly and resulted in the robot making moves that were significantly different

from what was expected.

Because the X80 robots had wheel encoders with 1200 ticks per revolution it

was expected that it would be possible to control the movement of the robots with

reasonable accuracy. However, large variations were occasionally observed during

experiments. Dropped network packets (because the robot used UDP which is not a

reliable transport) could result in the complete loss of a command, but no

explanation was found for the unusual movements.

The full impact of poor odometry only became apparent towards the end of the

work during the implementation of SLAM on an X80. In fact, there was literature

emerging concurrent with this thesis, [Hähnel 2004; Grisetti et al. 2005, Nieto et al.

2005], that pointed to the importance of improving pose estimates before

incorporating new sensor data into a SLAM algorithm.

3.4.3 Effect of Motion on Images

There are advantages to having the robot stationary when it captures images.

Firstly, there was no vibration to affect the position of the camera. If the camera

moved up or down slightly, the pixel locations of the obstacle edges in the image

would also change, resulting in errors in the Inverse Perspective Mapping.

Secondly, the shutter speed of the camera could not be controlled on the cheap

cameras and it was fairly slow. This means that images were blurred if the robot was

moving when they were captured. Blurring tends to suppress edges, making it harder



to detect obstacles. With piecewise linear motion, however, the robot was not

moving when images were captured.

At intermediate points during a move, it was not possible to accurately

determine the pose of the robot. The pose was only known with any certainty at the

beginning and end of moves.

Therefore, the ExCV program disabled image capture during motions. It was

also found that a short delay was required after completing a motion before re-

enabling the camera, otherwise the images were still blurred while the robot ‘settled’.

In effect, this approach meant that the robot was blind during moves. This was

not a problem for rotations because the robot only turned on the spot and could not

collide with an obstacle. However, for translations there was a strong assumption

that the environment remained static, at least for a short period of time.

3.5 Computer Vision

The Horswill Constraints listed in Chapter 2 were taken as the basis for the

design of the vision system.

The vision system produced range data that was sufficient to create workable

maps. However, it still had some deficiencies. Some recommendations for future

work are discussed in the last section of Chapter 4.

The primary conclusion from the vision system was that several different

modalities should be used – as many as possible. The Human Visual System uses

many different modalities: texture, shading, edges, segmentation, pattern matching,

optic flow, occlusion, etc. In contrast, the vision system in this thesis only used edges

and colour-based segmentation.

Expanding the vision system to support other modalities might also involve

changing the camera hardware.

3.5.1 Vision as the Only Sensor

It was an objective of this research to use only vision for navigation and

mapping. Clearly, there was an implicit assumption that this was possible.



There was one subtle issue that had to do with odometry. For the X80 robots

that had wheel encoders to control motions, there was inherently another sensor

besides vision. Therefore the vision-only assumption was not entirely correct.

However, in the case of the Hemisson and Yujin robots there were no wheel

encoders and motions were controlled using timers, so the vision-only assumption

held true.

As noted above, a sonar sensor on the X80 was used for safety, but not for

mapping or exploration.

3.5.2 Experimental Environments

Two playing fields were originally built for use with the Hemisson and Yujin

robots. However, these proved to be too small for adequate experimentation because

the robots could see the entire environment during a single pirouette.

The playing field environment was broken up using additional walls. However,

even then the time required to explore the entire playing field was quite short, and so

it was not a good experimental environment except for the initial prototypes. This

limitation was one of the driving factors in moving away from the playing fields to a

human-scale environment.

In the final phase of the project an X80 robot was used to map corridors in an

office building. This environment had its own particular limitations too. For

example, the corridor illumination was extremely variable causing problems with

shadows and colour variations. However, the environment was representative of

many office buildings.

The office environments used in this research were challenging due to the

colours of the carpets, walls and doors – refer to Figure 25.



Figure 25 – Environment showing floor, wall, door and ‘kick strip’

Even though a colour camera was used, the blue carpet appeared almost grey

in the images. The walls were a pale blue which was almost grey. Furthermore, the

doors were off-white which is also a shade of grey. For this reason, hue was not a

suitable metric for comparison of pixels.

Notice in Figure 25 that there are black strips, called ‘kick strips’ or ‘scuff

plates’, along the walls. It was observed during experiments that these usually

provided clear edges for the edge detector. However, edge detection is not perfect

and sometimes the flood fill for the floor ‘leaked’ through tiny gaps in the edges.

When this happened, the edges from these black strips stopped the floor detection

process from generating incorrect local maps except for thin ‘wisps’ which had no

impact on the usefulness of the maps.

Many office environments have kick strips similar to the ones in Figure 25, or

cable ducts along the walls. Sometimes kick strips are also added to items of

furniture, such as desks, to protect them. Even in homes, skirting boards are usually

used to hide the gaps between walls and floors that invariably occur during

construction. These strips could act as clear visual markers for the floor boundary.

Marking them with a distinct colour would be beneficial not only to robots but also

to visually-impaired people or in emergency situations such as a fire when rooms

and corridors might be full of smoke.

One conclusion from the experiments was that the environment affects the

design of the vision system. It is difficult to write a general-purpose system that

works in any environment, but some simple changes to the environment would help

to simplify the design.



As domestic robots become more widely used, interior designers might have to

take them into account in the designs of offices, hospitals and even homes. In

particular, the colours of walls and doors should be distinctly different from the

floor.

Consideration should be given to making human environments ‘robot friendly’

in future buildings. It should also be feasible to retro-fit existing buildings at little

additional cost during routine renovations. These changes could also prove helpful

for humans who are vision impaired. In fact, many buildings already have strips of

one kind or another along the edges of the walls.

3.5.3 Monocular Vision

Preliminary work by the author showed that it was possible to use a single

camera to obtain range information based on certain assumptions about the

environment and knowledge of the camera geometry [Taylor et al. 2004a]. (See

section 3.7.2 for details). It was not necessary, therefore, to use stereo vision which

had been used in the past to obtain range information.

Due to the complexity of de-warping the images from panoramic cameras, the

cost of these specialised cameras (requiring highly accurate mirrors/lenses), and their

large size, it was decided not to use them.

3.5.4 Cheap Colour Cameras

The reason for the choice of a low-resolution colour camera was to keep the

cost as low as possible on the assumption that for robots to be successful in the

domestic and office environments they would need to be attractively priced.

Web cameras with 320x240 pixel resolution are very cheap. The original

wireless camera used on the Hemisson and Yujin robots had this resolution and so

the simulation was designed to use this same resolution.

The original documentation for the X80 robot was misleading on the subject of

camera resolution. Although the camera was capable of higher resolution (352 x

288), the on-board firmware in the robot only used 176x144 pixels. Furthermore, the

throughput of 15fps (frames per second) claimed for the camera was not achievable.



Dr. Robot, Inc. has since amended the online specifications to state that the video

can achieve “up to 4 fps”.

These problems were discussed with the manufacturer at length, but the final

outcome was that the resolution could not be changed and remained set at 176x144.

The maximum frame rate obtained during experiments was about 2fps due to the

internal bottlenecks in the hardware architecture.

Lower than anticipated resolution might have contributed to difficulties with

the map building process. However, it also highlighted an important point about

SLAM algorithms, namely that they require reasonably accurate input data.

The focal length of a web camera is generally not known. The equations

derived for Inverse Perspective Mapping (explained below in section 3.7.2)

effectively eliminated the focal length. Therefore it was not necessary to know the

focal length for the cameras used in this research. However, as shown below, the

intrinsic parameters for the cameras include the focal length.

3.5.5 Narrow Field of View

Cheap cameras typically have a narrow field of view. The Swann wireless

camera (used on the Hemisson and Yujin robots) only had a 60º Field of View

(FOV), which is fairly typical of such cameras. For the X80, the FOV was

determined to be 50.8º from the intrinsic parameters obtained via camera calibration

(explained below in section 3.5.6).

The cameras were therefore tilted downwards to see the floor in front of the

robots, but even then there were blind areas in front of the robots. Tilting the camera

also introduces a complication due to perspective effects because vertical edges do

not remain vertical in the image, which is discussed in the next section.

Because the robot could only see a small area in front of it, drawing a map was

difficult. Therefore an approach was developed whereby the robot rotates on the

spot, called a pirouette. During a pirouette the robot turns through a full 360° in 12

steps of 30° each. The choice of 30° increments was a compromise that allowed for

overlap between images, but minimised the number of turns because each turn has

the potential to introduce odometry errors. This approach of ‘looking around’ is a



form of active vision – making deliberate moves in order to obtain an improved view

of the world.

Consider the sequence of images shown in Figure 26. These show what the

robot sees as it performs a pirouette. However, this image sequence only shows half

of the pirouette.

a b cd e f

Figure 26 – Sequence of images from a robot rotating on the spot (turning left by 30° per image from top-left to bottom-right)

To a human, it is apparent that the robot was rotating on the spot. It is also

obvious that it was not aligned perfectly with the corridor walls. Despite changes in

illumination between images and even across an image, the floor is immediately

obvious to a human.

On closer inspection, it is also apparent that there are very few distinguishing

features in some of the images. Such is the nature of corridors. These examples are

raw images which exhibit radial distortion, although many people do not even notice

this until it is pointed out to them. As noted earlier, radial distortion was removed

before processing the images.

Note that this process was a major source of odometry errors because, with

each individual rotation, there was a chance that the robot would turn by the wrong



amount. This problem does not exist for some other sensor configurations. Sonar

sensors, for instance, are typically used in a ring around the circumference of the

robot, so the robot does not need to turn at all to obtain a sweep. LRFs usually have a

180 degree field of view, so again they do not usually turn to ‘look around’.

3.5.6 Removing Camera Distortion

A disadvantage of cheap cameras is that they tend to have poor quality lenses

(often made of moulded plastic). The most obvious effect is radial distortion, which

causes problems with the Inverse Perspective Mapping (IPM). (IPM is explained

below in section 3.7.2).

Camera calibration can be performed using software available from [Bouguet

2006] which was based on the work in [Bouguet & Perona 1995]. Once the camera

intrinsic parameters are known, the distortion can be removed from the images using

routines included in the OpenCV Library [Intel 2005].

The calibration procedure was clearly documented. It involved taking photos

of a calibration target from a variety of different positions and angles. Some

examples are shown in Figure 27. There is no significance in this particular set of

images. This camera, on an X80 robot, had a resolution of 176 x 144 pixels.



a b c d

Figure 27 – Sample calibration images for X80 robot

After processing, the output was a set of intrinsic parameters. Table 2 shows

the parameters for the X80 robot based on an expanded set of over a dozen images

including those in Figure 27.

Table 2 – Intrinsic Parameters for X80 robot

Focal Length: 185.33677, 185.22794 ± 1.32945, 1.38745

Principal Point: 88.47254, 65.55508 ± 1.87102, 1.52126

Skew: Not estimated – pixels assumed to be square

Distortion

Parameters:

-0.29131, 0.11058, -0.00232, -0.00231, 0.0

± 0.02385, 0.08816, 0.00139, 0.00101, 0.0

Pixel Error: 0.14943, 0.14104

Another way to view these parameters is a graphical representation of the

distortion model as shown below:



Figure 28 – Complete distortion model for X80 camera

Note that there was significant radial distortion, as indicated by the numbered

contour lines which show the number of pixels of radial error. The small arrows

indicate the direction of the distortion. Notice that the distortion was quite substantial

in the corners, reaching as high as 9 pixels in the bottom left corner. It should also be

noted that for this camera the principal point (the small circle) was not at the centre

of the image (which is marked with ‘x’).

The effects of radial distortion can be removed once the camera’s intrinsic

parameters are known. (The OpenCV Library has a function to do this.) Figure 29 (a)

shows an original image taken using a Swann Microcam on a Yujin robot, and

Figure 29 (b) shows the image after correction for radial distortion, referred to as the

undistorted image. The effect is most noticeable at the bottom edge of the grid.



Figure 29 – Correcting for Radial Distortion: (a) Original image; and (b) Undistorted image

In further discussion of this research, especially map building, images are

assumed to be undistorted before being processed but this is not explicitly

mentioned.

3.5.7 Uneven Illumination

In the initial work the illumination was reasonably uniform because the

‘playing fields’ were relatively small. In simulation, illumination was not an issue

because the images were artificially generated.

However, in real corridors the effects of uneven illumination were quite

apparent. Figure 30 shows three images from the same test run. The first image (top

left) shows ideal illumination which was present in most areas of the corridors.

Figure 30 – Examples of uneven illumination in corridors

‘Hard’ shadows, such as in Figure 30 (b), could result in the vision system

detecting an edge that did not exist and consequently drawing a fictitious wall in the



map. Applying a Gaussian blur before processing the images helped to reduce the

incidence of phantom edges without significantly affecting the detection of real

walls.

Wide variations in illumination like Figure 30 (c) sometimes caused the vision

system to incorrectly segment the image and reduce the amount of floor that was

detected. In the worst case, the robot concluded that the corridor was impassable. To

attempt to mitigate this problem, the vision system used a moving average for the

floor pixel values. As it progressed around the building, the current value of an

average floor pixel was changed. However, there was a threshold on the amount of

variation and it was still possible for the floor to be too dark to be reliably detected.

In order to minimise these effects, the brightness of camera images was

automatically adjusted during a test run. When the robot was first started, it used a

patch of floor immediately in front of it as a template. The parameters of this floor

patch, minimum, maximum and average RGB values, were compared to new images

as the robot moved around.

If there was a noticeable difference in the average intensity, then the intensity

of the entire image was adjusted to match the initial template. This was done by

multiplying all pixel values by an appropriate amount. In other words, the brightness

of the image was adjusted proportionally so that the average intensity of the template

floor patch was the same as the initial patch.

Even when the system adjusted the image intensity to try to average out the

changes in illumination, it was not possible to entirely eliminate the problem of sharp

shadow edges. Clearly, adjusting the average intensity to match one side of the

shadow or the other will still leave a shadow.

A different approach to shadows proposed in the literature was switching to a

different colour space where the shadows should disappear. However, attempts to

use hue were not successful. The main problem with hue is that it is very unreliable

in low-light situations. It has no representation for shades of grey which were

prevalent in the experimental environments, and this causes the hue values to

fluctuate wildly. Furthermore, the environment turned out to contain a lot of colours

with a similar hue. This is discussed below under Colour Spaces.



Comprehensive image normalization was also found to be ineffective in

removing shadows. Therefore, shadows are still an open research topic.

3.5.8 Compensating for Vignetting

Vignetting makes the outer edges, especially the corners, of an image appear

darker than the centre of the image. This makes segmentation more difficult because

pixel values can vary considerably across the image.

A very simple approach was adopted in this project that improves image

quality, but cannot completely remove the vignetting effects. It was loosely based on

the work of Yu [Yu et al. 2004].

To reduce the effects of vignetting, pixel values in the image were adjusted by

a factor that was proportional to the square of the distance from the centre of the

image. This was done independently for each of the Red, Green and Blue

components, which also allows compensation for incorrect colour tint to be applied

at the same time. The resulting improvement can be seen by comparing Figure 31 (a)

and (b).

Figure 31 – Correcting for Vignetting: (a) Original image; and (b) Corrected image

Note that uniform illumination has an indirect effect on vignetting. The impact

of vignetting is worse in low-light situations, so good illumination helps to reduce

vignetting.

3.5.9 Camera Adaptation

Most cameras have automatic gain control – they adjust for the amount of light

entering the lens. This is desirable for obtaining the best possible images, but it can

also have adverse effects. Images were observed to get progressively brighter while

the robot stood still viewing the same scene.



Firstly, when the camera readjusts itself, the RGB values of pixels in the image

change. This means that colours seem to change. Any algorithm that relies on colour

matching in the RGB space will therefore have a problem because colours become

moving targets. One approach that was examined was to use a different colour space

to try to eliminate the effects of changing brightness, and to apply some ‘colour

constancy’ algorithms. Attempts to do this, for example using Comprehensive

Colour Image Normalization [Finlayson et al. 1998], proved unsuccessful.

The second problem was adaptation. If the camera remained stationary for a

significant amount of time, it adjusted itself to the point where the colours were

actually distorted. For instance, if there was a large red object in the camera’s field

of view, the image would take on a definite reddish hue over time. This effect was

very noticeable and could not be compensated for by changing to a different colour

space.

3.6 Navigation

This section discusses various aspects of navigation. Primarily the term

navigation in this research refers obstacle detection for collision-free motion.

Much of the literature assumes static environments where the obstacles stay

put and do not move around. In a human environment this is not true, but the

additional complexity involved in handling a dynamic environment makes it a much

more difficult problem. Therefore it was assumed in this research that the

environment was static.

Feature extraction was not used in the vision system for this project. In other

words, no attempt was made to recognise objects. The floor was assumed to be of a

uniform colour and therefore, by definition, it contained no features. Therefore this

work relied on detecting edges and performing segmentation to locate obstacles.

A simple approach to floor segmentation was used initially. It involved a

moving-average flood fill whereby an initial pixel value was selected at the bottom

of the image, and the pixel value to be matched varied continuously based on the

pixels previously seen while scanning vertically up an image column.



3.6.1 Reactive Navigation

The first prototype, based on a Hemisson robot, used a reactive approach for

steering the robot to enable it to wander around. While purely reactive behaviours

are simple and therefore easy to implement, they have significant drawbacks. For

instance, the simple algorithm on the Hemisson robot aimed to move towards the

area of greatest free space. In order to do this, the robot segmented the images into

floor and non-floor pixels by starting at the bottom centre of the image and using a

moving-average flood fill.

Note that the reactive navigation system made no attempt to remember where

the robot had been or build a map. It just made instantaneous decisions about the

steering direction for the robot and was a proof of concept for the intended approach

to floor detection.

Figure 32 (a) is an image captured using an early version of the ExCV

program. It shows the view from the camera mounted on top of the robot.

Figure 32 – Using vision to detect the floor: (a) Camera image; and (b) Extracted floor region and steering direction

In Figure 32 (b), the floor has been segmented from the rest of the objects

(shown as a light grey area) and a direction has been selected for the next move, as

shown by the thick solid white line.

A couple of notes are appropriate at this point. Firstly, narrow margins around

the outside edges of the image were ignored as far as detecting the floor was

concerned because of the significant vignetting effect which could not be completely

eliminated.



Secondly, notice the importance of the detected obstacle edges in limiting the

spread of the floor (light grey) area. Also, the floor in this artificial environment is

reasonably uniform in colour. These factors made segmentation reliable.

Lastly, the selected direction was chosen from a small set of fixed directions

(which is why it did not align exactly with the corner of the floor in the top-right).

These directions were based on knowing the distance and direction to several points

in the image – a coarse grid overlaid on the image that corresponded to real-world

locations relative to the robot. This grid is not shown in the figure, but it was the

predecessor of the full Inverse Perspective Mapping.

The robot could wander around without bumping into obstacles. By seeking

the maximum free space, the robot avoided obstacles in its immediate vicinity.

However, the algorithm has a fundamental flaw: it drives the robot into corners. As

the robot approached a corner, the point that was farthest from the robot was the

corner itself. It, therefore, adjusted its steering direction to aim directly towards the

corner.

A behaviour was added to the program for the situation where the robot found

itself boxed in. This consisted of backing up for a pre-defined distance, turning to the

left by 90 degrees, and then starting again. In most cases this allowed the robot to

escape from corners, but it did not work in all situations. Oscillations could then

ensue.

Another common behaviour is wall following. Simple obstacle avoidance

behaviour suggests immediately turning away from a wall as soon as the robot

senses it. In contrast, wall following requires the robot to move towards a wall until

it reaches a certain specified distance from the wall and then turn to a direction

parallel to the wall. The distance from the wall should be maintained as the robot

moves forward. However, if the robot moves parallel to the wall using vision, the

limited field of view of the camera means that it actually cannot see the wall

immediately beside it – it can only see the wall some distance in front. Therefore

wall following could not be implemented successfully.



These simplistic behaviours were inefficient and vulnerable to infinite loops.

This was a good reason for producing a map – the map provided a ‘memory’ for the

robot so that it could avoid exploring areas that it had already visited.

3.6.2 Floor Segmentation Approaches

The primary purpose of navigation was to locate the floor to detect obstacles

which was a segmentation problem. Several different methods were investigated for

this task, both alone and in combination. Examples are given below.

The following is a summary of some of the various methods that were trialled

for use in floor segmentation:

• Conversion to various other colour spaces before applying a flood fill,

including the Farnsworth ‘Perceptually Uniform Color Space’

[Wyszecki & Stiles 1967], the Improved HLS space [Hanbury 2002],

the ‘Color Invariance’ space [Geusebroek et al. 2001];

• Performing a Flood Fill independently on the Red, Green and Blue

image planes, as well as the Hue, and taking a majority vote;

• Converting to HLS colour space and then quantizing the values for

‘fuzzy’ colour matching;

• Using ‘Comprehensive Normalization’ [Finlayson et al. 1998] to pre-

process the RGB image before flood fill;

• Selecting the principal RGB pixel values for the floor using k-means

clustering [Duda et al. 2001];

• Applying the Mean Shift Algorithm [Comanicu & Meer 2002] using

the EDISON package [Rutgers 2004].

A wide variety of different image processing operations were applied to the

images, but none of them worked reliably on their own.



3.6.3 Colour Spaces

Changing colour spaces did not have a significant benefit, possibly because

several spaces are just linear combinations of the others. There have been conflicting

reports in the literature as to the benefits of one colour space over another.

The Improved HLS system (IHLS) [Hanbury 2002] is claimed to be more

robust than HLS, but in practice little difference was found during this research. HLS

is known to have several problems. For instance, hue is a cyclic function and hue

also has no representation for black and white or shades of grey. In an environment

such as the one shown in Figure 25 and Figure 30 the hue values fluctuated wildly

across the image due to the proximity of the colours to shades of grey.

Methods such as ‘Color Invariance’ showed spectacular results in the examples

used in the literature. However, these were generally based on contrived images. As

soon as the algorithms were applied to real-world images from a robot’s camera,

they failed to produce good results. It should be borne in mind that the purpose of

many of these algorithms was to assist with object recognition or retrieval from

image databases. The test images, therefore, tended to be of high quality.

3.6.4 k-means Clustering

The k-means clustering algorithm was tested and it was found that the main

colours in the image could be determined. Some additional information was still

required to decide which of the clusters corresponded to the floor and which colours

were obstacles. The obvious assumption was that a reference area in the image

immediately in front of the robot was the floor.

Figure 33 (a) shows a camera image after smoothing with a Gaussian filter. In

(b) the image was segmented using k-means clustering with 16 means. The

contouring effect is typical of clustering and is a consequence of variations in

illumination across the image.



a b c d

Figure 33 – Segmentation using k-means Clustering

With the classical approach to k-means clustering the number of clusters is

fixed and must be known in advance, which is a serious disadvantage when

developing general-purpose algorithms for unknown environments.

An attempt was made to circumvent this problem by adaptively changing the

number of clusters. The algorithm began with a relatively large number of clusters

(16). By examining the number of pixels in each cluster and throwing away clusters

with less than 5% of the total number of pixels, the number of clusters could be

reduced iteratively. Typically, an image could be represented by half a dozen

clusters. An example is shown in Figure 33 (c). It is difficult to see the difference

compared to (b), but there are fewer colours present in (c) as evidenced by the

reduction in speckling.

However, for carpet containing small flecks of different colours the extra

colours did not appear in sufficient quantity to be represented by a mean. Therefore,

at the pixel level, it was not possible to distinguish these flecks as part of the floor

because they were incorporated into other clusters. Applying a smoothing filter prior

to clustering reduced the impact of these flecks but did not eliminate them and they

are still visible in (c).



In Figure 33 (d) the cluster means inside the reference rectangle (shown in red)

were assumed to be floor and the image was re-coloured to show floor pixels in

green. This was quite a large reference area which caused problems if the robot got

too close to a wall. Notice that part of the floor still was not selected, and that the

floor leaked up the wall and door frame at the right-hand side.

This raises the main issue with clustering which is that clustering loses spatial

information – pixels from unrelated parts of the image can be grouped together just

because they are of a similar colour. (The same problem applies to matching using

colour histograms, which were also trialled).

An attempt was made to overcome this problem by incorporating pixel

coordinates into the feature vector to introduce a spatial bias. Examples of

segmented images incorporating spatial information are shown in Figure 34.

Figure 34 – k-means Clustering using pixel location and colour

The ‘means’ in k-means tend to move to minimize a squared-error criterion

function [Duda et al. 2001 page 527]. Therefore, if the number of means is

overspecified, a large area of uniform colour was broken up into sub-areas based

solely on the proximity of the pixels. This is clearly the case in Figure 34 (a) which

has a characteristic Voroni pattern.

Pixel locations are measured in different units (pixels) from colour values

(usually a range of 0 to 255 for each of the Red, Green and Blue components). Pixel

distances and RGB values therefore cannot be directly compared. Some scaling is

usually necessary to try to make the elements of the feature vector contribute in

equal proportions. Figure 34 (b) shows the result of reducing the values of the pixel



coordinates by a factor of 10. The clusters were larger, but the speckling had

increased and the segmentation was still not suitable.

3.6.5 EDISON

The EDISON system is worth mentioning because it produced some good

results. The key to the success of the Mean Shift algorithm lies in determining the

boundaries between areas of different colour – it preserves edges. An example is

shown in Figure 35.

Figure 35 – Segmentation using EDISON: (a) Original image from X80 robot; and (b) EDISON output

Unfortunately, EDISON was extremely slow, increasing processing times from

several seconds per image up to as much as a minute. It also needed to be tuned for

the particular environment. Determining the optimal parameters was a time

consuming process, and even once they have been found for a particular image,

different parameters might be required for another image in the same test run due to

changes in illumination. However, it was by far the most successful segmentation

method found during the project. As processor speeds increase in the next few years,

EDISON might become a viable option.

3.6.6 Other Image Processing Operations

A plethora of image processing operations were used to try to segment the

floor. However, the floors in the experimental environments were very difficult to

segment. For instance, the floor in Figure 36 was a particular problem because it had

carpet tiles with a pattern. This violated one of the uniform floor colour constraint,

but examples are included here because the exercise was instructive. Note that the

parameters for each case can be read from the control bar at the top of the window.



Figure 36 – Basic image processing: (a) Source image; (b) Morphological closing; and (c) Pyramid segmentation

Figure 36 (a) shows the original image from an X80 robot. Notice the texture

in the foreground, but also note that the texture disappears higher up in the image

and in particular outside the door on the left. The rectangular object in the centre is a

trap door for concealed power points.

In Figure 36 (b), a morphological closing operation has been performed.

Although this did smooth the image somewhat, it leads to a blocky texture. In part

(c), pyramid segmentation was performed. This was reasonably successful, but it

treated the area outside the door as a separate segment.

Figure 37 – Texture removal: (a) Original edges; (b) Textured area based on edgels; and (c) Edges removed

Taking a different approach, an attempt was made to use the texture to

advantage as a means of segmenting the image. In Figure 37 (a), a Canny edge

detector was used to obtain a set of edges. These edge pixels, or ‘edgels’, were

grouped together in part (b) based on a minimum number of edgels within a window

of designated size. The area marked as texture was considered as a single segment.

When these edgels were subtracted from the original edge image, the result in part

(c) showed that most of the texture had disappeared. Unfortunately, the segmentation

produced in (b) was not as good as some other methods.



Figure 38 – Effects of pre-filtering: (a) Image filtered with a 9x9 Gaussian; (b) Edges obtained after filtering; and (c) Pyramid segmentation after filtering

It is common practice in image processing to filter (smooth) the image prior to

performing other operations. Figure 38 (a) shows the original image filtered by a 9x9

Gaussian. Even though it was obviously blurred, solid edges could still be obtained.

In fact, they corresponded to the main features in the image as shown in (b),

although there are extraneous edges across the door due to shadows. In part (c),

pyramid segmentation was once again applied. It was questionable whether this was

better or worse than previous attempts.

Figure 39 – Flood Filling: (a) Fill with one seed point; (b) Fill from a different seed point; and (c) Fill after pre-filtering

Another method for segmenting an image was to use an adaptive flood fill

where the matching pixel value varied across the image within a defined range. As

can be seen in Figure 39 (a) and (b), the initial starting point for the fill had a

significant effect on the result. In Figure 39 (c) the image was pre-filtered, which

resulted in a much smoother flood fill. However, the fill also tended to ‘leak’ out of

the floor area.

Some of these approaches showed promise, but none of them worked on their

own and yet a human can look at the original image and immediately determine

which parts of it are floor. Furthermore, they all required parameters and a



considerable amount of time was spent tuning them to obtain the examples above.

This brings up one of the major drawbacks of most segmentation schemes: the

parameters must be optimised for the environment. Humans perform dynamic

adjustments without even realising it. Robots, however, are not so intelligent.

The trap door in the middle of the image illustrates a particular problem that is

very hard to solve using a single image. Applying optic flow to a series of images

should show from the flow pattern that the trap door is co-planar with the floor.

However, segmentation techniques can only ‘remove’ the trap door if their

parameters are set to be very broad, reducing their ability to discriminate.

3.6.7 Implementing Floor Detection

As explained above, a wide variety of methods for detecting the floor were

investigated. None of them proved to be completely reliable. The final system

combined several image processing operations.

The fundamental problem was related to segmentation and colour constancy,

both of which are active fields of research. The colour of floor pixels in the images

appeared to change due to changes in illumination as the robot moved around its

environment, or even as it turned to look in a different direction. To try to minimize

these effects, the camera image brightness was adjusted up or down prior to

processing (which is a simple scaling of the RGB values). The segmentation

algorithm still had to allow for variations in colour.

Another important aspect was texture. According to the Horswill Constraints

stated previously, it was assumed that the floor had no significant texture. In practice

this was not the case, although it could mostly be removed by a smoothing filter.

Due to the similarity of the floor colour to the colour of the walls, the absence of

texture on the walls was used as an additional check that the robot was viewing the

floor and not a wall. In this sense the range of pixel values in a small patch of the

image was taken to be indicative of the amount of texture – a small range (referred to

as a ‘bland’ texture) implied a wall.



Edges detected in the image were found to be the best indication of the floor

outline. Therefore the most prominent edges were incorporated into the floor

detection procedure.

In the final version of the system, the process was simplified to using a Flood

Fill as follows:

• Smooth the image using a 9x9 Gaussian;

• Obtain the strongest edges in the image using a Canny Edge Detector;

• Create a mask image to prevent filling beyond the detected edges;

• Apply the adaptive FloodFill function from the OpenCV Library with

the edge mask.

There were two main problems with this approach – the flood fill could ‘leak’

through gaps in the edges, and the fill could stop prematurely if there was a wide

range of illumination across the image.

To perform a flood fill, it was necessary to supply a ‘seed’ value for the pixels

to match. The robot determined the colour of the floor when it first started up using

an average of a reference patch in the bottom centre of the image. This assumed that

the area in front of the robot was clear initially. This was only a minor constraint and

it makes sense to do this in any case.

Care had to be taken to ensure that the fill was not initiated inside an obstacle.

Therefore the colour of the seed pixel was first checked against the known floor

colour. If it differed significantly from the floor, or if the texture was bland, then a

search was performed across the bottom of the image until the floor was found.

Based on the seed colour, the floor was detected using an adaptive flood fill where

the search value varied across the image.

The range limits for the flood fill and the parameters for the edge detector were

determined by the researcher by experimentation using images captured by the robot

during test runs. Automatically selecting these parameters is a possible area for

future research.



3.7 Mapping

Building maps was the major objective of this research. It was decided to

create metric maps using occupancy grids as the representation because these are

easily understandable by non-technical people such as the potential consumers of

home and office robots.

To build a metric map, it must be possible to obtain range information using

the robot’s sensors. In this research the only sensor was a digital camera. Therefore a

method had to be developed for extracting range information from images.

3.7.1 Occupancy Grids

This section explains how the occupancy grid maps were updated in this

research by showing the derivation of the update equation.

Firstly, the odds of an event, o(x), with probability p(x) is defined as:

)(1)()(xp

xpxo−

= (10)

As a starting point the derivation begins with equation 12 from [Thrun 2003]

which is re-stated here as equation 11:

)...|(1)...|(

)()(1

)|(1)|(

)...|(1)...|(

11,

11,

,

,

,

,

1,

1,

−

−

−×

−×

−=

− tyx

tyx

yx

yx

tyx

tyx

tyx

tyx

zzmpzzmp

mpmp

zmpzmp

zzmpzzmp

(11)

where:

p(mx,y| z1…zt) is the (posterior) probability that map cell mx,y is occupied given

the measurements z1 to zt,

p(mx,y| zt) is the probability that map cell mx,y is occupied given just the current

measurement zt (which is determined from the sensor model),

p(mx,y) is the probability that map cell mx,y is actually occupied in the real

world,



p(mx,y| z1…zt-1) is the (prior) probability that map cell mx,y is occupied given the

measurements z1 to zt-1.

In words, equation 11 says that the odds of a map cell being occupied at time t

are equal to the odds of the latest measurement times the odds of it being occupied at

time t-1 divided by the odds that the cell is actually occupied in the real world

(which is the middle factor inverted). This equation can be derived using Bayes

Rule, as shown by Thrun.

Equation 11 can be simplified by making an assumption that is common in the

literature of occupancy grids (and also mentioned in [Thrun 2003]), namely that the

initial value for the occupancy probability of all map cells is 0.5. In other words,

because the actual probability of occupancy of all the grid cells in the real world is

unknown, just assume a 50/50 chance.

Substituting 0.5 for p(mx,y) in equation 11 eliminates the middle term because

the numerator and denominator both become 0.5.

To simplify the notation, drop the subscript x,y which refers to the location of a

cell in the grid.

Define P as follows:

)...|(1)...|(

)|(1)|(

11

11

−

−

−×

−=

t

t

t

t

zzmpzzmp

zmpzmp

P (12)

This is equivalent to the right-hand side of equation 11 after substituting 0.5

for p(mx,y). The two factors in P above are the odds of the cell being occupied based

on the latest sensor reading and the previous odds for the map cell.

So equation 11 can now be re-written as:

Pzzmp

zzmp

t

t =− )...|(1

)...|(

1

1 (13)

After a small amount of rearrangement the new occupancy value for a map cell

at time t can be calculated as follows:



PPzzmp t +

=1

)...|( 1 (14)

Because equation 14 is a recursive update equation, an initial estimate is

required to get started. This is, as previously stated, set to 0.5 – all the cells in the

map are initialized to this value at the start of a mapping run.

The position of the robot needs to be incorporated when updating the global

map.


In order to use a camera as a range sensor, it is necessary to obtain distance

estimates from the images. This section describes the development of a set of

equations for Inverse Perspective Mapping (IPM) as published by the author [Taylor

et al. 2004a]. Note that this approach does not require the focal length of the camera.

The camera FOV is illustrated in the diagram in Figure 40. It shows the top

and side views of a robot (not to scale). Notice that there is a blind area immediately

in front of the robot. Also, the top scanline of the camera image in this diagram is

above the true horizon (the horizontal line through the centre of the camera). In

general, the horizon is not at the centre of the image because the camera is tilted

downwards to improve the visibility of the foreground.



Figure 40 – Camera Field of View (FOV) showing coordinate system and parameters for Inverse Perspective Mapping (IPM)

In the FOV diagram, γ is one-half of the horizontal FOV; α is one-half of the

vertical FOV; and δ is the camera tilt angle. (γ and α are related through the aspect

ratio of the camera, which is usually 4:3 for conventional video cameras.)

Image pixel coordinates are represented by (u, v) in the horizontal and vertical

directions respectively. Note that, by convention, v is measured downwards from the

top of the image. If the image resolution is m by n pixels, then the values of the

image coordinates (u, v) will range from 0 to (m-1) and 0 to (n-1) respectively.

Consider rays from the camera to points on the ground corresponding to

successive scanlines in the camera image. Each pixel in this vertical column of the

image corresponds to an angle of 2α/(n-1). Similarly, pixels in the horizontal

direction correspond to an arc of 2 γ/(m-1).

Four parameters must be measured for the IPM calculation:



b Blind area between the camera (in the centre of robot) and the point

visible in the bottom scanline of the image;

h Height of the camera off the floor;

d Distance along the floor between the bottom scanline and the centre

scanline of the image; and

w Distance across the floor measured from the robot centreline to the edge

of the image at the centre scanline of the image.

The following relationships can be determined from the diagram:

o90=++ δβα (15)

⎟⎠⎞

⎜⎝⎛

=h

b)tan(β (16)

⎟⎠⎞

⎜⎝⎛ +

=+h

db)tan( βα (17)

⎟⎠⎞

⎜⎝⎛

+=

db

w)tan(γ (18)

The values of b, d and w can be measured to within a few millimetres by

placing a grid on the ground and locating grid edges in the image. A value for h can

be measured directly using a ruler.

The real-world coordinates (x, y) can then be calculated as follows:

⎟⎠⎞

⎜⎝⎛−=

h

b1tanβ (19)

βα −+

= ⎟⎠⎞

⎜⎝⎛−

h

db1tan (20)

⎟⎠⎞

⎜⎝⎛

+= −

db

w1tanγ (21)



⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−

−−+=

)1(

)1(2tan

n

vnhy

αβ (22)

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−

+−=

)1(

12(tan

m

muyx γ (23)

With these formulae, it is possible to construct a lookup table for the values of

x and y for any given u and v (and the values of the parameters). Image coordinates

can then easily be transformed to real world coordinates via a process that is very

computationally efficient.

The values of b, h, d and w can be measured by placing a regular grid pattern

on the floor and capturing camera images. If necessary, the robot can be moved

around until the image coincides with markings on the grid.

A different approach was also examined for calculating the values of these

parameters using the extrinsic parameters obtained from the camera calibration

software discussed previously in section 3.5.6.

After completing the camera calibration, it was possible to obtain the extrinsic

information for a specific camera pose from the corresponding image. In other

words, given the intrinsic parameters, it was possible to calculate the transformation

from camera coordinates to a coordinate system aligned with the calibration target

(checkerboard pattern). This in turn allows calculation of the height of the camera

and the camera tilt angle using basic geometry.

Figure 41 and Figure 42 show how the extrinsic parameters can be interpreted

either from the camera coordinate frame, or a coordinate system attached to the

calibration target grid. (These figures were produced by the camera calibration

software [Bouguet 2006]). Figure 41 shows the placement of the calibration target

relative to the camera on the assumption that the camera remains fixed at the origin

(so the grids represent multiple calibration targets). In reality, the camera is moved

around and the calibration target remains fixed at the origin, which is shown in

Figure 42 (where the various camera viewpoints are represented by tetrahedrons).



Figure 41 – Extrinsic parameters (Camera view)

Figure 42 – Extrinsic parameters (World view)

The transformation between the two coordinate systems (the camera-centric

view in Figure 41 and the calibration target view in Figure 42) can be described by

the combination of a rotation matrix and a translation vector.

Neither of these coordinate systems matches the one used by the robot (see

Figure 40), so further calculations had to be done. An Excel spreadsheet was written

to do this, and the extrinsic parameters for each of the images were entered into

separate sheets. The results were averaged to give the necessary parameters, as well

as the camera tilt angle and field of view. This calculation was then incorporated into



the mapping section of the ExCV program (it already made use of the intrinsic

parameters for undistorting the images).

To verify the calculations, a simple test was set up. Sheets of paper were

arranged on the floor using a tape measure. The edges of these sheets are clearly

visible in the image (Figure 43).

Figure 43 – Test image for Inverse Perspective Mapping

It should be noted that the edges of the sheets are not horizontal, although they

should be. This was caused by the camera not being mounted perfectly horizontally

on the robot. As a consequence there were errors in the IPM calculations. It was

therefore necessary to compensate for this by rotating the image slightly before

applying IPM. The amount of rotation required can be determined from the extrinsic

parameters.

The locations of each sheet of paper were known and their distance from the

camera could therefore be plotted against the calculated data simply by reading off

the pixel values at the edges of each sheet in the image, as shown in Figure 44. Note

that the y-coordinate of a point in an image is the same for all pixels across a

scanline. Also, according to convention, the origin for pixel coordinates is the top-

left corner of the image so vertical pixel values start from zero at the top of the

image and increase downwards.



0

100

200

300

400

500

0 20 40 60 80 100 120 140

Pixel (vertical)

Dis

tanc

e (c

m)

Intrinsic params Measured params Validation

Figure 44 – Verification of Inverse Perspective Mapping

Figure 44 shows that up to about 100cm there was little difference between the

two methods of calculating the IPM lookup table (Intrinsic params versus Measured

params) and the values obtained from the calibration image (Validation). However,

by the time the distance reached 200cm there was a range of ±10cm either side of the

value obtained from the calibration image (Validation).

Note that the purpose of Figure 44 was to prove that the equations were

correct. It does not represent the noise characteristics inherent in the sensor model.

See section 3.7.5 for an analysis of the noise characteristics.

It is worth noting that the X80 robot had a pan and tilt camera. However, the

way that this was implemented using servo motors meant that it did not rotate around

the camera’s centre. Furthermore, positioning of the camera depended on the

accuracy of the servo motors. Small variations could occur, therefore, between test

runs. In particular, small changes in the camera tilt affected the estimated distance

towards the top of the image, especially for longer distances. It was therefore

necessary to check the camera’s extrinsic parameters from time to time.

3.7.3 Simplistic Sensor Model

For drawing local maps a simplistic model and ray-casting were used initially.

The procedure was as follows:



• Let the calculated range (using IPM) to the floor boundary be r, at a

given angle off the principal axis of the camera;

• Scan along a radial ‘ray’ in the map from the camera to the observed

floor boundary intersection point;

• For distances less than r, the probability of each cell was set to a low

value; and

• At the measured range, r, set the cell to a high value, and stop the ray-

casting.

In addition, the ExCV program incorporated a maximum vision distance. This

value was set when the program first started up. If the range exceeded this maximum

during ray-casting, the ray-casting was terminated. This meant that no obstacle was

recorded in the map, but all the free space up to the maximum range was recorded.

The justification for this maximum range was that beyond a certain range the

potential errors in determining the exact pixel location of the floor boundary

corresponded to errors in the range value that were larger than a grid cell, i.e. the

range information became unreliable.

One unintended consequence of the simple model was that it rapidly replaced

information that was already in the map. This happened if the values assigned to the

free space and obstacle probabilities were too extreme, viz 0 and 1. If the robot’s

estimated pose was incorrect, the result was that the incoming data quickly

overwrote previous data (which might actually have been correct) and effectively

altered the map to match the incorrect pose.

In [Choset et al. 2005 page 334] the probability of occupancy in the ‘free’

space was only about 0.35, while the obstacle itself rated about 0.7. These values

were fairly close to the ‘unknown’ value of 0.5. In other words, the probabilities

were not as ‘black and white’ as the simple sensor model would suggest. Choset et

al. provided arguments as to why this should be the case.



3.7.4 Effect of Range Errors

Consider the case of a small error, such as plus or minus one pixel in the

image. Examining the inverse perspective mapping lookup table for the Yujin robot

with a 1cm grid and specific camera geometry, the error in the radial distance was

found to be less than 0.5cm for ranges up to about 35-40cm. (It varies across the

image).

Figure 45 clearly shows the non-linearity of the inverse perspective mapping.

Note that this figure only shows the surface for the left half of the image from 0 to

160 pixels instead of the full 320. Quite clearly, towards the top of the image (by

convention, vertical pixels start numbering from zero at the top of the image) there is

a very steep change in the range from one pixel (or scan line) to the next.

Figure 45 – Radial range as a function of pixel coordinates

Once the range reaches the point where a single pixel corresponds to more than

one grid cell, the uncertainty in the location of the obstacle rises rapidly. Well before

reaching the top of the image, a single pixel represents many centimetres of distance

along the floor. At this stage the information contributed to the map by an obstacle is

‘smeared’ over a large number of grid cells.

Therefore, information was ignored if it fell beyond a certain maximum

distance – 40cm for the Yujin robot and 2m for the X80 – which corresponded to the

point where a one pixel increment would result in the range changing by more than a

grid cell. If the range to an obstacle was less than this, then the sensor model looked

similar to the ideal model with a narrow peak at the obstacle range.



With this constraint, known as the maximum vision distance, the simple sensor

model was used initially.

3.7.5 Vision Sensor Noise Characteristics

The simple sensor model assumed no noise. However, this was clearly not true

because the vision system was not perfect.

To assess the noise characteristics of the vision sensor, the output from the

vision system’s floor detector was captured. These ‘floor’ images were then edited

by hand to correct any errors in the locations of the obstacles. A total of 100 images

were corrected and used to generate statistics for the noise.

To compare them against the vision system output, images were scanned

column by column looking up from the bottom of the image to find the first edge

pixel, i.e. the nearest obstacle. This resulted in a total of 17,600 data points.

Consider the sample floor image and the corrected image in Figure 46.

Figure 46 – Corrected Floor Image: (a) Original; and (b) Hand-edited

Small gaps occurred in the edges for a variety of reasons. This was the primary

failure mode of the floor detector. Figure 46 (b) shows how these mistakes were

corrected by hand for the statistical analysis. Although it was possible to close these

gaps by increasing the sensitivity of the edge detector, this also enhanced shadows

and extracted carpet texture from the floor. The values used were a compromise.

In the initial statistics, the errors due to these gaps were included but they

resulted in very large standard deviations (to accommodate the large differences

from the correct scanline to the top of the image) and they also biased the means.

Usually these problems corrected themselves as the robot got closer to the obstacles,



or they were too narrow for the robot to pass through anyway. So these particular

cases (1206 data points, or 6.85% of the total) were ignored in calculating the

statistics.

As mentioned above, shadows and floor texture were sometimes interpreted as

walls. (See section 4.2.1 for further examples). These errors reduced the estimated

range to the nearest obstacle which could affect path planning. Once again the robot

usually saw the same area later on and resolved these mistakes. These effects were

minimized by appropriate selection of the Canny edge detector parameters.

Figure 47 shows the distribution of the data points in this set of sample images

by scanline, where the scanlines are numbered downwards from the top of the image.

0

100

200

300

400

500

600

700

800

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140Scanline

Cou

nt

Figure 47 – Frequency of edges in sample data set by scanline

Note that there were very few data points for scanlines above 95, although

there were data points for all values from 95 upwards. This region is the bottom of

the image and is very close to the robot (less than 36cm). The nature of the

exploration algorithm was such that the robot did not need to approach walls closely.

The large spike at the bottom scanline (143) was caused by a couple of images

where the robot was too close to a wall and could not see the floor. In this case, the

local map had the ‘blind’ area in front of the robot filled in with a low probability

indicating an obstacle at an unknown range.



Scanlines below 6 (top of the image) corresponded to distances of more than

10 metres, and there was no data in this range for this data set, which is why it is

empty in the histogram. The remainder of the data points were fairly well distributed

over the range of scanlines from 20 to 80, or from about 2 meters down to 40cm

from the robot.

For each edge located in a corrected image, the first edge in the corresponding

column of the original image was also located. Differences between the two scanline

numbers indicated errors. These errors were summed to calculate the mean and

standard deviation applicable to obstacle edges.

-8.0

-6.0

-4.0

-2.0

0.0

2.0

4.0

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140Scanline

Mea

n

Figure 48 – Mean of vision errors by scanline

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140Scanline

Stan

dard

Dev

iatio

n

Figure 49 – Standard Deviation of vision errors by scanline



It is interesting to note that the distribution appeared to be bi-modal with data

in the middle of the images being very accurate and two obvious peaks. Also note a

large number of the mean values were less than one pixel. In fact, the mean error

across all data points was only -0.24 pixels, with only 7 scanlines where the mean

exceeded 1 pixel (ignoring scanline 143). The standard deviations were also fairly

small and only exceeded 1.0 in 38 cases.

Using the IPM data for the camera, error ranges were calculated for each

scanline by taking the mean plus or minus one standard deviation, and then

converting the resulting scanline number to a grid cell distance from the robot (with

a 5cm cell size). This is shown below where the error bars are superimposed on the

IPM range information.

0102030405060708090

100

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140Scanline

Grid

Cel

l

Figure 50 – Range errors by scanline

In Figure 50 a value of 40 grid cells (scanline 22) corresponds to a distance of

2 meters. This was the maximum vision distance used in most of the experiments.

Given the low error rates over most of the image, an empirical estimate of ±1 pixel

was reasonable.

3.7.6 More Realistic Sensor Model

Details for the development of sensor models can be found in textbooks such

as [Thrun et al. 2005, Choset et al. 2005]. However, the sensor models are derived

for sensors that produce a continuous range of values. Vision on the other hand

produces quantized values due to the pixels in the image.



Furthermore, the textbook derivations assume that the range error has a

Gaussian distribution. This is clearly not the case for vision because each successive

scanline (pixel row) moving up towards the top of the image represents a larger and

larger distance due to perspective, so an error of as little as ±1 pixel results in an

error distribution that not only changes across the image, but is also significantly lop-

sided.

Single pixel errors in edge detection would barely be noticed by a human and

this is generally accepted as a typical error in image processing. Although edge

detectors do exist with sub-pixel resolution, ultimately the resolution of the camera

comes into play. When the edge corresponding to an obstacle is towards the top of

the image, perspective can make a significant difference to the estimated range. In

the development of the sensor model for vision a possible error of ±1 pixel was

assumed for the accuracy of edge detection.

However the key point is that errors in edge detection affect the map and the

primary purpose of this research was to produce maps. To account for range

uncertainty due to errors in the location of the floor boundary, a more sophisticated

model was developed that was similar to Figure 51. Note that the diagram is

exaggerated. The peak is not symmetrical because the distance from the camera does

not increase linearly for higher scanlines.

Figure 51 – Complex sensor model for errors in range measurement

Formulae are not provided for the sensor model because it was determined

using table lookup based on IPM (explained below). The peak of the curve in Figure



51 corresponds to the estimated range. The ‘spread’ of the peak is based the

differences between the range to successive scanlines. Although it might not be

immediately apparent from Figure 51, the curve is not symmetrical about the peak

because an error of one pixel upwards in the image corresponds to a much larger

distance then an error of one pixel downwards due to perspective.

Also note that the shape of the distribution changes with the estimated distance

from the camera, so that the spread of the peak decreases for scanlines further down

the image. Figure 52 is for an obstacle that is closer to the camera than Figure 51,

and therefore the obstacle appears closer to the bottom of the image. Notice that the

peak is much sharper.

Figure 52 – Complex sensor model (with errors) for shorter range

Near the bottom of the image, the range difference between successive

scanlines was very small – less than a centimetre. With a grid size of several

centimetres, the peak in the distribution was therefore only a single cell wide (even

for errors of several pixels in determining the floor boundary) and the distribution

was essentially the same as for the simple sensor model as shown in Chapter 2.

Eventually, the range difference between successive scanlines exceeded the

size of a map grid cell and ‘shoulders’ appeared on the curve as shown in Figure 51

and Figure 52 above. For the X80, the grid size was usually set to 5cm. By the time

that the range had reached 2m, the difference was around 10 grid cells, or 50cm

which was more than the diameter of an X80 robot.



Note that the value for free space in this model was not set to zero and the peak

did not reach 1. These changes from the simple sensor model reflected the fact that

there was a small probability of bad sensor readings which gave entirely incorrect

ranges. They also prevented a single sensor update from immediately overwriting the

data in the map. Ideally, each cell in the map should be observed multiple times in

order to verify its occupancy state.

To implement this sensor model, the ExCV program built a table when it

started that contained the relevant values. The way that the values for this model

were obtained was as follows:

For each value of radial distance (measured in the centre of a cell):

1. Find the scanline corresponding to this range in the image;

2. Calculate the average change in range for the scanlines above and below, i.e.

±1 pixel; and

3. Convert this change in the range value into the equivalent number of map

grid cells and store this in the table. (This is referred to as the ‘spread’.)

When drawing the local map, if this spread was zero, then the obstacle was

simply drawn into the map at the specified range. However, if the spread was greater

than zero, then the values either side of the obstacle range needed to be calculated to

determine the starting and ending range values. The occupancy values were then

increased along the ‘ray’ from the robot from the starting range up to the estimated

range, and then decrease again on the other side out to the ending range.

3.7.7 The Radial Obstacle Profile (ROP)

A Radial Obstacle Profile (ROP) is an array of distances to the obstacles in the

vicinity of the robot. It might be considered to be similar to a traditional sonar, radar

or laser range finder scan.

The term ROP was introduced to the literature by the author [Taylor et al.

2004b]. It is similar to a range scan but represents the information from a single

image.



The ‘width’ of an ROP is determined by the FOV of the camera. In this thesis

ROPs are 50° or 60° wide. In contrast, range scans from laser range finders usually

include information over an arc of 180°.

The angular increment between array elements in the ROP is fixed and is one

of the configuration parameters. It is determined from the FOV divided by the

horizontal resolution of the camera. (See the next section). This means that each

element in the ROP array only needs to contain a single value for the distance

because the array index corresponds to an angle.

If the robot spins around on the spot taking a succession of overlapping

images, referred to in this thesis as a pirouette, the resulting sequence of ROPs can

be combined into a full 360° ROP sweep.

The combined ROPs can be plotted in a similar way to a radar sweep, as

shown in Figure 53 below. The grey radial lines in the figure are 10° apart and the

circular lines are 20cm apart. The thin white line is a smoothed version of the ROP

obtained by using the Discrete Cosine Transform (DCT), which is explained further

in Chapter 4. The actual ROP is the thick white line.

Figure 53 – Typical Radial Obstacle Profile (ROP) Sweep

A different way to view an ROP is shown in Figure 54 where the individual

free-space maps have been overlaid on a top-down picture of the playing field.

Notice that the obstacles cast ‘shadows’ because the robot (a Hemisson at the centre

of the picture) could not see through the obstacles. (The alignment is not perfect



because the scale of the photograph was unknown and so the map had to be re-scaled

and overlaid by eye. However, the principle is nonetheless clearly illustrated.)

Figure 54 – ROP overlaid on a playing field

Note that the robot in Figure 54 can see the whole of its environment in a

single sweep. Additional walls were introduced in later experiments to force it to

explore.

3.7.8 Angular Resolution of the ROP

There is a limit on the angular resolution that is possible with the ROP. The

maximum angular resolution, i, can be calculated as follows:

)1( −=

mFOVi (24)

Note that this is not exact because it is calculated at the centre of the image and

the actual angle varies with the height in the image.

For example, a typical web camera has a FOV of 60° and a horizontal

resolution of 320 pixels. This means that the best possible ROP increment is 0.188

degrees. For most of the work either 0.5 or 0.25 of a degree has been used.

3.7.9 Calculating the ROP

In the initial implementation, the ROP was approximated as follows:

• Scan across the bottom of the floor map image one pixel at a time;

• For each pixel column, scan upwards until a non-floor pixel is found;



• Obtain the polar coordinates of this pixel (floor/obstacle intersection)

via a table lookup.

In later versions of the code, however, ray casting was used – tracing a ray

from the camera outwards at a given angle. Note that the angle is measured across

the floor, not directly in the image. The rays remain straight however, because a

perspective transformation preserves straight lines.

To obtain the image pixel coordinates in the image, the ray must be

transformed back from real-world coordinates to image coordinates. This is

perspective mapping, the opposite of IPM.

For the purpose of testing the ROP and mapping algorithms it was sufficient to

use a simulator to confirm that maps were constructed correctly. With a simulator the

pose is known exactly, so errors in the pose do not affect the map generation. In

practice, ROPs are subject to a variety of errors due to incorrect segmentation or ‘off

by one’ problems in locating the edges of the floor. Cumulative odometry errors are

also a fact of life in the real world. Therefore a simulator is a very useful piece of

software because it allows experiments to be performed in an environment where

random errors do not exist. Errors can then be introduced into the simulation in a

controlled fashion to test the robustness of the algorithms.

3.7.10 Effects of Camera Tilt

One problem with the floor boundary that only became apparent during

development was the effect of camera tilt on vertical edges.

Following a pirouette the 12 individual maps were combined to create a local

map. In simulation, a typical result looked like the example in Figure 55. Notice that

there are ‘fuzzy’ lines in the map which are the result of round-off and quantization

due to the size of the map cells. However the real problem that needed to be

corrected was accounting for vertical edges.

In the map in Figure 55, some of the walls (solid black lines) have been

circled. These are not walls at all. These extraneous walls are an artefact of the floor

detection process as a consequence of applying a naïve approach.



Figure 55 – Naïve local map showing extraneous walls

Consider the view from the robot and the proposed floor outline shown in

Figure 56 (a) and (b) respectively. These figures correspond to the view inside the

top (larger) ellipse in Figure 55. If the outline in Figure 56 (b) was used to mark the

floor boundary in the local map using IPM then incorrect walls would result.

Humans recognise that the large (yellow) object near the middle of the image has

two vertical edges even though the vertical edges do not appear vertical in the image

due to the camera tilt.

Figure 56 – Detecting vertical edges: (a) Simulated camera image; (b) Detected floor; and (c) Detected vertical edges

A naïve approach would treat all of the edges of the floor contour, Figure 56

(c), as physical boundaries (walls). However, in Figure 56 (c), the vertical edges

have been detected in the floor contour and are marked with thicker (green) lines as

explained in [Taylor et al. 2006, Taylor 2007]. When drawing the local map the

robot knows not to mark these edges as walls.

After removing vertical edges from the floor contour, the local map looks like

Figure 57, which no longer has the extraneous walls.



Figure 57 – Corrected local map without walls due to vertical edges

A couple of additional points need to be made about this map. Firstly, the

simulator was constantly ‘taking photos’, just as a real camera would. Once the robot

had stood still for a while, the map became quite black and white – literally. (See the

circled area at the top of Figure 57). This was a result of the same data being added

to the map continuously and reinforcing what was already there.

On the other hand, as the robot does its pirouette, it takes a photo after each

turn and then immediately turns again. The result appears as a star pattern that shows

where the overlaps have occurred between photos. If the robot had taken several

photos after each turn and integrated them all into the map, then this pattern would

not be visible.

3.7.11 Mapping the Local Environment

The first stage in creating a global map is to produce a map of the immediately

surrounding environment, which is a local map. This local map can then be

combined with the global map using Bayes Rule.

Note that that local map is relative to the robot, so the robot is always at the

centre (origin) of the local map. Before incorporating the new information into the

global map, the local map has to be positioned relative to the global map. The local

map can be generated with the correct rotation using the robot’s current pose

information and it is then only a matter of applying the appropriate translation in

order to overlay it onto the global map.

Selecting an appropriate grid size for an occupancy grid map is dependent on

several factors, including:



• Memory limitations;

• Size of the robot; and

• Degree of clutter in the environment.

The author proposes as a general rule of thumb that the cell size should be

roughly an order of magnitude smaller than the size of the robot. Obviously the cell

size should not be larger than the size of the robot because it could get ‘lost’ inside a

cell. Conversely, a very small cell size increases the computational overhead. As the

cell size decreases, the number of cells increases quadratically – reducing the cell

size by a factor of two results in four times as many cells in the map.

Another consideration in the selection of the map grid size is the environment.

For instance, the narrowest corridors still need to be several grid cells wide in the

map. Otherwise errors in the map might squeeze a corridor to the point where it

appears to the robot that it is blocked.

When using the Hemisson robot (roughly an oval shape 12cm by 10cm) and

Yujin robot (a 7cm cube), a cell size of 1cm was used. There were no narrow

openings in the playing fields that the robot needed to pass through.

For the X80 robots (roughly 30cm in diameter) cell size of 5cm and 10cm was

used in the experiments. The narrowest corridor was just under 1m in width, or about

10 map cells wide even at the 10cm grid size.

3.7.12 Creating a Global Map

To create the global map, each new local map must be integrated into the

existing global map. This is done by registering the local map to the global map

based on the robot’s estimated pose, and then applying Bayes Rule for each

corresponding cell from the local map which does not have a value of ‘unknown’.

Values of ‘unknown’ (0.5) in the local map correspond to areas where no

information was obtained from the vision system. If Bayes Rule were applied to

these areas as well they would incorrectly affect existing information in the global

map.



3.8 Localization

Reliable localization is essential for producing an accurate map. Localization

presented several challenges which are discussed below.

Various methods of visual localization were tested in the experiments. These

are discussed in the following sections, but first some remarks about odometry are

appropriate because it is the source of errors in pose estimation.

3.8.1 Odometry

In the initial work using the Yujin robot, there was no odometry information

available from the robot. Although it does have wheel encoders, which are used by

the on-board controller to control wheel speed, the encoder information cannot be

obtained from the robot because the communication is one-way via a radio modem.

Therefore the ‘odometry’ consisted of timing the motions of the robot. This is

extremely unreliable because Windows is not a real-time operating system and

cannot accurately execute timed delays.

Each wheel encoder on an X80 robot generates 1200 ‘ticks’ for a complete

revolution of the wheel. An encoder counts upwards when the wheel rotates in one

direction, and counts down for rotations in the opposite direction. The tick count is a

16-bit value so it wraps around (back to zero) at 32767. The resolution of the

encoders equates to 0.3° in terms of wheel rotation. This is a fairly good resolution,

but the usual odometry problems still apply.

The wheel diameter for the X80 robots is nominally 165mm according to the

manufacturer. This gives a perimeter of 518.36mm. In the case where the robot is

moving forward (both wheels turning at the same speed) one tick corresponds to

travelling 0.43mm, assuming no slippage between the wheel and the ground.

It is of course impossible to move the robot by only a single tick. The

mechanical backlash in the motor/gear/axel assembly amounts to several ticks. Even

with the on-board PID controller operating in position mode (driving the motors to

make the wheel encoder tick count match a specified value), it is not possible to

position the robot to closer than 10 ticks from the desired position.



If the robot is driven to turn on the spot by turning the wheels in opposite

directions, then the distance travelled per tick is around the perimeter of a circle with

a diameter equal to the wheelbase (spacing between the wheels). The manufacturer

says that the wheelbase is 272mm, giving a perimeter of 854.51mm. Therefore a

single tick corresponds to 0.18° in terms of the orientation of the robot.

3.8.2 Tracking Image Features

Many visual SLAM approaches use features, particularly if they are based on a

Kalman Filter. Therefore extracting features from the images was investigated.

Due to hardware limitations, the X80 robots did not produce frames fast

enough to allow optic flow to be calculated because optic flow requires a high frame

rate. Furthermore, the images were affected by motion blur while the robot was

moving. The hardware therefore precluded the use of optic flow.

Some experiments were carried out using SIFT as a possible method for

tracking the robot’s motion. Sample code for SIFT was downloaded from Lowe’s

web site at http://www.cs.ubc.ca/~lowe.

It was found that in most of the camera images there were very few SIFT

keypoints detected, and many cases they were not useful. This was primarily because

SIFT looked for corner-type features. Even a ‘good’ image containing doorways

rarely produced matches between successive images during a rotation.

Figure 58 shows two examples of SIFT feature matching between successive

images when the robot rotated on the spot. Each diagram consists of two images, one

above the other, with corresponding keypoints connected by white lines.


http://www.cs.ubc.ca/~lowe


Figure 58 – Matching SIFT features between images: (a) Small rotation; and (b) Large rotation

In Figure 58 (a), the robot had just completed a pirouette but clearly it did not

return exactly to its original orientation, although the error was only a few degrees.

There was only one matching keypoint found, even though there were roughly 20

keypoints in each image.

Figure 58 (b) shows a 30° rotation. In this case three matches were found, but

they were all false matches. This was a common problem. Obviously, if these

matches were to be used for localization then the robot would quickly become lost.

A further problem was that the detected features were often not at floor level.

This made determining their location in 3D space difficult, although not impossible.

It was found that interest operators tended to break down when confronted

with noisy images and either fail to produce any features or generate spurious

features. For example, SUSAN was unable to produce reliable features with actual

images from the X80 camera.

The main point is that feature tracking was not appropriate for the

experimental setup in this project.



3.8.3 Scan Matching of ROPs

Scan matching is not directly relevant to vision. However, the approach taken

in this thesis was to generate a Radial Obstacle Profile which is similar to a laser

range scan. Laser scans have been used with scan matching in the literature.

Early in the project, an attempt was made to use the Lu & Milios approach, but

it was not successful. There were two distinctly different reasons for this. They are

quite easy to understand in retrospect.

In their work, Lu & Milios had data for a full 360° scan at 1° intervals. The

algorithm for determining rotations slid one set of data over the other looking for the

best match. Because they had a full circle, there were no problems with boundary

conditions because the data sets could be wrapped around. Furthermore, they had a

lot of data points to work with.

However, in this research, when a robot rotated on the spot, the overlap

between camera images was only half of the field of view, or about 30 degrees. As

one data set was moved with respect to the second data, the area of overlap reduced

until there was nothing to match. The algorithm had to ignore points with no

corresponding point. Consequently, the best match could often be found by sliding

the data until only a couple of points remained and they were a perfect match. The

resulting rotation estimates could therefore be grossly incorrect.

The diagram in Figure 59 shows a top-down view of a robot rotating on the

spot. The field of view of the robot is shown before and after a rotation. (It does not

matter which is which). Clearly, the amount of overlap is not large, so there is very

little data for comparison.



Figure 59 – Overlap in field of view between rotations

Reducing the amount of the rotation to increase the overlap would increase the

amount of data available for matching, but it would also mean that more rotations

would be required to complete a pirouette. As previously mentioned, the more times

the robot rotated the more opportunities there were for it to become disoriented

because motions were never exact. In fact, the error in rotations increased

significantly, sometimes to as much as 50%, when they were less than 10 degrees.

Note too that the camera is not in the centre of the robot in Figure 59, so the

vantage points for each of the views are separated in space. This makes comparisons

difficult because the effect is a combination of rotation and translation. It can also

have undesirable effects due to occlusion and it leaves a small area unseen, but these

effects are just noted in passing and are not addressed here.

For translations, there are entirely different problems – perspective combined

with the quantization effects involved in capturing a digital image. The result is that

successive scanlines higher up in an image equate to significant changes in the

corresponding locations of points on the floor.

For example, consider two points in successive images that are only one

scanline apart (a single pixel). Some typical values from the IPM (Inverse

Perspective Mapping) data for a robot are shown in the following table. Note that

scanlines count downwards from the top of the image, which is why the range

decreases with increasing scanline numbers.



Table 3 – Effect of perspective on distance measurements

Scanline

(pixels)

Range from camera

(cm)

Difference

(cm)

Percentage

Difference

21 201.17 (Reference 1) (Reference 1)

22 190.84 10.33 5.41%

110 30.12 (Reference 2) (Reference 2)

111 29.79 0.33 1.11%

At a range of around 2m, a difference of one scanline represents 10.33cm

across the floor. Near the bottom of the image, at a range of 30cm, the difference is

only 0.33cm.

Expressing differences as a percentage disguises the impact, but cannot

entirely eliminate it (as shown in the final column of Table 3). Besides, there could

be cases where one data point was an outlier or there was a sudden discontinuity.

Penalising or, just as importantly, failing to penalise a data point in these situations

would not help to obtain the correct alignment between the scans.

The problem is that the Lu & Milios approach implicitly assumed that the

range measurements were linearly distributed and that the measurement error was

constant regardless of distance. For a laser range finder, this is a valid assumption.

However, for a perspective image, this is simply not true. Not only are the distances

restricted to a set of discrete values, but the separation between them increases non-

linearly further up the image. The associated error in the estimated position also

increases dramatically at the same time so the significance of a single pixel error in

detecting an edge in the image becomes larger for scanlines higher up the image.

It was found that although the standard Iterative Closest Point (ICP) algorithm

applied to pure translations (forward moves) would converge; the result would often

be offset by some amount from the real distance travelled.



Attempts were made to change the algorithm by applying weights to the

distance measurements so that larger distances contributed less to the accumulated

error sum, but good results still could not be obtained.

Additional theoretical work might allow the Lu & Milios approach to be

modified for use with vision, but this remains an area for further investigation.

3.8.4 Using Hough Lines in the Map

Extracting Hough lines from a map to assist with localization has been done in

the past.

Figure 60 shows two different views of the Hough lines associated with a map

from this research. In (a), the raw lines are shown. The Hough Transform locates

lines parametrically by the angle (slope) and orthogonal distance from the origin. In

this case, the angular resolution was set to one degree and minimum line length to 10

pixels.

Figure 60 – Hough Transform of a grid map: (a) Standard; and (b) Probabilistic

Because of the obvious confusion from the sheer number of Hough lines

shown in Figure 60 (a), OpenCV implements what is called ‘Probabilistic’ Hough

lines. This function attempts to locate the actual lines in the image, including the

end-points. An example is shown in Figure 60 (b) for the same map as in (a).

The lines shown in Figure 60 (b) were obtained for a minimum length of 20

pixels, which corresponded to 1 metre (at a 5cm grid size). At the edges of the map

there were wall segments that were shorter than this, so gaps resulted. However,



reducing the minimum length of the Hough lines produced more extraneous lines. In

other words, the minimum line length is a compromise.

This approach was not pursued further due to the difficulties explained above

but it warrants more investigation.

3.8.5 Orthogonality Constraint

Based on the concepts of scan matching and using Hough lines in the map, an

approach was developed for localization, referred to as Incremental Localization.

This involved assuming that the walls were orthogonal. In most buildings, walls are

built parallel or at right-angles to each other. If the robot can identify a wall in an

image, then it can use this information to determine its orientation provided that it

has a reasonable estimate of its current orientation.

In the simplest case, a robot can follow a corridor by aligning with the walls.

However, if the walls run either North-South or East-West (which are arbitrary

directions related to the orientation of the map, not real compass directions) then the

robot also knows to the nearest 90 degrees which direction it is travelling and can

‘snap to’ one of the primary directions.

Because this is one of the main contributions of this thesis, it is discussed in

detail in Chapter 4 rather than here. A brief description of the algorithm is as

follows:

• Apply the Douglas-Peucker algorithm to the ROP to find straight lines;

• Select lines longer than a specified threshold to ignore noise and small

objects;

• Check if any of these lines correspond to walls close to North-South or

East-West in the map; and

• If the difference between the direction of a wall and one of the primary

directions is small enough, then update the robot’s orientation to be

consistent with the wall.



More than one wall might be visible. In this case, the adjustment to the robot’s

orientation is a weighted average of the adjustments indicated by each wall based on

the length of the wall segment, i.e. longer walls carry a larger weight because they

are more likely to be correct.

The effect of applying this constraint was to improve the SLAM results

significantly because it improved the motion estimates.

3.9 Simultaneous Localization and Mapping (SLAM)

The GridSLAM algorithm was selected from a recent textbook, Probabilistic

Robotics [Thrun et al. 2005], which is one of the definitive texts in the field of

SLAM. A Particle Filter algorithm was selected because it is more suitable for use

with occupancy grids than a Kalman Filter approach.

Because SLAM was well documented in the literature, a fundamental

assumption was made that reliable algorithms were available. However, this proved

not to be the case when using visual data from the hardware in this research.

Recently, several researchers created the OpenSLAM web site [OpenSLAM

2007]. Some of the SLAM packages from this web site were used for comparison

when problems were encountered with GridSLAM in order to verify that the

problems were generic and not restricted to a particular algorithm.

3.9.1 Implementation

In the algorithm below, the notation is the same as in Chapter 2 with the

addition of:

ut' Corrected control input at time t after Incremental Localization

The GridSLAM algorithm was modified slightly to incorporate incremental

localization, which is one of the innovations of this work. The new algorithm is

shown below.



Algorithm 5 – FastSLAM Occupancy Grids V2

FastSLAM_Occupancy_Grids_V2(Pt-1, ut, zt)

Tt = Ø

For k = 1 to M

ut' = Incremental_Localization(ut)

xtk = Sample_Motion_Model(ut', xt-1

k)

wtk = Measurement_Model_Map(zt, xt

k, mt-1k)

mtk = Update_Occupancy_Grid(zt, xt

k, mt-1k)

Add particle ‹xtk, mt

k, wtk› to Tt

End For

Pt = Resample(Tt)

Return Pt

This modified FastSLAM algorithm uses yet another algorithm called

Incremental_Localization that was created by the author.

3.9.2 Incremental Localization

In the version of the algorithm used in this work the control input, ut, is first

corrected by using incremental localization before applying the motion model. The

procedure involved is discussed in detail in section 4.4 on Incremental Localization.

In summary, it involved locating walls in the images and making the assumption that

walls run North-South or East-West in the map. (These are nominal directions

corresponding to up-down and left-right in the map. They are not actual compass

directions.) The robot’s orientation could therefore be corrected so that it remained

aligned with the walls.



It is important to note that without the incremental localization step GridSLAM

was found to be only marginally stable and would often fail.

3.9.3 Motion Model

It was decided to use a simplified motion model similar to the approach used in

GMapping [OpenSLAM 2007]. It has no physical significance, but it is plausible

based on experience with the X80 robots. The model involves adding random errors

with a Gaussian distribution to the pose.

There were two basic types of motion that the robot performed during

exploration: forward moves and rotations on the spot. For each of these a Gaussian

noise component was applied to the robot position and the orientation. The variances

for position and orientation were maintained separately.

The equations relating the motions of the robot are shown below. These are

expressed in the 2D coordinate frame of the robot where the robot begins at the

origin facing along the positive x axis.

ηζ

=+=

ydx

(25)

ξθα += (26)

Where:

x, y = Components of the actual translation

d = Requested distance to translate

ζ = Translation error in x direction as a Gaussian

η = Translation error in y direction as a Gaussian

α = Actual angle of rotation

θ = Requested angle of rotation

ξ = Rotation error as a Gaussian



To apply the motion model to the robot’s pose, the translation vector, (x,y),

was first calculated and then transformed to the world coordinate system using the

robot’s current estimated orientation. The coordinates could then be added to the

current position.

The rotation was easier to apply because it was simply a matter of adding the

new rotation value to the current orientation.

Before determining the parameters for the motion noise, the parameters of the

encoder calculations (see section 2.1.2, Equations 1 and 2) were confirmed. The

manufacturer, Dr. Robot Inc, supplied the following information:

Wheel diameter = 165mm = 2r

Wheel base = 272mm = w

Ticks per revolution = 1200 = n

The robot was set up facing a wall and commanded to move backwards by

20cm five times in a row. The sonar measurement from the front sonar was recorded

in each case, and it was seen to increase by 20cm except in the last case where it

oscillated between 20 and 21cm. However, the sonar could only be read to the

nearest centimetre so this was not a significant error (less than 1cm in 100).

This was followed by five forwards movements of 20cm. The sonar readings in

these cases decreased by 20cm each time. Although this was not an accurate

assessment, it provided quick validation that the parameters were correct.

By trial and error, values were established for the standard deviation for the

position and orientation noise components. The motion model parameters (see the

equations above) were obtained by driving the robot manually and measuring the

resulting movements.

When commanded to move forward, X80 robots rarely move by exactly the

correct amount. They also tend to rotate slightly in the process. This probably

happens because the wheels do not start and stop simultaneously, resulting in a small



amount of torque being applied around the central axis of the robot. Other factors

such as wheel slip and unevenness of the floor can also contribute.

For a rotation, the robots do not always turn by exactly the desired amount. To

complicate matters, after a full pirouette the robots had often drifted slightly from

their starting position. This drift was more likely to be sideways than forwards or

backwards.

Two types of experiment were performed:

• Drive a fixed translate distance (30cm) and record the actual distance

travelled; and

• Rotate by the ‘sweep’ amount (30 degrees) and record the actual angle.

These experiments were repeated 20 times, and the variances were obtained

from the data.

A zero mean was found in all cases. A non-zero mean would indicate a

systemic error, such as an incorrect value for the wheel diameter or distance between

wheels. This type of error could have been corrected by adjusting the commands to

the wheel motors. However, no consistent errors were observed that would indicate

such systemic errors.

The variances found in the experiments were rounded up slightly for use in the

SLAM algorithm so that the particles would propagate across the full variance of

motions. The values used were 3.0cm and 0.1cm for the x and y variances

respectively, and 2.0 degrees for the rotation variance.

3.9.4 Calculating Weights

The measurement (sensor) model is discussed in detail in sections 3.7.6 and

3.7.2. The first step is to take an image and extract a range scan from it. This must

then be used to draw a local map.

There are parameters required for the measurement model. In particular, a

maximum visual range was set. There are also several parameters involved in

extracting the floor boundary from the video images. An incorrect choice for any of



these parameters can adversely affect the output of the measurement model. There is

no methodology for choosing these parameters other than trial and error.

After creating a local map using the sensor model, the map correlation

algorithm was used to obtain the particle weights. (This was discussed in section

2.7.8). The algorithm produces correlation values in the range -1.0 to +1.0.

The basic map correlation algorithm compares the entire local map to the

global map. However, the camera can only see a portion of the space in front of it

due to the limited field of view. Therefore the algorithm was modified to include

only those grid cells that were potentially visible, rather than the full square region of

the map. To do this, a mask was created as shown in Figure 61 (a).

Figure 61 – Map Correlation for Particle 31: (a) Mask; (b) Global map segment; and (c) Local map

In Figure 61 (a), the mask shows the area the camera could potentially see.

Clearly, a comparison of the entire map would be incorrect because the black area is

outside the visual range of the sensor and any comparison would be meaningless.

Notice that there is a round area at the centre of the map which represents the robot.

This area must be free space because it is occupied by the robot.

The second map, Figure 61 (b), shows the result of overlaying the mask on the

current global map maintained by the particle. Only this area is included in the

calculation of the map correlation. The last map (c) shows the local map that was

generated from the vision system. (The background appears grey because this

indicates unknown space). The local map was compared against the global map and

the resulting correlation coefficient was 0.508773, which was the best particle

weight at this time step.



Now consider the particle with the worst correlation coefficient, 0.164626.

This is shown in Figure 62. The particle had a different position and orientation so

that the view in its global map did not match what the camera saw.

Figure 62 – Map Correlation for Particle 48: (a) Mask; (b) Global map segment; and (c) Local map

The global map segment and the local map in Figure 62 clearly do not match

well. However, it is difficult to tell with the naked eye that there are any differences

between the local maps in Figure 61 (c) and Figure 62 (c).

There is another subtle point in these figures that is also not immediately

apparent to the naked eye – the mask in Figure 62 contains several more white pixels

than Figure 61 due to the ‘jaggies’ along the top edge of the pie slice. Although the

difference is minor, it is sufficient to cause small differences in the calculated

weights for particles that might otherwise be considered ‘equally good’.

Another issue with a limited range in the sensor model is that sometimes there

is nothing to see. When the robot was moving down a wide open corridor, a range of

only 2 metres would sometimes mean that the robot could not see anything except

free space. In this case a large number of particles could have the same weight,

which allowed poor particles to survive for a period of time.

3.9.5 Resampling

The Low Variance Resampler discussed in Chapter 2 was used to perform

resampling.

In [Thrun et al. 2005] it is stated that particles with negative correlation

coefficients should have their weights set to zero. This will result in them being

removed from the particle set during resampling. Experience during test runs showed

that particles often had negative weights for brief periods of time. Even though they

were generally well-behaved they would be flushed from the particle set.



Consequently, a change was made to the algorithm so that if more than 50% of the

particles had a zero weight, the weights were redistributed rather than throwing away

particles that might have had a single bad result. Bad particles that continually had

poor weights would still be removed from the set eventually.

Another way to prevent particles from being unexpectedly flushed was also

used. It was based on advice from Dr. Grisetti, at the University of Freiberg and

involves what amounts to a ‘moving average’ of the particle weights to smooth out

changes. The moving average was calculated by adding successive weights together.

During resampling, all weights are normalized so that they sum to one. This is also

discussed in Chapter 4.

The resampling frequency is an important factor in the stability of the SLAM

algorithm, which is discussed in Chapter 4. It was decided that resampling would

occur only in one of two situations:

• After completing a pirouette; or

• After making a specified number of steps (translations or rotations).

A pirouette should not change the pose of the robot. However, it obtains

information about the surrounding environment for a full 360 degrees which should

provide good information for updating the map. Therefore it seemed sensible to

update the particle set after a pirouette.

Conversely, the first and last images in a pirouette sequence are the same view.

If the first image results in changes to the global map, then the last view will

reinforce this and is likely to give the particle a high weight. This could be the wrong

time to use the particle weight for a decision on resampling.

In most test runs, the number of steps before resampling was set to 5. This

value was selected using trial and error by comparing maps produced using saved

data sets.

For more discussion, see sections 4.5.10, 11 and 12.



3.9.6 Updating the Global Map

Each particle in a Particle Filter has its own global map. Updates to the global

maps are made using the pose for each particle and its local map. The update

equation was developed in section 3.7.1.

There are some design issues related to how to apply the local map information

to the global map, but these are discussed in Chapter 4 – Results and Discussion.

3.9.7 Experimental Verification of Localization

Because SLAM used a particle filter, it was important to test the localization

ability of the filter. A portion of a saved log file from a test run was re-run using the

filter for localization only. To do this, the final map (produced using SLAM) was

loaded into the ExCV program when it started up and updating of the map was

turned off. This effectively ensured that only the localization was performed.

Fifty particles were initialized at random around an approximate starting

location of the robot (not the true location of the robot). The program was then run to

see what the trajectories of the particles would be.

NOTE: If SLAM had not worked successfully to produce the map in the first

place, then this experiment could not have been performed. This example is intended

as an illustration of localization.



a b cd e f

Figure 63 – Particle progression during localization and final path on map (from top-left to bottom-right)

The particles were initially scattered at random around the estimated location,

which is shown with a red ‘X’ in Figure 63 (a). Note that it is not possible to tell

from this 2D map that the orientations of the particles also varied. The particles were

distributed according to a Gaussian distribution with a zero mean and variance, σ, of

50cm in the x and y directions, and 20° for the orientation. Given that 2σ covers 95%

of a Gaussian distribution, this results in an effective range of about ±1m and ±40°.

After a few steps and updates to the particle set, the paths had already started

to coalesce as shown in Figure 63 (b) (top row in the centre). However, one aberrant

particle headed off in the wrong direction as can be seen more clearly in (c).

This ‘bad’ particle received an initial orientation that was way off the true

value. This can happen when randomly initializing particles with a Gaussian

distribution due to the long ‘tail’ in the distribution. The initial random poses can

have a significant effect on the filter performance with a small number of particles.

The results were quite reproducible, however, because the random number seed was



always the same when the ExCV program was started. In this case, the first few steps

were in open space and there was very little to distinguish between particles due to

the limited range of the vision sensor.

Several steps later (c), the ‘bad’ particle had replicated, and these particles had

in turn split up. Notice, however, that all of the particles that were below the ‘X’ in

the map had disappeared. By (d) (bottom left of the figure) the ‘bad’ particles had

also disappeared.

Bad particles should not survive re-sampling. However the ‘signal to noise’

ratio for vision is much worse than for other range sensors such as laser range

finders. The small number of particles used was barely enough to effectively capture

the distribution and in this case the algorithm was on the borderline of becoming

unstable. This establishes a minimum number of particles necessary for the

localization algorithm to work.

Finally, in (e) (bottom centre) the particles had clumped together because the

robot had seen the corner which provided a good reference point. The final path of

the robot is shown overlaid on the map in Figure 63 (f) (bottom right).

It is important to note that the starting point of the final trajectory in (f) is not

at the ‘X’. This is correct because the starting estimate was not exactly in the right

location. Had the initial proposal distribution (shown in (a)) been too narrow to cover

this location, then it is possible that the filter would not have converged.

This is a difficult test for localization because there are no obvious landmarks –

the environment consists mostly of a long corridor. It was selected deliberately to

test the stability of the algorithm. Note that the last stray particle did not disappear

until the robot turned the corner and became much more certain about where it was.

Lastly, note that the final particle set is entirely derived from only a few of the

original particles. If for any reason these had been eliminated along the way due to

resampling, then the filter would have failed. Resampling is a statistical process

therefore it is possible that sometimes the wrong particles might be kept.



3.9.8 Quality of Generated Maps

The nature of the experiments was such that the maps were grossly distorted if

SLAM failed. Quantitative comparison was not necessary in these cases because the

differences were clearly visible to the naked eye.

However, it could be difficult to visually compare the resulting maps from

SLAM test runs based on the same data set using different sets of parameters. A

formal evaluation method was useful in this case.

Two different methods were used:

• Map Correlation; and

• Normalized Sum of Squared Differences.

Note that these comparisons require a ‘ground truth’ map. This is usually

available in an experimental environment, but it is often not available in the real

world.

Map Correlation used the same formula from [Thrun et al. 2005] as for

calculating the particle weights (section 2.7.8). This was applied to the entire map

and the result was a value between -1.0 and +1.0 with +1.0 meaning a perfect match.

Negative values indicate a very poor match, and should not occur if the generated

map is reasonably similar to the ground truth.

A SLAM map for one of the test runs was compared to the ground truth by

offsetting it by up to 20 grid cells in each direction. The resulting correlation values

were plotted as shown in Figure 64. This showed the sensitivity to displacement of

the SLAM map and gave some idea of the type of surface produced by map

correlation.



-20-10

010

20-20

-10

0

10

20

00.2

0.4

0.6

0.8

1

Figure 64 – Map Correlations versus map displacement

The Sum of Squared Differences (SSD) was calculated as follows:

• Take the difference between the values of corresponding grid cells;

• Squares these differences; and

• Sum them across the entire map.

The best possible value was zero (for a perfect match), and the worst possible

value was 1,647,993,600 for a 176x144 map. The resulting large numbers made it

difficult to understand the quality of the map, so two extra steps were added:

• Normalize the sum by dividing by the maximum possible value; and

• Subtract from 1.0 to invert the direction of curvature.

This was also tested by displacing the same generated SLAM map against the

ground truth. The surface curvature was not as great as Map Correlation, which

made this method less practical.



-20-10

010

20-20

-10

0

10

20

0

0.2

0.4

0.6

0.8

1

Figure 65 – Normalized SSD versus map displacement

To evaluate the discriminatory power of the methods, three generated SLAM

maps for Level 8 were compared. Consider the following set of maps which show

first the ground truth for Level 8 (Figure 66) and then three SLAM maps generated

by different test runs (Figure 67).

Figure 66 – Ground truth for Level 8



Figure 67 – Sample SLAM maps for Level 8

The resulting values for comparison against ground truth were as shown in the

table below.



Table 4 – Comparison of Map Quality

Map

Number

Normalized

SSD

Map

Correlation

1 0.8709 0.7877

2 0.8136 0.6927

3 0.7198 0.5564

The results from the two methods agreed in their rankings. Visual inspection of

the maps also showed that the ranking was correct. Note that these numbers are

difficult to interpret as a ‘goodness of fit’, and can only be used for comparison

purposes.

3.10 Other SLAM Implementations

Other SLAM packages were tested with sample data generated by the

simulator to compare their stability against GridSLAM. The results of applying these

algorithms to test data generated by the simulator are shown in Chapter 4 – Results

and Discussion.

In early 2007, several researchers joined together to create the OpenSLAM

web site [OpenSLAM 2007] which offered open source versions of several SLAM

packages for download. This provided an opportunity to run some test data through

packages that had been written independently to see if they could produce reliable

maps using visual data from the experimental environment.

Two SLAM packages were selected for comparison:

• DP-SLAM [Eliazar & Parr 2003]

• GMapping [Grisetti et al. 2005, Grisetti et al. 2007]

Both of these packages were provided in source form for Unix. They were

based on Particle Filters and therefore similar in architecture to GridSLAM.



The packages were ported to Windows. To verify that each package ran

correctly, public domain data sets were run through of them and the results compared

to the corresponding maps available on the Internet. These are shown below simply

for validation.

3.10.1 DP-SLAM

For DP-SLAM, sample data and maps were obtained from the web site

http://www.cs.duke.edu/~parr/dpslam/ (accessed February, 2007).

The selected data set was Loop 5 from the Duke University Computer Science

Department on the Second Floor of the D-Wing of the LSRC building. This data set

was used courtesy of Austin Eliazar and Dr. Ronald Parr. The original map on the

web site was produced using 3,000 particles with a grid cell size of 3cm.


http://www.cs.duke.edu/~parr/dpslam/


Figure 68 – Validation of Windows version of DP-SLAM: (a) Map from the Internet; and (b) Map produced for this Thesis

Although an exact mathematical comparison was not performed, it was clear

from visual inspection of the maps that the Windows version of the program was

working correctly.

3.10.2 GMapping

For GMapping, a data set and map were obtained from Radish [Radish 2007]

for the Newell-Simon Hall, A Level at Carnegie Mellon University. This data is

courtesy of Nick Roy. These files were available from the web site

http://cres.usc.edu/radishrepository/view-one.php?name=cmu_nsh_level_a (accessed

February 2007).


http://cres.usc.edu/radishrepository/view-one.php?name=cmu_nsh_level_a


Figure 69 – Validation of Windows version of GMapping: (a) Map from the Internet; and (b) Map produced for this Thesis

Visual comparison shows that the maps are substantially the same, although

the approaches used to draw the maps differ – GMapping only writes the end-points

of laser hits into the map and does not do ray casting to draw the free space. This

confirmed that the Windows version of GMapping worked correctly.

3.10.3 MonoSLAM

The MonoSLAM system is well known in the area of visual SLAM. Therefore

it seemed reasonable to try to use it.

Davison released the source code for his visual SLAM package, called

MonoSLAM, which built 3D maps in real time [Davison 1999]. A C# version of the



code for Windows was available on the Internet from the following web site:

http://sluggish.uni.cc/monoSLAM/monoslam.htm.

MonoSLAM was tested using pre-recorded images from an X80 robot, but it

could not obtain a tracking lock. When Davison was contacted about the problems

with MonoSLAM, he replied that it requires a high frame rate. A wide-angle lens

also improves the chances of tracking an object because it appears in more frames

[Davison et al. 2004].

This was not a failing of the MonoSLAM package, but rather a limitation of

the experimental hardware used in this research. No qualifications were stated for

using MonoSLAM in the literature, although by implication the hardware should

have been similar in performance to the hardware listed in the relevant publications.

The issue was that the X80 robot was moving too far between images and the

tracking algorithm could not find matches. MonoSLAM had been shown to work

successfully with a hand-held video camera. However, the frame rate and resolution

were both much higher than for the X80.

In 2007, the author discussed MonoSLAM with one of Davison’s associates,

Dr. Neira, who explained that MonoSLAM has a tendency to lose tracking if the

camera is moved too fast. In Neira’s experience with making maps he found that he

had to move slowly and consistently with the camera.

Furthermore, MonoSLAM had no concept of scale because this could not be

determined from the available data. If the camera operator moved at varying speeds

some parts of the map would be stretched. In order to close large loops it was

therefore necessary to apply a map matching algorithm to a set of hierarchical maps

obtained using MonoSLAM [Clemente et al. 2007].

Lastly, note that MonoSLAM had typically been used in ‘feature rich’

environments with a lot of visible features that the algorithm could track. It had not

been used for exploration of indoor environments containing long featureless

corridors. The experiments using SIFT earlier in the chapter demonstrate why

MonoSLAM had a hard time working in the experiemental environment.


http://sluggish.uni.cc/monoSLAM/monoslam.htm


3.11 Exploration Process

The Distance Transform (DT) was selected as the exploration algorithm

because it is grid-based, and therefore a good match for an occupancy grid map, and

also because it provides an indication of when the exploration is complete.

The DT must be applied to a binary image. The occupancy grid map, therefore,

has a threshold applied to it to select the cells where the probability of being

occupied is high. (The probability of a cell being occupied might not always reach

1.0, so a lower threshold was used – typically around 0.75. This value must be

greater than 0.5 which corresponds to unknown space.)

The next step is to calculate the C-Space map. This might involve changing to

a different grid size as well, usually a coarser grid. In the case of the X80 robot, the

usual map grid size was 5cm and the ‘translate distance’ was 30cm, which is roughly

the diameter of the robot. This ‘translate distance’ becomes the grid size for the DT.

When calculating the C-Space map, the cells are resized from the map grid size

to the DT grid size. It is highly desirable for the DT grid size to be the same as, or a

multiple of, the map grid size. If the DT grid is larger than the map grid then it may

happen that an obstacle will only partially occupy a DT cell. In this case the resulting

path is even more conservative.

However, it is also possible that an unfortunate alignment of the map with the

real world will result in a gap that is too narrow for the robot to pass through, even

though in reality the robot will fit. Due to uncertainty in the position of the robot and

its motions, safety (avoiding collisions) was considered to be of primary importance

and therefore this issue was ignored.

Figure 70 shows an example of a DT path from the robot (which is the round

object near the middle of the diagram) to unknown space (grey area) in the top left.

Notice that the obstacles appear ‘fat’ because this is a C-Space map and they have

been expanded by the diameter of the robot.



Figure 70 – Path to nearest unknown space

While calculating the path it sometimes happens that there are two adjacent

cells with the same distance value. In order to minimize the number of turns, the

direction chosen is the one that the robot is currently facing, if possible.

When creating the path, if a cell is reached which has no surrounding cells with

a lower value, and where this cell is not an unknown space (zero), then there are no

more paths and exploration is complete. This is a very useful feature of the DT

because the robot can tell when it is finished, assuming a closed environment, which

indoor environments are by definition.

The DT in OpenCV was modified to handle the requirements of exploration as

outlined above and also to iterate until there were no more changes in the distance

values. (The original implementation only performed a single forward/backward

pass). The code was verified by exporting the data and checking it manually.

3.12 Summary

In this chapter the essential aspects of the system design have been outlined. A

large variety of different subject areas had to be covered. Many of the design criteria

arose out of initial experiments, so the system design evolved throughout the term of

the research.


Chapter Four Results and Discussion

4 Results and Discussion

‘The problem of exploring an environment belongs to the

fundamental problems in mobile robotics.’

– Burgard, Moors, Stachniss and Schneider,

IEEE Transactions on Robotics, 21(3), 2005.

Due to the process of successive refinement, some of the results presented in

this chapter were also covered in Chapter 3 – Design and Experimentation. However,

the key results are in this chapter as well as some recommendations for future

research.

As noted in Chapter 1 – Introduction, some avenues of research turned out to

be dead ends. Rather than omitting these from the thesis, they have been included

here because the information might be useful to future researchers.

4.1 Hardware

This section includes brief discussion of the hardware components used in the

thesis experiments.

4.1.1 The Robots

The Hemisson and Yujin robots achieved their purpose which was to test

various prototypes of the system. Their motions were not characterized because they

were not used with the final SLAM algorithm.

A simulator proved to be invaluable because running experiments with real

robots can be very time consuming and at times very frustrating. The simulator

allowed the mapping and exploration algorithms to be tested without requiring

SLAM to be operational because there were no motion errors.



The X80 robot was found to have several hardware limitations, such as the

camera resolution, reliability of the WiFi interface and accuracy of the wheel

encoders. However, these are real-world problems and affect all robots to some

extent.

4.1.2 Computational Requirements

The robots used in this thesis did not have onboard computers and were

remotely controlled by a PC. Two different PCs were used in the course of the

investigation:

• Desktop PC with 1.5 GHz Intel Pentium processor, 1GB of memory.

• Laptop PC with 2.0 GHz Intel Pentium M processor, 1GB of memory.

All of the SLAM test runs were performed using an X80 robot. Consider the

following details for a typical test run:

Duration: 2 hours

Images received: 8,045 (just over one frame per second)

Processed images: 2,054

Although this is an average of 3 to 4 seconds between images, much of this

time was while the robot was moving. From the log file, it was determined that

motions took between 1.7 and 5.4 seconds. The average time to process an image

was found to be only 0.69 seconds.

Because of the low frame rate, the bandwidth utilization (over WiFi) was only

a few percent. Therefore the network was not a limiting factor.

The computational and memory requirements increase in direct proportion to

the number of particles used for SLAM. A faster PC would result in faster

processing times.

It is clear from these details that the robot did not move in what might be

considered “real time”. However, the program was not optimized and contained a

large amount of debugging code, so some improvement would be possible.



4.2 Navigation

The primary purpose of navigation was to locate the floor to detect obstacles.

This was basically a segmentation problem. Several different methods were

investigated for this task, both alone and in combination, as explained in the previous

chapter.

4.2.1 Final Floor Detection Algorithm

After much experimentation, as outlined in the preceding sections, the floor

detection algorithm that was used was as follows:

• Adjust image brightness to compensate for changes from the robot’s

very first view of the floor;

• Apply a Gaussian smoothing filter to remove carpet texture;

• Run an edge detector to find obstacle boundaries;

• Examine patches across the bottom of the image to see if the colour

matched the floor colour determined at startup and use the matching

points as seeds; and

• Perform a moving-average flood fill at each seed point with the edges

acting as a mask to constrain the flood fill.

Colour matching for seed points was based on the colour of a floor patch

obtained when the robot was initialized. The matching process was performed in

RGB space and also included the range of pixel values as a measure of texture so

that bland (non-texture) surfaces could be ignored.

An example of the output from this algorithm is shown in Figure 72 (b) with

the input image (after filtering) in Figure 72 (a).



Figure 72 – Floor detected using Flood Fill and Edges

There were several ways in which the floor detection could fail. The most

common case was where the edge detector did not pick up a door, or conversely it

detected shadows.

Edge detection is a compromise. If the edge detector was too sensitive, then it

would pick up texture in the carpet. Gaussian smoothing helped to remove floor

texture, but it also disguised the real edges.

Figure 73 – Floor edges not detected

In Figure 73, the wall edge towards the top-left of the first image was not

detected because the edge was not distinct enough. In the second image, the wall at

the top-right is obscured due to shadow. These types of errors resulted in ‘break

outs’ that appeared in the map. In general, the resulting holes in the walls in the map

were filled in subsequent images as the robot got closer to the wall.



The opposite problem of ‘phantom’ walls is illustrated below. This is caused

by strong shadows that are interpreted as edges.

Figure 74 – Shadows detected as walls

At the left-hand end of the black strip on the right (a corner) there is a small

extension in Figure 74. Once again, this incorrect information was corrected as the

robot approached the corner.

Lastly, there were JPEG artefacts in the images that contributed to mistakes in

the edge detection.



Figure 75 – Enlarged image showing JPEG artefacts

Figure 75 shows an original image (with the extracted floor) and one that has

been enlarged by a factor of 300% to show the detail. Note the blurring along the

wall edges and even false edges below the walls and in the doorway. Although it is

difficult to see, the colour of the carpet also bled into the walls in some places. These

are typical JPEG artefacts resulting from the compression of the image for

transmission over wireless Ethernet. The robot did not provide an uncompressed

mode in the firmware so these false features were unavoidable.



4.3 Mapping

The mapping component of the system, based on occupancy grids, proved to

be very successful. It was tested extensively in simulation to verify the program code

before being used on real robots.


Equations were developed for Inverse Perspective Mapping based on the

camera geometry that were efficient to compute. These equations are highly non-

linear due to the nature of perspective. Consequently, errors in the estimated

distances increase rapidly as the distance from the robot increases. Eventually the

uncertainty becomes unacceptable.

Obtaining good results with this approach proved difficult. The problem lay in

measuring the values of b, h, d and w. Although these could be measured

approximately just with a ruler, the resulting values in the IPM lookup table for

longer distances were sensitive to small errors in the parameters.

Alternative methods of IPM are available, and in retrospect it might have been

more appropriate to use one of them rather than develop a new method.

As shown graphically in Chapter 3, the error between the calculated and

observed distances approached ±10cm for distances of 200cm or more. This

discrepancy was of some concern because it was equivalent to two grid cells on the

map (assuming a 5cm grid). However, given the low resolution and poor quality of

the camera, it was nonetheless a remarkable result because it was the equivalent of

only ±1 pixel in the image.

Edge detection is generally considered to be accurate to within ±1 pixel, but

larger errors can be encountered in practice. Even small amounts of jitter in the

camera’s position could affect the location of an edge in the image by more than a

pixel. Because edge detection was the primary means of locating obstacles, the

accuracy of the edges was important. If the resolution of the camera were increased,

then an error of one pixel in detecting an obstacle would correspond to a smaller

distance.



The higher the camera resolution, the smaller the distance increments would be

in the IPM Lookup Table. Therefore a high resolution camera should produce more

accurate maps. Furthermore, a higher resolution would increase the distance before

the uncertainty became unacceptable effectively allowing the robot to ‘see further’.

This should improve localization. Further work is required to confirm this

hypothesis.

4.3.2 The Radial Obstacle Profile (ROP)

Two different approaches were used to calculate the ROP. The first algorithm

traced up pixel columns from the bottom of the image until a non-floor pixel was

encountered. The second method involved tracing rays in the image that emanated

from the camera.

This latter approach is more computationally intensive, but it produces more

accurate results. In particular, it allows the ROP to see ‘around corners’. The

explanation for this is clear from Figure 32 (b) (in Chapter 3) which was obtained

using the first method. Notice the shape of the floor (light grey area) just past the

right-hand side of the first obstacle. The floor is shown as rising vertically in the

floor image because it used the initial implementation. This was clearly not correct,

and it ignored a small amount of floor that was visible in the camera image of Figure

32 (a).

Note that this problem only arises if the camera is tilted. If the camera is

horizontal, it cannot see ‘around obstacles’. In fact, the camera is not seeing around

corners at all. Vertical edges appear tilted in the image due to the camera tilt, and so

a small region of the floor was ignored without ray casting.

There is another problem related to the tilted ‘vertical’ edges – a naïve

approach to mapping the floor inserted walls into the map corresponding to these

edges. However, this was incorrect. A method for eliminating this problem was

published by the author [Taylor et al. 2006, Taylor 2007] and is discussed in section

4.4.3.



4.3.3 Pirouettes

Each pirouette consisted of 12 steps, with rotations of 30° in between. An

image was captured after each step, and a local map was built from the image.

Experiments were conducted with taking an average of three images to produce a

more reliable image before processing. This proved to be unnecessary. However, it

might be desirable to process multiple images after each turn. This would allow the

map to be updated several times.

Note that rather than performing a complete sweep before integrating the local

map into the global map, the local map obtained from each image was individually

integrated into the global map (although the program still had the option to

accumulate local maps first). This was found to be more reliable because it allowed

incremental localization to be performed after each motion.

4.3.4 Discrete Cosine Transform (DCT) of the ROP

Because a ROP sweep is a single-dimension array, a Discrete Cosine

Transform (DCT) can be calculated for it. By truncating the DCT to only the first

few parameters, and then applying an inverse DCT, a much smoother curve can be

produced. An example is shown in Figure 76, courtesy of Dr. Shlomo Geva (the

author’s supervisor). This was effectively a low-pass filter.

Figure 76 – ROP sweep smoothed using DCT

Note that successive ROPs were deliberately overlapped, i.e. each 60° ROP

had a 30° overlap with the previous one. It was therefore possible to produce two

360° sweeps from a single pirouette. Figure 76 shows three curves: the combination

of the odd ROPs, the even ROPs and the average of the two. The curves were



obtained by truncating the DCT to only 24 coefficients, and then applying the

inverse DCT. The resulting curves do not fit the actual obstacle profile very well, but

that was not the intention.

One reason for investigating the DCT was not for filtering, but to see if an

ROP sweep could be condensed to only a few (around 20) values. This reduction in

data storage requirements was expected to help if large numbers of ROP sweeps

were recorded for later comparison in order to recognise previously visited places by

their ROP sweeps.

The DCTs had nothing to do with reactive navigation – they were intended as a

technique for Localization. Experiments were performed with DCTs by obtaining

ROP sweeps at many different locations on a playing field by placing the robot at

different positions on a regular 5cm grid, performing a pirouette to obtain an ROP

sweep, and calculating the DCT coefficients

Figure 77 shows the first 20 DCT coefficients for 91 grid points. Some points

on the grid could not be included because there were obstacles at those locations.

Note that there are two other dimensions that cannot be easily represented

graphically – the x and y coordinates of the grid points.



Figure 77 – DCT Coefficients of ROP Data

In Figure 77 the DCT series (S1 to S91) are arranged in order of increasing x

coordinate and within that increasing y for each x. The peaks and valleys at the top

of the graph show the progression of the first coefficient (which invariably had the

largest value). The sudden discontinuities are where the x coordinate changed. This

indicates that the coefficients varied with x and y although the relationships are hard

to visualise.

Using a simple nearest-neighbour comparison against the DCT coefficients, it

was found that the robot’s location could be determined even when it was not exactly

on a grid point.

This approach would be vulnerable to aliasing if different parts of the

environment had the same physical layout, but for the playing field no two places

had the same DCT. One possibility to reconcile aliasing might be to keep a history of

DCTs as the robot moved around. It is unlikely that the same sequence of DCTs

would result from two different places if the environment were sufficiently



‘cluttered’. However, it was obvious that long featureless corridors would have many

similar DCTs.

The DCT was phase-sensitive, which meant that it was not rotation invariant.

If the robot’s orientation was changed before an ROP sweep, the DCT would also

change. This could potentially have been used to determine the robot’s orientation if

the location were known. However, the purpose was to determine the robot’s

location. Therefore the ROP sweep would first have to be ‘rotated’ (shift the array

left or right with wrap-around) before applying the DCT. This was an unrealistic

limitation because the robot’s orientation was not generally known.

Building a large collection of DCTs to use for localization would only be

feasible if the robot’s orientation were always known. Otherwise each new ROP

sweep would have to be rotated several times and compared against the table of

previously recorded DCTs to try to find the best match for both orientation and

location. Given the computational overhead that this approach would incur, it was

not pursued any further.

4.3.5 Floor Boundary Detection for Creating Local Maps

The first step in building a map was to make a local map based on the view

from the robot’s camera.

As explained in Chapter 3 – Design and Experimentation, turning a video

camera into a range sensor can be achieved by detecting the floor in the images. The

boundary where the floor meets the walls or obstacles can be used, via inverse

perspective mapping, to obtain range information in the form of an ROP. However,

floor detection requires good segmentation of the images. Various approaches to

segmentation were discussed in Chapter 3, with the final algorithm in section 4.2.1.

Regardless of the sensor used, there is the possibility that it will either miss an

obstacle or incorrectly register an obstacle where there is none. For instance, sonar

signals can bounce off walls so that they are not returned to the sender. This

generates a ‘maximum range’ reading, and the wall is not seen. A maximum range

reading will also occur for sonar if the wall is acoustically absorbent.



In the case of vision, the most common problems were incorrectly identifying

the floor boundary, or missing it altogether. Despite trying many different

combinations of image processing algorithms, as discussed above, no single method

of floor detection was found that was highly reliable.

Edge detection was not perfect, and it was not unusual for an edge to be off by

one pixel or even more. The real edge could also fall between two pixels, which is

more likely at lower resolution. Many edges were not clear due to being in shadow

or simply because the edge itself was rounded rather than sharp.

Also, there were several reasons why edges appeared where they did not really

exist. Quantization of the colour space could introduce apparent edges if the floor

has a colour gradient due to uneven illumination. Shadows sometimes produced

edges as well.

Sometimes there were missing pixels in the edge detector output which created

gaps in the lines representing the floor boundary. (See below at the left-hand side of

Figure 78 and at the bottom of the map in Figure 88). The result was that the flood

fill could leak through these ‘holes’. It then appeared to the robot that the range to

the obstacle was very much larger than it actually was.

Usually these gaps were too narrow for the robot to fit through. (The code

performed a dilation followed by an erosion of the edge image to try to close some of

these small holes.) Consequently small gaps did not result in the robot colliding with

the wall as it tried to negotiate a non-existent opening.

In Figure 78, the air-conditioning grill at the left-hand side in (a) was not

properly detected. Only part of the edge was visible in the floor image (b), and so the

floor ‘leaked’ through. This caused a spurious wall to appear in the map (d).



a b c d

Figure 78 – Floor detection: (a) Camera image; (b) Floor; (c) Contour; and (d) Local map

4.3.6 Ignoring Vertical Edges

Note that there was a vertical edge in the contour in Figure 78 (c) which was

correctly detected (and marked with a thick green line). It was not added to the map,

as can be seen from (d). Vertical edges were discussed in detail in section 3.7.10 in

connection with camera tilt and again below in section 4.4.3. Removing vertical

edges from the floor boundary was one of the contributions of this thesis [Taylor et

al. 2006, Taylor 2007].

4.3.7 Handling Detection Errors

In the original design, a maximum range was set for the vision system based on

the reliability of the range estimates. It was found that a limited range contributed to

problems with localization by reducing the information content in the input.

Therefore the range was increased and a more complex sensor model was

implemented to account for the uncertainty. It was found that SLAM was more

stable with a larger maximum vision distance. However, the maps became more

fuzzy.

Another problem that arose with the X80 robots was completely unexpected.

For the work with the Yujin robot the camera was positioned on the robot manually.



This gave complete control over its orientation. With the X80 robots, however, the

CCD cameras were mounted inside sealed units.

It was found that the cameras in the X80s were slightly rotated in their

mountings, perhaps due to poor quality control. Even a rotation of the CCD as small

as two or three degrees resulted in walls appearing to be at significantly incorrect

angles, especially when they were in the top half of the image. This was a side-effect

of the inverse perspective mapping being applied to edges that were skewed by as

little as a couple of pixels.

The effect of this systemic error was to produce discontinuities in the walls

obtained from successive images, but more importantly it affected localization. To

overcome this problem, the images had to be counter-rotated by a small amount

before applying the image processing.

It should also be noted that a pan and tilt camera assembly can cause problems.

Although being able to move the cameras around might seem like an advantage, it

can be difficult to reproduce exactly the same positions on different test runs.

The last issue was one of image quality. For example, the images below in

Figure 79 show respectively: (a) data corruption on the wireless network; (b) poor

illumination; (c) too much illumination; and (d) colour changes due to illumination

as well as extraneous shadows by a power cord.



a b c d

Figure 79 – Examples of problems with images: (a) Interference; (b) Poor illumination; (c) Too bright; and (d) Shadows

Occasional bad images did not cause problems. The system either ignored

them and did not update the map, or corrected the map later when the area was seen

from a different viewpoint. It was sufficiently robust to be able to survive such

glitches.

4.3.8 Warping Camera Images to show Rotations

Given an initial camera view it is possible to ‘warp’ it to show what it should

look like when the camera rotates about its vertical axis. This can be done by moving

each pixel of the original image to a new location that is calculated based on the new

camera point of view and constructing a new image. Many of the updated pixel

coordinates fall outside the new image because some parts of the image move out of

view as the camera rotates. Therefore the warped image contains only a portion of

the original image.

Figure 80 shows a set of simulated camera images in the top row (a, b, c) with

a FOV of 60 degrees. The camera was rotated 30° to the left (a) and right (c) with

respect to the original view in (b). The bottom row (d, e, f) shows the ‘warped’

versions of the images. So image (d) is the warped version of (a), and (f) corresponds

to (c). The centre images (b) and (e) are the same because they are from the

reference viewpoint which does not move.



a b cd e f

Figure 80 – Overlap between successive images: Top row – simulated camera images; and Bottom row – warped images

The overlapping areas between two images after a rotation are obvious from

(d) and (f). There is a very small area in the centre of the image that would still be

visible after another rotation. This is only true for a tilted camera. If the camera was

not tilted, a second rotation would have resulted in no overlap with the original

image because the FOV was twice the rotation angle. So a feature can remain in at

most two successive images during a pirouette.

These warped images were intended to be used as a means of confirming the

actual rotation by comparing the warped image against the new camera image.

However, the camera on the X80 robot was not located at the centre of the robot and

the actual transformation for the warped images was much more complex than the

one shown here. Furthermore, the X80 tended to move slightly sideways during

rotations and even small changes in position resulted in changes in the image that

could be incorrectly interpreted as rotations.

However, a useful conclusion can be drawn from this exercise: no feature, such

as the corner of an obstacle, was visible in the camera image for three successive

images (two rotations). This explains the difficulty of tracking features.



4.3.9 Creating a Global Map

The simulator was used to validate the mapping code. A generated map was

overlaid on the original model and the result is shown in Figure 81.

Note that this map was produced before the code had been modified to handle

vertical edges. This is why there is an extraneous piece of ‘wall’ at the tip of the long

wall near the centre of the map. However, the overall fit is very good. It

demonstrates that, in the absence of any errors in the robot’s pose, the resulting

occupancy grid is correct and therefore the mapping algorithm was correctly

implemented in the code.

Figure 81 – Map generated in simulation overlaid with the actual model

Accumulated round-off errors and quantization effects (due to the grid cell

size) sometimes resulted in an ‘off by one’ problem when updating the global map.

This caused obstacles to have fuzzy edges and walls not to be completely straight.

Note in Figure 81 that there are small regions behind obstacles where the robot

could not see from its vantage point. However, the robot realised that these areas

were inaccessible (due to the Exploration algorithm) and did not attempt to go there.

Apart from these minor unavoidable errors, the mapping component of the

code was proven to be correct. Even in simulation though, the generated maps will

never be perfect.



4.4 Incremental Localization

As discussed in the next section on SLAM, performing incremental

localization proved to be essential for the stability of SLAM. The purpose of

incremental localization is to improve pose estimates prior to SLAM updates.

Odometry is notorious for being unreliable. Odometry errors were the major

cause of localization problems. Therefore this topic deserves further discussion.

In the course of this research several different approaches were investigated for

improving the pose estimates for the robot. These are discussed in the following sub-

sections.

4.4.1 Odometry Errors

Rotations of the robot were the chief cause of localization problems. One

common cause of errors was the motors not starting simultaneously. This tended to

jerk the robot as it pivoted around the stationary wheel (which was slower to start

up). When the motors eventually stopped, the opposite happened because the motor

that started first reached its target position first and therefore stopped slightly earlier.

This combination of effects caused the robot to move sideways, even though the

objective was a pure rotation.

Even small errors in the estimated rotations accumulated and would eventually

cause the robot to become lost. It is easy to understand why. Consider the following

exaggerated example where the robot is commanded to turn 30°, but actually only

turns 25°. If it now moves forward by a distance of just one metre, it will also move

sideways by 8.7cm. (This is a 5° difference extrapolated over 1m of travel). Clearly

this type of discrepancy, if left unchecked, will quickly result in large errors in the

robot’s position estimate.

Furthermore, if the robot makes a succession of rotations and loses 5° every

time, then it will soon think it is facing the wrong direction. Of course, a recurring

error of 5° would suggest that the robot control system needed adjustment and in

practice the errors can reasonably be expected to be much smaller than this.



Wheel encoder information provides much better estimates of actual motion

than the commands to the robot’s motors. For instance, the robot might not receive a

command due to radio frequency interference and not move at all. In this case, the

wheel encoder values would correctly show zero motion.

Missing commands was a constant source of problems while running

experiments with the Yujin robot because it had one-way communication via a radio

modem and could not detect lost commands.

The X80 used UDP, which is not a reliable protocol, and connections were

sometimes lost when the robot was on the other side of the building from where the

wireless access point was located. The connection could be re-established, but in the

meantime the robot might have ‘lost’ a command resulting in a discrepancy between

the expected pose and the actual pose. This networking issue was outside the scope

of this research.

4.4.2 Using High-Level Features

Methods for using line features in the images to assist with localization have

been published in the author’s related papers [Taylor et al. 2006, Taylor 2007].

These methods rely on using features in the environment: Vertical edges appear in

images at the corners of walls or around doors; and Horizontal edges appear where

walls meet the floor. These are explained in more detail in the next two sub-sections.

4.4.3 Vertical Edges

When the camera is tilted, vertical edges in the real world do not appear

vertical in the images. Refer to Figure 82 which clearly shows vertical lines in a

simulated camera image. The slope of the lines varies across the image.



Figure 82 – Verticals in an image with a tilted camera

For vertical edges, a very simple relationship exists between the slope, m, of

the line in the camera image and the location of where the edge meets the floor in

real world coordinates.

The following derivation can be found in [Taylor et al. 2006] and is one of the

contributions of this research.

Start with the robot coordinate system {x, y, z} which has its origin at the same

point as the camera coordinate system {X, Y, Z} as shown in Figure 5 in Chapter 2.

Note that the positive x axis is into the page. (It might make more sense to have the

origin of the robot coordinates on the floor, but only the x and z coordinates are of

interest because the robot moves in two dimensions). Therefore, the height off the

floor (in the y direction) is not important except to the extent that it affects the size of

the blind area in front of the robot.

Ignore the camera tilt for now. Start with the standard pin-hole camera

equations:

ZYfv

ZXfu == , (27)

where (u, v) are the coordinates of a pixel in the image (assuming the image

origin is at the centre) of a real-world point (X, Y, Z) expressed in the camera

coordinate system, and f is the camera focal length.



By using an edge detector and a line detector, e.g. the Canny edge detector and

probabilistic Hough transform in the Intel OpenCV Library, it is possible to obtain a

set of lines from an image.

The slope of any given line in an image (which is a 2D space) can be obtained

from two points on the line, (u1, v1) and (u2, v2), as follows:

12

12

uuvvm

−−

= (28)

where m is the slope of the line.

If the actual angle is required, then this can be calculated as arctan(m). Note

that the nature of the tangent function is such that m approaches infinity as the line

approaches vertical, which can create computational problems. An alternative would

be to use the inverse slope, which would move the singularity to lines of zero slope –

well outside the range of interest.

Substitute into equation (2) after using equation (1) to determine the image

coordinates for two real-world points, p1 and p2. After a small amount of

simplification the result is:

2112

2112

ZXZXZYZYm

−−

= (29)

This equation describes the slope of a line between two points as it appears in

the image.

It is worth noting at this stage that the focal length has disappeared from the

equation. Also, it is irrelevant which point is chosen as p1.

In a robot-centric coordinate system the effect of the camera tilt is that the x

coordinates of all points in robot space remain the same in camera space, i.e. x = X.

However, the y and z coordinates must be transformed. The new coordinates can

easily be calculated as explained below.

For notational convenience, let c = cos(θ) and s = sin(θ). The rotation about the

X-axis can then be expressed as a rotation matrix:



⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−=csscR

00

001 (30)

In terms of robot coordinates, the rotated coordinates of two points can be

converted to camera space as follows:

( )( czsyszcyxp

czsyszcyxp

222222

111111

,,,,

+−= )+−=

(31)

If the two points were originally selected so that they were both on a vertical

edge in the real world, then in the robot coordinate system x1 = x2 and z1 = z2, i.e.

only the y coordinate varies. Therefore the subscripts have been dropped on x and z

in the following equations.

Using a similar approach to that used to obtain equation (3), substitute these

new coordinates from (5) into equations (1) and (2) and simplify:

)())()((

21

221

222

yysxscyscyzm

−+−+

= (32)

Applying the trigonometric identity c2 + s2 = 1, the final result is:

)sin(θxzm −

= (33)

The camera tilt angle, θ, can be determined from the extrinsic parameters once

the camera has been calibrated. The coordinates (x, z) are the location of the

intersection of the vertical edge with the floor in the real world relative to the robot’s

camera, i.e. measured across the floor.

Clearly, if the camera is horizontal so the tilt angle, θ, is zero, then m becomes

undefined. This corresponds to an infinite slope, or a vertical edge in the image. In

other words, with no tilt, all of the vertical edges actually appear vertical in the

images regardless of their position. This is only true in the special case of a camera

mounted horizontally.



Interestingly, when x is zero the slope is also infinite. This corresponds to a

vertical edge at the absolute centre of the image which is on the camera’s y-axis.

Using the slopes of the same vertical edge in two successive images for the

case where the robot rotates on the spot and the camera is at the centre of the robot,

the angle of rotation can be calculated from:

))sin(arctan())sin(arctan( 21 θθψ mm −= (34)

The calculation is more complicated if the camera is not at the centre of the

robot. The derivation is not included here because ultimately it was not used.

However, an example is shown in Figure 83 where an X80 robot was commanded to

rotate on the spot while keeping a door frame in view. There are four images – the

original (a) plus three images after rotations of 10° each.

a b c d

Figure 83 – Vertical edges in successive images for rotations on the spot

By using a calibration target at each location (not shown in the images in

Figure 83) to obtain the extrinsic parameters, it was possible to determine the actual

rotation of the robot, δ, in each of the three cases. The angles calculated using

equation 25, ψ, are shown in Table 5.



Table 5 – Comparison of calculated rotation angles

Image Angle δ Angle ψ

Figure 83 (b) 10.96 10.72

Figure 83 (c) 10.33 8.09

Figure 83 (d) 10.39 14.92

Total 31.68 33.73

Note that this was a contrived example. The robot normally turned by 30°

during pirouettes, not 10°, so vertical edges did not stay in view for very long.

The number of vertical edges in images of corridors turned out to be quite

small. Furthermore, the results were not very reliable as can be seen from Table 5.

Nonetheless, it seems plausible that humans might use this type of information

to estimate rotations or the orientation of objects. In particular, consider the left half

of Figure 82. All of the verticals lean to the left at progressively larger angles

towards the left edge of the image. If an edge was visible that sloped to the right,

then it could not possibly correspond to a vertical edge in the real world. Obviously

the reverse is true on the right side of the image. This could be used for a simple

sanity check for straight edges detected in images to see if they might be vertical.

4.4.4 Horizontal Edges

Another type of high-level feature that can be obtained from the images is the

boundaries between walls and the floor, or horizontal edges. Figure 84 shows the

process used for obtaining a local map. Firstly the floor was detected using edges

and a flood fill to create the filled image (b). Then, by ray tracing inside the filled

image and inverse perspective mapping, the floor map could be drawn (d).



a b c d

Figure 84 – Obtaining a local map from an image: (a) Image; (b) Floor; (c) Contour; and (d) Local map

Note that the walls appear slightly curved in (d), especially the one on the left.

Also, the end of the corridor was quite uncertain and therefore messy in the map.

The extreme bulge in the end wall corresponded to the small blip in the filled image

– only a single pixel at almost the maximum visual range in this example of 4.0m.

The floor contour (c) was obtained by applying the Douglas-Peucker algorithm

to the floor image (b). This was not required for drawing the local map, although it

could be the topic of future research. In particular, the contour might be used to

straighten up the walls in the map. In this case though, the lines in the contour were

used for localization.

By assuming that the walls are orthogonal and run either North-South or East-

West in the map, it was possible to obtain corrections to the robot’s orientation that

were absolute. This is a reasonable assumption for indoor environments.

Furthermore, buildings rarely change internally (except during renovations) so their

structure is a permanent feature.

In the example of Figure 84, there were three straight edges that were long

enough to be significant. Two of these related to the corridor walls and the third one

(in the middle) was the end of the corridor. All three were transformed to real world

coordinates and used to obtain estimates of the robot’s orientation.



If the difference between an orthogonal wall and one of the detected walls was

less than a threshold (set to 15° in the program), the pose of the robot was updated so

that it aligned with the walls.

Figure 85 shows two maps – one without localization (a), and the other with

localization (b). When using only the odometry information from the robot’s wheel

encoders, Figure 85 (a), the resulting map was distorted and clearly incorrect. By

applying the constraint that the walls should be orthogonal, the robot’s orientation

was adjusted to keep it correctly oriented for most of the trajectory, resulting in

Figure 85 (b).

Figure 85 – Map created from odometry data: (a) Without incremental localization; and (b) With incremental localization

In this case, there was a 45° wall in the bottom-right of the map marked as

point (1). It has been drawn at an angle in Figure 85 (a). It is not visible in (b)

because it was overwritten when the robot returned later.

The odometry information was so inaccurate that by the time the robot

returned to the 45° corner, marked as point (2), it mistook this diagonal wall for an

east-west wall and drew the corridor in (b) between points (2) and (3) at 45 degrees.

Later it reached the top-right corner, point (3), and again became confused.

This resulted in the map turning south rather than west, although it did ‘lock onto’

one of the principal directions.

Note that Figure 85 (b) was created only using incremental localization applied

to the raw odometry data without the benefit of SLAM. It illustrates clearly the

improvement that can be achieved by making a simple structural assumption about

the environment.



It should be feasible to add more discrete angles, such as 45°, 135°, 225° and

315°, to account for walls that are not at a multiple of 90 degrees. However, this was

not tested and has been left for future research. Note that this is an unusual feature of

a building.

As already noted, errors in orientation have a more significant effect on the

robot’s estimated pose than errors in its position. Therefore it is vitally important that

the orientation is correct. Hence the importance of this method of localization.

However, Figure 85 (b) indicates that the robot still incorrectly estimated the

distances travelled. The corridor at the bottom of the map is far too wide. The correct

width is shown in (a), but in (b) the robot retraced its path and in the process it

redrew the corridor much wider. This problem, and the incorrect orientations, were

corrected by using a particle filter to perform SLAM in conjunction with the

incremental localization. See Figure 89 below.

Side Note:

As this thesis was in its final stages a new SLAM algorithm was published,

called P-SLAM. It used the structure of indoor environments to improve localization

by recognising features such as corners in corridors [Chang et al. 2007].

P-SLAM used an environmental structure predictor to ‘look ahead’ and create

possible local maps based on the types of structures previously found during

exploration. For instance, if it encountered a corner in a corridor, it assumed that the

corridor then continued on for a certain distance at a right angle to the corridor it was

currently following based on having seen such corners before which produced

similar structures already in the map. The key point is that there is an underlying

assumption that the environment has regular features. This would most likely be

invalidated by a single 45° wall as in the example above.

It is interesting to note that the examples given by Chang et al. showed that a

conventional Rao-Blackwellized Particle Filter (RBPF) with 200 particles produced

a map which was severely distorted. To correct the problem a minimum of 1000

particles was required for the RBPF. This was with a laser range finder and also

applying scan matching.



Conversely, P-SLAM was able to create a good map with only 200 particles. In

either case, the number of particles was substantially more than were used in the

experiments in this thesis (usually 50 or 100).

The P-SLAM paper appeared too late to be incorporated into this research. It is

a possible avenue for future work.

4.5 Difficulties with GridSLAM

The main focus of this work was not on developing new SLAM algorithms. To

the contrary, it was expected that existing algorithms could be used, but when it

came to finding appropriate software the results were disappointing. There was a

large discrepancy between what had been reported in the literature and the

availability of robust packaged software to make SLAM accessible to researchers

without the need to implement complex algorithms from scratch.

The approaches to Visual SLAM identified in the initial research were not

appropriate for the experimental hardware used in this research. The robot had only a

single camera; images were captured at a very low frame rate; and they exhibited

motion blur during moves. Using piecewise linear motion restricted the number of

images that were available for processing and the significant changes in the robot’s

pose between images meant that feature tracking was not feasible.

The major problem encountered in this research was getting SLAM to work

reliably. Despite SLAM having been an active research area for decades and claims

that the SLAM problem has been solved in a theoretical sense [Durrant-Whyte &

Bailey 2006, Bailey & Durrant-Whyte 2006], SLAM still has problems with the

practical aspects of implementation.

In particular, this research initially relied on algorithms published in

Probabilistic Robotics [Thrun et al. 2005]. While there was nothing fundamentally

wrong with these algorithms, they did not work reliably with the vision system and

there were some significant changes required to ensure convergence. Part of the

problem is related to the limited amount of information available from vision. As has

been pointed out previously, corridors do not provide a wealth of features. Both

these factors conspired to cause localization to fail.



While on sabbatical in Germany in 2006, the author discussed several issues

with Prof. Burgard, one of the authors of Probabilistic Robotics. It became clear that

the practical implementation of SLAM still falls short of the theory.

4.5.1 Failures in Simulation

In the process of testing the GridSLAM code using the simulator it was found

that GridSLAM would often fail to produce maps that were consistent with the

ground truth.

Figure 86 (a) is an example of a final map produced from a simulation run with

random errors in the robot’s motions. The ground truth map used in the simulation is

shown in Figure 86 (b) for comparison. There were 50 particles in this test run and

the map (a) shows the average of all the particle maps. The noise level was not

particularly high, but was representative of the noise encountered with the real

robots. Clearly, the map is not accurate although it is substantially correct.

Figure 86 – Map from simulation with random motion errors (50 particles): (a) GridSLAM output; and (b) Ground Truth

Most of the errors in the map in Figure 86 occurred due to an incorrectly

interpreted rotation half-way through the test run and it can be seen that the map is a

composite of two distinct maps. With limited FOV and range, the SLAM algorithm

is susceptible to errors in areas where there are few obstacles in view, which results

in a localization failure.

The simulator, vision system, mapping and exploration algorithms were all

effectively validated by performing test runs in the simulator with no motion errors.

This can be seen from the screenshot in Figure 87 where SLAM was not performed

because it is not necessary with perfect odometry.



Figure 87 – Map from simulation with no motion errors

Because of the problems encountered with the stability of the GridSLAM

algorithm, changes were made to the code during development to try to make the

filter more robust. These changes had no mathematical basis, and might invalidate

the original theory from which the filter was developed. They are explained in the

following sections.

4.5.2 Sensor Model

It has been explained previously how the accuracy of range estimates decreases

rapidly the further an obstacle is from the camera. Unless there is a long sequence of

very close obstacles, then the sensor measurements will generally contain a large

amount of error, which can be viewed as noise.

Experience on this project showed that the sensor model could have a

significant impact on the robustness of SLAM. In particular, if the model was too

simplistic then good information in the map was rapidly replaced by bad information

due to the clear-cut values for free space and obstacles.

In the case of the X80 robot, the images are converted to JPEG format by the

robot before transmission to the PC over the wireless LAN. This causes artefacts in

the images which sometimes induced artificial edges. As previously explained,

shadows can also induce artificial edges. Therefore the probability of an obstacle

should not be set to 1.0 at an edge. It should be somewhat lower to account for the

possibility that the detected edge was not an edge at all.



Also, the frequency of updates from the sensor can be viewed as affecting the

sensor model in a subtle way. Laser range finders typically provide several updates

per second. This means that the same area is seen repeatedly, which naturally

increases the certainty of the cells in the map. Therefore the individual updates can

be more conservative.

However, in this work just one camera image was used after each step during

pirouettes. This meant that the values used in the sensor model had more extreme

values than would have been necessary if several updates had been performed from

each vantage point. The cumulative effect would be the same, except that occasional

bad images would have less impact on the map. Fortunately, errors tended to be

corrected because the robot usually saw the same area several times.

For these reasons, a more realistic model was developed, as explained in

section 3.7.6. The following diagram, Figure 88, is an example of the realistic sensor

model. It shows a local map generated as the robot approached the opening of a

corridor.

Figure 88 – Local map created with complex sensor model: (a) Camera image; and (b) Map showing uncertain range values

The wall in the map (b) on the near right-hand side does not have much ‘fuzz’

because the error was less than one grid cell. However, the far wall on the left-hand

side of the corridor was too far away to obtain a reliable range measurement.

Therefore it is quite fuzzy. Note too that there were quantization effects due to both

the pixel quantization in the image and round-off to grid cells in the map.

Figure 88 also illustrates one of the problems with the vision system. The stray

ray at the bottom edge of the local map extends to the maximum vision distance.

This occurred because of a failure in the detection of the floor boundary that allowed



the floor to ‘leak’ through. In this case, it was an aluminium air-conditioning duct

that was a similar grey colour to the floor and very hard to detect. (Refer back to

Figure 78 for a better view of the duct).

Figure 89 – Maps created with: (a) Simple sensor model; and (b) Complex sensor model

Figure 89 shows two maps. The first map was created with the simple model.

The second map used the complex model. The walls are more clearly defined in the

second map, but they are qualitatively the same. The first model was less

computationally intensive.

4.5.3 Particle Filter Issues

There are several different parameters and approaches that can be used with

Particle Filters. Some issues with Particle Filters were identified during the literature

review. The issues included:

• Maintaining particle diversity (for a representative distribution);

• Determining the number of particles required for an accurate map;

• Improving the proposal distribution during an update;

• Avoiding particle depletion;

• Defining an appropriate motion model and determining its parameters;

• Obtaining importance weights with the widest possible spread of

values; and

• Implementing a suitable resampling algorithm.



The following sections discuss these issues in the context of this research.

There are still many open research questions related to these issues.

4.5.4 Maintaining Particle Diversity

Particle diversity is important for successful operation of a particle filter. The

motion model parameters also have an impact on diversity, but this is discussed

below in the section 4.5.8 on the Motion Model.

Figure 90 – Particle trajectories: (a) Before; and (b) After closing a loop

Figure 90 shows how particle trajectories are affected by loop closure. The

robot started at the bottom-centre of the map and proceeded to the left travelling

around the loop in a clockwise direction. (It made a brief deviation to the right in the

bottom-right corner.) In (a), it was just about to join up with its ‘tail’ at the point

where it started. Notice that the particle trajectories had spread out significantly into

a ‘fan’.

In Figure 90 (b) the robot had started to retrace the early portion of its path. As

the robot became more certain about where it was, based on seeing part of the map

again, many of the particles were discarded and their trajectories disappeared. The

net effect was that after closing the loop there was only a single trajectory (the single

line at the left-hand side) that could be traced back to the very beginning of the test

run.

Notice in Figure 90 (a) that there were (at least) two distinctly different paths

(based on clumps of particles). On the scale of this map, which is roughly 20m

square, the trajectories at the left side of the map were about 60cm apart. However,



after loop closure all but one of them was lost forever. This indicated that the number

of particles was very close to the minimum required to reliably close the loop.

Statistically speaking, it was quite likely that the robot’s pose was correct in

(b). However, the exact path that it took to reach where it ended up might actually

have been wrong. There could have been small errors in the map along the way, due

to an incorrect path, even though the final pose was close to being correct.

Sufficient particles are required to maintain diversity. However, this is always

a compromise because the computational requirements increase directly in

proportion to the number of particles.

4.5.5 Determining the Number of Particles Required

In the experiments, only 50 or 100 particles were typically used for

performance reasons. Even though piecewise linear motion allowed the robot to ‘sit

and think’, it was still using up battery power so elapsed time had to be minimized.

By re-running recorded video images and log files, various parameter settings

could be tested afterwards without having to be concerned about the battery life of

the robot. Using pre-recorded data was also essential to provide reproducible test

runs for comparison purposes when changes were made to the algorithms.

The maximum number of particles trialled was 500. Even this is a low number

compared with some examples reported in the literature. However, the program still

failed to produce a reliable map unless incremental localization was enabled (as

explained in section 4.4), which is one method of improving the proposal

distribution.

4.5.6 Improving the Proposal Distribution

In an attempt to improve the stability of the SLAM algorithm, code was added

to the filter to iteratively find the best local map for each particle during the selection

of new poses using the motion model. The number of iterations was limited because

it had a significant impact on the performance.

Another approach, which was suggested in Probabilistic Robotics, was to

iterate until the new pose was not inside an obstacle according to the particle’s own



map. This makes a lot of sense from a physical point of view. However, it is possible

for the particle’s map to be wrong, but the selected pose to actually be correct.

After trialling both of these approaches, no noticeable improvement was found

in the performance of the SLAM algorithm. In other words, it still failed to produce

accurate maps.

Other researchers have used scan matching of laser range finder scans. As

noted previously, the non-linear nature of inverse perspective mapping makes

conventional approaches to scan matching inappropriate. New theoretical

development will be required to make scan matching viable for vision. However, as

long as there is only a small FOV, scan matching will still have problems unless the

frame rate is increased substantially to keep the incremental changes between images

very small. This line of research warrants further investigation.

Using incremental localization based on wall alignment eventually proved

successful. This was discussed at length in section 4.4.2 on Incremental Localization

using high-level features.

4.5.7 Avoiding Particle Depletion

One of the first problems that became apparent while testing the SLAM code

was that sometimes the particle set would be ‘flushed’ and a small number of

particles would effectively take over. This was in fact a previously recognized

problem known as particle depletion / deprivation [Thrun et al. 2005] or sample

impoverishment [Montemerlo & Thrun 2007].

The problem is basically a consequence of over-sampling. The objective of

resampling is to maintain diversity. However, if too many particles are replaced

during a single resampling step, then diversity will be reduced substantially because

deleted particles are replaced with clones of existing particles.

Montemerlo and Thrun claimed that this was due to a mismatch between the

proposal and posterior distributions over the robot’s pose. They suggested that this

happened when the robot’s sensor was ‘too accurate’ relative to the robot’s motion

error. However, the author did not find this to be the case for the experimental



environment in this thesis. To the contrary, the sensor errors due to perspective were

significantly greater than the observed motion errors.

In this research it was found that ‘bad’ particles can be selected quite

unpredictably during resampling. Because the motion model uses probabilities, it can

happen that some of the particles might have poor pose estimates. However, there

might be insufficient information in the observations to eliminate these particles on

the next SLAM update because the particle weights are very similar.

If a bad particle is duplicated during the resampling process, then it has twice

the chance of affecting the filter stability. Each successive SLAM update that the bad

particles survive doubles the effect. Rapid growth in the number of bad particles

ensues, and the filter fails catastrophically.

This problem arises in the situation where the observations are not very

accurate, and therefore the calculation of the weights does not discriminate

sufficiently between ‘good’ and ‘bad’ particles. In a sense, this is the opposite of

what Montemerlo and Thrun suggest.

In [Bennewitz et al. 2006] the paper authors stated that: ‘Due to spurious

observations it is possible that good particles vanish because they have temporarily a

low likelihood.’

This argument supports the requirement for sufficient richness, or information

content, in the input data. A thought experiment helps to explain. If the input data is

‘bland’ so that all particles have equal weights, then wide variations in particle poses

are possible. For example, if the robot were in the middle of a large football stadium,

then it would have no way to discriminate between particles (because there would be

nothing to see within the range of the sensor) and any arbitrary pose could be

propagated because all weights would be equal.

4.5.8 Defining the Motion Model

The motion model consisted of simple translations and rotations (piecewise

linear motion). The noise distributions used in the motion model could be adjusted

via program parameters. Note that the initial pose of the robot was the same for all

particles because clearly it could only start from one location.



Experience from this research showed that the parameters of the motion model

could be set too low. This is easy to understand from a thought experiment. If the

variances of the motion model were set to zero, then the particles would have no

variation and the filter would be almost certain to fail even if the motion model were

otherwise perfect because the robot itself always has noisy motions. In other words,

the particles would be unable to track the real motion of the robot. Therefore, some

minimum variance is required to encompass the actual motion variance of the robot

and to ensure that there will be sufficient particle diversity.

Conversely, if the motion model error variances were set too high, the particles

wandered around a lot. This increased the chance of a rogue particle being generated

that deviated significantly from the true path. If this particle survived for a

substantial amount of time, the resampling process cloned it. The overall effect was

to make the filter unstable.

As with the number of particles, there is no easy way to select the motion

model parameters. Texts such as [Thrun et al. 2005, Choset et al. 2005] offer

suggestions based on analysing the statistics of measured motions from logs of test

runs. The problem with this is that the real ground truth is rarely available. So the

estimated motion errors are based on the trajectory of the ‘best’ particle. This is

usually done off-line, but it could be done dynamically on-line. Further research is

required in this area.

4.5.9 Calculating Importance Weights using Map Correlation

During the resampling phase of the particle filter update it is necessary to

calculate weights for each of the particles to determine their ‘goodness of fit’ to the

existing map.

The Correlation-Based Measurement Model from [Thrun et al. 2005] was used

in this research. The textbook notes both the advantages and disadvantages of this

approach. In particular, there is no plausible physical explanation for this approach.

However, it has been widely used in the past.

It was found in this research that map correlation only produced a fairly narrow

range of weights. This arises to some extent due to the small area covered by the



vision sensor. It is therefore possible for a ‘bad’ particle to survive resampling, and

even multiply. This was discussed above under Particle Depletion.

There are some other issues with map correlation as expressed in [Thrun et al.

2005]. Firstly, it suggests calculating the correlation over a local map that is square.

However, this includes many cells that are not in the robot’s sensor range because

the visible range is a segment of a circle. A mask was therefore added to the program

that included only the cells that were potentially visible. (See section 3.9.4).

Another issue is whether or not the summation should include cells where the

value is unknown space. In theory, there is no point including a cell from the global

map with a value of unknown because there is an equal chance of it being occupied

or unoccupied. (Such a cell has not been seen before). Similarly, the cells in the local

map with a value of unknown were not seen during the latest scan. It could therefore

be argued that they should not be included either.

In the author’s experience, excluding unknown cells from the correlation did

not have any significant impact on the stability of the SLAM algorithm. However,

the effect might have been minimal due to the fact that the number of cells involved

was small.

One other approach investigated was to accumulate information into temporary

local maps before updating the global map. For example, it seemed sensible to

combine all of the local maps from a pirouette before running a SLAM update and

integrating the information into the global map. This feature was added to the ExCV

program and it was found that there was a compromise – delaying SLAM updates

also delayed localization and the robot pose error could increase in the meantime.

Combining several scans into a single local map before performing the

correlation was also tested. This lead to more reliable matches, but it also meant that

there was potential for the robot to stray from the true path in the interim because it

made several moves before updating its pose estimate.

Instead of combining scans, an alternative approach was developed whereby a

moving average was kept for the weight of each particle. This had a smoothing effect

on the particle weights. The objective was to prevent a ‘bad’ particle from increasing



its chances of being cloned during resampling if it suddenly had a good correlation.

Averaging the weights also had the effect of ensuring that a ‘good’ particle did not

suddenly get thrown away due to a single bad match.

4.5.10 Resampling Issues

The author found that the Low Variance Sampler often kept particles with very

low weights, especially if the first particle in the set was one of these. This was due

to the nature of the algorithm and happened if the random starting value, r, was

much less than the weight if all particles were equal, i.e. 1/M.

Changes to the resampling frequency could affect the longevity of particles and

this had dramatic effects on the output of the filter. In extreme cases, a poor particle

could survive for a long time. It would split and create more particles, which

compounded the problem.

Figure 91 – Divergent particle

Figure 91 shows a particle that separated from the rest and then split several

times. This could potentially have caused the filter to diverge. In this case, however,

these aberrant particles were eliminated a short time later when the robot turned the

corner (at the top-right) and realised where it was (having already been there before).

It is also important to note that in Figure 91 a large portion of the trajectory is

represented by a single particle. The length of the ‘tail’ (to where all the particles

have a common ancestor) gives some indication of the size of a loop that can be

closed by the filter.



4.5.11 Retaining Importance Weights to Improve Stability

According to the GridSLAM algorithm, a new importance weight is calculated

for each particle every time a new sensor update is received. When the importance

weights are very similar this can cause problems during resampling because ‘bad’

particles might be selected. A typical case where this might occur is in a large open

corridor where the robot sees only free space. In this case, all particles are likely to

have the same weights. This problem is compounded by a limited sensor range

(discussed below in section 4.7).

To try to reduce this effect, the weights were kept as a property of the particles

and accumulated across iterations. Two different ways to do this were investigated.

The weights were simply added together because they were subsequently normalized

anyway. Alternatively, the logarithm of the weights was taken to spread them out.

Adding them together was then equivalent to multiplying them together.

4.5.12 Adjusting Resampling Frequency

The GridSLAM algorithm says that the particle set should be resampled on

every update. Over-sampling is known to cause problems with particle filters.

Therefore, resampling of the particle set was not performed on every iteration of the

SLAM algorithm. However, it was necessary to decide how long to wait before

resampling.

One simple approach, which was implemented in GMapping, was to perform a

SLAM update only after the robot had moved by a certain distance or rotated by a

certain amount. These parameters had to be established by trial and error because

there were no published methods for doing so.

The concept of a ‘Number of Effective Particles’ had been introduced for

selective resampling [Grisetti et al. 2005]. The formula was based on the particle

weights:

∑=

= M

i

ieff

wN

1

2)(

1 (35)



Neff is a measure of particle diversity. It works on the assumption that if all of

the particles are distributed according to the true posterior distribution then they

should have equal weights and the value of Neff would be equal to M. As the particles

start to diverge, the value of Neff will fall.

The literature suggested that resampling should only be done when Neff falls

below half the number of particles, or 50% of M. This did not occur very often

during the experiments. Therefore other criteria were used to decide when to

resample as explained previously.

The ExCV program accumulated the distance travelled by the robot, and the

total turn angle. When either of these exceeded a pre-set amount, then resampling

was performed. The value of Neff was used as a fallback, but this rarely triggered

resampling even when set to 65% of M.

4.6 Comparison with Other SLAM Algorithms

Two data sets containing random motion errors were generated using the

simulation environment, shown in Figure 92, to test the stability of two other SLAM

packages in comparison to the GridSLAM implementation in this research. The

GridSLAM algorithm was unable to produce a correct map with these data sets (see

Figure 95 and Figure 96 below in the section on GMapping). For these test runs the

maximum vision distance was set to 8m.

Figure 92 – Ground truth map for simulation

4.6.1 DP-SLAM

The map in Figure 93 (a) shows the initial result for DP-SLAM with 50

particles. (Ignore the rotation of the map which is not relevant and is an artefact of



the program). Although DP-SLAM started out well, it became disoriented half way

through the test run. The map in Figure 93 (a) is collapsed on itself and should be

much larger as shown in (b).

Figure 93 – Test 1 with DP-SLAM: (a) Initial test run; and (b) After learning the motion model

The authors of DP-SLAM were contacted and the problem was discussed with

them. They suggested that the motion model might need to be updated to match the

robot, and another program was supplied that could learn the motion model. This

underscored the importance of a good motion model.

After performing some training runs to determine the motion model

parameters, the test data was again run through DP-SLAM. The number of particles

was also increased to 100. The resulting map is shown in Figure 93 (b). Although it

was significantly better, it still clearly had problems.

In Figure 93 (both a and b) there appear to be two distinct maps overlapping

each other with the second one slightly rotated with respect to the first. The most

likely cause was an area where the vision data is ambiguous due to the limited FOV

and range of the simulated camera. In this case the SLAM algorithm would be

unable to accurately determine the amount of rotation.

The second data set was also run with both 50 and 100 particles with the

results shown below in Figure 94. The second map was much better than the first



one, which was not surprising because particle filters should work better with more

particles.

Figure 94 – Test 2 with DP-SLAM: (a) 50 Particles; and (b) 100 Particles

The examples on the DP-SLAM web site used thousands of particles.

However, in order to keep the comparison on a similar footing with the other SLAM

algorithms, the number was limited to 100. Also, DP-SLAM was found to be slower

than GridSLAM and using 1,000 particles (or more) would have been impractical.

4.6.2 GMapping

Comparisons of GridSLAM against GMapping are shown in Figure 95 and

Figure 96. These are easier to compare directly than GridSLAM and DP-SLAM

because of the format of the GMapping output which is from the CARMEN log2pic

program. The GridSLAM examples were for 50 particles and a grid cell size of 5cm

and the GMapping examples for 30 particles and a grid cell size of 10cm. The lower

parameter settings for GMapping were adopted after it was found that the algorithm

ran very slowly.



Figure 95 – Test 1 – GridSLAM compared to GMapping: (a) GridSLAM; and (b) GMapping

Figure 96 – Test 2 – GridSLAM compared to GMapping: (a) GridSLAM; and (b) GMapping

GMapping produced better results in both tests. However, the maps still

showed obvious flaws.

GMapping was definitely not fast enough for use in real-time. This appears to

be due to the fact that it is a more complex algorithm. However, another test run was

performed with the second data set using 50 particles and a 5cm grid size. This is

shown in Figure 97.



Figure 97 – GMapping with 50 Particles and 5cm grid size

This map from GMapping was substantially correct, although there is still a

double wall visible at the right-hand side.

4.7 Information Content Requirements for Stable SLAM

One aspect of SLAM that has been ignored in the literature is the Information

Content Requirements – the input (sensor measurement) data must provide enough

information for reliable SLAM to take place.

There are no stated requirements for ‘richness’ in the input data for any of the

major SLAM algorithms to the author’s knowledge. However, all algorithms should

include constraints on the input stream to satisfy stability and robustness criteria. The

literature implicitly assumes that the data will be sufficient for the task.

Problems arise from the nature of the vision sensor. There are three aspects to

information content that deserve mention:

• Field of View;

• Effective Range; and

• Frequency of updates.

The video camera on the X80 robot had a field of view only 50.8°, low

resolution (176 x 144), and a low frame rate (about 2 fps). As a consequence of

perspective, it has very limited range when the accuracy of the range data is taken

into account.



In contrast, humans have a field of view that varies from 160° for a single eye

to as much as 200° for stereo vision [Wandell 1995]. There is no ‘frame rate’ for

human vision because separate parts of the visual pathways run asynchronously.

However, visual information is updated around 20 times per second.

Side Note:

To actually see how large your field of view is, stretch your arms out wide and

wriggle your fingers. Even while you are looking straight ahead, you will be able to

detect the motion of your fingers even though your direction of view is at 90° to your

arms. Humans are very sensitive to peripheral motions – a legacy from survival in

the wild.

Now consider the sensor of preference for SLAM research – the Laser Range

Finder. The best-known LRFs are manufactured by SICK GmbH. The product

specifications are available on their web site (http://www.sick.com). The following

information is for the SICK LMS 200.

Table 6 – Technical Specifications for SICK LMS 200

Feature Value

Field of View 180°

Range Maximum 80 m (262.5 ft)

Angular Resolution 0.25°/0.5°/1.0° (selectable)

Response Time 53 ms/26 ms/13 ms (Up to 75 Hz)

Measurement Resolution 10 mm (0.39 in)

System Error

(good visibility, 23°C

(73°F), reflectivity ≥10%)

Typ. ± 20 mm (mm-mode), range 1…8 m

(3.2…26.2 ft)

Typ. ± 4 cm (cm-mode), range 1…20 m

(3.2…65.6 ft)

Statistical Error, Standard

Deviation (1 sigma)

Typ. ± 5 mm (at range ≤ 8 m / ≥ 10%

reflectivity / ≤ 5 kLux)


http://www.sick.com/


To put this in context, the following diagram, Figure 98, compares the FOV for

vision with the FOV of a LRF.

Figure 98 – Comparison of vision and laser range finder fields of view

This diagram has been drawn to scale so that the outermost hemisphere

represents a range of 20m and a 180° field of view. This is sufficient to see all the

way to the end of most corridors in office buildings. However, the LMS 200 can

actually see up to 80m, so even this diagram is too small to show the full range.

In contrast, a camera with a 60° field of view is represented by the dotted lines.

The smaller (and darker coloured) segments correspond to ranges of 2m, 4m and 6m

respectively. For the X80 robot, with the camera tilted downwards, a maximum

range of only 2m was often used for reasons explained previously.

The researchers at the University of Freiburg noticed in the course of an

industry project that reducing the range of a laser range finder to just 8m made

SLAM much less reliable.

The accuracy of a LRF is basically constant across its entire range, with a

system error of only 2cm. For vision, the potential error in range measurements

increases rapidly with the distance from the camera. In fact, for the X80 robot, by the

time the range reached 6m, the very next pixel in the image corresponded to a point

more than 7m away – a possible error of up to 1m even assuming a conservative

error in locating the floor boundary of only one pixel.



To test the hypothesis that information content affects SLAM, the GMapping

algorithm was run using a modified log file where the data was truncated. Instead of

the full 180° of laser scan data, only 60° was used. In addition, the range of the data

was limited to 8m instead of 80m.

The resulting map is shown in Figure 99 and it has some obvious flaws.

Figure 99 – Map produced by GMapping with truncated data

In summary, the amount of information available from vision using a cheap

camera is much less than for a LRF. However, LRFs are very expensive compared to

web cameras.

Information content, or richness of the input signal, can also be considered in

the time domain, not just the spatial domain. As explained in Chapter 3, MonoSLAM

requires a high frame rate due to the way that it tracks features. In other words, then

it comes to tracking features in successive images the temporal separation between

images can be an important design parameter too.

4.8 Exploration

For a robot to build a map autonomously, it must have an exploration

algorithm. The Distance Transform (DT) was used in this research. The DT, as

implemented using the Borgefors algorithm [Borgefors 1986], is often referred to as

a two-pass process (forward and backward). Each pass is similar to a convolution.

However, for some maps, it is necessary to make multiple forward/backward passes



[Jarvis 1993]. In the experiments this was rarely a problem, however, just to be

certain, the code iterated repeatedly until there were no changes. This always

resulted in executing a redundant iteration.

In the literature, has often been assumed that the map cell size matches the cell

size for robot motions. If the cell size is small compared to the robot, this leads to

paths that hug the walls and frequent direction changes.

These ‘too close’ paths, as they have been called [Zelinsky 1992], present two

problems. Firstly, the robot might side-swipe an object in passing, especially when

moving around corners. The second problem for the camera geometry in this

research was that the robot might move so close to an obstacle that the base of the

obstacle fell into the blind area in front of the robot. This made it impossible for the

robot to determine exactly where the obstacle was located.

To address the first problem, the robot always ‘looked’ before it moved. This

meant that the free space in front of the robot had to be re-assessed immediately

before each forward motion. A useful side-effect of this was that the robot could

detect objects that moved into its field of view, so in theory the environment did not

have to be completely static.

In the case of the X80 robot, an additional check could be made because the

robot had sonar sensors. For safety, the robot was programmed so that it would not

move forwards if the front centre sonar sensor value was less than the diameter of

the robot.

The second problem might be tackled by using active vision, for instance,

moving to a better vantage point in order to actually see the exact location of the

obstacle. A simple approach would be to move backwards, provided that the space

behind the robot was already known to be clear based on previous information. This

approach was not implemented. Instead, the ExCV program relied on the robot

seeing the blind space at some later stage of the exploration. In general this worked

fairly well because the robot did frequent pirouettes and therefore invariably looked

back at the area that it had come from.



Using a larger cell size for the Distance Transform was another way to

minimize the possibility of these problems occurring. Expanding the obstacles in the

map (morphological dilation) by the radius of the robot also helps to avoid collisions.

This is commonly referred to as Configuration Space and it resulted in the generated

paths having a margin around obstacles.

Figure 100 – Path generated by Distance Transform: (a) Original map; and (b) Configuration space showing path

Figure 100 (a) shows a partially completed map during exploration. The robot

is the small circle in the corridor at the top of the map. Figure 100 (b) is the

corresponding Configuration Space map showing the path (in red) that was

generated by the Distance Transform. It is the shortest path that allows the robot to

reach unexplored space (shown as grey) without hitting an obstacle. Even though the

path in (b) appears to run close to the obstacles, it actually does not because the

obstacles have been expanded. Notice that the path consists only of straight line

segments (north-south, east-west or diagonals).

Another alternative for avoiding obstacles was to apply a ‘gradient field’

around obstacles. The combination of an artificial field and the DT was examined for

the purposes of multi-robot exploration in one of the author’s papers [Taylor et al.

2005b] in order to force two robots apart during exploration. The efficacy of this

approach has therefore been demonstrated, although in a different context.

Figure 101 shows maps from both the simulated environment and the Yujin

robot. The robot is the small black circle in the bottom left and its path (light grey

line) is shown starting from near the middle of the map. Notice how the trajectory

skirted the obstacles.



Figure 101 – Maps generated by: (a) Simulation; and (b) Yujin Robot

Larger cell sizes reduce the number of moves. Fewer moves should help to

reduce the cumulative odometry errors. However, large cell sizes prevent the robot

from entering narrow corridors or passing through tight gaps.

Figure 101 (b) is distorted due to odometry errors. Also, there are spikes at the

edges that occurred because the robot’s shadow fell on the wall and it interpreted this

as the floor (which was black). However, the basic shape of the map and the path

followed are similar to the simulation, which confirms the algorithms and their

implementation in the program code.

Side Note:

The maps and the trajectories that the robots followed in the two diagrams of

Figure 101 are qualitatively very similar. This verified the accuracy of the

simulation. In addition, note that the second map was produced without the benefit

of a SLAM algorithm but it was still substantially correct. For this small

environment, the effects of odometry errors were not significant enough to seriously

distort the map. However, the ‘banana’ shape of the second map is testimony to the

inaccuracy of command execution by the robot.

Lastly, note that the robots automatically stopped exploring at the ends of the

paths shown in Figure 101. This is a useful feature of the Distance Transform.

4.9 Experimental Results

This section contains examples of maps that were produced by an X80 robot in

S Block of QUT. Initially the robot had no map. The robot was started in the middle



of a corridor facing a wall so that it could orient itself initially. It was allowed to

explore on its own and build a map.

The process of building a map began with the robot performing a pirouette to

generate a ROP sweep. This gave it a 360° view of its surroundings, called a local

map. This local map was incorporated into the global map using SLAM, including

an incremental localization step. Based on the global map, the robot generated a path

to the nearest unexplored space using a Distance Transform. It then set out along this

path for a certain number of steps (usually set to 5 steps) incorporating new

information into the map as it went. After the specified number of steps, the robot

again performed a pirouette and the whole process was repeated until it determined

that there were no more areas left unexplored.

Typical test runs took about two hours to complete, due in part to the slow

movement of the robot (pirouettes take a lot of time) and also due to the

computations required. During test runs between 6,000 and 8,000 images were

captured and processed.

As an indication of the CPU requirements, re-processing of a saved log file

(with camera images) took about an hour on a laptop with a 2.0GHz Pentium M

(Centrino) processor.

The map in Figure 102 shows a map drawn using the raw data obtained from

the robot’s vision system on Level 7.



Figure 102 – Map of Level 7 produced using raw data

There are several problems with the map, including some incorrect rotations

and two corridors that are doubled up. Obviously SLAM is necessary to correct the

map.

Applying the modified GridSLAM algorithm outlined in this thesis resulted in

the map shown in Figure 103. This map was produced using 50 particles with a

maximum vision range of 4 metres and the complex sensor model. The map grid cell

size was 5cm.



Figure 103 – Map of Level 7 using SLAM with 4m vision range

There are several points to be made about the map in Figure 103.

• The robot had difficulty in distinguishing doors from the floor in

several places along the bottom of the map, resulting in the

characteristic ‘fans’ or ‘cones’. This was due to the similarity between

the colour of the doors and the carpet.

• There were two regions where the walls were at 45° – the bottom-left

corner and at the intersection of the right-hand corridor with the main

corridor (the horizontal corridor at the bottom of the map). The one at

the bottom-left is a little difficult to see because the robot interpreted a

door as the floor due to the similarity of the door colour with the carpet.

The one at the intersection of the right-hand corridor is clearly visible

and has the correct angle. This presented a challenge to the SLAM

algorithm.

• At the extreme right-hand side of the map, it appears that the robot saw

through the wall. However, there was no wall in the corridor – several

chairs had been placed there to prevent the robot from straying and it

could see between the legs of the chairs.



• This was a large loop with the sides of the rectangle approximately

10m in length. Loop closing regarded as one of the significant tests of a

SLAM algorithm.

Figure 104 shows another map of the same area, but in this case the maximum

vision range was limited to 2 meters. This map also has a floor plan overlaid on it to

show the ‘ground truth’.

Figure 104 – Map of Level 7 using SLAM with 2m vision range

The ground truth in Figure 104 was obtained by using a tape measure to an

accuracy of one or two centimetres. Each pixel in the map represents 5cm (the grid

size), so errors in the measurements are at most the thickness of the lines (a single

pixel width). The two circles in the floor plan were concrete pillars.

There was a discrepancy in the map at the top left caused by two problems.

Firstly, this part of the corridor was very dark and the robot could not see the floor

properly. Secondly, at this point in the test run the robot lost communication with the

PC that was controlling it.

WiFi suffers from ‘black spots’ due to multi-path propagation and absorption

of the signal which is exacerbated by internal building walls that usually contain



metal studs. Because a motion command was lost, this caused the robot to become

disoriented. On re-establishing communication, the robot had to be driven manually

for a short distance until it could re-localize itself. This resulted in the ‘hole’ in the

corridor. However, wireless network connectivity was not one of the topics

addressed by this research.

Although this map is not perfect, it has all of the right features. The reason for

presenting a less than perfect map is to provide examples of the types of problems

encountered.

Additional lights were installed in the corridor and subsequent test runs

produced better results. An example is shown in Figure 105.

Figure 105 – Second map of Level 7 using SLAM

In this case the robot started at the centre of the top corridor. It completed a full

circuit before realising that it had not explored the bottom-left corner. In the process

of returning to the bottom-left it misjudged the top-left corner. The orientation of the

robot was wrong for a short period of time, and an extraneous corridor was started in

the map, but the localization procedure pulled it back on track.



As has been discussed several times in this thesis, the quality of the odometry

information from the robot was poor. Consequently the bottom corridor was slightly

curved in the map and SLAM did not completely correct it.

Further experiments were performed on Level 8 of the same building which

did not have a loop and was a more complex layout. An example map is shown in

Figure 106.

Figure 106 – Map of Level 8 using SLAM

Again, the map was substantially correct and the salient features can be seen.

Although some of the fine detail was blurred, the map would still be adequate for

navigating around the floor.

Points to note about Figure 106 are:

• The fuzzy area at the top centre which is not included in the ground

truth is a filing room that was cluttered with boxes, filing cabinets,

printers, etc. It had a different carpet from the rest of the floor and the

robot saw the join between the two carpets as a wall boundary. It had to

be driven into the room manually. Because it was not explored

autonomously it was not included in the ground truth.



• There were 45° walls in the bottom-left of the map similar to Level 7.

The walls in Figure 105 are more clearly defined.

4.10 Areas for Further Research

This section outlines several areas which were touched upon during the

research, but could not be explored fully due to time constraints. They merit further

investigation.

4.10.1 Using Colour in the ROP

Colour is a powerful tool for identifying and distinguishing amongst obstacles.

It should be possible to use the colour of obstacles to assist in the localization

process by incorporating it into the ROP.

However, there is a significant problem referred to in the field of computer

vision as Colour Constancy. Humans have the ability to perceive a wide range of

colours as actually resulting from the same base colour. Computers still have

difficulty with this.

Experiments were conducted with the Improved HLS system [Hanbury 2002]

and quantizing the Hue into 12 colours. Unfortunately, Hue has no representation for

black, white or shades of grey which are common colours in man-made

environments. Therefore, when the saturation was too low or too high the program

classified the ‘colour’ as black, white or grey depending on the lightness. This gave a

total of 15 ‘colours’.

In testing, it was found that it was not possible to reliably select the correct

colour. Furthermore, there is a problem with comparing colours when trying to

match two ROPs. For instance, is orange closer to red or yellow? Grey could

possibly match any colour in low light situations. Some sort of fuzzy logic needs to

be applied to colour comparisons.

4.10.2 Floor Segmentation

As has been explained in detail in this and previous chapters, colour is

primarily used to detect the floor. The author encountered a floor where the carpet

had a pattern consisting of two colours (instead of a single colour).



One approach to multi-coloured carpets might be to replace all the different

(but known) colours with a single colour throughout the image, thereby converting

the floor to a single colour. While this sounds reasonable in theory, in practice there

is no such thing as a set of ‘fixed’ colours in an image – all colours cover a range of

pixel values. Therefore it was not feasible to replace all the carpet colours, including

the minor variations due to illumination.

A k-means clustering algorithm was implemented in the ExCV program that

determined the most commonly occurring pixel values in an image and grouped the

pixels together in clusters. In one test environment the colour of the carpet was very

similar to the colour of the walls. This was probably by design when the building

was first outfitted. As a result, the floor and walls contained areas with the same

cluster number in widely different parts of the images.

This highlights a significant drawback to the clustering approach – it loses

spatial information. The same is true when using histograms, colour equalization,

and so on.

An attempt was made to add the (u,v) coordinates of the pixels to the

information vector for clustering, but this was not successful because it produced too

many small clusters based on spatial proximity rather than colour.

Eventually the best approach that was found was to run a heavy smoothing

filter (Gaussian 11 x 11) on the images and then apply a flood fill algorithm. The

flood fill was constrained to lie within the major edges found in an image. Although

this resulted in the walls sometimes being blurred into the floor, there were generally

sharp edges between the wall and floor that limited the spread of the flood fill.

This is a perfect example of where a single image processing operation is not

sufficient to solve a problem. There is little doubt that humans apply a multitude of

algorithms simultaneously. Furthermore, humans can actually interpret what they see

and therefore apply logic to the segmentation problem to ‘see’ features that are not

really present at the level of individual pixels – a dotted line is a good example

which a human can interpret as a continuous line (although simultaneously

recognizing that it has gaps).



There might be better ways to detect the floor, and this deserves further

investigation. Introducing additional methods in parallel, and then taking a majority

vote, might improve the reliability.

4.10.3 Map Correlation versus Scan Matching

As pointed out in the discussion on Localization, map correlation did not

produce a good spread of importance weights. Further research on alternative

correlation measures might produce a better result.

Scan matching (between successive ROPs) might be a better approach than

map correlation, but it will require some theoretical development to handle the

variation in range errors within an image because the Lu and Milios approach does

not work due to an implicit assumption that the errors are uniform regardless of the

range.

4.10.4 Information Content Required for Localization

Localization proved to be the most difficult issue in this project. One aspect of

localization that has not been extensively researched is the information content

requirements. It is generally assumed that there is sufficient information available

from the sensors, whatever they may be, to enable localization to be performed

reliably. Placing this on a solid theoretical footing is extremely difficult. However,

some observations are appropriate at this point.

The range of the sensor has a significant impact on map correlation because a

larger range provides more information for comparison between the local and global

maps.

A similar reduction in the effectiveness of map correlation results from a

narrow field of view, and cheap monocular cameras have quite a narrow FOV.

Most of the localization failures that were encountered in the experiments

could be attributed to one or both of these limitations. Quantifying these issues

would help future researchers to avoid falling into the same trap.



4.10.5 Closing Loops and Revisiting Explored Areas

Seeing areas that are already on the map allows the robot to confirm its pose.

The experimental validation that was performed (see Chapter 3) showed that this

worked well.

Returning to previously visited areas is partially an exploration issue, but it is

very important for localization. This was noted in [Newman et al. 2002] which

discussed an ‘explore and return’ algorithm. Therefore, modifications should be

made to the exploration algorithm to periodically backtrack.

Closing large loops is known to be a problem with many SLAM algorithms.

The larger the loop that must be traversed, the more particles that are required to

maintain diversity. Actively closing loops has been proposed by [Stachniss et al.

2005].

4.10.6 Microsoft Robotics Developer Studio

Some preliminary work has been done with MRDS, but the vision program has

not been integrated into this environment because the vision code was written in

unmanaged C++, and MRDS is primarily a managed C# environment.

The simulator in MRDS is excellent, and it includes a simulated WebCam. It

could be used to replace the simulator in the ExCV program. Work on the simulator

would be useful for producing samples for use in teaching robotics and computer

vision.

4.11 Summary

This chapter has presented the details of the results obtained from the research.

Although the basic approach to using vision for mapping proved to be feasible, the

limited amount of information available from the vision system made it very difficult

for the SLAM algorithm to remain stable.

Incremental localization was added to the GridSLAM algorithm based on the

assumption that the walls were orthogonal. This was necessary to improve the

stability of GridSLAM and produce reliable maps.


Glossary

5 Conclusions

‘The greatest single shortcoming in conventional mobile robotics is,

without doubt, perception …’

– R. Siegwart and I.R. Nourbakhsh,

Introduction to Autonomous Mobile Robots, 2004, page 11.

5.1 Motivation for the Research

As robots are introduced into homes and offices they will need maps in order

to perform many common tasks. It is essential that they be able to create the

necessary maps on their own rather than relying on the user to provide maps.

These robots will invariably have cameras. Digital cameras are now very cheap

and can provide information that is not available from any other type of sensor.

Therefore vision should be considered as a source of information for building maps.

A great deal of work had already been done on mapping using robots before

this research began in 2002. However most of this had been with sensors such as

laser range finders or sonar. This research pushed the edge of the envelope by

attempting to build maps based solely on monocular vision.

This chapter summarises the findings of this research in relation to the original

research questions and also offers some general recommendations and suggestions

for future areas of research.



5.2 Mapping Indoor Environments Using Only Vision

The primary research question was:

How can an autonomous robot build maps of indoor environments using vision

as its only sense?

Implicit in this question was the assumption that it was possible to use only

vision. The research investigated many of the issues associated with visual mapping

and they are documented in this thesis. The main result of the research was a system

for robots to build maps that has been demonstrated to work in office corridors.

Three subsidiary questions were identified to break the research into smaller

pieces:

• How can vision be used as a range sensor?

• What mapping and exploration techniques are appropriate for indoor

environments?

• Are existing SLAM (Simultaneous Localization and Mapping)

approaches applicable to visual information?

The outcomes related to these subsidiary questions are addressed below.

5.3 Using Vision as a Range Sensor

Using a camera as a range sensor presents several challenges. Cheap cameras

typically have low resolution, a relatively narrow field of view, and are subject to

radial distortion. These characteristics of cheap cameras make them difficult to use

for mapping. Uneven illumination, including shadows, remains a significant problem

for vision. Even in indoor environments, artificial lighting does not necessarily

guarantee uniform illumination. Floor detection is not completely reliable when

based solely on the colour (and limited texture) of the floor.

Inverse Perspective Mapping (developed in section 3.7.2) provides a reliable

mechanism for determining the range to obstacles in an image once the camera has

been calibrated and assuming that the floor boundary can be correctly identified.


Glossary

This research took a novel approach to building local maps called pirouettes –

the robot spun around on the spot taking a series of photos of its surroundings.

Pirouettes allowed Radial Obstacle Profile (ROP) sweeps to be made. (See section

3.7.6). The output of the vision system therefore appeared similar to a sonar sweep

or laser range scan. Once the data was in the form of a range scan, other open source

tools could be used such as CARMEN utilities and other SLAM algorithms. In other

words, the camera was effectively turned into a range sensor.

5.4 Mapping and Exploration for Indoor Environments

The mapping and exploration components of this research were successful

given the purpose as stated in Chapter 1 – building maps of indoor environments

using vision. This can be seen from the sample maps in Chapters 3 and 4.

Mapping and exploration algorithms were initially developed using a

simulator, prior to live testing. The simulation environment provided exact motions

and perfect pose information which eliminated the need for a SLAM algorithm

during testing of the mapping and exploration code. This allowed confirmation of the

suitability of the algorithms and that the program code was correctly implemented.

Piecewise linear motion – restricting movements to forward translations or on-

the-spot rotations – eliminated the requirement for real-time processing, and it

allowed exploration to be performed on a regular grid which fitted in well with the

Distance Transform for exploration.

Occupancy Grids were found to be a suitable map representation that could

easily be used with the Distance Transform for exploration.

The grid size for the occupancy grid should be selected based on the

environment and the size of the robot. The author suggests that the grid size should

be roughly one order of magnitude smaller than the robot.

The Distance Transform was selected as the exploration algorithm for this

research. Many exploration algorithms suffer from local minima problems, but the

Distance Transform does not. The Distance Transform also has the useful feature of

indicating when there is nothing left to be explored (that is reachable by the robot).



On the other hand, the basic Distance Transform requires the robot to move on

a square grid rather than moving at random or by arbitrary amounts. Continuous

motion is also not feasible because the robot is constrained to move on a grid.

However, given the previously mentioned benefits of ‘piecewise linear’ motion and

the occasional adjustments that the robot made to its orientation in order to maintain

its alignment with the walls, this did not prove to be a problem in practice.

5.5 Simultaneous Localization And Mapping (SLAM) for use with Vision

This research found that it was possible to produce maps using a modified

SLAM algorithm. Examples are shown in section 4.8.16.

SLAM is a complex problem. Although many different solutions have been

published in the literature, there is no single algorithm that clearly stands out as the

‘best of breed’. It is still an active area of research and practice has not yet caught up

with the theory.

Localization is the single greatest challenge for a mobile robot. It requires the

robot to track its own movements and to recognize places that it has visited before.

Odometry is notoriously unreliable, so a variety of methods have been invented to

improve the accuracy of motion estimates and assist with localization.

Localization was found to be unreliable with a visual range sensor using map

correlation. The reasons for this were the limited field of view of the camera and a

practical limit on the useful visual range due to perspective. These factors limited the

amount of information in the input data stream from the visual range sensor, and

hence the number of map cells available for comparison.

Knowledge of the structure of the environment was used to perform

incremental localization to compensate for this deficit in the input information. This

approach was shown to be effective at improving the pose estimates for the robot.

A Particle Filter algorithm, called GridSLAM, was selected for use in this

research. This is a well-established algorithm published in a textbook by leading

researchers in the field. Problems were encountered in getting GridSLAM to work


Glossary

successfully with the experimental setup in this research. Some of these problems

can be attributed to the equipment and the environment, but there is one fundamental

issue that is not addressed in the literature. There has been little work done on

specifying the requirements on the input data stream to guarantee the stability of

SLAM. The author refers to this as the ‘Information Content Requirement’.

The effect of a lack of ‘richness’ in the input data on the stability of SLAM

algorithms was confirmed using two other SLAM algorithms available as open

source implementations. However, incremental localization allowed GridSLAM to

produce acceptable maps.

5.6 General Recommendations

5.6.1 Adapting the Environment for Robots

There was an implicit assumption underlying this research that the

environment did not need to be modified to accommodate the robots. However,

humans constantly modify their environments and carefully regulate them to cater

for new technologies. The classic example is the automobile, which has resulted in

highly structured roads with clear markings and standardized dimensions.

If robots offer enough benefit, then people will be prepared to modify their

environments to better suit robots. It is recommended that in future the interior

design of buildings should be ‘robot friendly’.

This is not a difficult task. Mostly it just requires some fore-thought in the

design of buildings and the decoration of the internal spaces. These same changes

would also be beneficial to humans, especially those with poor eye-sight or in

situations like a fire when the building is filled with smoke. Therefore it is a very

sensible proposition.

The required changes are simple. They are based on experience with the robots

in this research:

• Arrange lighting so that it is uniform throughout the environment. Spot

lights should be avoided. Dark areas should also be eliminated. This



makes sense from an occupational health and safety perspective

anyway.

• Use only uniformly coloured floor coverings. Alternatively, keep the

texture to a minimum so that it can easily be filtered out of camera

images.

• Ensure that all walls have a distinct strip along the bottom of the wall

where it meets the floor. Strips of different colours at waist level and/or

eye level might also be beneficial for both humans and robots by

helping to identify walls. One place where this is commonly done is on

glass doors to prevent people from walking into the doors.

• Paint all doors in a contrasting colour to the floor and walls. In

particular, consider changing one of the colour components (red, green

or blue) substantially so that doors are not visually similar to the floor.

The same applies to the door frame, which would assist in locating

doors even when they were open.

• Avoid unusual wall angles, curved walls and ‘niches’. Building only

straight walls with 90° corners will help robots to localize.

5.6.2 Additional Sensors

Humans perform multiple image processing operations simultaneously.

However, a vital aspect of human vision that makes it reliable is the ability to

interpret a scene even from a static image.

When analysing a scene, humans apply accumulated knowledge of the real

world including directional lighting effects, sophisticated pattern matching and

object recognition, high-level reasoning about distances and orientation, and

inferences regarding occlusions.

The current state of the art in computer vision still has a long way to go to

before it will approach the complexity of human vision. Therefore it is highly

advisable to incorporate additional sensors into any robot that must operate in a


Glossary

human environment and interact with humans. It is not recommended that vision be

used as the only sensor on a commercial robot.

5.7 Contributions

5.7.1 Theoretical Contributions

Few researchers have attempted vision-only navigation, exploration and

mapping. Although the problem has not been completely solved, positive

contributions to the body of knowledge have been made both in terms of what does

work and what does not. This thesis documents the problems and identifies some

solutions.

The core contribution of this thesis is the confluence of range estimation from

visual floor segmentation with SLAM to produce maps from a single forward-facing

camera mounted on an autonomous robot. Contemporaneously with the writing of

this thesis, work was published [Stuckler and Benhke 2008] that reinforces and

validates the approach taken in this thesis.

The following is a summary of the topics covered in publications by the author

during the course of this research:

• A method for Inverse Perspective Mapping (IPM) using monocular

vision [Taylor et al. 2004a]. See sections 3.7.2 and 4.5.1.

• A novel approach to creating local maps by directing the robot to

perform a pirouette (spin on the spot) to obtain a Radial Obstacle

Profile (ROP) [Taylor et al. 2004b, Taylor et al. 2005a]. See sections

3.7.6 and 4.5.2.

• Extensions to the use of the Distance Transform for use in exploration

[Taylor et al. 2005a, Taylor et al. 2005b]. See sections 3.8 and 4.6.

• Development of a simple method for detecting vertical edges in an

image when the camera is tilted [Taylor et al. 2006, Taylor 2007] (see

sections 3.5.6 and 4.5.6); and



• Development of methods for incremental localization based on

knowledge of the structure of the environment [Taylor et al. 2006,

Taylor 2007, Taylor et al. 2007]. By applying constraints, fewer

particles are required for stable operation of the GridSLAM algorithm.

See sections 3.10.2 and 4.7.

5.7.2 Practical Contributions

As noted in Chapter 1 – Introduction, a large amount of software was written

in the course of this project. The Experimental Computer Vision (ExCV) program

provides a robot-independent framework. It currently supports a Simulator (written

in OpenGL), Hemisson, Yujin and X80 robots.

From a purely practical perspective, this research produced over 80,000 lines

of source code, including the following (which is not an exhaustive list):

• An OpenGL simulator for computer vision experiments;

• An implementation of the GridSLAM algorithm;

• Facilities for recording and replaying test runs, including video;

• A program library for controlling X80 robots; and

• A complete working system based on X80 robots for map building in

indoor environments.

Versions of the code to control X80 robots were developed in both C++ and

C#. The code for the library is available under an open source licence from the

author’s company web site http://www.soft-tech.com.au/X80/.

The X80 software included a program that ran on a PDA for controlling an

X80 robot using a wireless LAN connection. This program, which allowed remote

teleoperation of an X80 from a hand-held device, was a world-first for this particular

robot.

Concurrent with this research, the author developed software examples for

Microsoft Robotics Developer Studio and co-authored a textbook on the subject


http://www.soft-tech.com.au/X80/

Glossary

[Johns & Taylor 2008]. The software included a Maze Simulator and a simple

exploration program that performed mapping. These programs are available on the

Internet from the author’s book web site http://www.promrds.com/.

5.8 Future Research

Several topics for further investigation were listed at the end of Chapter 4 in

section 4.9. This section highlights the most important topics that the author

recommends for future research.

The floor segmentation algorithm that was ultimately used was not very

sophisticated. This deserves further investigation.

There are many aspects of particle filters and incremental localization

(discussed in section 4.8) that still warrant further research. Particle filters remain an

active research area.

The robot needs to see the same map features in many successive updates to

increase the correlations. The process adopted in this thesis of moving after every

image did not allow multiple images of the same location to be processed.

Furthermore, the large step size in the forward motions and rotations meant that there

was often not a lot of overlap between images.

A finer grain of step size, or capturing images on the fly may have helped.

However the hardware precluded this, due to both the low frame rate and the motion

blur in images taken while the robot was moving. Future work should use a camera

with a faster shutter speed (to reduce or eliminate motion blur); take photos more

frequently; and take multiple photos from the same vantage point.

In this research the robot did not deliberately return to previously explored

areas (unless it was just ‘passing through’ on its way to another unexplored area).

Actively backtracking might be a useful approach for improving localization.

Alternatively, the robot could simply turn to look again at nearby areas of the map

that were poorly defined.

Floor ‘contours’ were extracted from the ROPs using the Douglas-Peucker

algorithm in order to obtain straight wall segments. Although this information was


http://www.promrds.com/


used for incremental localization, it was not used to improve the quality of the map.

It should be feasible to straighten out the walls in the map based on the floor

contours. Also, the robot occasionally drove too close to a wall and could not see the

floor. Information from prior wall contours might be used to anticipate this problem

and correct it.

Localization was a significant issue in this research. The author has

hypothesised that the problems are related to the limited information content in the

input data stream that is available from vision. This should be investigated by

examining the effects of varying the field of view of the camera. Because this

requires changing the camera, initial work might be done in simulation.

The effect of increasing camera resolution on the reliability of IPM, and hence

SLAM, should also be investigated. The author conjectures that higher resolution

will improve the stability of SLAM by providing more accurate range information

and allowing the ‘maximum vision range’ to be increased. It should be possible to

perform initial testing by using a simulator and varying the resolution of the

simulated camera.

5.9 Summary

This research has demonstrated that mapping using only monocular vision

from a cheap camera is possible.

Some different approaches were used in this research from those that had been

used before and these contributed to the body of knowledge on visual mapping. In

particular, it introduced pirouettes and the Radial Obstacle Profile for local mapping

and piecewise linear motion for robot movements. The Distance Transform proved

to be a suitable exploration algorithm.

Vision provides a limited amount of information for mapping. Further research

is required into the input requirements for the stability of SLAM algorithms.

The characteristics of the environment can be used to perform incremental

localization. This allows the stability of SLAM to be improved so that maps can be

produced reliably.


Glossary

Glossary

Words that are defined in the Glossary usually appear in italics the first time

that they are used in the text. In this Glossary, cross-references to other definitions

also appear in italics.

NOTE: The definitions provided here are the meanings as used in this thesis.

Autonomous Mobile Robot

An autonomous mobile robot is one that can move freely. It does not

necessarily have any on-board intelligence apart from that necessary to

control its wheels. ‘Autonomous’ therefore means unencumbered by a

cable connecting the robot to a computer. The word ‘mobile’ is important

because there are many robots that stay in a fixed location, such as the

ones that build cars.

Beacon

Beacons take many forms such as Infrared transmitters or barcodes

placed on walls or ceilings. The purpose is to assist robots with

localization. However, beacons require modifications to the environment

and are therefore expensive to install and maintain.

See also Feature, Landmark.

Bearing-only Sensor

A bearing-only sensor provides only a direction. For example, the

direction to a radio beacon can be determined by rotating the receiver’s

antenna to find the maximum signal strength, but not the distance to the

beacon.

See also Range Sensor.

Calibration

Camera calibration is the process whereby the intrinsic parameters of the

camera are determined. These parameters can then be used to undistort

camera images.



CCD

A Charge Coupled Device is one of the common sensors used in a digital

camera. It relies on the impact of photons to charge up small cells within

the CCD. The voltage can then be measured thereby giving an estimate

of the light intensity at a particular pixel location.

CML

CML stands for Concurrent Mapping and Localization. This term is

mostly used in Europe. In American literature it is commonly referred to

as Simultaneous Localization and Mapping (SLAM).

Computer Vision

Computer Vision is a process that takes an image as input (usually after

Image Processing) and produces a decision or an action as the result.

Edges / Edge Detection

An edge in an image is an area of discontinuity in the colour. It usually

corresponds to a physical edge of an object in the real world. Edges can

be found in images using edge detection algorithms.

EKF

Extended Kalman Filter. An EKF linearizes the system around the

current state so that the Kalman Filter equations can be applied.

See also Kalman Filter.

Encoder

See Wheel Encoder.

Extrinsic Parameters

Extrinsic parameters can be calculated from the image of a calibration

target using the intrinsic parameters for a camera. They consist of a

rotation matrix and a translation vector that can be used to transform

from the camera coordinate frame to the coordinate frame of the

calibration grid.

Feature

In computer vision, an image feature is something that can be easily

distinguished in an image, such as corners. Ideally a feature should have


Glossary

a unique signature (obtained from the feature selection algorithm) so that

it can be tracked through a sequence of images.

In localization, a feature is a point in the real world that can be identified

uniquely. For visual localization, image features can be used as

localization features.

Field Of View (FOV)

The Field Of View of a camera is the viewing angle, expressed in

degrees, visible in the image at the centre scanline. Note that this is a

horizontal FOV.

Focal Length

In the traditional mathematical model of a camera the focal length is the

distance from the centre of the camera lens to the image plane. In the

case of a digital camera, this is expressed in units of pixels which does

not necessarily have any physical significance.

Global Localization

Global localization involves the robot determining where it is on a map

given no a priori information.

Global Map

A global map is created from local maps using an appropriate algorithm

for combining the maps. In the case of grid maps, this is usually an

application of Bayes Rule.

Greyscale

Greyscale images are expressed using pixels that have a single value

which is the intensity of the light at that point. They are sometimes also

referred to as Black & White images, although they actually contain

shades of grey.

Ground Plane

The ground plane is the surface that the robot travels on. For indoor

environments, it is the floor.



Holonomic (Motion)

If a robot can instantaneously change its direction of motion to move in

any direction, then it is referred to as holonomic. From a mathematical

viewpoint, it can follow any trajectory and is not constrained to follow

particular paths dictated by its method of locomotion. Cars, for example,

are not holonomic because they have a limited turning circle.

HSI, HSL, HSV

Hue, Saturation and Intensity / Lightness / Value are a set of related

colour spaces.

See also Hue, Intensity, Lightness, Saturation.

Hue

Hue is a measure of colour in the range of 0 to 360 degrees. Because it is

a circular metric, it has a discontinuity. The hue can be calculated from

the RGB values of a pixel but it is unreliable in low light situations

where it can swing wildly due to small variations in pixel values.

Image

An image is an array, usually rectangular, of pixels representing a scene

as viewed by a digital camera.

See also Greyscale, Pixel.

Image Processing

Image Processing is a process that takes an image as input and produces

another (modified) image as output. Contrast this with Computer Vision.

Importance Weights

Particles in a particle set are assigned importance weights as a measure

of how closely they represent the true value(s) being estimated. There are

many different criteria that can be used for determining the ‘goodness of

fit’. The resulting weights must be normalised so that they correspond to

the approximate probability of the particles being correct. These weights

are then used in resampling.


Glossary

Incremental Localization

Incremental Localization is a process that is performed after each motion

to improve the pose estimate before applying a SLAM algorithm.

Infra-red Sensor

Infra-red (IR) sensors come in two varieties: passive and active. Passive

IR sensors detect the presence of infra-red radiation from the

environment and this gives an approximate measure of the proximity of

obstacles. Active IR sensors send out an IR pulse and wait for it to

bounce back. In this sense they are similar to Sonar sensors and Laser

Range Finders. However, IR sensors are extremely noisy and limited to

very short range because IR signals are easily absorbed or blocked.

Intensity

Intensity is a measure of how bright a pixel is. For the RGB colour

space, it can be calculated as the average of the three colour components.

Intrinsic Parameters

Intrinsic parameters describe the properties of a particular camera. They

include the focal length, principal point, skew, and distortion parameters

for radial and tangential distortion. The calibration process involves

taking photos of a calibration grid from a variety of viewpoints.

See also Extrinsic Parameters.

Inverse Perspective Mapping (IPM)

Given the coordinates of a pixel in an image that is known to lie on the

floor, inverse perspective mapping can be used to convert the pixel

coordinates to real-world coordinates, thereby inverting the perspective

effect.

Inverse Sensor Model

An inverse sensor model is used to convert sensor readings into the

probabilities of occupancy.

Kalman Filter (KF)

A Kalman Filter is a recursive state estimator. The Kalman Filter was

originally invented for target tracking where it provided the best



estimate, in a least squares sense, of the position of a target. Because it is

recursive, it is quite efficient.

Landmark

A landmark is a point in the real world that can be uniquely identified for

the purposes of localization.

See also Feature.

Laser Range Finder (LRF)

A laser range finder (LRF) uses a laser to determine the distance to

nearby obstacles. The laser beam might be red or infra-red and the LRF

measures the time required for the beam to bounce off an obstacle and

return. For this reason they are classified as Time of Flight sensors. A

LRF typically uses a rotating mirror to fire a series of beams at different

angles. Therefore a LRF can provide both range and bearing.

Lightness

The lightness of a pixel is a measure of its intensity.

Local Localization

In this document, Local Localization means finding the robot’s pose

within a small region surrounding the last known pose. In other words, it

has a good estimate to begin with as opposed to Global Localization

where it has no idea of its pose.

Localization

Localization (also referred to as pose estimation in the literature)

involves the robot determining where it is relative to an existing map.

Markov Process

A discrete-time system that is described by a Markov process of the first

order is one where the current state of the system depends only on the

previous state and the input in the intervening period.

Monte Carlo Localization (MCL)

A Monte Carlo process is a random (stochastic) process. Monte Carlo

Localization uses particles to represent this randomness where each


Glossary

particle has its own pose. The set of particles represents a probability

distribution for the true pose of the robot.

Monocular Vision

A single camera provides monocular vision. Two cameras give stereo

vision, and three produce trinocular vision.

Occupancy Grid

An occupancy grid is a type of metric map where the world is broken up

into a set of cells (usually a square grid). Each cell has an associated

value that indicates the probability that it is occupied.

Odometry

When a robot moves, the distance and direction of the motion can be

measured. This is referred to as odometry information.

Optic Flow

Optic flow is an estimate of localized motion, either of the camera or of

objects in the field of view, which is obtained from multiple successive

images. Optic flow algorithms generally track the movement of small

patches of pixels from one image to the next.

Orientation

The orientation of a robot is an angle measured with respect to a

designated ‘north’ direction. It determines which way the robot is facing.

Particle Filter (PF)

A particle filter is an approach to estimating a probability distribution

which might be multi-modal or non-gaussian. It consists of a set of

particles that each holds the necessary information, such as the pose of a

robot. The particles are updated at each time step based on an input, and

then evaluated against a measurement to derive an importance weight.

The best particles are kept and the poorest particles are discarded in a

resampling process. Then the whole procedure is repeated.

Piecewise Linear (Motion)

Piecewise linear motion involves only two types of motion: rotations and

translations. It does not allow moving along curved paths. By combining

these two motions, however, it is possible to approximate curves or in



fact any arbitrarily complex trajectory. The accuracy of the

approximation depends on the length of the shortest translation.

Pinhole Camera Approximation

The pinhole camera model is an idealized approximation to a camera

based on all of the light entering the camera through a single point, or

pinhole. This simplifies the equations describing the camera optics.

Pirouette

A robot performs a pirouette by rotating on the spot in a series of steps

until it has completed a full circle. The purpose of this exercise is to

obtain a complete sweep of the surrounding environment for drawing a

map or for localization.

Pixel

The term pixel is a contraction of “picture element”. In a digital image,

the picture is divided into a rectangular array of pixels. A pixel is

therefore the smallest amount of information available in an image. Pixel

values are usually expressed as a triplet of Red, Green and Blue values.

Pose

The pose of a robot consists of its position and its orientation.

Position

For the purposes of this thesis, position refers to a pair of coordinates

that locate a robot in a 2D space, i.e. on the ground plane. The position

of the robot is always deemed to be the centre of the robot, regardless of

the shape of the robot.

Posterior Probability

The posterior probability of something is the probability after

incorporating some new information.

Prior Probability

The prior probability of something, also just called the prior, is the

probability before some new information is incorporated. For example, it

might be the probability of a map cell being occupied or the probability

that the robot has a particular pose.


Glossary

Radial Obstacle Profile

A Radial Obstacle Profile (ROP) is a single-dimension array calculated

by processing an image. Each element in the array is the distance to the

nearest obstacle at a particular angle. The angular range and the

increment between angles are configuration parameters that are

constrained by the FOV of the camera and the camera resolution.

Range Sensor

A range sensor measures the distance to an object. All range sensors

have a maximum range, as well as a particular level of accuracy. Some

range sensors, such as laser range finders, also provide the bearing

(direction) associated with each range. This are sometimes referred to as

range and bearing sensors.

See also Laser Range Finder and Bearing-only Sensor.

Resampling

The particles in a particle set are periodically evaluated according to

some criterion and given an importance weight. These weights should be

proportional to the probability of the particle being correct. In other

words, they are estimates of ‘goodness of fit’. Based on the weights,

particles are selected to propagate to the next step.

Resolution

The resolution of a camera is the number of pixels in the horizontal and

vertical directions. A typical resolution for a cheap camera is 320 by 240.

See also Pixel.

RGB

RGB (Red, Green, Blue) is the most commonly used colour space.

However, many other colour spaces are also in use.

Saturation

The saturation of a pixel, expressed in the range 0 to 100%, is a measure

of how strong the colour is. Low saturation values correspond to washed-

out colours and high values are intense colours.



Scan

The term scan is usually used in connection with laser range finders. The

LRF returns a range scan, which is an array of distances and the

corresponding bearings.

Scanline

Each horizontal row of pixels in an image is a scanline. By convention,

scanlines number downwards from the top of the image.

Segmentation

Segmentation is the process of splitting an image up into regions that are

related according to some criterion. The most obvious example is

separating areas of the image based on colour. However, segmentation

can also be performed by using a texture measure in place of colour, or

applying edge detection to generate the boundaries between regions.

Sensor

A sensor is a device that measures something in the surrounding

environment and provides the information to the robot as input data.

Common sensors for robots include sonar, infra-red, laser range finders,

bump sensors, compasses, cameras and wheel encoders.

See also Laser Range Finder, Sonar Sensor, Range Sensor, Bearing-only

Sensor and Wheel Encoder.

Sensor Model

See inverse sensor model.

SIFT

Scale-Invariant Feature Transform. This is a popular feature detector

invented by Lowe [Lowe 2004].

SLAM

SLAM stands for Simultaneous Localization and Mapping. This term is

mostly used in America. In European literature, it is often referred to as

Concurrent Mapping and Localization (CML).


Glossary

Sonar Sensor

A sonar sensor uses ultrasonic sound to detect nearby obstacles. The

sound bounces off the obstacle and the time it takes to return to the

sensor gives an estimate of the distance based on the speed of sound.

Sonar sensors are very noise and not very accurate because the sound

waves spread out in a cone as they move away from the sensor. This

means that a returning signal could have come from a wide range of

directions. Sonar sensors also have problems with lost echoes due to

sound-absorbent materials, or multiple echoes from adjacent obstacles.

See also Laser Range Finder, Range Sensor and Bearing-only Sensor.

Sweep

A sweep usually refers to a full 360° range scan. Originally it was used

in connection with sonar rings which provided proximity measurements

from all around a robot. However, it is sometimes used loosely when

referring to scans over a smaller angle especially for laser range finders

which typically provide 180° scans.

Vignetting

Due to limitations on the fact that a digital camera sensor is rectangular,

but the lens is round, the amount of light reaching the senor tends to fall

off towards the edges. This results in the image getting progressively

darker further from the centre – a problem known as vignetting.

Weights

See Importance Weights.

Wheel Encoder

A wheel encoder is a sensor that determines the amount by which a

wheel has rotated. These sensors usually count ‘ticks’ and their

resolution is specified in ticks per revolution. A tick is therefore

equivalent to a certain number of degrees of rotation.


References

Aguirre, E., & Gonzalez, A. (2002). Integrating fuzzy topological maps and fuzzy

geometric maps for behavior-based robots. International Journal of

Intelligent Systems, 17(3), 333-368.

Andrade-Cetto, J., & Sanfeliu, A. (2006). Environment Learning for Indoor Mobile

Robots - A Stochastic State Estimation Approach to Simultaneous

Localization and Map Building (Vol. 23). Berlin, Germany: Springer-Verlag.

Andreasson, H., Triebel, R., & Lilienthal, A. (2006, Dec). Vision-based Interpolation

of 3D Laser Scans. Paper presented at the 3rd International Conference on

Autonomous Robots and Agents, Palmerston North, New Zealand.

Araujo, R., & de Almeida, A. T. (1998). Map building using fuzzy ART, and learning

to navigate a mobile robot on an unknown world. Paper presented at the

IEEE International Conference on Robotics and Automation (ICRA).

Araujo, R., & de Almeida, A. T. (1999). Learning sensor-based navigation of a real

mobile robot in unknown worlds. IEEE Transactions on Systems, Man and

Cybernetics, Part B, 29(2), 164-178.

Arkin, R. C. (1998). Behavior-Based Robotics. Cambridge, MA: MIT Press.

Ayache, N. & Faugeras, O.D. (1988). Building, Registrating, and Fusing Noisy

Visual Maps. International Journal of Robotics Research, 12(7), 45-65

Badal, S., Ravela, S., Draper, B., & Hanson, A. (1994). A Practical Obstacle

Detection and Avoidance System. Paper presented at the Workshop on

Applications of Computer Vision, Sarasota, FL.

Bailey, T., Nieto, J., & Nebot, E. (2006). Consistency of the FastSLAM Algorithm.

Paper presented at the IEEE International Conference on Robotics and

Automation (ICRA).

Bailey, T., Nieto, J., & Nebot, E. (2006). Consistency of the EKF-SLAM Algorithm.

Paper presented at the IEEE/RSJ International Conference on Intelligent

Robots and Systems (IROS).

Bailey, T., & Durrant-Whyte, H. F. (2006, Sep). Simultaneous localization and

mapping (SLAM): part II. IEEE Robotics & Automation Magazine, 13, 108-



117.

Baker, S., & Nayar, S. K. (1998, Jan). A Theory of Catadioptric Image Formation.

Paper presented at the International Conference on Computer Vision (ICCV),

Bombay, India.

Ballard, D., & Brown, C. (1982). Computer Vision: Prentice Hall.

Beauchemin, S. S., & Barron, J. L. (1995). The Computation of Optical Flow. ACM

Computing Surveys, 27(3), 433-467.

Bennewitz, M., Stachniss, C., Burgard, W., & Behnke, S. (2006). Metric

Localization with Scale-Invariant Visual Features using a Single Perspective

Camera. Paper presented at the European Robotics Symposium (EUROS),

Palermo, Italy.

Bertozzi, M., Broggi, A., & Fascioli, A. (1998). Stereo inverse projection mapping:

theory and applications. Image and Vision Computing, 16, 585-590.

Bertozzi, M., Broggi, A., & Fascioli, A. (2000). Vision-based intelligent vehicles:

State of the art and perspectives. Robotics and Autonomous Systems, 32, 1-

16.

Borenstein, J., & Koren, Y. (1989). Real-Time Obstacle Avoidance for Fast Mobile

Robots. IEEE Transactions on Systems, Man, and Cybernetics, 19(5), 1179-

1187.

Borenstein, J., & Feng, L. (1996). Measurement and correction of systematic

odometry errors in mobile robots. IEEE Transactions on Robotics and

Automation, 12, 869-880.

Borenstein, J., Everett, H. R., & Feng, L. (1996). Where am I? Sensors and Methods

for Mobile Robot Positioning. Retrieved 10-Sep, 2003, from http://www-

personal.umich.edu/~johannb/shared/pos96rep.pdf

Borenstein, J., Everett, H. R., Feng, L., & Wehe, D. (1997). Mobile robot

positioning: Sensors and techniques. Journal of Robotic Systems, 14(4), 231-

249.

Borgefors, G. (1986). Distance Transformations in Digital Images. Computer Vision,

Graphics and Image Processing, 34, 344-371.

Bouguet, J.-Y., & Perona, P. (1995). Visual Navigation Using a Single Camera.

Paper presented at the International Conference on Computer Vision (ICCV).

Bouguet, J.-Y. (2006). Camera calibration toolbox for Matlab. Retrieved 19-Mar,

2006, from http://www.vision.caltech.edu/bouguetj/calib_doc


Brooks, R. A. (1986). A Robust Layered Control System for a Mobile Robot. IEEE

Journal of Robotics and Automation, RA-2(1), 14-23.

Brooks, R. A. (2002). Flesh and Machines: How Robots Will Change Us: Pantheon

Books.

Bruce, J., Balch, T., & Veloso, M. (2000). Fast and inexpensive color image

segmentation for interactive robots. Paper presented at the International

Conference on Intelligent Robots and Systems (IROS).

Burgard, W., Fox, D., Moors, M., Simmons, R., & Thrun, S. (2000). Collaborative

multi-robot exploration. Paper presented at the IEEE International

Conference on Robotics and Automation (ICRA), San Francisco, CA.

Burgard, W., Moors, M., Stachniss, C., & Schneider, F. (2005). Coordinated Multi-

Robot Exploration. IEEE Transactions on Robotics, 21(3), 376-388.

Burschka, D. & Hager, G.D. (2003, Oct). V-GPS - image-based control for 3D

guidance systems. Paper presented at the IEEE/RSJ International Conference

on Intelligent Robots and Systems (IROS).

Burschka, D. & Hager, G.D. (2004). V-GPS(SLAM): vision-based inertial system for

mobile robots. Paper presented at the IEEE International Conference on

Robotics and Automation (ICRA).

Campbell, J., Sukthankar, R., Nourbakhsh, I. R., & Pahwa, A. (2005, Apr). A Robust

Visual Odometry and Precipice Detection System Using Consumer-grade

Monocular Vision. Paper presented at the IEEE Conference on Robotics and

Automation (ICRA).

Camus, T., Coombs, D., Herman, M., & Hong, T. H. (1996). Real-time single-

workstation obstacle avoidance using only wide-field flow divergence. Paper

presented at the 13th International Conference on Pattern Recognition.

Canny, J. F. (1986). A computational approach to edge detection. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 8(6), 679-697.

CARMEN. (2007). Carmen Robot Navigation Toolkit. Retrieved 2-Feb, 2007, from

http://carmen.sourceforge.net/home.html

Castellanos, J. A., & Tardós, J. D. (1999). Mobile Robot Localization and Map

Building - A Multisensor Fusion Approach. Norwell, MA: Kluwer Academic

Publishers.

Castellanos, J. A., Montiel, J. M. M., Neira, J., & Tardós, J. D. (1999). The SPmap: a

probabilistic framework for simultaneous localization and map building.



IEEE Transactions on Robotics and Automation, 15(5), 948-952.

Castellanos, J. A., Neira, J., & Tardós, J. D. (2004, Jul). Limits to the consistency of

EKF-based SLAM. Paper presented at the 5th IFAC/EURON Symposium on

Intelligent Autonomous Vehicles, Lisbon, Portugal.

Chang, H. J., Lee, C. S. G., Lu, Y.-H., & Hu, Y. C. (2007). P-SLAM: Simultaneous

Localization and Mapping With Environmental-Structure Prediction. IEEE

Transactions on Robotics, 23(2), 281-293.

Cheng, G., & Zelinsky, A. (1996, Nov). Real-Time Visual Behaviours for Navigating

a Mobile Robot. Paper presented at the IEEE/RSJ International Conference


Cheng, G., & Zelinsky, A. (1998). Goal-oriented behaviour-based visual navigation.


Automation (ICRA).

Choi, W., Ryu, C., & Kim, H. (1999). Navigation of a mobile robot using mono-

vision and mono-audition. Paper presented at the IEEE International

Conference on Systems, Man and Cybernetics.

Chong, K. S., & Kleeman, L. (1997, Sep). Indoor Exploration Using a Sonar Sensor

Array: A Dual Representation Strategy. Paper presented at the IEEE/RSJ

International Conference on Intelligent Robots and Systems (IROS).

Choset, H., Lynch, K. M., Hutchinson, S., Kantor, G., Burgard, W., Kavraki, L. E.,

et al. (2005). Principles of Robot Motion. Cambridge, MA: MIT Press.

Clemente, L. A., Davison, A. J., Reid, I. D., Neira, J., & Tardós, J. D. (2007, Jun).

Mapping Large Loops with a Single Hand-Held Camera. Paper presented at

the Robotics: Science and Systems (RSS), Atlanta, GA.

CMU. (2006). The CMUcam Vision Sensors. Retrieved 6-Dec, 2006, from

http://www.cs.cmu.edu/~cmucam/

Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature

space analysis. IEEE Trans. Pattern Analysis and Machine Intelligence, 24,

603-619.

Dahlkamp, H., Kaehler, A., Stavens, D., Thrun, S., & Bradski, G. (2006, Aug). Self-

supervised Monocular Road Detection in Desert Terrain. Paper presented at

the Robotics: Science and Systems, Philadelphia, PA.

Das, A. K., Fierro, R., Kumar, V., Southall, B., Spletzer, J., & Taylor, C. J. (2001).

Real-time vision-based control of a nonholonomic mobile robot. Paper


presented at the IEEE International Conference on Robotics and Automation,

Seoul, Korea.

Davison, A. J., & Murray, D. W. (1998). Mobile Robot Localisation Using Active

Vision. Paper presented at the 5th European Conference on Computer Vision

(ECCV), Freiburg, Germany.

Davison, A. J. (1999). Mobile Robot Navigation using Active Vision. Unpublished

PhD, University of Oxford.

Davison, A. J., & Murray, D. W. (2002). Simultaneous localization and map-

building using active vision. IEEE Trans. Pattern Analysis and Machine

Intelligence, 24(7), 865-880.

Davison, A. J. (2003, Oct). Real-time simultaneous localisation and mapping with a

single camera. Paper presented at the 9th IEEE International Conference on

Computer Vision, Nice, France.

Davison, A. J., Cid, Y. G., & Kita, N. (2004, Jul). Real-time 3D SLAM with Wide-

Angle Vision. Paper presented at the 5th IFAC/EURON Symposium on

Intelligent Autonomous Vehicles (IAV), Lisbon, Portugal.

Dellaert, F., Burgard, W., Fox, D., & Thrun, S. (1999, Jun). Using the Condensation

Algorithm for Robust, Vision-based Mobile Robot Localization. Paper

presented at the IEEE Computer Society Conference on Computer Vision and

Pattern Recognition (CVPR).

DeSouza, G. N., & Kak, A. C. (2002). Vision for Mobile Robot Navigation: A

Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence,

24(2), 237-267.

Dissanayake, M. W. M. G., Newman, P., Clark, S., Durrant-Whyte, H. F., & Csorba,

M. (2001). A solution to the simultaneous localization and map building

(SLAM) problem. IEEE Transactions on Robotics and Automation, 17(3),

229-241.

Doucet, A., de Freitas, N., & Gordon, N. (Eds.). (2001). Sequential Monte Carlo

Methods in Practice. New York, NY: Springer-Verlag.

Dr Robot Inc. (2008). Sentinel2 Robot with Constellation System. Retrieved 29-Jun,

2008, from

http://www.drrobot.com/products_item.asp?itemNumber=Sentinel2

Dr Robot Inc. (2008). X80 WiFi Robot. Retrieved 29-Jun, 2008, from

http://www.drrobot.com/products_item.asp?itemNumber=X80



Duda, R. O., & Hart, P. E. (1972). Use of the Hough Transformation to Detect Lines

and Curves in Pictures. Communications of the ACM, 15, 11-15.

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (2nd ed.):

Wiley-Interscience.

Dudek, G., & Jenkin, M. R. M. (2000). Computational Principles of Mobile

Robotics: Cambridge University Press.

Durrant-Whyte, H. F. (1988). Uncertain geometry in robotics. IEEE Transactions of

Robotics and Automation, 4(1), 23-31.

Durrant-Whyte, H. F., & Bailey, T. (2006, Jun). Simultaneous localization and

mapping: part I. IEEE Robotics & Automation Magazine, 13, 99-110.

Eade, E., & Drummond, T. (2006, Jun). Scalable Monocular SLAM. Paper presented

at the IEEE Computer Society Conference on Computer Vision and Pattern

Recognition (CVPR).

Egido, V., Barber, R., Boada, M. J. L., & Salichs, M. A. (2002, May). Self-

Generation by a Mobile Robot of Topological Maps of Corridors. Paper

presented at the IEEE International Conference on Robotics and Automation

(ICRA), Washington, DC.

Elfes, A. (1989). Using occupancy grids for mobile robot perception and navigation.

Computer, 22(6), 46-57.

Eliazar, A., & Parr, R. (2003). DP-SLAM: Fast, Robust Simultaneous Localization

and Mapping Without Predetermined Landmarks. Paper presented at the

International Joint Conference on Artificial Intelligence.

Everett, H. R. (1995). Sensors for Mobile Robots: Theory and Applications. Natick,

MA: A K Peters, Ltd.

Evolution Robotics. (2005). Robots, robot kits, OEM Solution: Evolution Robotics.

Retrieved 20-Jun, 2005, from http://www.evolution.com/er1/

Evolution Robotics. (2008). Northstar System. Retrieved 29-Jun, 2008, from

http://www.evolution.com/products/northstar/

Faugeras, O. D. (1993). Three-Dimensional Computer Vision. Cambridge, MA: MIT

Press.

Filliat, D., & Meyer, J.-A. (2000). Active Perception and Map Learning for Robot

Navigation. Paper presented at the 6th International Conference on

Simulation and Adaptive Behaviour.

Filliat, D., & Meyer, J.-A. (2001). Global localization and topological map-learning


for robot navigation. Paper presented at the 7th International Conference on

Simulation and Adaptive Behaviour.

Finlayson, G. D., Schiele, B., & Crowley, J. L. (1998). Comprehensive Colour

Image Normalization. Lecture Notes in Computer Science, 1406, 475.

Firby, R. J., Kahn, R. E., Prokopopowitz, P. N., & Swain, M. J. (1995). An

Architecture for Vision and Action. Paper presented at the Fourteenth

International Joint Conference on Artificial Intelligence (IJCAI).

Forsyth, D. A., & Ponce, J. (2003). Computer Vision - A Modern Approach. Upper

Saddle River, NJ: Prentice-Hall Inc.

Fox, D., Ko, J., Konolige, K., Limketkai, B., Schulz, D., & Stewart, B. (2006).

Distributed Multirobot Exploration and Mapping. Proceedings of the IEEE,

94(7), 1325-1339.

Funt, B. V., & Finlayson, G. D. (1995). Color constant color indexing. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 17(5), 522-529.

Funt, B., Barnard, K., & Martin, L. (1998). Is Machine Colour Constancy Good

Enough. Lecture Notes in Computer Science, 1406, 445-449.

Gaussier, P., Joulain, C., Banquet, J. P., Leprêtre, S., & Revel, A. (2000). The visual

homing problem: An example of robotics/biology cross fertilization. Robotics

and Autonomous Systems, 30, 155-180.

Geusebroek, J. M., van den Boomgaard, R., Smeulders, A. W. M., & Geerts, H.

(2001). Color invariance. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 23(12), 1338-1350.

Goncalves, L., di Bernardo, E., Benson, D., Svedman, M., Ostrowski, J., Karlsson,

N., & Pirjanian, P. (2005, Apr). A Visual Front-end for Simultaneous

Localization and Mapping. Paper presented at the IEEE International

Conference on Robotics and Automation (ICRA).

Grisetti, G., Stachniss, C., & Burgard, W. (2005). Improving Grid-based SLAM with

Rao-Blackwellized Particle Filters by Adaptive Proposals and Selective

Resampling. Paper presented at the IEEE International Conference on


Grisetti, G., Stachniss, C., & Burgard, W. (2007). Improved Techniques for Grid

Mapping with Rao-Blackwellized Particle Filters. IEEE Transactions on

Robotics, 23(1), 34-46.

Gutmann, J.-S., Burgard, W., Fox, D., & Konolige, K. (1998). An experimental



comparison of localization methods. Paper presented at the IEEE/RSJ


Gutmann, J.-S., & Konolige, K. (1999, Nov). Incremental mapping of large cyclic

environments. Paper presented at the International Symposium on

Computational Intelligence in Robotics and Automation, Monterey, CA.

Hähnel, D., Burgard, W., Fox, D., & Thrun, S. (2003, Oct). An efficient FastSLAM

algorithm for generating maps of large-scale cyclic environments from raw

laser range measurements. Paper presented at the IEEE/RSJ International


Hähnel, D. (2004). Mapping with Mobile Robots. Unpublished PhD, University of

Freiburg, Freiburg, Germany.

Hanbury, A. (2002). The Taming of the Hue, Saturation and Brightness Colour

Space. Paper presented at the 7th Computer Vision Winter Workshop, Bad

Aussee, Austria.

Harris, C., & Stephens, M. (1988). A Combined Corner and Edge Detector. Paper

presented at the Fourth Alvey Vision Conference, Manchester, UK.

Hart, P. E., Nilsson, N. J., & Raphael, B. (1968). A formal basis for the heuristic

determination of minimum cost paths. IEEE Transactions on Systems Science

and Cybernetics, 4(2), pp 100-107.

Hartley, R., & Zisserman, A. (2003). Multiple View Geometry in Computer Vision

(2nd ed.). Cambridge, UK: Cambridge University Press.

Hoffman, D. D. (1998). Visual Intelligence: W.W. Norton & Company.

Hong, S., Moradi, H., Lee, S., & Lee, S. (2006, Dec). A Comparative Study of Local

Minima Escape in Potential Field Method. Paper presented at the 3rd

International Conference on Autonomous Robots and Agents, Palmerston

North, New Zealand.

Horn, B. K. P. (1986). Robot Vision. Cambridge, MA: MIT Press.

Horswill, I. D., & Brooks, R. A. (1988). Situated Vision in a Dynamic World:

Chasing Objects. Paper presented at the American Association of Artificial

Intelligence, St. Paul, MN.

Horswill, I. D. (1993). Polly: A Vision-Based Artificial Agent. Paper presented at the

11th National Conference on Artificial Intelligence.

Horswill, I. D. (1994). Visual Collision Avoidance by Segmentation. Paper presented

at the Image Understanding Workshop.


Horswill, I. D. (1994). Specialization of Perceptual Processes. Unpublished PhD,

MIT.

Horswill, I. D. (1997). Visual architecture and cognitive architecture. Journal of

Experimental and Theoretical Artificial Intelligence (JETAI), 9(2-3), 277-

292.

Howard, A., & Kitchen, L. (1997). Fast Visual Mapping for Mobile Robot

Navigation. Paper presented at the IEEE International Conference on

Intelligent Processing Systems, Beijing, China.

Howard, A. (2006). Multi-robot Simultaneous Localization and Mapping using

Particle Filters. International Journal of Robotics Research, 25(12), 1243-

1256.

Intel. (2005). Open Source Computer Vision Library. Retrieved 2-Aug, 2005, from

http://www.intel.com/technology/computing/opencv

iRobot. (2006). iRobot Corporation: iRobot Roomba. Retrieved 9-Nov, 2006, from

http://www.irobot.com/sp.cfm?pageid=122

Ishiguro, H. (1998). Development of low-cost compact omnidirectional vision

sensors and their applications. Paper presented at the International

Conference on Information Systems, Analysis and Synthesis.

Jarvis, R. A. (1984). Collision-free Trajectory Planning Using Distance Transforms.

Paper presented at the National Conference and Exhibition on Robotics,

Melbourne, Australia.

Jarvis, R. A. (1993). Distance Transform Based Path Planning For Robot Navigation.

In Y. F. Zheng (Ed.), Recent Trends in Mobile Robots (Vol. 11, pp. 3-31):

World Scientific.

Johns, K., & Taylor, T. (2008). Professional Microsoft Robotics Developer Studio:

Wrox (Wiley Publishing).

Johnson, C. R., Jr., & Taylor, T. (1980). Failure of a parallel adaptive identifier with

adaptive error filtering. IEEE Transactions on Automatic Control, 25(6),

1248-1250.

Julier, S. J., & Uhlmann, J. K. (1997, Apr). A new extension of the Kalman filer to

nonlinear systems. Paper presented at the 11th SPIE International

Symposium on Aerospace/Defense Sensing, Simulation and Controls,

Orlando, FL.

Julier, S. J., & Uhlmann, J. K. (2001, May). A counter example to the theory of



simultaneous localization and map building. Paper presented at the IEEE

International Conference on Robotics and Automation (ICRA), Seoul, Korea.

Julier, S. J., & Uhlmann, J. K. (2004). Unscented filtering and nonlinear estimation.

Proceedings of the IEEE, 92(3), 401-422.

Kak, A. C., & DeSouza, G. N. (2002). Robotic vision: what happened to the visions

of yesterday? Paper presented at the 16th International Conference on Pattern

Recognition.

Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.

Transactions of the ASME, Journal of Basic Engineering, 82, 35-45.

Karlsson, N., di Bernardo, E., Ostrowski, J., Goncalves, L., Pirjanian, P. and

Munich, M.E. (2005, Apr). The vSLAM Algorithm for Robust Localization

and Mapping. Paper presented at the IEEE International Conference on


Khatib, O. (1985). Real-time obstacle avoidance for manipulators and mobile

robots. Paper presented at the IEEE International Conference on Robotics

and Automation (ICRA).

Koch, C. (2004). The Quest for Consciousness. Englewood, CO: Roberts and

Company Publishers.

Koenig, S., Tovey, C., & Halliburton, W. (2001). Greedy mapping of terrain. Paper

presented at the IEEE International Conference on Robotics and Automation

(ICRA).

K-Team. (2006). K-TEAM Corporation (Hemisson). Retrieved 10-Nov, 2006, from

http://www.k-

team.com/kteam/index.php?site=1&rub=22&page=16&version=EN

Kuffner, J. J. (2004, Sep). Efficient Optimal Search of Uniform-Cost Grids and

Lattices. Paper presented at the IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS), Sendai, Japan.

Kurzweil, R. (2005). The Singularity is Near - When Humans Transcend Biology:

Penguin Books.

Lee, D. (1996). The Map-Building and Exploration Strategies of a Simple Sonar-

Equipped Mobile Robot. Cambridge, UK: Cambridge University Press.

Lenser, S., & Veloso, M. (2003, Oct). Visual sonar: fast obstacle avoidance using

monocular vision. Paper presented at the IEEE/RSJ International Conference



Lorigo, L. M., Brooks, R. A., & Grimson, W. E. L. (1997). Visually-guided obstacle

avoidance in unstructured environments. Paper presented at the IEEE/RSJ


Lourakis, M. I. A., & Orphanoudakis, S. C. (1998, Jan). Visual Detection of

Obstacles Assuming a Locally Planar Ground. Paper presented at the Asian

Conference on Computer Vision (ACCV).

Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Keypoints.

International Journal of Computer Vision, 60(2), pp 91-110.

Lu, F., & Milios, E. (1994). Robot pose estimation in unknown environments by

matching 2D range scans. Paper presented at the IEEE Computer Society

Conference on Computer Vision and Pattern Recognition (CVPR), Seattle,

WA.

Lu, F., & Milios, E. (1997). Globally consistent range scan alignment for

environment mapping. Autonomous Robots, 4, 333-349.

Maja, J. M., Takahashi, T., Wang, Z. D., & Nakano, E. (2000). Real-time obstacle

avoidance algorithm for visual navigation. Paper presented at the IEEE/RSJ


Mallot, H. A., Bülthoff, H. H., Little, J. J., & Bohrer, S. (1991). Inverse perspective

mapping simplifies optical flow computation and obstacle detection.

Biological Cybernetics, 64, 177-185.

Manko, E. (2003). CodeGuru: Creating a High-Precision, High Resolution and

Highly Reliable Timer. Retrieved 14-Mar, 2003, from

http://www.codeguru.com/Cpp/W-P/system/timers/article.php/c5759/

Marr, D. (1982). Vision: a Computational Investigation into the Human

Representation and Processing of Visual Information. New York, NY: W.H.

Freeman.

Martin, M. C. (2001). The Simulated Evolution of Robot Perception. Unpublished

PhD Dissertation, Carnegie Mellon University, Pittsburgh, PA, USA.

Martin, M. C. (2002, Aug). Genetic Programming for Robot Vision. Paper presented

at the From Animals to Animats 7: The Seventh International Conference on

the Simulation of Adaptive Behavior.

Martinez-Cantin, R., & Castellanos, J. A. (2006, May). Bounding uncertainty in

EKF-SLAM: the robocentric local approach. Paper presented at the IEEE

International Conference on Robotics and Automation (ICRA).



Milford, M.J., Wyeth, G.F., & Prasser, D.P. (2004). Simultaneous Localization and

Mapping from Natural Landmarks using RatSLAM. Paper presented at the

Australasian Conference on Robotics and Automation, Canberra, Australia.

Miro, J.V., Dissanayake, G., Zhou, W. (2005, Dec). Vision-based SLAM using

natural features in indoor environments. Paper presented at the International

Conference on Intelligent Sensors, Sensor Networks and Information

Processing.

Montemerlo, M., Thrun, S., Koller, D., & Wegbreit, B. (2003). FastSLAM 2.0: An

Improved Particle Filtering Algorithm for Simultaneous Localization and

Mapping that Provably Converges. Paper presented at the International Joint

Conferences on Artificial Intelligence (IJCAI).

Montemerlo, M., & Thrun, S. (2007). FastSLAM - A Scalable Method for the

Simultaneous Localization and Mapping Problem in Robotics (Vol. 27).

Berlin, Germany: Springer-Verlag.

Moravec, H. P. (1977, Aug). Towards Automatic Visual Obstacle Avoidance. Paper

presented at the 5th International Joint Conferences on Artificial Intelligence

(IJCAI), Cambridge, MA.

Moravec, H. P. (1980). Obstacle Avoidance and Navigation in the Real World by a

Seeing Robot Rover. Unpublished PhD, Stanford.

Moravec, H. P. (1983). The Stanford Cart and the CMU Rover. Proceedings of the

IEEE, 71(7), 872-884.

Moravec, H. P., & Elfes, A. (1985, Mar). High-resolution maps from wide-angle

sonar. Paper presented at the IEEE International Conference on Robotics

Automation (ICRA), St. Louis, MI.

Murphy, R. R. (2000). Introduction to AI Robotics. Cambridge, MA: MIT Press.

Murray, D., & Jennings, C. (1997, Apr). Stereo vision based mapping and

navigation for mobile robots. Paper presented at the IEEE International

Conference on Robotics and Automation (ICRA), Albuquerque, New

Mexico.

Murray, D., & Little, J. J. (2000). Using Real-Time Stereo Vision for Mobile Robot

Navigation. Autonomous Robots, 8(2), 161-171.

Nayar, S. K. (1997, 17-19 June). Catadioptric omnidirectional camera. Paper

presented at the IEEE Computer Society Conference on Computer Vision and

Pattern Recognition (CVPR), San Juan, Puerto Rico.


Newman, P. M. (1999). On the Structure and Solution of the Simultaneous

Localisation and Map Building Problem. Unpublished PhD, University of

Sydney, Sydney.

Newman, P. M., Leonard, J. J., Neira, J., & Tardós, J. D. (2002). Explore and return:

Experimental validation of real time concurrent mapping and localization.


Automation (ICRA).

Nieto, J., Bailey, T., & Nebot, E. (2005). Scan-SLAM: Combining EKF-SLAM and

Scan Correlation. Paper presented at the International Conference on Field

and Service Robotics (FSR).

Nilsson, N. J. (1984). Shakey the Robot (Technical Report No. 323). Menlo Park,

CA: SRI International.

Oh, J. S., Choi, Y. H., Park, J. B., & Zheng, Y. F. (2004). Complete coverage

navigation of cleaning robots using triangular-cell-based map. IEEE

Transactions on Industrial Electronics, 51(3), 718-726.

OpenSLAM. (2007). OpenSLAM. Retrieved 19 Feb, 2007, from

http://www.openslam.org/

Pei, S.-C., & Horng, J.-H. (1998). Finding the optimal driving path of a car using the

modified constrained distance transformation. IEEE Transactions on

Robotics and Automation, 14, 663-670.

Pérez, J. A., Castellanos, J. A., Montiel, J. M. M., Neira, J., & Tardós, J. D. (1999,

May). Continuous mobile robot localization: vision vs. laser. Paper presented

at the IEEE International Conference on Robotics and Automation (ICRA),

Detroit, Michigan.

Pérez Lorenzo, J. M., Vázquez-Martín, R., Núñez, P., Pérez, E., & Sandoval, F.

(2004, Dec). A Hough-based Method for Concurrent Mapping and

Localization in Indoor Environments. Paper presented at the IEEE

Conference on Robotics, Automation and Mechatronics (RAM), Singapore.

Prasser, D.P., Wyeth, G.F. & Milford, M.J. (2004, Oct). Biologically inspired visual

landmark processing for simultaneous localization and mapping. Paper

presented at the IEEE/RSJ International Conference on Intelligent Robots

and Systems (IROS).

Radish. (2006). Radish - The Robotics Data Set Repository. Retrieved 15-Nov, 2006,

from http://sourceforge.net/radish



Rekleitis, I., Sim, R., Dudek, G., & Milios, E. (2001). Collaborative exploration for

the construction of visual maps. Paper presented at the IEEE/RSJ

International Conference on Intelligent Robots and Systems (IROS), Maui,

Hawaii.

Reyes, N. H., Barczak, A. L., & Messon, C. H. (2006, Dec). Fast Colour

Classification for Real-time Colour Object Identification: AdaBoost Training

of Classifiers. Paper presented at the 3rd International Conference on


Rosenfeld, A., & Pfaltz, J. L. (1966). Sequential operations in digital picture

processing. Journal of the Association for Computing Machinery, 13(4), 471-

494.

Rutgers University. (2004). Edge Detection and Image SegmentatiON system

(EDISON). Retrieved 12-Oct, 2004, from

http://www.caip.rutgers.edu/riul/research/code/EDISON

RxPG News. (2006). Microsaccades are indeed responsible for most of our visual

experience. Retrieved 10-Nov, 2006, from

http://www.rxpgnews.com/research/ophthalmology/article_3188.shtml

Schiele, B., & Crowley, J. L. (1994, May). A Comparison of Position Estimation

Techniques Using Occupancy Grids. Paper presented at the IEEE

International Conference on Robotics and Automation (ICRA), San Diego,

CA.

Schultz, A. C., & Adams, W. (1996). Continuous Localization using Evidence Grids

(No. NCARAI Report AIC-96-007). Washington, DC: Naval Research

Laboratory.

Se, S., Lowe, D., & Little, J. J. (2001). Vision-based Mobile robot localization and

mapping using scale-invariant features. Paper presented at the IEEE

International Conference on Robotics and Automation (ICRA), Seoul, Korea.

Se, S., Lowe, D., & Little, J. J. (2002). Mobile Robot Localization and Mapping with

Uncertainty using Scale-Invariant Visual Landmarks. International Journal

of Robotics Research, 21(8), 735-758.

Shi, J., & Tomasi, C. (1994). Good Features to Track. Paper presented at the IEEE

Computer Society Conference on Computer Vision and Pattern Recognition

(CVPR), Seattle, Washington.

Shih, F. Y., & Wu, Y.-T. (2004). Fast Euclidean distance transformation in two


scans using a 3 x 3 neighborhood. Computer Vision and Image

Understanding, 93, 195–205.

Siegwart, R., & Nourbakhsh, I. R. (2004). Introduction to Autonomous Mobile

Robots. Cambridge, MA: MIT Press.

Sim, R., & Dudek, G. (2003, Oct). Effective exploration strategies for the

construction of visual maps. Paper presented at the IEEE/RSJ International


Sim, R., Elinas, P., Griffin, M., Shyr, A., & Little, J.J. (2006, Jun). Design and

analysis of a framework for real-time vision-based SLAM using Rao-

Blackwellised particle filters. Paper presented at the 3rd Canadian

Conference on Computer and Robot Vision.

Simmons, R., Apfelbaum, D., Burgard, W., Fox, D., Moors, M., Thurn, S., et al.

(2000). Coordination for Multi-Robot Exploration and Mapping. Paper

presented at the 17th National Conference on Artificial Intelligence.

Sintorn, I.-M., & Borgefors, G. (2001). Weighted distance transforms in rectangular

grids. Paper presented at the 11th International Conference on Image

Analysis and Processing.

Smith, R. C., & Cheeseman, P. (1986). On the representation and estimation of

spatial uncertainty. International Journal of Robotics Research (IJRR), 5(4),

56-68.

Smith, R. C., Self, M., & Cheeseman, P. (1990). Estimating uncertain spatial

relationships in robotics. In I. J. Cox & G. T. Wilfong (Eds.), Autonomous

Robot Vehicles (pp. 167-193). New York: Springer-Verlag.

Smith, S. M., & Brady, J. M. (1995). SUSAN - A new approach to low level image

processing (No. TR95SMS1c): Oxford University.

Smith, P., Reid, I., & Davison, A. (2006). Real-Time Monocular SLAM with Straight

Lines. Paper presented at the British Machine Vision Conference (BMVC).

Stachniss, C., Hähnel, D., Burgard, W., & Grisetti, G. (2005). On Actively Closing

Loops in Grid-based FastSLAM. International Journal of the Robotics

Society of Japan, 19(10), 1059-1080.

Stentz, A. (1994, May). Optimal and Efficient Path Planning for Partially-Known

Environments. Paper presented at the IEEE International Conference on


Tang, K. W., & Jarvis, R. A. (2003). A Simple and Efficient Algorithm for Robotic



Swarm Exploratory Tasks (Technical Report No. MECSE-11-2003).

Melbourne, Australia: Department of Electrical and Computer Systems

Engineering, Monash University.

Taylor, T., Geva, S., & Boles, W. W. (2004, Dec). Vision-based Pirouettes using the

Radial Obstacle Profile. Paper presented at the IEEE Conference on

Robotics, Automation and Mechatronics (RAM), Singapore.

Taylor, T., Geva, S., & Boles, W. W. (2004, Jul). Monocular Vision as a Range

Sensor. Paper presented at the International Conference on Computational

Intelligence for Modelling, Control & Automation (CIMCA), Gold Coast,

Australia.

Taylor, T., Geva, S., & Boles, W. W. (2005, Dec). Directed Exploration using a

Modified Distance Transform. Paper presented at the Digital Image

Computing Techniques and Applications (DICTA), Cairns, Australia.

Taylor, T., Geva, S., & Boles, W. W. (2005, Sep). Early Results in Vision-based

Map Building. Paper presented at the 3rd International Symposium on

Autonomous Minirobots for Research and Edutainment (AMiRE), Awara-

Spa, Fukui, Japan.

Taylor, T. (2006). MSRS Code Page. Retrieved 10-Nov, 2006, from

http://sky.fit.qut.edu.au/~taylort2/MSRS/

Taylor, T. (2006). X80 WiRobot. Retrieved 10-Nov, 2006, from

http://sky.fit.qut.edu.au/~taylort2/X80/

Taylor, T., Geva, S., & Boles, W. W. (2006, Dec). Using Camera Tilt to Assist with

Localisation. Paper presented at the 3rd International Conference on


Taylor, T. (2007). Applying High-Level Understanding to Visual Localisation for

Mapping. In S. C. Mukhopadhyay & G. Sen Gupta (Eds.), Autonomous

Robots and Agents (Vol. 76, pp. 35-42): Springer-Verlag.

Taylor, T., Boles, W. W., & Geva, S. (2007, Dec). Map Building using Cheap

Digital Cameras. Paper presented at the Digital Image Computing

Techniques and Applications (DICTA), Adelaide, Australia.

Thau, R. S. (1997). Reliably Mapping a Robot's Environment using Fast Vision and

Local, but not Global, Metric Data. Unpublished PhD, MIT.

Thrun, S., & Bucken, A. (1996). Integrating Grid-Based and Topological Maps for

Mobile Robot Navigation. Paper presented at the 13th National Conference


on Artificial Intelligence.

Thrun, S. (1998). Learning metric-topological maps for indoor mobile robot

navigation. Artificial Intelligence, 99(1), 21-71.

Thrun, S., Burgard, W., & Fox, D. (1998). A probabilistic approach to concurrent

mapping and localization for mobile robots. Machine Learning, 31(1-3), 29-

53.

Thrun, S., Beetz, M., Bennewitz, M., Burgard, W., Cremers, A. B., Dellaert, F., et al.

(2000). Probabilistic algorithms and the interactive museum tour-guide robot

Minerva. International Journal of Robotics Research, 19(11), 972-999.

Thrun, S. (2001). A probabilistic on-line mapping algorithm for teams of mobile

robots. International Journal of Robotics Research, 20(5), 335-363.

Thrun, S. (2002). Robotic Mapping: A Survey. In G. Lakemeyer & B. Nebel (Eds.),

Exploring Artificial Intelligence in the New Millenium (pp. 1-36): Morgan

Kaufmann.

Thrun, S. (2003). Learning Occupancy Grid Maps with Forward Sensor Models.

Autonomous Robots, 15(2), 111-127.

Thrun, S., Burgard, W., & Fox, D. (2005). Probabilistic Robotics. Cambridge, MA:

MIT Press.

Thrun, S. (2006). Winning the DARPA Grand Challenge. Retrieved 25-Nov, 2006,

from

http://video.google.com/videoplay?docid=8594517128412883394&q=thrun

Thrun, S., Montemerlo, M., Dahlkamp, H., Stavens, D., Aron, A., Diebel, J., et al.

(2007). Stanley: The Robot That Won the DARPA Grand Challenge. In M.

Buehler, K. Iagnemma & S. Singh (Eds.), The 2005 DARPA Grand

Challenge (pp. 1-43). Heidelberg: Springer.

Tomatis, N., Nourbakhsh, I., & Siegwart, R. (2001). Simultaneous localization and

map-building: a global topological model with local metric maps. Paper

presented at the IEEE/RSJ International Conference on Intelligent Robots

and Systems (IROS), Mauii, Hawaii, USA.

Tomatis, N., Nourbakhsh, I., & Siegwart, R. (2002). Hybrid simultaneous

localization and map building: closing the loop with multi-hypothesis

tracking. Paper presented at the IEEE International Conference on Robotics

and Automation (ICRA).

Ullman, S. (1996). High-Level Vision. Cambridge, MA: MIT Press.



Ulrich, I., & Nourbakhsh, I. (2000, Apr). Appearance-Based Place Recognition for

Topological Localization. Paper presented at the IEEE International

Conference on Robotics and Automation (ICRA), San Francisco, CA.

Ulrich, I., & Nourbakhsh, I. (2000, Jul). Appearance-Based Obstacle Detection with

Monocular Color Vision. Paper presented at the National Conference on

Artificial Intelligence, Austin, TX.

Wandell, B. A. (1995). Foundations of Vision. Sunderland, MA: Sinaur Associates.

Wang, L. C., Yong, L. S., & Ang, M. H., Jr. (2002). Hybrid of global path planning

and local navigation implemented on a mobile robot in indoor environment.

Paper presented at the IEEE International Symposium on Intelligent Control.

Wheeler, J. (2003). CodeGuru: CWaitableTimer. Retrieved 14-Mar, 2003, from

http://www.codeguru.com/cpp/w-p/system/timers/article.php/c2837/

Wolf, J., Burgard, W., & Burkhardt, H. (2005). Robust Vision-Based Localization by

Combining an Image Retrieval System with Monte Carlo Localization. IEEE

Transactions on Robotics, 21(2), 208-216.

WowWee. (2006). RS Media. Retrieved 15-Dec, 2006, from

http://www.rsmediaonline.com/

Wyszecki, G., & Stiles, W. S. (1967). Color Science: John Wiley & Sons, Inc.

Yamauchi, B. (1997, Jul). A Frontier Based Approach for Autonomous Exploration.

Paper presented at the IEEE International Symposium on Computational

Intelligence in Robotics and Automation, Monterey, CA.

Yamazawa, K., Yagi, Y., & Yachida, M. (1995). Obstacle detection with

omnidirectional image sensor HyperOmni vision. Paper presented at the

IEEE International Conference on Robotics and Automation (ICRA).

Yu, W., Chung, Y., & Soh, J. (2004, Aug). Vignetting distortion correction method

for high quality digital imaging. Paper presented at the 17th International

Conference on Pattern Recognition.

Zelinsky, A. (1992). A mobile robot exploration algorithm. IEEE Transactions on

Robotics and Automation, 8(6), 707-717.

Zelinsky, A., Jarvis, R. A., Byrne, J. C., & Yuta, S. (1993). Planning Paths of

Complete Coverage of an Unstructured Environment by a Mobile Robot.

Paper presented at the International Conference on Advanced Robotics,

Tokyo, Japan.

Zhang, Z., Weiss, R., & Hanson, A. R. (1997). Obstacle Detection Based on


Qualitative and Quantitative 3D Reconstruction. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 19(1), 15-26.


mapping of indoor environments by robots using low...

Documents