content based semi-invariant search for natural, symbolic ... · pdf filecontent based...

Content based Semi-Invariant

Search for

Natural, Symbolic and Sketch

Images

C.E. Tuke

Thesis submitted for the degree of

Doctor of Philosophy

University of York

Department of Computer Science

September 2004

Abstract

The overall aim of this thesis is to develop a working architecture to facilitate the

automatic recognition of single query image content from within a library of images with

similarity decisions similar to human judgement with unconstrained image content in

reasonable periods (less than two minutes for an average query). The secondary aim of

this work is to investigate the implementation of an architecture that can plausibly

approximate human Gestalt grouping decisions to generate description types useful for

recognition. Images used in this work include linear, greyscale and colour types depicting

natural, cartoon, facial and symbolic content. The architecture consists of a novel

segmentation/grouping engine using a KD-Tree structure to facilitate nearest neighbour

decisions in an n-dimensional space and generate a multi-dimensional binary tree

architecture of groups. These groups are then ranked and by salience and a series of

weighted description labels preserving group order and description type are generated

from the top n-groupings to form a library of descriptor labels. For recognition we

retrieve these labels from the pre-generated library, normalize query and library labels by

the from library label set and select best correspondence between groups contained with

the labels. The final stage is the evaluation of general similarity from the corresponding

label values, ranking and output of results. Analysis shows that this architecture

provides good results for natural colour and greyscale images and reasonable results for

symbolic and linear geometric image types.

Contents

Acknowledgements xiii

Declaration xv

1 Introduction 1

2 Recognition/Vision, Storage & Grouping Processes 5

2.1 Human Vision/Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Machine Vision/Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Gestalt Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Colour Constancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.1 Assumptions of surface reflectance or illuminant properties in image 17

2.4.2 Assumptions of a finite image gamut . . . . . . . . . . . . . . . . . . 18

2.4.3 Measuring the illuminant indirectly from image properties . . . . . . 18

2.4.4 Limitations to Colour Constancy Techniques . . . . . . . . . . . . . 19

2.5 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.1 Photometric Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.2 Geometric Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.3 Global Invariant Techniques . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.4 Local Invariant Techniques . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6.1 Measures used to guide segmentation . . . . . . . . . . . . . . . . . . 30

2.6.2 Region grouping processes . . . . . . . . . . . . . . . . . . . . . . . . 36

2.7 Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Preliminary Work and Experimentation 45

3.1 Signature Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 Initial Experiments with Linear Gestalt segmentations . . . . . . . . . . . . 52

iii

iv CONTENTS

4 Gestalt Multi-Scale Feature Extraction 73

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.2 The Core Segmentation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 75

4.3 Seeding with Nearest Neighbours . . . . . . . . . . . . . . . . . . . . . . . . 82

4.4 The Edge Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.4.1 Generating a Multi-Scale Segmentation . . . . . . . . . . . . . . . . 87

4.5 A New Edge Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . 89

4.6 Updating Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.6.1 Appropriate Data Structures for K-Dimensional Nearest Neighbour

Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.6.2 Efficiently searching the KD Tree for nearest neighbours . . . . . . . 102

4.7 The Original Current State Group Description . . . . . . . . . . . . . . . . 116

4.8 The New Binary Tree Group Description . . . . . . . . . . . . . . . . . . . . 119

4.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5 Segment Ranking 127

5.1 Introduction Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.1.1 Edge Based Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.1.2 Area Based Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.2 Combining Edge and Area Ranking . . . . . . . . . . . . . . . . . . . . . . . 131

5.3 Correspondence Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6 Generating Segment Group Descriptions 145

6.1 Overview Of Description Types . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.2 Photometric Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.3 Geometric Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.4 Pairwise Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.5 Preprocessing for Geometric Descriptions . . . . . . . . . . . . . . . . . . . 153

7 Searching the Label Database 155

7.1 Description Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

7.2 Normalizing Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

7.3 Weighting Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

7.4 Searching the Database Labels for Image Similarity . . . . . . . . . . . . . . 158

8 Evaluation of final algorithm performance 163

8.1 General effectiveness with different image types . . . . . . . . . . . . . . . . 163

8.2 Effects of global user weighting on recognition . . . . . . . . . . . . . . . . . 170

8.3 Tolerance to realistic transformations . . . . . . . . . . . . . . . . . . . . . . 172

CONTENTS v

8.4 Comparison with human decisions . . . . . . . . . . . . . . . . . . . . . . . 172

9 Conclusions 179

9.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

10 Self-Similar Convolution Image Distribution Histograms 185

11 The Linear Gestalt Grouping Algorithm and Data Types 189

12 The n-Dimensional KD Tree Structure and Algorithms 197

13 Combined Gestalt Grouping Algorithms and Results 207

14 Database Search Algorithms 213

vi CONTENTS

List of Tables

2.1 Calculating two-dimensional correlation invariants . . . . . . . . . . . . . . 21

2.2 Calculating Moment invariants . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Common local invariant architectures . . . . . . . . . . . . . . . . . . . . . 22

2.4 Degrees of freedom (order) of primitive geometric image features . . . . . . 23

2.5 The four main levels of local geometric invariance . . . . . . . . . . . . . . . 24

4.1 Calculating two-dimensional correlation invariants . . . . . . . . . . . . . . 81

4.2 Minimum requirements for region description. . . . . . . . . . . . . . . . . . 96

6.1 Levels of geometric invariance . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.2 Types of colour description . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

vii

viii LIST OF TABLES

List of Figures

1.1 Library Generation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Query Generation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Illusion caused by mid-level processing . . . . . . . . . . . . . . . . . . . . . 8

2.2 Illusion caused by mid-level processing . . . . . . . . . . . . . . . . . . . . . 9

2.3 Simultaneous Contrast I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Simultaneous Contrast II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Simultaneous Contrast - Snake illusion . . . . . . . . . . . . . . . . . . . . . 11

2.6 Simultaneous Contrast - Colour Channels . . . . . . . . . . . . . . . . . . . 11

2.7 Gestalt Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.8 Angle between vectors, a similarity invariant . . . . . . . . . . . . . . . . . 24

2.9 The ratio of linear lengths, a similarity invariant . . . . . . . . . . . . . . . 25

2.10 Four point area ratio, an affine invariant. . . . . . . . . . . . . . . . . . . . 25

2.11 Ratio of parallel line lengths, an affine invariant . . . . . . . . . . . . . . . . 26

2.12 Geometric Hashing, two reference line segments are used to define a local

geometry that will remain unaffected by affine transformations. . . . . . . . 26

2.13 A point, tangent and two lines can form an affine invariant . . . . . . . . . 26

2.14 The Shape Query Using Image Database recognition system uses invariant

curve based signatures built up from different degrees of boundary smoothing 27

2.15 The point on the curve furthest from the endpoint vector is affine invariant 27

2.16 The projective invariant Cross Ratio. . . . . . . . . . . . . . . . . . . . . . . 27

2.17 A Two Dimensional Signature . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.1 The set of 12 natural query images . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 The set of 12 natural images searched . . . . . . . . . . . . . . . . . . . . . 46

3.3 Difficulties arising from the use of the neighbour region constraint and

boundary ased culling. Generation of pairwise invariant signatures would

be impossible in these examples, where the grey areas are culled and each

remaining region therefore has no immediate neighbour. . . . . . . . . . . . 48

ix

x LIST OF FIGURES

3.4 Direct storage of binary invariant signature. . . . . . . . . . . . . . . . . . . 49

3.5 Initial non-binary signature blurring and final binary signature. . . . . . . . 49

3.6 Initial non-binary signature blurring, using area bias and the final binary

signature calculated from mean value. . . . . . . . . . . . . . . . . . . . . . 50

3.7 Multi-scale central moment storage . . . . . . . . . . . . . . . . . . . . . . . 51

3.8 Examples from the Gestalt library . . . . . . . . . . . . . . . . . . . . . . . 60

3.9 Output from the Linear Gestalt Grouping algorithm . . . . . . . . . . . . . 61

3.10 Scale and region grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.11 Gestalt Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.12 The basic architecture of Thorisson’s algorithm . . . . . . . . . . . . . . . . 62

3.13 Minimum Boundary Distance versus Region Centroid Distance . . . . . . . 63

3.14 LGG algorithm example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.15 Resulting groupings in artificial images using LGGA . . . . . . . . . . . . . 64

3.16 Effect of description elements . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.17 Confusing the LGG algorithm - Clustering . . . . . . . . . . . . . . . . . . . 65

3.18 Images that should confuse the LGG algorithm - Context and Complexity 65

3.19 Linear regions output from the LGG algorithm . . . . . . . . . . . . . . . . 66

3.20 Linear regions output from the LGG algorithm . . . . . . . . . . . . . . . . 67

3.21 Segmentations using the LGG algorithm on a composite of test images . . . 68

3.22 Perceptually insignificant Linear groupings . . . . . . . . . . . . . . . . . . . 69

3.23 Resulting groupings in more complex photographic images using LGGA . . 70

3.24 Iterating the LGG algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.25 Connections that could be used to bridge regions . . . . . . . . . . . . . . . 72

4.1 Image grid based 8 nearest neighbours . . . . . . . . . . . . . . . . . . . . . 76

4.2 Segmentation primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.3 Interior Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.4 Development of a segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.5 Example of a texture that can not be properly merged into a single segment

using only 8 neighbourhood edges . . . . . . . . . . . . . . . . . . . . . . . . 82

4.6 Showing how smoothing can decrease the 8 neighbour image grid problem . 83

4.7 Edges chosen using 8 nearest neighbours . . . . . . . . . . . . . . . . . . . . 83

4.8 Circular arrangement of larger segments . . . . . . . . . . . . . . . . . . . . 84

4.9 Example of the improved segmentation using Nearest Neighbour Seeding . . 92

4.10 Original decision function and the effect of the k component . . . . . . . . 93

4.11 KD Tree explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.12 Determining if a KD branch could contain a nearest neighbour . . . . . . . 105

4.13 Modifiable KD Tree Build Algorithm . . . . . . . . . . . . . . . . . . . . . . 106

LIST OF FIGURES xi

4.14 Half rib orthogonal list structure . . . . . . . . . . . . . . . . . . . . . . . . 107

4.15 8NN Modifiable KD-Tree results surface and Exhaustive 8NN results. . . . 107

4.16 8NN Modifiable KD-Tree results with Exhaustive 8NN results subtracted

and decreasing numbers of search points. . . . . . . . . . . . . . . . . . . . . 108

4.17 8NN Modifiable KD-Tree results expressed as percentage difference to Ex-

haustive 8NN results, over decreasing numbers of search points. . . . . . . . 109

4.18 3D chart showing the performance of 8-NN algorithms . . . . . . . . . . . . 110

4.19 Modifiable KD Tree performance. . . . . . . . . . . . . . . . . . . . . . . . . 111

4.20 Performance of Nearest Neighbour algorithms . . . . . . . . . . . . . . . . . 111

4.21 Effects of dimensionality on 8-NN Search algorithims . . . . . . . . . . . . . 112

4.22 3D Chart Showing the performance of 8NN Search using Modifiable KD-Trees113

4.23 Showing optimal values for 8NN Modifiable KD-Tree search . . . . . . . . 114

4.24 Showing the segmentation process, with a delayed update . . . . . . . . . . 115

4.25 Current state grouping/segmentation process . . . . . . . . . . . . . . . . . 117

4.26 Merging and erasing parent groups connected by an edge . . . . . . . . . . 118

4.27 Merging parent groups without deletion in a tree based structure . . . . . . 121

4.28 Merging the same groups across different edges in the tree based structure. 121

4.29 Binary tree grouping/segmentation process . . . . . . . . . . . . . . . . . . 122

4.30 Benefits of retaining groups from previous generations . . . . . . . . . . . . 124

4.31 Drawbacks to retaining links from previous generations . . . . . . . . . . . . 124

4.32 Increase in time taken by grouping/segmentation engine as image dimension

increases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.1 Contradictions when using edge values to rank segment groups. . . . . . . . 129

5.2 Segment group scoring based upon edge values. . . . . . . . . . . . . . . . . 130

5.3 Results relating to segment groups that form the basis for used Founding

Edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.4 Founding Groups and Image Type, an example. . . . . . . . . . . . . . . . . 132

5.5 Demonstrating the difference between Founding Edges and Parental Edges. 133

5.6 Size and Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.7 Segment group ranking results, natural . . . . . . . . . . . . . . . . . . . . . 137

5.8 Segment group ranking results, mondrian . . . . . . . . . . . . . . . . . . . 138

5.9 Results of combined Parental Edge and minimum Parental area ranking . . 139

5.10 Directly equivalent segment groups are important to recognition in photo-

graphic material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.11 Segments ranking based upon correspondence . . . . . . . . . . . . . . . . . 141

5.12 Example showing final groups extracted by rank, example 1 . . . . . . . . . 142

5.13 Example showing final groups extracted by rank, example 2 . . . . . . . . . 143

xii LIST OF FIGURES

6.1 RGB and HLS Space diagrams . . . . . . . . . . . . . . . . . . . . . . . . . 154

7.1 Differences in properties such as size and intensity during segmentation . . 160

7.2 Simplified example of direction dependence of recognition . . . . . . . . . . 161

7.3 Screenshot of algorithm output . . . . . . . . . . . . . . . . . . . . . . . . . 162

8.1 Samples from the natural image set . . . . . . . . . . . . . . . . . . . . . . . 164

8.2 Samples from the facial image set . . . . . . . . . . . . . . . . . . . . . . . . 164

8.3 Samples from the cartoon image set . . . . . . . . . . . . . . . . . . . . . . 164

8.4 Samples from the Gestalt/symbolic image set . . . . . . . . . . . . . . . . . 165

8.5 Recognition score results for the natural image set . . . . . . . . . . . . . . 165

8.6 Contributions of description types to final score . . . . . . . . . . . . . . . . 166

8.7 Comparing rankings generated from greyscale and colour versions of the

same image queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

8.8 Correlation between component descriptor scores for both greyscale and

colour natural images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.9 Correlation between component descriptor rankings for both greyscale and

colour natural images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

8.10 Comparing rankings generated from greyscale and colour versions of the

same image queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

8.11 Greyscale natural image recognition performance with different global weight-

ings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

8.12 Changing context after tranformation . . . . . . . . . . . . . . . . . . . . . 174

8.13 Similarity score performance over increasing rotation . . . . . . . . . . . . . 174

8.14 Similarity score performance over increasing translation . . . . . . . . . . . 175

8.15 Similarity score performance over Affine transformations(stretch) . . . . . . 175

8.16 Similarity score performance over Affine transformations(squeeze) . . . . . . 176

8.17 The colour rotation sample set . . . . . . . . . . . . . . . . . . . . . . . . . 177

8.18 The greyscale affine sample set . . . . . . . . . . . . . . . . . . . . . . . . . 177

8.19 Screenshot from the ‘Human-Like Survey’. . . . . . . . . . . . . . . . . . . . 178

9.1 Directing Edge Selection using a 2nd Order Edge Search Space and move-

able origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

10.1 Showing the invariance of self-similar convolution image histograms to im-

age transformations (but NOT perspective) . . . . . . . . . . . . . . . . . . 186

10.2 Screen-shot of a typical query and database response using the self similar

geometric histogram as an identifier. . . . . . . . . . . . . . . . . . . . . . . 187

Acknowledgements

I would like to thank my supervisor, Simon O’Keefe, and my assessor, Jim Austin, for

their invaluable support and advice, providing me with the opportunity to work at the

University of York in the first place. Thanks to the University of York and the Computer

Science department as a whole for making me feel welcome and providing a relaxed and

friendly atmosphere within which to complete this work. Special thanks to Tori Kemp

for the support and supplies, Joanna Moy for sharing the trials and tribulations of the

Phd experience, and a final thanks to my family for their invaluable encouragement and

support throughout the years.

xiii

xiv ACKNOWLEDGEMENTS

Declaration

I declare that this thesis has been completed by myself and that, except where indicated

to the contrary, the research documented is entirely my own.

Charles Edward Tuke

xv

xvi DECLARATION

Chapter 1

Introduction

The proliferation of image based content over the Internet, coupled with increased proces-

sor capabilities and the ready availability of image capture hardware for personal comput-

ers has largely been responsible for fuelling demand for a new generation of methods to

efficiently process the resulting visual imagery. There exists a particular need to process

visual imagery in a much more human fashion. Human like recognition is required in

order to make sense of complex imagery very quickly and base decisions upon the abstract

objects contained within images rather than their low-level intensity information. The

ability to process visual information in a human-like way is particularly important when

considering sketch and symbolic inputs, whereby the similarity of equivalent sketch images

is impossible to establish using direct low level comparison techniques and higher level as-

sumptions about image content is required. Another task humans perform very well when

compared to machine recognition is the ability to recognize both 3-dimensional object sim-

ilarity and symbolic similarity from limited 2-dimensional image information. While much

of this success is due to the human ability to abstract and compare to prior experience,

experiments comparing recognition when using scrambled shapes [Cutzu96, Edelman95]

indicate that similarity can (although taking more time, and with a reduction in accuracy)

be determined in 3-Dimensional objects from unfamiliar arrangements of localized object

subcomponents.

The proposed approach in this work is that, through the use of Gestalt segmenta-

tion/grouping processes to generate multiple semi-invariant descriptions 1, it is possible to

efficiently rate image similarity based on derived descriptions that capture both symbolic

and literal content from unconstrained image content. Such an approach will facilitate

recognition in photographic (regardless of perspective and viewing condition change), ar-

tificial, symbolic and sketch images without prior training. The architecture will have

1Invariant descriptors are properties of an image object/shape that remain constant regardless of image

transformations

1

2 CHAPTER 1. INTRODUCTION

Figure 1.1: Overview of the proposed library generation architecture.

three main elements; a Gestalt segmentation/grouping algorithm that is capable of cap-

turing symbolic groups, the creation of semi-invariant description labels from these groups

and a recognition algorithm that can compare these descriptions (figures 1.11.2. Results

should approximate, and will be compared to, human similarity decisions.

3

Figure 1.2: Overview of the proposed image search process, pre-generated library labels

can be compared against a query label for fast similarity matching.

4 CHAPTER 1. INTRODUCTION

Chapter 2

Recognition/Vision, Storage &

Grouping Processes

The primary task of this thesis is to facilitate, as far as possible, machine recognition of

unconstrained single image content to give the impression of human-like recognition. In

order to approximate human-like performance we require the recognition of two dimen-

sional images and their content, which may be either two dimensional or three dimensional,

natural, artificial or symbolic in nature.

It is clear that if we wish to generate any recognition algorithm that gives the impres-

sion of human-like judgement then key aspects of human perceptual organization need to

be inherent in whatever primitives we extract from raw image content, regardless of the

final representation. The grouping of raw image information into specific visual objects

is of fundamental importance to this process, the segmentation stage of our algorithm

needs to supersede traditional approaches to include a wider range of Gestalt grouping

principles. While our immediate requirements are a practical approximation, an elegant

and plausible architecture for simulating human visual grouping would be ideal. Static

thresholds, or the reliance upon prior training, should be avoided in this work to allow

unconstrained recognition and a more globally justifiable solution. As will be further dis-

cussed in 2.3, the application of Gestalt principles is likely to require the abstraction of

image groups into a higher dimensional space of extracted and changing object properties.

While the efficient processing of high dimensional spaces is more problematic than group-

ing based purely around image pixel information, it does reduce many Gestalt principles

to the same inclusive framework. Similarity and proximity principles reduce to the same

proximity calculation within such a space, common fate and continuation also reduce to

the same measure and certain aspects common to colour constancy algorithms may also

be generated. The Gestalt principle of completion should be partially attained through

the use of convex hull based descriptors. Closure and simplicity still represent a challenge,

5

6 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES

although it is anticipated that the application of convex hulls along with proximity, similar-

ity and continuation rules will facilitate these in the general case. Such an approach could

also facilitate future expansion to video image analysis, with the easy incorporation of a

time dimension that would naturally expand the perceptual grouping to include periodic-

ity principles and common fate over time. Research [Palmer96, Palmer00] indicates that

any such grouping process should operate, as much as feasible, on a multi-scale (with pre-

vious groupings forming parents to new groupings) and as parallel as possible. Although

true parallelization is likely to be beyond the scope of this work, the retention of previous

groupings between successive generations may go some way towards approximating this.

Once we have these Gestalt geometric primitives, we will require a method of extracting

and efficiently storing information essential for fast recognition between images. The use of

invariant descriptions represent a simple and efficient method of overcoming the detrimen-

tal effect of changing viewpoint, image transformations and lighting change between images

and should enable us to approximate much of the invariant low-level grouping that occurs

in human vision. Signature representations seem a promising and convenient method of

storage for these invariant measures and should allow the level of control and generaliza-

tion required to simulate the fast ’first impression‘ aspect of human vision. Through the

careful choice of a range of descriptors that capture basic qualities of image content, such

as colour, texture, shape, size, position, it is anticipated that such an algorithm should be

able to generate similar results to human-based judgements of similarity that do not rely

heavily on prior knowledge/experience. The following sections represent a brief overview

of the key areas involved with achieving the above; vision, Gestalt grouping, preprocessing,

Invariance, segmentation and storage.

2.1 Human Vision/Recognition

Human perception of image content and shape recognition is still an area of much on-

going debate. Where three dimensional shape recognition is concerned, two prominent

theories have emerged. [Marr82], [Biederman87] advocate that recognition is due to the

encoding of shape information in terms of its three dimensional components, the other

approach is that the internal representation is that of a collection of two dimensional

views [Bulthoff95]. Variations on the latter approach also propose that recognition relies

upon the indexing of three dimensional shape by properties that remain similar regardless

of viewpoint, an approach paralleled by research into invariant representations (see section

2.5). If perfect invariance was the only method of recognition used in human vision (ie. no

information about viewing transformations was recognized by the viewer) then we would be

unable to perceive the transformations that the recognized object had undergone, which

is not the case. Conversely, research into the impact of extrinsic cues [Christou03] on

2.1. HUMAN VISION/RECOGNITION 7

shape recognition (taking care to remove the possibility of the target object being encoded

with respect to its surrounding context) provides evidence that contextual information

about the viewpoint transformation increases the accuracy of recognition, although has

minimal impact upon response times. The impact of viewpoint transformations upon the

speed and accuracy of human recognition seems to vary according to the nature of the

specific recognition tasks. [Christou03] found that recognition performance decreased as

target objects are rotated away from the initial viewpoint, with performance decaying

as rotations approach 90o and improving slightly as the object is rotated toward 180o.

This, coupled with participants verbal feedback, provided some indication that egocentric

encoding of two dimensional correspondences between object features plays at least some

part in the internal encoding of shape information in the human mind. Such encoding

would be least effective at right angles to the learnt viewing position. Contrary to this,

[Tarr90] finds that in certain cases this is not the case, with recognition performance

remaining constant under target transformation. It would appear that both orientation

dependence and invariance play a part in human recognition.

In realistic situations, object recognition is supplemented by the full range of hu-

man senses, reasoning and previous experience. While attempts to simulate aspects of

higher level human reasoning have been made (Bayesian reasoning and mid-level pro-

cessing [Forsyth98]), these methods often suffer from insufficiently consistent and com-

parable internal representations or image primitives. While recognition performance is

increased when extrinsic visual cues to movement around an object of interest are pro-

vided, [Simons98] shows an even greater improvement where the subject performs their

own movements around the object. This indicates that the subjects knowledge of their

bodies position and movement in space is also an important factor in object recognition.

Inclusion of such non-visual stimuli, chaotic processes or any higher order of reasoning

is beyond the scope of this work, which will be based around limited two dimensional visual

cues. Fortunately, work in pre-attentive grouping [Yu01], segmentation and recognition

indicates that a much of the initial recognition process is performed very quickly at a

relatively low level. Whilst still a non-trivial problem, this level of performance should be

achievable through machine recognition algorithms. Humans possess the ability to quickly

make basic distinctions between visual objects on a subconscious level, quickly determin-

ing the presence of familiarity or potential threats in the environment. This ability for

fast recognition and generalization of complex stimuli bears some similarity to properties

inherent in machine encoding using signatures (see section 2.7). Evidence suggests that

these initial responses are then refined, with conscious reasoning and recognition of other

factors such as viewpoint change and lighting conditions occurring later in the process.

While internal representations of objects is variant to viewpoint change and knowledge


Figure 2.1: A simple demonstration of Opponent Colour Theory and human sensitivity

adjustment to image content. Stare at the green square for a while, then quickly shift your

eyes to the spot in the adjacent white square. The majority of people with normal colour

vision will see a red square which is the opponent to green and also shows that the eye

has become de-sensitized to green.

of viewpoint change can increase accuracy of recognition, changes in lighting conditions

are largely ignored during the preliminary subconscious stages of recognition. The initial

grouping of raw visual information (analogous to an image segmentation) appears to incor-

porate a natural normalization of colour and intensity values at a low level. The presence

of lateral inhibition in centre-surround cells in human vision [Adelson00] (where light on

the centre cells is excitatory but light on the surround is inhibitory) suggests that the

actual signal transmitted to the brain is a normalized, edge based, description discarding

the absolute light intensity information received from cones (the Opponent Colour Model

[Hering64]). That centre-surround cells receive light information in red-green, yellow-blue

and black-white pairs explains why human vision cannot register certain colour combina-

tions such as greenish reds, yellowish blues or blackish whites while yellowish reds and

bluish reds are easily perceived. CIE (Commission Internationale de l’Eclairage) Stan-

dard Observervations based around Opponent Colour Theory show that de-sensitization

to dominant colour stimulus occurs in human vision. Another result of this model of

human vision is the colour latency effect (figure 2.1).

If this is the case, then perceptual grouping and the perception of light in the brain can

only ever be determined in relation to the surrounding visual context. The evolutionary

advantage to this form of perception is that low level grouping and recognition of similar

objects can take place with a large degree of invariance to change in real-world lighting

conditions. Attempts to simulate this process in machine vision have resulted in the fields

of photometric invariance and colour constancy (see section 2.4), which usually pre-filter

raw image information before any significant grouping/segmentation processes occur. A


Figure 2.2: Left and Centre images show the affect of mid-level processing, the central

square in both images appears to be a different shade to the human observer even though

it is in fact the same. The snake illusion on the right shows that such illusions can be both

local and contradictory (the four diamond squares are the same intensity).

thorough investigation into colour vision theory can be found in [Brainard01]. In reality,

work using optical illusions (figure 2.2) indicates that both grouping and light percep-

tion are fundamentally linked, with local geometric features also affecting the perception

lightness in low level vision. [Adelson93][Adelson00] shows that there is a two way rela-

tionship between the perception of colour and shading and the geometry of visual stimuli.

Three dimensional cues appear to have a dominant role to play in the human perception

of image content. Where shading is apparent in an image, these effects (intrinsic illumi-

nation image) are removed and the perceived lightness of the surface is greater than the

actual intensity registered on the retina [Gilchrist97]. In many cases, these perceptions of

colour difference are so compelling that test subjects refuse to believe that the physical

intensities are the same in the optical illusions. Whether this perception of shading and

transparency is the result of three dimensional interpretation or a more basic reaction to

two dimensional local configurations within image content is still unresolved.

Simultaneous contrast may be one of the two dimensional low-level processes that

facilitates this compensation for three dimensional lighting, shading and transparency

effects. Simultaneous contrast effects are the result of a localized normalization of object

intensity in the context of the surrounding neighbourhood intensity. Figures 2.3 and

2.4 demonstrate the difference between perceived intensity and true intensity, objects

with a darker background appear lighter than they really are, and objects with lighter

backgrounds appear darker than their true intensity values.

Of the two figures, figure 2.3 provides the best representation of Simultaneous Contrast

because of the lack of geometrical interference. The more extreme apparent intensity

differences in figure 2.4 cannot be totally explained in this way, as is demonstrated in


Figure 2.3: Simultaneous contrast in action. Both inner circles are in fact the same inten-

sity, but appear very different to the human observer due to their different surrounding

context.

Figure 2.4: The diamond shapes in the left hand examples are exactly the same intensity,

although they appear to the observer as very different shades. The right hand images

show the same examples compensated to appear the same intensity, although the actual

intensities are nearly twice the original. This is due to simultaneous contrast occurring

in human vision, where the different surrounding contexts to the diamonds have a large

impact upon their perceived brightness.


Figure 2.5: Snake illusion with its ‘anti-snake’ image on the left.

Simultaneous Contrast cannot account for the effectiveness of this illusion of intensity

difference as the anti-shadow image (left) does not show much difference in diamond in-

tensity, even though the surrounding background intensity is the same as the illusion.

It is only as the shadow geometry is added that the difference becomes more noticable.

Figure 2.6: If simultaneous contrast effects relied upon summative or average intensity

values, rather than being calculated from individual colour channels/receptors then these

three examples would all contain similar diamonds. The left image uses just the green

colour channel.The centre image introduces spurious blue elements which equal the same

summative (R+G+B) values as in the left. The right image contains spurious blue elements

which equal the same avarage intensity (R+G+B)3 .

figure 2.5. If the comparison of immediate intensity background was responsible for the

effect in this image then the left hand image would show a much greater apparent intensity

difference between the top and bottom rows of diamonds. It is not until the full background

geometry surrounding the diamonds is replaced (giving the Gestalt impression of two dark

shadow strips across the pattern) that the intensity difference is apparent. The logical

explanation for such geometrical effects is that the eye is automatically interpreting the

image in a natural world context and adjusting intensity values to compensate for shadows

and transparencies [Adelson93].

Greyscale and single colour channel examples in figures 2.4 and 2.3 exhibit the same

effect, this indicates that simultaneous contrast is likely to be occurring independently in

the separate colour channels. Figure 2.6 demonstrates that, even though mean intensity

may remain the same, the introduction of other (disproportionate) colour elements disrupts

the simultaneous contrast effect.


Image normalization relative to surrounding image geometry appears to be a dominant

feature of this mid-level processing. All normalization processes require some form of

anchor feature from which to generate a canonical system. Research [Adelson00] indicates

that the following three major factors that determine how an image is perceived during

human perception.

1. ‘Highest Luminance Rule’ - The surface of highest luminance in a scene tends

to be anchored to the perception of white [Gilchrist97] (used by Land and

McCann [Land71] in combination with Retinex theory)

2. ‘Largest Area Rule’ - A secondary factor, the largest visible area will tend to

appear white [Gilchrist97].

3. ‘Articulation’ - The size of colour constancy backgrounds decreases (the sur-

rounding contextual area that affects the appearance of object colour) as over-

all colour and shape complexity in the image increases.

Another important factor when considering the segmentation of natural images is the

human perception of texture. Although often considered different disciplines, colour and

texture are fundamentally linked in the human perception of image content. Traditional

texture segmentation algorithms rely on localized structure or statistical clues to group

areas of an image together. The size of the context of a texture element has a fundamental

impact upon the perception of texture boundaries [Bergen88]. A method of combining

texture with colour in grouping processes is required if true human-like grouping is to be

simulated in machine vision.

Whatever preliminary processes occur in human vision, it apparent that at some point

the raw visual input is abstracted and grouped in such a way that the scene becomes

a collection of perceived objects. Conscious perception of image content is in terms of

the human interpretation of the scene, not the actual intensity values of the image. For

example, an image of a chair is first perceived as a chair-like thing before the finer detail

(such as what it may be made of, or its orientation) is discerned. This ability to quickly

reach a preliminary impression of what an object may be provides further evidence for

a signature-like internal representations in human vision, and at least some form of top

down processing or refinement process. Whether human visual grouping is a top down

process, bottom up process or a combination of the two, is still an unresolved issue. The

grouping of raw visual information into meaningful entities equates, at a simplified level,

to segmentation processes (see section 2.6) in machine vision. Whilst most machine vi-

sion approaches, including that of this work, cannot take into account the contribution

of higher level understanding and memory to image understanding, the study of Gestalt

processes (section 2.3) indicates that raw visual information can successfully be segmented

into perceptual groups that can form useful and simplified descriptions of image content

2.2. MACHINE VISION/RECOGNITION 13

without the requirement for actual understanding. Further to traditional segmentation

rules, Gestalt grouping can facilitate the generation of symbolic correspondences not ex-

plicitly contained within an image, such as the perception of a larger circle from a loosely

circular arrangement of other features within the image. It is anticipated that the inclu-

sion of Gestalt grouping principles (and the ability to recognize more abstract symbolic

content) is one of the key ways in which this work will be able to give the impression of

human-like performance.

2.2 Machine Vision/Recognition

Machine vision is primarily concerned with the construction of meaningful and useful de-

scriptions based upon real world objects present in images. In this respect, computer vision

is attempting to emulate aspects of human visual perception and therefore is integrated

closely with the field of Artificial Intelligence. The most common tasks in computer vision

are generally related to the labelling, or segmentation, of input images into meaningful

objects. While at first glance this may seem a trivial task, the difficulties involved in

this basic vision process are compounded by the fact that this initial level of processing

is largely a subconscious activity in human vision. Where humans see a world of mean-

ingful shapes, textures and objects automatically, there is currently no formalized correct

method of achieving, or even evaluating, these abilities in machine vision. Even if hu-

man vision and recognition were thoroughly understood, computer technology available at

present has so far been unable to fully model the general recognition and understanding

capabilities present in human vision. Current research in machine vision usually repre-

sents a compromise between performance and speed, and often hampered by the need to

re-invent the most basic, but inaccessible, talents of biological visual systems. This has

resulted in fragmented approaches to machine vision with large degrees of specialization

in different machine vision applications such as face recognition, geometry reconstruction,

satellite image analysis and trademark recognition.

Constructing a meaningful interpretation of a natural perspective image is a non-

trivial task, and in the case of single images can have no unique mathematically correct

solution. In the case of most photographic content recognition, the objects in an image

that need to be recognized and labelled are the projected images of three-dimensional

objects. The appearance of such objects can change dramatically at different orientations,

lighting conditions and when under occlusion. By the time conscious thought is applied

to the visual process the scene has already been segmented into separate objects in a

manner largely invariant to the multitude of possible differences in object appearance.

Object surfaces are seen are seen as continuous, even where they may exhibit a reflective

highlight and colours are initially perceived as being constant even in the presence of


shadows and other lighting effects. These low-level capabilities, that we take for granted,

are essential for the initial interpretation of image content, upon which logical reasoning

can later be applied. This suggests that the generation of representations that are to some

extent invariant to these unwanted environmental effects represents an important area of

computer vision research.

The difficulty with single images made up from real world projections is that the depth

component of objects is completely lost. The perceived shape and shading components

of the image are intrinsically connected to this missing information, with object occlusion

further complicating the task of generating a meaningful description of the scene. Although

it is impossible to uniquely determine the three dimensional information of objects depicted

by a single image, the use of assumptions about content, along with geometry cues, allows

the generation of a hypothesis or ‘best fit’ description of a single image scene. Although

human visual processes primarily operate in a stereoscopic domain, they are still capable

of generating accurate descriptions from single image sources such as photographs.

At the base level, domain knowledge consists of lighting; shading and perspective cues

where a basic underlying set of object and lighting properties are assumed. The description

of a scene can be further enhanced if higher-level domain or content knowledge is available

to refine the image description, this will not be the case in this work. Many processing

architectures employ a feedback process, where high-level information derived from low

level processes is fed back into the system to refine base assumptions about image content.

Due to the incomplete nature of single image representations of three dimensional

scenes, the imposition of assumptions about scene illumination, object surface properties,

or geometry type is essential to enable the reasoned analysis of the scene as a three di-

mensional entity. Shape from shading algorithms such as the face recognition approach

usually assume that surfaces have constant surface reflectance properties and either ambi-

ent or single point light sources. Projective invariants (explained on page 25) rely on the

assumption that the component features will be coplanar or co-linear in nature.

2.3 Gestalt Grouping

Gestalt principles derive from experimental observations related to human perceptual

grouping. This represents the low level process in which humans make sense of noisy

or incomplete image content, often extracting more geometrical and symbolic informa-

tion than an image may literally contain [Levine85]. Significant structures are perceived

and grouped almost instantaneously, at many levels of abstraction. Perceptual (Gestalt)

grouping is commonly associated with preconstancy, the grouping of raw visual informa-

tion into significant structures before any significant colour/lightness constancy occurs

in the human visual system. However, [Rock92, Rock64] indicates that some degree of

2.3. GESTALT GROUPING 15

grouping occurs after the perception of lightness constancy and stereoscopic depth per-

ception. Gestalt grouping is a multi-level process, with large-scale groupings being formed

from previous groups, illusory contours and amodal completion [Palmer96, Palmer00].

The point at which visual Gestalt grouping actually occurs is still an unresolved issue

[Vecera97, Schulz03], although evidence suggests that it may well be a fundamental com-

ponent to many aspects of perceptual grouping. It is likely that Gestalt processes may

represent a global mechanism that accounts for all low-level perceptual processes, from

grouping, simultaneous contrast and texture to colour constancy and even certain aspects

of three dimensional interpretation. The following commonly accepted list of Gestalt prin-

ciples can be used to approximate what occurs a fundamental stage of human perceptual

grouping.

• Proximity -

Objects in close proximity are likely to be grouped.

• Similarity -

Objects similar in appearance are likely to be grouped.

• Continuity -

Objects on a continuous path are likely to be grouped.

• Closure -

Boundaries/groupings will have a tendency to be perceived as closed paths.

• Completion -

Groupings will be perceived as complete objects, even if partially obscured.

• Region -

Objects contained within the same visual structure are likely to be grouped.

• Connectedness -

Objects connected together by other features are likely to be grouped.

• Common Fate -

Objects with same degree of change in appearance or position are likely to be

grouped.

• Periodicity -

Objects that appear at the same time/frequency are likely to be grouped.

• Simplicity -

The simplest interpretation and grouping of image content will be favoured.


(a) Proximity (b) Similarity (c) Continuity, Closure,

Simplicity

(d) Common Fate (e) Closure, Comple-

tion

Figure 2.7: Examples of common Gestalt principles

There is some degree of overlap between these principles, as closure can often be con-

sidered the result of both continuity, completion and simplicity. Similarly, if objects are

considered in a high-dimensional space including aspects such as size, position and ap-

pearance, then it follows that common fate is a result of continuation within this space.

Although this work is based around single image sources, video sequences could well in-

crease the dimensionality of this space to include time, in which case periodicity is an

implicit result of continuity in this space. Unfortunately, calculating continuity in such a

large dimensional space is a non-trivial task, although this thesis will go some way towards

achieving this.

While these Gestalt grouping principles are well known, the actual implementation

and ordering of these into an algorithmic framework remains highly problematic due to

the parallel and multi-level nature of the task. Localized proximity and similarity are

implicit to most segmentation algorithms, but the connection, continuity and closure of

disparate objects on a multi-scale level represents a much greater challenge. In this work we

2.4. COLOUR CONSTANCY 17

intend to address, to some degree, the five main principles; proximity, similarity, continuity,

common fate and closure (figure 2.7).

2.4 Colour Constancy

The colour appearance in raw images is formed from a combination of illumination spectra

and the surface reflectance properties of the objects contained in the scene as the illumina-

tion is reflected from the objects back into the camera. A common aim of colour constancy

algorithms is to enable the separation of object and lighting properties within images, usu-

ally to enable lighting and pose invariant matching between objects of the same surface

properties. While colour constancy algorithms generally rely on world assumptions and

simplified environments, such as Mondrian surfaces under a single uniform light source,

they can be successfully used to reduce lighting effects detrimental to recognition tasks.

Unfortunately, the practical application of colour constancy theory is often adversely af-

fected by limitations in camera sensitivity which can cause noise and normalization errors

in calculations.

The separation into intrinsic images may not always be required for certain recognition

tasks, in this case it may be sufficient to use illumination quasi-invariants. An example of

this can be found in [Koubaroulis00a] where the Multimodal Neighbourhood Signature,

a signature formed from the cross-ratio (invariant to the linear illumination model) of

neighbouring patches. Finlayson [Finlayson92] proposes that a linear relationship between

illumination and reflectance is sufficient for natural images, although the assumption of a

single illumination source and Mondrian surfaces is not. Gamut algorithms [Finlayson95,

Brainard97, Barnard00] attempt to address these problems by utilizing information about

expected lighting and surface properties previously measured from target environments to

establish illumination constraints.

Conventional algorithmic approaches can be roughly categorized by the world assump-

tions they rely on to provide the extra constraints for colour constancy.

2.4.1 Assumptions of surface reflectance or illuminant properties in im-

age

The ‘Grey-World Algorithm’ [Buchsbaum80] makes the assumption that the surface RGB

values in an image will average to grey if viewed under a canonical (white) light. With

these constraints, any deviation in the image average from grey has to be due to the

illumination of the scene. The ‘White Patch Algorithm’ which lies at the heart of many

Retinex algorithms [Land71] (derived from centre-surround theory), assumes that there

will be a surface in the image that shows maximal reflectance in the various colour bands


that can be used to anchor colour normalization.

Retinex Theory [Land71, Rahman] asserts that reflectance is constant across space,

except where there exists a transition between objects/pigments.

Given these constraints, Retinex theory makes the following assumptions:

1. Reflectance change is shown as a large step edge in an image.

2. Illuminance changes gradually over space (low edge values)

3. The Reflectance image and, conversely, the illumination map

can be extrapolated by removing low edge derivatives that

are caused by illumination change, from the image.

2.4.2 Assumptions of a finite image gamut

Gamut algorithms (such as the CRULE algorithm [Forsyth90]) make use of the Coefficient

Rule, that although a camera is able to record a large range of colours, only a subset of these

will frequently occur in real images. A set of illuminants are mapped directly from target

environments and their convex hull taken to represent the canonical illumination gamut.

By combining real reflectance measurements taken from objects with the illuminations

possible from the canonical gamut we can add constraints to possible illuminants in images

by taking each RGB value in an image and calculating the gamut of illuminations that

would be required to map that RGB value into the canonical RGB convex hull. Each

coloured surface in an image generates a new illumination gamut set that can be intersected

with past results to provide ever-tighter constraints on the illumination source. The final

illuminant is chosen from the remaining set by either finding the mean illumination or the

illumination which would describe the greatest scene reflectivity. This approach is only

effective where a scene contains a sufficient number of differently coloured objects and

minimal specularity and shading. The use of colour direction, which is largely unaffected

by shading, [Finlayson95, Barnard96, Barnard00] can overcome some of these limitations.

[SimonFraser] provides a useful explanation of gamut constraint theory.

2.4.3 Measuring the illuminant indirectly from image properties

Some approaches attempt to measure illumination through indirect methods. One method

used is the Dichromatic Reflectance Model and differences between the reflectance prop-

erties of ambient and specular illumination [Klinker88]. Ambient reflection components

are a combination of illumination and reflectance (object colour), specular reflection is

much more closely related to the original illumination source. This relationship can possi-

bly be used to estimate the illumination spectra and therefore facilitate colour constancy

correction.

2.5. INVARIANCE 19

2.4.4 Limitations to Colour Constancy Techniques

While a large number of well established colour constancy techniques do exist, it must be

noted that there remain several issues that can prove problematic with such approaches.

The most obvious are the limitations and inaccuracies that can occur through the im-

position of the artificial assumptions that these algorithms are based upon. The other

difficulty arises through the assumption that an intrinsic image can be generated from real

colour information available in an image based upon reflectance and lighting properties.

As discussed in section 2.1, there can be a vast difference between the human perception of

colour and the actual physical stimulous arriving at the eye. Such perceptual differences

such as Simultaneous Contrast and Opponent Colour Theory are not addressed in the

standard colour constancy model.

2.5 Invariance

Invariant theory was originally pioneered by 19th century mathematicians Boole, Cayley

and Gordan [Boole1872, Cayley, Gordan]. With the advent of machine vision, it has since

been developed as a practical and useful tool for image recognition. Invariants are prop-

erties of image content that remain stable regardless of image transformations that would

otherwise be detrimental to recognition. Invariants can be of different levels, depending

upon the number of transformations they are required to be invariant to, although the

higher the order of the invariant the more components it requires, increasing susceptibility

to noise.

2.5.1 Photometric Invariants

Optical and photometric invariants are descriptions of image colour or shading content that

are invariant to image or scene transformations. Especially useful where colour or shading

information is rich and geometric image content is unreliable, as in natural images, photo-

metric invariants provide very robust image content description qualities. Being generated

from photometric properties of the image, this type of invariant can be used to provide

an effective image description regardless of geometric image transformations. Although

this may be advantageous in the case of general image database search, the very fact that

geometric information is completely ignored can result in false image matches and make

photometric invariants unsuitable for pointwise mapping between images. Photographic

images contain colour information that are sensitive not only to the base colour of the ob-

jects in the scene, but lighting conditions and shading effects caused by object shape. The

combination of colour-constancy techniques coupled with evidence accumulation frame-

works can minimize these detrimental factors and facilitate useful recognition. A major


advantage of Photometric descriptions over geometric descriptions is that they can usu-

ally be derived directly from pixel information in an image. Photometric properties of

images are also less sensitive to image transformations common in photographic material

when the viewpoint has altered. While a movement in camera position can have a rad-

ical difference upon the geometrical appearance of an object, its photometric properties

will generally remain stable. For similar reasons, photometric properties are compara-

tively tolerant to other image artifacts that can adversely affect geometric descriptions,

such as occlusion and motion. This makes them ideal for use as robust descriptors with

realistic/photographic image content.

2.5.2 Geometric Invariants

Geometric invariant features are concerned with generating descriptions of the apparent

geometry contained in an image that are invariant to unwanted image, object or viewpoint

transformations. While geometric invariants are usually applied to two dimensional image

content, they can be created to describe objects of any dimensionality. Early research into

geometric invariant descriptors were based around describing target objects as a whole

(Global Invariants), but the need to describe objects in situations where no true invariants

exist (for example, describing the content of projective scenes from a single image source)

has resulted in the use of localized features (local invariants) to generate semi-invariant

descriptors for evidence accumulation frameworks.

2.5.3 Global Invariant Techniques

Global invariants are formed from the global properties of an image, shape or object as

a whole. Correlation invariants (table 2.1), are formed directly from the image function

or pixel content and can be combined to higher orders to represent different levels of

completeness. First order correlation invariants represent a simple measure of region area

or, for non-binary images, mass.

Moment invariants tend to be based upon extracting positional cues, such as an im-

age functions centre of mass or object spread, from raw image information. Moments

can be used to effectively simplify image information and encode general shape trends in

image data as well as form the basis for invariant information (Table 2.2). An indirect

use of moment information to achieve invariant descriptions is as a basis for normalising

position, scale and orientation of image data using moments of different orders. [Reiss93]

provides a good overview of the use of correlation and moment invariants. Both correla-

tion and moment invariants can be considered primary members of the group of global

invariants that rely on image mass for their results. Another group of global invariants

can be generated from the boundary of an image object. Boundary based invariant fea-

2.5. INVARIANCE 21

1ST ORDER CORRELATION (OBJECT AREA/MASS)∑

x,y

f(x, y)

2ND ORDER CORRELATION∑

x,y

f(x, y)f(x + a, y + b) a, b = displacement vaues

3RD ORDER CORRELATION∑

x,y

f(x, y)f(x + a, y + b)f(x + c, y + d) a, b, c, d = displacement vaues

Table 2.1: Calculating two-dimensional correlation invariants

tures include polynomial equations [Taubin92] and Arc Length Space [Gool92]. Although

effective at generating similarity and affine level invariants these approaches rely heavily

upon accurate boundary information and are sensitive to occlusion.

0TH ORDER MOMENT: MASS

M =∑

x,y

f(x, y)

1ST ORDER MOMENT: CENTROID

Cx =∑

x,y

x(f(x,y))M

Cy =∑

x,y

y(f(x,y))M

2ND ORDER CENTRAL MOMENT: VARIANCE

Cxx =∑

x,y

f(x, y)(x − Cx)(x − Cx)

Cxy =∑

x,y

f(x, y)(x − Cx)(y − Cy)

Cyy =∑

x,y

f(x, y)(y − Cy)(y − Cy)

Table 2.2: Calculating Moment invariants

Frequency based descriptors use signal theory to overcome geometric transformations.

Applicable to both pixel and boundary information, Fourier descriptors and Wavelets

describe the image objects in terms of their component frequencies.

Global invariants are usually formed using algebraic techniques that operate on the

low-level image data as a whole. With most global invariants being formed from basic

low-level object image features, they require little pre-processing and can be very fast and

efficient to implement. Unfortunately, reliance of global invariants upon the entire shape


of the object image makes them very sensitive to real image effects such as occlusion or

overlap with image boundaries, rendering them ineffective as invariant descriptors under

these conditions.

2.5.4 Local Invariant Techniques

Local invariant features are defined from localized image information or geometry such

as points, lines or curves and are therefore more robust to occlusion than their global

counterparts. An image object can contain many local invariants, each of which represents

invariant evidence that can be used to identify it. Methods that use local invariants

usually feature some form of evidence accumulation such as histogram matches [Kliot98]

for database search or transformation evidence accumulation in parameter space. Such

invariant features can be used directly for recognition (for component labelling or canonical

mappings) or as a means to isolate the different transformations within the image to

facilitate a direct comparison. Table 2.3 lists some of the most common applications of

local invariant descriptions.

1. Signatures

Two independent invariant features plotted against each

other to form a single invariant canonical representation of

the object [Weiss88].

2.Histogram

A one dimensional histogram of a single description

[Kliot98].

3. Localised Signatures

Individual signature descriptions generated from the imme-

diate neighborhood, for each image feature [Thacker95].

4. Creating anchor points

Invariant boundary properties used to describe shape

[Mokhtarian99] or reverse image transformations for tem-

plate matching.

5. Parameter Space (Hough Transform)

Invariant feature evidence plotted into parameter space to

facilitate the reversal of transformations between images

[Xilin99].

Table 2.3: Common local invariant architectures

Local geometric invariant features are derived through differentiation, the order of

2.5. INVARIANCE 23

which is determined by the number of transformations they need to be invariant to. Usually

generates from a combination of primitive geometric features such as lines, points and arcs,

these features represent different degrees of freedom that can be combined to generate an

invariant of a predictable order. Table 2.4, below, shows some common image primitive

features and their corresponding degrees of freedom

Features Degrees of Freedom

Tangent 1

Points 2

Curves 3

Conics 5

Table 2.4: Degrees of freedom (order) of primitive geometric image features

Invariance to a given set of transformations can be generated by ensuring that the sum

order of the component features exceeds the order of the transformations.

An example:

A Weak Projective Transformation has 8 degrees of freedom

Cross Ratio, projective invariant derived from 5 co-planar points (see figure 2.16).

(While the standard model for a Cross ratio is to use 4

co-linear points, any 5 co-planar points can be mapped

into 4 co-linear points by a simple projection)

Order of Cross Ratio = 5(2) = 10

Number of independent Invariants from Cross ratio = 10 − 8 = 2

[Reiss93, Mundy92] contain more detail on the use of geometric primitives and the

mathematic principles behind local geometric invariance. Local geometric invariants can

be categorized according to their order of invariance and hence the image transformation

types they are unaffected by. In the case of image analysis, invariants fall in one of the

four orders of invariance presented in table 2.5


NAME ORDER TRANSFORMATIONS INVARIANT TO

Translation Rotation Scale Shear Skew

Euclidean 3 YES YES NO NO NO

Similarity 4 YES YES YES NO NO

Affine 6 YES YES YES YES NO

Projective 8 YES YES YES YES YES

Table 2.5: The four main levels of local geometric invariance

Figure 2.8: Angle between vectors, a similarity invariant

Euclidean Invariants

Euclidian invariants are of relatively low order and are unaffected by Euclidean transforms

(translation and rotation). This makes Euclidian invariance ideal for most trademark

recognition or similarly constrained two dimensional document analysis tasks. Linear

length is a simple example of a Euclidean invariant.

Similarity Invariance

Similarity invariants are measures that remain unaffected by scaling and Euclidian trans-

forms. Similarity invariants are ideal for recognition in most document analysis tasks,

where this limited range of transformations are common. The angle between vectors (fig-

ure 2.8) and the ratio of linear lengths (figure 2.9) are both easily calculated Similarity

invariants and are commonly used to generate two-dimensional signature descriptions.

Curvature is also a commonly used Similarity invariant [Weiss88, Califano94], and when

combined with image smoothing to enable closure of image curves [Dudek97] can form a

useful approach to sketch image analysis.

Affine Invariants

Affine invariants are unaffected by shear plus the Similarity transforms. This makes them

particularly useful for our purposes, as affine transformations are common in photographic

2.5. INVARIANCE 25

Figure 2.9: The ratio of linear lengths, a similarity invariant

imagery. Invariants of this level are still fairly efficient to calculate and can provide usefully

tolerant quasi-invariant local measures for photographic image content, particularly where

objects are sufficient difference from the camera to minimize projective effects. Area

ratios are particularly useful Affine invariants as area can be easily updated during a

segmentation process. Boundary descriptions can also be derived from area ratios by

using the ratio of areas between four points on a curve (figure 2.10) [Kliot98]. The ratio

of parallel linear lengths (figure 2.11) is a another common example of an Affine Invariant

[Gros98]. Geometric Hashing [Lamdan88], uses adjoining line pairs to form the basis for an

affine local planar geometry into which the other graph features are mapped (figure 2.12).

Another affine invariant is the ratio of vectors formed from a point at a tangent to two

linear vectors [Reiss93]. Boundary curvature has can be used to generate Affine invariants,

and has been successfully implemented in the SQUID image recogniton [Mokhtarian99]

(figure 2.14). A further useful property of curves is that the point on a curve which lies

furthest from the vector between the two curve endpoints will remain the same regardless

of affine transformations [Reiss93] (figure 2.15).

Figure 2.10: Four point area ratio, an affine invariant.

Projective Invariants

Projective invariants, or weak perspective invariants, are the highest order of invariant

possible when dealing with single images. Unaffected by skew as well Affine transfor-

mations, they can be used to approximate perspective tolerant descriptions where image

content is generally planar in nature. The most commonly used projective invariant is the


Figure 2.11: Ratio of parallel line lengths, an affine invariant

Figure 2.12: Geometric Hashing, two reference line segments are used to define a local

geometry that will remain unaffected by affine transformations.

Figure 2.13: A point, tangent and two lines can form an affine invariant

Cross Ratio, which is the ratio of distances between four points on a line (figure 2.16). A

Cross Ratio can also be derived from five coplanar points by forming a vector between two

of the points and projecting the other points onto this line.

Properties of planar conics can also be used to generate high-level invariants [Mundy92,

Forsyth90a, Quan98, Brill92].

The major problem associated with implementing high order invariants is that a large

number of primitive features are required to generate them, this makes them sensitive to

initial feature extraction errors, aliasing and noise present in query images.

2.6 Segmentation

Image segmentation usually refers to the subdivision of an image into regions that repre-

sent disparate image subcomponents. In the general case, the aim of segmentation is to

2.6. SEGMENTATION 27

Figure 2.14: The Shape Query Using Image Database recognition system uses invariant

curve based signatures built up from different degrees of boundary smoothing

Figure 2.15: The point on the curve furthest from the endpoint vector is affine invariant

Figure 2.16: The projective invariant Cross Ratio.

identify which parts of an image constitute separate visual objects. Segmentation is often

used as a first step towards the analysis of a given scene, assigning labels to areas of an

image to generate concise and computationally useful information about scene content. In

the case of image compression, segmentation can be used to eliminate redundant image

information, reducing storage requirements. The other main application of segmentation

is in image recognition and understanding, where regions of connected pixels with similar

characteristics will usually belong to a single visual object.

Segmentation algorithms can be roughly divided into five overlapping approaches: mo-

tion, edge, boundary, region and model techniques. Motion segmentation measures rely on

the principle that points in an image belonging to the same object in an animated sequence

of images will generally share the same common fate. Edge and boundary based techniques

rely on discontinuities at region borders to find region boundaries from which the regions

themselves can be derived. Region techniques rely on point proximity and continuity to

directly identify regions and, conversely, the boundaries between them. Model approaches

rely upon prior knowledge of image content to enhance the segmentation process and

directly search the image space for known objects.


While many the above approaches should ideally yield the same results with the same

given image, problems arising from noise, image complexity and algorithm assumptions

mean that this is very rarely the case.

For natural images containing three-dimensional objects, the segmentation becomes

more problematic due to noise, complex lighting and shading effects. Pre-processing such

as Symmetric Neighbourhood Filtering [Harwood87], median filtering or Gaussian filtering

to reduce noise, and colour constancy techniques [Finlayson97b],[Finlayson01] to reduce

shading and lighting effects is often applied before edge or region detection in such images.

The following sections provide an overview of the four main segmentation approaches that

are applicable to a single image segmentation.

Edge Based Segmentation

Edge based techniques attempt to determine regions by detecting and linking the edge

discontinuities in an image into region boundaries. Edge detection techniques commonly

utilise localised partial derivates calculated from pixel neighbourhood masks. The Roberts

[Roberts65], Laplacian, Sobel [Gonzalez92] and the popular Canny [Canny86] operators,

along with the many variations possible provide quite effective edge detection using the

derivatives of pixel nearest neighbours can be extremely efficient to implement. Unfortu-

nately, due to the bottom-up nature of these methods, these approaches can be sensitive

to camera artifacts, low resolution contours and sub-optimal contour groupings formed

from boundary intersections within the image at early stages of edge contour generation.

After region boundaries have been located, then the regions themselves can be easily

labelled. This process relies heavily upon the retrieval of fully enclosing boundaries, which

edge detectors alone are unlikely to generate in all but the most contrived or simple images,

due to imaging noise or similarity between adjacent regions.

Although binary and intensity measures are generally used with edge based techniques,

colour and texture measures, most often implemented with region segmentation, can also

be applied. To compensate for incomplete boundary recovery, most edge based techniques

require some form of post processing using image, boundary or geometric assumptions to

correct any errors. Edge relaxation [Zucker77] involves the use of localised edge direction

information to encourage the growth of weak adjoining edges whilst inhibiting edges due

to noise. Sequential edge detectors, use strong edges as seeds for object boundaries from

which all possible edge paths are calculated and evaluated for likelihood and connectivity.

Because this process can be quite computationally intensive, it is often directed towards

likely paths based upon edge direction in order to reduce the number of potential paths

searched.

The Hough transform can be implemented with noisy edge data to gather evidence for,


and extract, line or curve segments. The Hough transform plots edge pixels into a curve

or line parameter space where evidence is accumulated for the existence of a particular

boundary segment. This information can then be used to inhibit outlying edge points

whilst encouraging weaker edge points which lie on the curve. An adaptation of this is

the Generalised Hough Transform, which can be used to detect the entire boundary of an

object with sparse edge data, although prior knowledge of the shape to be extracted is

required. One drawback to the use of Hough algorithms is that they have a tendency to

be computationally expensive. A variety of speed-up and shortcut techniques exist which

direct the search or use multiple edge points in combination, such as the Randomised

Hough Transform [Xu90].

Boundary methods: Snakes, Active Contour Models, Balloons

Boundary based segmentation lies somewhere between object recognition, edge detection

and region detection, although they are primarily concerned with detecting object bound-

aries as a whole.

Snakes, or active contour models [Kass87] are moving curve segments that negotiate

and adapt to the image space to find a position that minimises an energy function, usually

to seek smooth curve segments whilst maximising edge potential. The practical imple-

mentation of snakes to real images can prove quite sensitive to initial conditions, so prior

knowledge of image content or careful seeding is required for effective results. Snake al-

gorithms result in multiple detected curves that, as in the case of edge detectors, require

some form of linking into complete boundaries before segmentation can be performed. Bal-

loons [Cohen93], implemented to seek out both 2D and 3D contours are a more controlled

method of using snakes to find full region boundaries. Balloons represent continuous snake

like boundaries that seek to find maximal area, maximal smoothness and maximal edge

potential and inherently remove the need for edge linking into boundaries. Snake/balloon

based boundary techniques are very effective where some information about the shape,

curvature or content of the image to be segmented is already known and use assumptions

to enhance the boundary detection. Although methods exist to adapt these approaches

to unknown image content, they can prove very sensitive to starting conditions and image

content.

Model Based Segmentation

These techniques are primarily devoted to the direct fitting of a known shape or model

to the image data. Similar in nature to Active Contour Models, and often based upon

these, they use edge information to adaptively fit a known object boundary to the image

data. The simplest form of this approach is template matching, where a known object,


which will usually cycle through a series of transformations, is tested directly for match

with the image data. In the case of deformable models, known boundaries are adaptively

fitted to image edge areas using fitness functions similar to Balloons. Active Shape Mod-

eling [Stegmann00] relies upon a statistical description of object boundaries based upon

boundary variation by applying principle component analysis. Although very effective at

locating and segmenting known objects within images, and especially resistant to image

noise, their reliance upon prior knowledge of image content makes these approaches of

limited use to generalised image segmentation.

Region Based Segmentation

Segmentation based upon the detection of regions in an image approaches the problem

directly, grouping points into regions based upon their proximity and the similarity of

their characteristics. A segment is represented directly by some measure of the image

data contained within it and the boundary between regions is usually conceptualized as the

difference between these measures. Segments can be extracted directly from the statistical

properties of the image, generated as a result of merging lower level regions with similar

properties or created through the iterative subdivision of regions in a top down process.

Region based approaches are probably best defined in terms of the characteristic measures

and the actual region grouping processes that use these measures to create the segment.

2.6.1 Measures used to guide segmentation

There are a wide range of measures that can be used to form the basis of a segmentation

process, optimal measures are dependent upon the type of segmentation required and

anticipated image content. Essentially, these measures represent a localized property of

an image at a given position which can be compared with a neighbouring image point for

similarity and determine whether the two points belong to a common region. Measures can

be subdivided into single point (colour, intensity, binary) or multiple point descriptions.

Single Pixel Measures

Single pixel properties represent a measure of similarity between localised points and the

subsequent grouping into regions or segments. Although less appropriate for textured im-

ages, single point measures largely avoid the problems associated with sampling scale and

dimensionality, and can generate computationally inexpensive algorithms Pixel intensity

is the most commonly used single point measure, representing grey-scale images, whereas

colour can be denoted through hue measures or component colour intensity measures such

as RGB.


If the image source is of a binary nature, or has been reduced to a binary image through

pre-filtering techniques such as thresholding, then similarity between individual points is

either true or false. Unless a texture measure is being used within a wider neighbourhood

of pixel, the proximity or inter-connectedness of same state points controls the region

grouping procedure. Although extremely efficient, binary image representations lack the

detail required by most modern analysis tasks and are most useful in highly simplified

or controlled environments, where a limited number of non-complex image regions are

expected.

If image content cannot easily be reduced to a binary measure, such as in natural im-

ages, we can increase the dimensionality of our measures to represent linear pixel intensity.

The image is represented as a function consisting of greyscale light intensity measures. In

the general case, images form a three dimensional surface consisting of the pixel position in

the image and it’s corresponding intensity value. In photographic images, a similar image

intensity between nearby points in an image indicate that the two points share the same

reflectance and surface properties and are therefore likely to belong to the same surface

or region. Segmentation through reflectance intensity still represents a good trade-off be-

tween processing time and the level of image detail, although valuable colour information

may be lost in the process. Grouping by intensity values and proximity results in good

segmentations where image regions are Mondrian in nature and under ambient illumintion.

Changes due to surface shape, texture and lighting conditions within an image scene will

usually result in an over segmentation.

When dealing with photographic images, the maximum amount of information we can

use from a single point is that of colour. Colour information helps differentiate between

surfaces with similar reflectance and can also be used to overcome some of the three di-

mensional lighting difficulties encountered when grouping by intensity only. RGB (Red,

Green, Blue) and CMY (Cyan, Magenta, Yellow) colour descriptions are essentially in-

tensity measures of different parts of the visual spectrum, and can viewed as a simple

extension of intensity descriptions. HSB (Hue, Saturation, Brightness) descriptions can

be more problematic as they combine linear and polar coordinate systems, but have useful

inherent invariant properties. The increase of point information facilitates the generation

of semi-invariant measures such as normalized rgb and hue that prove more tolerant to

lighting and shading changes in an image, although the Mondrian constraint is still as-

sumed. Segmentation based upon colour can be more computationally expensive than

intensity due to the increase in information that requires evaluating.


Texture Measures

Texture measures are generated from information provided by a group of pixels within an

image. Texture descriptions are usually generated either through the use of statistical mea-

sures describing a given rectangular or circular neighbourhood around a point or through

a functional description of image data. The resulting increase in information, often at

the expense of extra processing and poorer boundary localisation, can be further used to

overcome complex shading effects in real imagery. More importantly, texture measures

are capable of grouping complex and highly textured images, which are typical of natural

scenes, into disparate regions. With the introduction of wider neighborhood information

comes the problem of dimensionality, how to reduce the dimensions of the description to

a manageable form and which particular neighbourhood size or shape is most appropriate

to describe a given texture region.

Random Fields

Random fields rely on the assumption that a given two-dimensional neighborhood of point

values can be modelled by the parameters of the distribution function they best fit. Given

a distribution function and the point values, these parameters can be easily calculated,

although in the general case of image segmentation the actual distribution of a entire

image is unknown. Following from the Markov property which states that the probability

of a point value given an entire image is the same as in a smaller surrounding area, and

the general prevalence of Gaussian distributions, sample regions are often assumed to be

Gaussian. This special class of fields are called Gaussian Markov Random Fields. Another

common class, which uses exponential distributions determined from image point values,

are Gibbs Markov Random Fields. Random Field parameters can be used to determine the

mean, variance and the directional auto covariance of a neighbourhood of image points.

Dissimilarity measures

Texture dissimilarity from a known, or assumed, distribution can provide a sparse, ef-

fective, description of texture. Chi-Square measures can be implemented to determine

the variation of a table of image values from a table of values following an expected dis-

tribution. The Kolmogorov-Smirnov distance measure and Cramer-von Mises distance

estimator can both provide useful dissimilarity measures.

Cooccurence Matrices

Cooccurence Matrices, also referred to as Spatial Grey Level Dependence Matrices, are

a series of matrices that each represent a given set of offsets from a sample point. Each


matrix stores the number of different grey level combinations that occur at the given

offset. Cooccurence Matrices can be used to determine a number of region properties,

including mean, variance, entropy, energy, contrast, and correlation. Both [Conners80]

and [Singh80] show that co-occurrence matrices provide the best texture discrimination

when compared against other approaches, with the possible exception of Laws [Laws80]

texture measures. An example of the application of co-occurrence matrices to segmenting

multi-spectral texture images can be found in [Kasari96]

Auto-correlation

This statistical measure determines how much a sample function varies with itself as it is

offset from the origin. When applied to two-dimensional neighbourhoods it can be used

as a measure of texture coarseness in different directions.

Moments

Calculating the moments of a point neighbourhood can provide a wide range of useful

texture descriptions for use with segmentation. Mass, center of mass, variance, skewness

and kurtosis (4th order moment representing non-gaussianity) can all be derived in this

way and used to indicate texture similarity.

Edges/Extrema in Unit Area

A simple count or normalized average of edges, corners or other notable extrema within a

point neighbourhood provide an extra measure of texture similarity. The Edge frequency

method is a simple summation of absolute point differences at given sample offsets that

can provide effective smoothness or roughness measures. Not a very effective discriminant

between textures when used in isolation.

Laws Texture Measures

Laws method [Laws80] produces a set of 14 rotationally invariant, or 24 rotationally de-

pendent, measures derived through the application of simple convolution kernels. The

basis kernals are a set of 5 one-dimensional kernels of length 5, each representing level,

edge, spot, wave, and ripple.

L5 = [ 1 4 6 4 1 ]

E5 = [ -1 -2 0 2 1 ]

S5 = [ -1 0 2 0 -1 ]

W5= [ -1 2 0 -2 1 ]

R5 = [ 1 -4 6 -4 1 ]


These kernels are convolved with each other to generate 25 combination kernels which

are applied to the image to generate 25 separate new images. These images are replaced

by their texture energy measures, gained through a summing point in a window around

a given pixels. The images are then normalized by the level-level convolved image which

is usually not used further. A limited degree of rotational invariance can be achieved by

summing the set of 24 images with their rotational opposites. For example, the L5E5

image, which is sensitive to vertical edges, can be summed with the E5L5 image which

is sensitive to horizontal edges to generate a combined image which is sensitive to both.

This results in a set of 14 unique images derived from texture properties, which in turn

provides 14 texture measures for every point in the original image.

Although a relatively simple approach, Laws texture measures actually provide a very

effective set of texture descriptions which have been shown in [Conners80] and [Singh80]

to outperform most other techniques, with the exception of co-occurrence matrices.

Fractals

Fractals, and multi-scale multifractals [Talon00] [Kam99] derive descriptions through the

analysis of the image contents fractal geometry, and are usually based around co-occurrence

features. Fractals analysis can be performed at different fractal dimensions, or scales,

and also be used to determine optimum fractal dimensions to implement with a given

image. This approach shares some considerable similarities to wavelet based descriptions,

especially in its ability to reduce noise whilst highlighting multi-scale signal information.

Primitive length texture features

Primitive length texture features rely on measures derived from the fact that coarse fea-

tures will have a larger number of interconnected points at a similar grey level, whilst fine

textures will have smaller clusters. They are calculated by recording the number of con-

nected points in a neighbourhood with a particular length and grey level in each direction.

This information is then biased to highlight either short primitive lengths (fine texture)

or long primitive lengths (coarse texture), which can then be used to measure primitive

length uniformity, grey level uniformity and a primitive percentage.

Gabor Filters

A Gabor filter is a linear two-dimensional, local filter type similar to a local band-pass

filter that extracts information from an image’s spatial and spatial frequency information.

Because one filter only samples a particular spatial frequency and orientation, a bank


of multiple Gabor filters are commonly used the provide texture descriptions. Although

the results can be used directly to categorize image textures, they are more commonly

combined to derive other measures such as complex moments, grating cell operators and

Gabor-energy quantities. A comparison of the effectiveness of different Gabor filter ap-

proaches can be found in [Kruizinga99].

Although this filter type can be effective at describing texture features, the need for a

large number of individual features and their static nature has lead to them being largely

superseded by more flexible signal based techniques such as Fourier analysis and wavelets.

Fourier Analysis

Originally designed for linear signals, the Fourier series expansion of continuous and pe-

riodic waveform provides a means of expanding a function into it’s major sine/cosine or

complex exponential terms. These individual terms represent various frequency compo-

nents which make up the original waveform and posses properties that can be useful for

segmentation and image processing.

The Discrete Fourier Transform was developed to be used where both time and fre-

quency variables are discrete, and are particularly useful for image analysis problems where

the time element can be replaced with axis position and the function run for each line and

column of the image.

The Fast Fourier Transform is a class of special algorithms which implement the Dis-

crete Fourier Transform with considerable savings in computational time. Whilst it is

possible to develop Fast Fourier Transform algorithms to work with any number of points,

the number of points used is generally limited to integers with a power of 2 to allow

maximum efficiency to be obtained.

Fourier techniques have many useful properties when applied to computer images and

are primarily used for filtering, compression and the extraction of textural information.

Once the DFT of an image has been generated, the resulting frequencies can be manip-

ulated and then re-transformed back into a reconstructed image using the inverse DFT.

High frequencies in the DFT correspond to high frequency image elements such as edges

with low frequencies corresponding to general large scale image structure. If high frequen-

cies are filtered out of the image we are left with the overall shape of the image content,

with edge details lost. Conversely, if low frequencies are filtered out, then we are left with

information about the edges and their location in the image but little about overall shape,

a method useful for edge detection algorithms. In a similar way, neglecting very low (close

to 0) magnitude frequencies and storing of only the largest frequency coefficients can re-

sult in image compression and de-noising that still retains general structure. The image

Power Spectrum can be determined from Fourier analysis to provide information about


the frequency and direction of a texture pattern. While this information is useful when

applied to images or sections of image that contain regular patterns of constant interval,

segment edges in the image can have an adverse affect upon these descriptions. For this

reason texture descriptions using Fourier analysis are often generated for localized image

neighbourhoods within the larger image, and such descriptions are sensitive to the scale

of these neighbourhoods and the type of image content.

As component frequencies represent image features at different scales, this allows the

analysis of texture information at a multi-scale level and can help determine optimum

scales for analysis. Extensions of multi-scale Fourier analysis to include affine invariant

analysis have also been proposed [Hsu93]. The highly popular descendant of Fourier

analysis, Wavelets, was developed to address some of the difficulties inherent with the

use of continuous sinusoidal signals.

Wavelets

Wavelets were developed through a combination of different scientific disciplines to address

the non-localised nature of the sinusoidal components used with Fourier analysis. Both

Wavelet analysis and Fourier analysis map discrete data into the frequency domain and

share many of the same features and advantages in texture description. The main differ-

ence between the two is that, unlike sines and cosines, wavelet functions are localised in

space and frequency. This localisation results in wavelets producing a much more sparse

representation that Fourier approaches, which makes them more effective for application

to noise removal, compression and feature detection. Wavelets can also be applied with

varying image window sizes, which is an extremely useful property where the localisation

of image discontinuities is required. Many wavelet approaches use small windows with

high frequency analysis to detect discontinuities and apply larger windows with lower fre-

quencies to obtain detailed image frequency analysis. Another advantage is that whereas

Fourier analysis is restricted to two kinds of basis function, sine and cosine, there are a

large number of potential wavelet basis functions to choose from. This can result in more

effective analysis for different image content. Examples of wavelet types are the fractal

Daubechies, Coiflet,square-wave Haar, and Symmlet wavelets. An overview of wavelets,

their formation and applications can be found in [Graps95].

2.6.2 Region grouping processes

There are a wide variety of algorithms available to implement the segmentation of an im-

age and their effectiveness varies according to the nature of image content, homogeneity

measures and the intended nature of the segmentation. Bottom up algorithms begin with

low level information and gradually build regions through some form of merging process.


Top down approaches begin with the entire image as the first segment and iteratively

split segments until a threshold homogeneity is reached. Determining the optimal di-

mensionality for a segmentation is crucial, especially in the case where a single segmented

image surface is required. Top down approaches can settle on sub-optimal solutions due to

splitting across regions and bottom up segmentations are adversely affected by noise and

image imperfections. Both approaches have their limitations, so hybrid algorithms featur-

ing splitting and merging processes are often implemented in an attempt to integrate low

level and high level information. Some approaches attempt to overcome the dimensionality

problem by producing a set of segmentations at different scales and either post-evaluating

them or returning them all as valid results. The main difficulty with this approach, apart

from the lack of a definitive solution, is the increased processing requirements of these

algorithms and the need to maintain multiple different scale representations of the original

image at the same time.

The nature of the segmentation required is also important in influencing the choice

of algorithm. In the general case of satellite imagery, where the desired texture elements

are already known, then the algorithm becomes a basic search for the nearest match to

the known elements. Where the number of segments required is known, then the task

is to subdivide image content into the given number of regions whilst optimising some

homgeneity measure. The general case, where no prior assumptions are made about the

number or nature of regions, relies totally upon the evaluating function to define the

segmentation and is less likely to obtain useful segmentations. Region connectivity is also

an issue, many tasks require regions to be grouped together spatially whereas for some

tasks this is not important and disconnected regions of similar properties can share the

same label, histogram techniques which disregard spatial information are usually much

more useful in the latter case.

Pre-attentive Segmentation

Pre-attentive segmentation algorithms represent pre-processing algorithms that can aid

in successive segmentation tasks. Such algorithms commonly analyse raw image data for

low level discontinuities, smooth contours and pop out targets and mark these as areas of

interest for other segmentation processes. [Yu01] presents such a strategy for pre-attentive

segmentation based upon nondirectional repulsion, detecting boundaries as discontinuities

in texture orientation.

Histogram Techniques

In some cases it may be advantageous to simplify the segmentation process through the use

of histogrammic representations. Histograms are usually implemented where images can be


separated into simple regions or regions of anticipated description, or as a preproccessing

step in seeding likely regions for other segmentation algorithms. Disparate regions will

have a tendency to show as peaks in a histogram of the image and troughs will indicate

potential segment boundaries. Although there may be labelling difficulties stemming from

the removal of spatial information, where unconnected regions are assigned the same label,

histogram segmentation can provide an efficient method of segmentation and is especially

useful where simple bimodal segmentations are required.

Histogram thresholding is one of the most commonly used methods for identifying

image regions, where all values below the threshold are disregarded as background and all

remaining image points are assigned labels corresponding to their histogrammic peaks. A

multi-scale or adaptive variation of this is the use of a Histogram Watershed, where the

threshold is gradually increased and adjusted.

Common histogram based segmentation algorithms are:

The Midpoint Method

Minimum Error Method

High-pass masking

Local and Global Thresholding

Primitives/Texture Databases

Primitive based segmentation processes are the subdivision of an image space according to

its similarity to a database of anticipated textures. Where image content is known, such

as landsat imagery, then this reduces the problem to a least-cost matching process and

can generate very effective results. Statistical texture descriptions within the image are

compared with a library of textures and assigned labels that provide the closest match, a

process almost identical to texture recognition.

Because of the success of this constrained method and the growing need to automati-

cally process landsat data, this approach to segmentation has been widely researched and

implemented. See [Ma95] and [Austin96] for examples of this kind of segmentation.

Adjustable Models

Adjustable models rely upon user feedback to identify incorrect segment classifications

and adjusts itself to accommodate this.


Watershed Segmentation

Watershed segmentation is a well established approach that is usually implemented in non-

textured, Mondrian images and could be extended to deal with texture. Object boundaries

represent high values in the 1st derivative, edge space, of an image, whereas region cen-

tres represent shallow basins. The grouping of regions in edge space can be modelled as

comparable to water flowing downhill and collecting in the region basins.

One method of watershed segmentation is to treat each image point as a particle that

seeks the path of minimum cost through edge space until a stable state results, essentially

flowing downhill in edge space. Eventually, every point in an image will settle in a region

basin and those point that belong to common basins can be assigned the same label in the

original image.

Another approach uses gradual flooding to determine regions, all edge values below a

threshold value are considered to be ’under water’ and are given a common label according

to which basin they share. As the threshold increases, more and more of the edge space

will be below the threshold until separate region basins will eventually meet. Once regions

meet, their labels are fixed and the flooding process is carried on with other regions until

the entire image is labelled and segmented.

Region Growing

Region growing represent bottom up approaches that begin by clustering, or growing,

regions from primitive data points into larger scale segment descriptions. An example

of this is Pairwise clustering, where each point in the image is initially given a unique

segment label and an iterative process is used to gradually merge together similar segment

descriptions into a larger single segment. Although this process can be initially slow, with

many segments to consider, the process can become progressively faster depending upon

the exact nature of the segment comparison and grouping algorithms.

K-Means Nearest Neighbour

This classic approach to segmentation has many variant methods of application and inti-

ialization. A number of initial region labels are usually chosen and their position in the

image is seeded either randomly or through some rough post-processing such as histogram

thresholding. The iterative process examines each image point’s position and description

to determine which region the point lies closest to. Once all points in the image have been

assigned to their nearest region seed, the position of each region is recalculated as the

mean position of all member points and the assignment process is repeated. This usually

results in a movement of the region centre to a position of less-cost until all region centres


eventually reach a stable or periodic state. The final labelling of image points then rep-

resents the optimal position for each segment, given the starting conditions. K-means is

both a popular, efficient and effective technique but it’s sensitivity to initial seeding con-

ditions can make the results less predictable than other approaches. [Singh99] developes

the standard K-Nearest neighbour algorithm and applies it to texture segmentation.

Split and Merge

Split and merge algorithms base segmentation upon a top down approach, where the image

is recursively split into homogenous regions which are themselves subdivided until some

cost function is satisfied. The most common approach to split and merge is the use of

Quad-Tree splitting, which subdivides using a rectangular geometry, and example of which

can be found in [Smith94]. The main drawback to this splitting procedure is the artificial

imposition of a given geometry to the splitting process, which may result in regions that

span the geometry being incorrectly categorised as different segments. To address this, a

merging algorithm is usually applied to the final segments to correct any such errors.

Image Pyramids

Image pyramids attempt to address the problem of segmentation scale by providing a

multiscale representation of image content. Often implemented using Laplacian filters,

Image pyramids are also extensively used in conjunction with frequency domain filters

such as Gabor, Fourier and Wavelets. These pyramid structures represent a bank of filters

that sample image content at gradually increasing resolutions, presenting a multiscale and

sometimes multi-directional description of image content.

Steerable image pyramids use basis functions which are directionally derivative, this

allows the pyramids filter output in any direction to be calculated from the simple weighted

sum of previously determined function outputs. Steerable pyramids can be thought of as

a type of over-complete wavelet transform.

An application of Steerable Pyramids to achieve invariant texture recognition can be

found in [Greenspan94].

Simulated Annealing

Simulated Annealing is a function minimisation approach that takes its approach from the

annealing process with metals. Although many variants exist, the fundamental process

is to gradually arrive at a solution by probabilistically minimising an objective, or cost

function. The thresholding to this cost function is initially set high so that many potential

solutions can be investigated without settling upon a local minima and is gradually reduced

until an optimum solution is found.


In practical terms, an initial segmentation of the image is first generated, often ran-

domly, and a slightly modified version is compared to it to determine if it represents a

better solution. If the new segmentation is better, or within the tolerance threshold then

the new representation is kept and the process is iterated. After a given number of such

iterations, the threshold is reduced and the process repeated. This is analogous to reducing

the energy of the system until it ’freezes’ at the solution. In this way, an optimal segmen-

tation of the image will eventually be produced and the system is likely to have avoided

settling into early non-optimal solutions due to the higher tolerances in early development.

[Cook96] presents an example of Simulated Annealing to achieve image segmentation.

Deterministic Annealing represents a combination of deterministic approaches guiding

the Simulated Annealing framework, this has greater control over initial and new seg-

mentation propositions. Although usually faster to arrive at a solution, Deterministic

Annealing is more likely to fall into local minima due to the less randomised nature of the

algorithm. Examples of segmentation through Deterministic Annealing can be found in

[Hoffmann98],[Hoffmann97].

Genetic Algorithms

As in most NP-hard search problems, segmentation can be achieved iteratively through

the use of Genetic Algorithms. Similar in nature to Simulated Annealing and modelled

upon natural evolution, this approach relies on a survival of the fittest process. Multiple

candidates for segmentation are first generated and then evaluated against a fitness func-

tion, those that perform the best, and a few chosen at random to maintain diversity, are

then ’mated’ with each other to create the next generation of segmentation candidates.

This next generation, generally seeded from the good candidates, are likely to represent

even better segmentation solutions. Gradually, after many such generations, and optimal

solution will be reached. Genetic algorithm based segmentation of natural scenes can be

found in [Albert90].

Bayesian Classifiers

A Bayesian approach can also be applied to image segmentation, using probability rules to

compare and classify different regions of texture or points in the image. The Iterative Con-

ditional Mode, first proposed in [Besag86], which relies heavily upon the Markov random

field probability model is commonly used in a Bayesian framework for image restoration

and can be applied to segmentation tasks. This algorithm relies upon the high probability

that an image points value will be the same as neighbouring point values and iteratively

updates and relabels these until a stable state is reached.


[Forbes98] provides an example of the use of Bayesian classification for segmentation,

implementing ICM.

Voting

The voting method is usually applied in a single scale texture segmentation process where

image content is subdivided into a series of overlapping grids. For each grid the nearest

predefined texture class is found and every point in the grid casts a vote for that class.

At the end of this relatively fast algorithm, each point is examined to determine which

class it voted most for and is assigned a region label. Voting provides a fast grid-based

segmentation process that, to some extent, overcomes the effects of implementing the grid

geometry.

Minimum Length Encoding

Minimum Length Encoding is particularly useful where segmentation of images contain-

ing non-textured mondrian surfaces is required. Based upon the Occam’s Razor Princi-

ple, minimum length encoding examines polynomial function representations of an input

waveform, or series of image points, and subdivides them into a series of minimum cost

best-fit polynomials. This is usually achieved by generating a graph of adjacent potential

polynomial pairs along with a cost function associated with the complexity of replacing

them with a single polynomial. This process results in a sparse set of functions which,

when applied in segmentation, naturally correspond to individual surfaces of both different

colour/intensity and surface shading.

For this reason, Minimum Length Encoding can be most effective when dealing with

non-textured image segmentation, as can be seen in [Keren89] and [Peleg90].

Behavioural Systems

Behavioural systems use communal behaviour models, such as the complexity exhibited

by bee colonies, to generate self organising systems. These techniques define simple be-

havioural rules for low-level inter communicating entities which lead to large scale complex

organisation. [Ramos00] and [Chialvo95] propose the use of swarm modeling for self or-

ganisation tasks such as segmentation and cognitive modeling. The self-organising nature

of these iterative processes can be seen as similar to many self organising neural network

approaches.


Neural Networks

Neural Network techniques lend themselves well to image segmentation as their parallel

nature can overcome many of the processing problems encountered in serial algorithms.

Self organising neural networks and Oscillatory Correlation nets [Chen00] such as LEGION

[Wang97] are ideal for the data clustering required for unsupervised segmentation. Most

networks suffer where region continuity is required as their parallel approach is not easily

applied to differentiating between separate regions of similar properties. [Austin96] uses

the AURA neural network architecture to search and label images from a library of pre-

encoded textures.


2.7 Signatures

Signatures form a simplified, non-reversible, accumulation of variables that can be used

for fast and approximate recognition. Sharing aspects in common with histograms and

parameter-space representations, signatures are usually formed from the plots of two or

more independent variables against each other (figure 2.17). While information is lost,

signatures can facilitate be highly efficient methods of fast image trait recognition. Plot-

ting two variables against each other in a two dimensional also space provides a much

greater discrimination and separation of values than the use of single dimensional his-

togram descriptions. These plots can be linear or binary in nature, with the generality of

the description easily manipulated through the careful selection of signature dimension,

resolution, smoothing and storage. Care must be taken to constrain variables to maximum

and minimum ranges to facilitate normalization into signature coordinates. A common

approach is to use the results of variable ratios, constraining results to a 0 ≤ v ≤ 1 range,

which is also a common method used for the generation localized geometric invariant de-

scriptions. Signature representations provide a very natural framework for the storage and

recognition of invariant description data [Weiss88, Squire00].

Figure 2.17: A simple two dimensional signature plotting the green-red and blue-red colour

ratios against each other

Earlier work using invariant measures combined with signature storage and statistical

comparison showed promising results, and it was originally anticipated that signature

storage would be the most likely method used in this thesis work. Towards the end of the

work, signature storage would be abandoned in favour of one-to-one label based methods.

Chapter 3

Preliminary Work and

Experimentation

3.1 Signature Storage

A brief review of signature storage has been presented in 2.7, this work intends to look

into the practicality of encoding two dimensional signatures from invariant descriptions for

image component recognition. This section details preliminary work into the effectiveness

of signature storage and possible ways of encoding these to facilitate generalization and

efficient storage.

Local invariant values are usually generated as the expression of the relationship of one

image region property to another, and are therefore fractions by nature. The first issue

to be addressed is our strategy to normalize coordinate values into a predefined range

for each axis. The most convenient method of limiting ratio values is to ensure that the

smallest value is divided by the largest, limiting all results to a 0 ≤ v ≤ 1 range (0 where

both components are 0). While information about the directionality of the ratio is lost, if

consistently applied then this information is redundant as only one point for every pairwise

description is required. A greater problem is likely to occur where ratio components can

have different signs (and cannot be adjusted into a positive range), although this is not

common for image feature descriptions. For these preliminary tests, a signature size of

160 by 160 was selected.

Neighbouring Region Pairs or all Region Combinations?

The proximity of features and regions is regarded as an indicator that they will be in some

way related to each other [Alwis00]. When dealing with three dimensional image content,

there is a conflict between the need to generate descriptive invariants (where components

45

46 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION

Figure 3.1: The set of 12 natural query images

Figure 3.2: The set of 12 natural images searched

should be sufficiently different) and the need to limit components to localized ranges where

Mondrian planar geometry assumptions are most likely to hold (but components are likely

to be similar). The use of localized components also reduces the number of calculations

required from an exponential O(N 2) range to a more manageable linear range using only a

predefined subset of nearest neighbours. A neighbour constraint was incorporated into the

algorithm that could limit the signature values to components from neighbouring groups.

Tests were conducted on the 12 natural images (figure 3.1, 3.2) using combined results

of simple geometric (area ratios, circularity ratios) and optical invariants (normalized

colour ratios) with the neighbour constraint switched on and off. The initial segmenta-

3.1. SIGNATURE STORAGE 47

tion methods tested were split and merge and watershed region growing techniques, with

multi-generation watershed region growing eventually selected to generate the best region

primitives. General scores attained in the target image match are reduced when the neigh-

bour constraint is active. However, the direct comparison of image match scores between

different signature generation architectures is appropriate in this case, as each image sig-

nature and resulting score range is unique to that invariant type. A more reliable indicator

of cumulative performance is the sum of rankings. The rank of the target image represents

the degree of image match compared to the full dataset of potential matching images, in

this case a rank of 1 indicates a successful match. Rankings for signatures results are

expressed as percentages.

If rank is taken into consideration then it was found that the use of the neighbour

constraint resulted in a marked improvement in the search results and the number of

successful image matches.

Region Culling

The next stage of this preliminary work involved looking into the effect of removing un-

wanted, underdeveloped groups from the signature input. A culling algorithm was applied

to discount groups that were too small to provide any useful geometric information (lead-

ing to distortion of ratio results due to image aliasing) and those intersecting the boundary

of the image (shape information is likely to be clipped).

if I > B3 then cull

if B + B2 > A then cull

where:

B=Region boundary

I=Region boundary intersecting image boundary

A=Region Area

It was found that this kind of culling resulted in signatures too sparse when applied to

images with line drawings or images with few non-boundary regions (figure 3.3). To avoid

the culling of linear features, size based culling was reduced to a static threshold based

around group area.

The modified region culling and the neighbour constraint algorithms were tested on a

test-set of 12 natural images (figure 3.1, 3.2). Tests were carried out on both of the primary


Figure 3.3: Difficulties arising from the use of the neighbour region constraint and bound-

ary ased culling. Generation of pairwise invariant signatures would be impossible in these

examples, where the grey areas are culled and each remaining region therefore has no

immediate neighbour.

signature types, one set for geometric properties such as area and relative position and

another on normalized colour. Little difference was found in recognition results while

processing speed was increased significantly, indicating that some form of boundary and

area based culling would be advantageous to our final algorithm.

Signature Comparison Criteria and Signature Storage

Initial development of the test software implemented direct binary signature storage. Both

invariant coordinates were calculated and scaled up to the appropriate signature size and

the point set to true on a binary array of the same size. These binary arrays (the signatures)

were then compared directly to each other. Figure 3.4 shows the original direct signature

storage and scoring criteria.

It was found that, unless a coarse signature size was used, this method of storage

lacked sufficient generalization and tolerance to small deviations in signature components.

To increase generalization, our next set of tests implemented non-guassian smoothing of

cumulative signature bins before reduction to binary storage for comparison 3.5.

signature bin=

0 B < M

1 B ≥ M


Figure 3.4: Direct storage of binary invariant signature.

Figure 3.5: Initial non-binary signature blurring and final binary signature.

where S = the value of the signature bin and M = mean bin value

The next test set was generated in the same way, but using component areas to in-

crement the signature points (rather than a simple count) before blurring and conversion

to binary storage (figure 3.6. The rationale behind this was that larger objects are more

perceptively significant and will be most tolerant to aliasing effects.

Unless we significantly decrease the resolution of signature descriptions, their direct

storage (even those reduced to binary representations) does not form a sufficiently compact

representation, especially where large numbers of signature descriptions for each image are

anticipated. It would also be convenient if our storage methods included the generalization

we have so far been using signature blurring to achieve. For these reasons, it was decided to

test the use of multi-scale moments to encode signature descriptions into a more compact

label form. The signature is subdivided into grids of different scale (figure 3.7), each grid

generates two coordinate values representing the normalized deviation of the grid central


Final

signature size = 160*160 = 25600 binary values

Figure 3.6: Initial non-binary signature blurring, using area bias and the final binary

signature calculated from mean value.

moment from the actual center, each converted to 8 bit storage. The grid subdivision was

iterated six times to constrain the data storage requirement to one comparable with direct

binary storage. The actual resulting storage requirement at this resolution was still less

than the equivalent direct binary storage.

Final direct binary signature size = 160*160 = 25600 bits

Final moment signature size = 2(1+4+16+64+256+1024)

2(1365)(8)= 21840 bits

The actual method used to divide the signature image into multi-scale grids is shown

in Figure 3.7, and decreases the grid size by half at each level of scale. The final grid level

of scale actually implemented in the test algorithms was chosen to be 6, largely because

the storage requirements for the list of moments was equivalent to those required by the

direct binary signature storage method. To avoid the need for grid weights, and to avoid

bias towards signatures with widely dispersed plots, empty grids were assigned with the

moment value of an average distribution (0,0), representing the center of the grid.

Evaluation of signature similarity was then performed by calculating the normalized

length of the vector formed between the two sets of stored moment vectors in that grid

and summing the results.

All five methods of signature storage; direct binary plot, blurred binary plot, blurred

binary plot with area bias and multi-scale moment storage and multi-scale moment storage

with area bias were compared against each other.


5 level example, of 2(1+4+16+64+256)=2(341)=682 integer entries.

Figure 3.7: The signature is divided into multi-scale grids and the two vector coordinates

for each central moment is stored in a single combined list.

Neighbour Constraint: TRUE

IMAGE Rank classification of image (1=correct recognition)

D Rank DB Rank DBA Rank M Rank MA Rank

Mean Rank: 3.75 2.33333 2.5 1.16667 1.16667

These results indicate that as well as allowing economical storage, multi-scale mo-

ment storage can produce improved recognition results, with implicit generalization, when

encoding signatures.


3.2 Initial Experiments with Linear Gestalt segmentations

While the development from traditional pixel-based segmentation methods toward a truly

Gestalt engine will be discussed later, it is first prudent to consider Gestalt grouping in a

more simplistic context. Images specifically designed to test Gestalt grouping principles

are pre-segmented using traditional segmentation algorithms to generate simple region

primitives that can then be used in conjunction with Gestalt grouping algorithms. In this

way, different approaches towards Gestalt grouping can be implemented and tested, and an

insight into the difficulties and issues of incorporating Gestalt grouping principles can be

gained. A truly Gestalt engine would need to apply Gestalt rules (as discussed on page 15)

to larger scale visual objects that can provide richer descriptions. Although our goal is to

develop an algorithm that is capable of dealing with both types of segmentation (symbolic

and photographic), it is appropriate to study both forms in isolation before beginning the

integral engine.

A segmentation/grouping engine was created based around similar fundamental princi-

ples, but specifically designed to operate upon simplified image content and the application

of Gestalt rules. A library of Gestalt images was created which could be used to test this

algorithms ability to perceive symbolic Gestalt groupings using the three most important

Gestalt rules: proximity, similarity and continuity (figure 3.8).

Specifically, the algorithm was to work on pre-segmented primitives and attempt to

find the best linear groupings. For example, the top left image in figure 3.8 would be

presented to the algorithm as a set of 45 nodes describing the position and appearance of

the 45 distinct black object primitives in the image (the white background is discarded).

Using these primitives, the image can be grouped into a number of Gestalt regions which

depend upon the scale at which you are examining the image. In figure 3.10 we can see the

two most apparent groupings, a purely linear grouping that forms 9 distinct groups and

a larger scale grouping of these lines to form 3 distinct groups/clusters. As it is the first

level of grouping into new linear primitives that this algorithm is designed to examine, we

would expect it to output the 9 group result given this image.

The Linear Gestalt Grouping algorithm was developed from the algorithm originally

detailed in [Thorisson94] to simulate perceptual grouping. This algorithm clusters pre-

generated image primitives into perceptually significant groups based upon similarity of

appearance (colour, shape, brightness, size) weighted by their proximity within the image.

Their approach is to create separate ordered edge lists for each appearance attribute by

calculating the similarity of each attribute between all pairs of objects in the image and as-

signing them an inversely proportional score. The resulting lists are then further weighted

by a score inversely proportional to the proximity of each pair of primitives that formed

the edge. The edges are then sorted with the highest scores at the head of the list, ensuring

3.2. INITIAL EXPERIMENTS WITH LINEAR GESTALT SEGMENTATIONS 53

that the most perceptually significant groupings (those that are similar in appearance and

proximal to each other) will be processed first. The algorithm then iteratively searches

through each list looking for the most significant difference between adjacent edge scores

(essentially evaluating second order edge differences). As each significant difference is

found, all primitives belonging to edges prior to that difference are labelled as a grouping.

The lists are then iteratively searched in the same way for smaller and smaller significant

differences until a minimum threshold value is reached. This process results in sets of per-

ceptually significant groupings and their subgroups for each of the appearance attributes.

Groupings from the different appearance lists are then combined (with identical groupings

merged), assigned a ’goodness‘ score and form the output of the algorithm. Figure 3.11

shows a typical output from the Thorisson algorithm.

Whilst the use of ordered lists provides a useful starting point for Gestalt processing,

[Thorisson94] is only capable of detecting Gestalt cluster types based around similarity

and proximity. This results in the drawback that perceptually significant clusters of the

same type, but separate from each other in the image, are assigned to the same group.

For this thesis, we require the ability to separately identify such disconnected grouping

of the same type, so some modification to this algorithm is essential if we are to use

elements of this work in this thesis. A further drawback is that, in its original form, this

algorithm contains no elements relating to continuity or linearity, something which this

work intended to investigate.

The algorithm, whilst providing a useful starting point for this work (in particular

the use of ordered edge lists and appearance descriptions), was eventually altered into

something very different. The Linear Gestalt Grouping algorithm that evolved from this

starting poitn was to use a single edge list of combined features, first order edge differences

and the imposition of both weighting by continuity and a linear constraint to generate

groupings, as detailed below.

Similarity measures, data structures and equations relating to Gestalt similarity largely

follow the those detailed in the original paper. The algorithm relies heavily on self nor-

malization and treating the region similarity description types separately until final com-

bination. Proximity information is based around the minimum distance between region

boundaries, rather than region centroid, and is implemented as a function of the separate

similarity measures. An overview of the algorithm can be seen in figure 3.12.

Once a simple thresholded nearest neighbour algorithm has extracted the basic objects

from the highly artificial gestalt images, the algorithm generates a set of edge lists. Each

edge list represents the difference between region pairs in that particular attribute. Size

(e1), colour (e2), orientation (e3) and appearance (e4) edge lists are generated and weighted

by the boundary proximity of the regions.


Size Difference = e1 = (s1 − s2)2

where s1 and s2 represents the area of two regions, in pixels.

Colour Difference = e2 = (R1 − R2)2 + (G1 − G2)

2 + (B1 − B2)2

where R1, G1, B1 and R2, G2, B2 are the RGB colour components of two regions.

Orientation Difference = e3 = min(θ1, θ2)

where θ1 and θ2 are the angles between unit orientation vectors o1 and o2.

Orientation vectors are calculated using second order correlation moments from the

centroid of the region. If ri is the set of pixel positions in region i (disregarding pixel

colour information), then:

oi =∑

~vi∈ri

D(~vi)si

≡ ∑

pixels in region i

sum of vectors from centroidnumber of pixels in region i

Function D() ensures that vectors in the left hand plane are reversed into the same

180◦ range.

D(~vi) =

{

~ci − ~vi if (x component of ci)>(x component of vi)

~vi − ~ci otherwise.

with ~ci being the the centroid of Region ri

Circularity Difference = e4 = (circ1 − circ2)2

where circ1 and circ2 measure the circularity or linearity of the two regions and are

defined as:

0 ≤ circi ≤ 1

(

0 → circle

1 → line

)

.

circi =

{

bi − ai/si − ai if |ai| ≥ |bi|0 region too small to use boundary count.

ai =√

4πsi

With bi being the number of pixels that lie on the region boundary and si being the

number of pixels that form the region. ai is the circumference of the region if it was

circular in shape, and also represents the theoretical minimum possible boundary for the

region. In reality it is possible for the boundary pixel count (which is approximated by

the image grid) of small regions to exceed this theoretical minimum, so a check if made to


compensate for this.

Once these attribute difference lists (ejk where j indexes the attribute type) are gener-

ated from each region pair within the image they are self-normalized 1 and summed with

the normalized minimum boundary distance (dk) between the region pair k .

dk = minimum distance between the pixel boundaries of both regions joined by edge k

Normalized ejk ⇐√

e2jk+d2

k√1+1

While this algorithm makes use of the minimum boundary distance, which provides the

true minimum distance between two image regions, calculating this distance requires all

boundary points in one region to be compared with all boundary points in the other. This

approach can be prove to be a considerable drawback where speed of operation is an issue.

Later work relies on the region centroid difference, calculated using first order moments,

as the less accurate but more efficient distance component (see figure 3.13).

Each list is now sorted so that ej,k−1 ≥ ej,k ≥ ej,k+1

It was found that the self-normalization of description types was causing unwanted

bias towards certain description types when the description lists are combined to form an

overall score. This also resulted in an over-sensitivity to change in image content when a

particular description had a narrow range of values. A different approach to normalizing

the score values of list items was implemented that would allow the production of an

overall score based, instead, upon the rank of an item within a description list.

ejk = ksize of description list

With this alteration, edges that have the same significance (but different score mea-

sures) in differing feature description lists are considered equivalent. The replacement of

actual proximity values with list placement values also neatly circumvents any potential

unit bias when combining significance scores of different feature types (for example, size

and colour) for a single edge.

We now have a set of j ordered edge lists for each attribute, e with 0 ≤ k ≤ Size(ej)

entries, running from unlikely connections with small scores to the most likely, with higher

scores, ejk−1 ≤ ejk ≤ ejk+1.

At this point a different approach was taken to [Thorisson94], which continues treating

the attribute lists separately and runs the grouping algorithm for each, only combining

extracted groups afterward. We eliminate the need to run a separate grouping process for

each description type by combining the individual description lists into a single master list

1normalized using maximum and minimum list values to the range 0 ≤ v ≤ 1


(m) such that for each region pair k:

mk =∑Number of feature types

j=1

e2jk

Number of feature types

The final master list is sorted so that pairings with higher scores(those with the most

perceptual significance) are at the front of the list.

The following data structures are required by the Linear Gestalt Grouping algorithm.

The main structure is the sorted edge list Edge which is traversed during algorithm ex-

ecution and contains pointers to individual nodes. Individual nodes represent the image

(region) primitives that are presented to the algorithm before execution and are stored in

the node list Nodes. Before execution, each node points to its own separate region, stored

in the list Regions, and vice versa. During execution regions are merged and developed by

having parent region node references added to their node lists and the component nodes

pointed toward the new child region structure. During execution, nodes also act as junc-

tions between edges, the linear constraint in the algorithm prevents more than two edges

connecting to each node (see algorithm 11.2 and figure 3.14).

The main Linear Gestalt Grouping algorithm (algorithm 11.3) runs through the edge

list, evaluating continuity and junction constraints and reordering edges once continuation

information is available. Edges that remain in the current position and follow the linear

constraint form bridges between regions, which are combined.

Each time an active edge is reached in the sorted edge list Edge it is first checked

in JunctionTest() to ensure that the edge could form a valid connection. Preventing

more than two edges from connecting to the same node ensures linear groupings only

and checking that each end of the edge resides in a different region prevents unnecessary

computation. Continuity is then evaluated each time a new edge junction occurs (11.5). If

neither of the nodes pointed to by the edge are connected by other edges then no continuity

evaluation is possible and the edge is automatically accepted. If the edge’s nodes connect

to other edges then a continuity evaluation is possible. In this situation the angle between

each edge junction is calculated (11.6) and the perceptual significance score of the edge is

altered according to the continuity of the worst junction and the predefined Continuity

Bias.

Edge.Score = Edge.Orginal Score + Minimum Junction Angle − Continuity Bias

0◦

180 ≤ Minimum Junction Angle ≤ 180◦

180

Edge.Score =

0 Edge.Score < 0

1 Edge.Score > 1

Edge.Score otherwise.


If the junction angle is lower than the predefined minimum angle (MC ) then the

junction is too acute for the continuity constraint and the edge is rejected. If the edge

is not rejected and the new score differs from its previous value then it is repositioned

in the list according to this score. If the edge has already been repositioned according

to continuity constraints then the new edge score will be the same as the previous score,

which indicates that the edge can be accepted and regions merged. The algorithm repeats

this process until all edges have been evaluated and the output is a set of region groupings

that adhere to both proximity, similarity and linear continuity constraints.

Discussion

Figure 3.15 shows how effective the Linear Gestalt Grouping algorithm is on the test

images it was designed to deal with. With simplified images where clustering does not

play an important part of the perceptual grouping process the algorithm correctly groups

the patterns into linear primitives. As the algorithm only update edges according to local

junction continuity constraints and does not update edge proximity or similarity scores as

regions develop, the algorithm is incapable of performing a multi-scale segmentation in its

current form. Given a series of primitive regions to work with, this algorithm can only

find perceptually significant linear groupings using the original primitives alone. Figure

3.19 shows the different linear regions that form the output from the algorithm.

As in figure 3.21, the influence of the Continuity Bias (CB) variable is fairly minimal

when compared to Minimum Continuity (MC) and serves to determine the value at which

continuity information begins to increase or decrease initial edge score values.

As discussed earlier, this algorithm is not designed to detect groupings where non-

linear clustering would be required. Figure 3.17 shows the confusion caused when the

Linear Gestalt Grouping algorithm attempts to segment such image types. These group-

ing errors are further compounded by the algorithms determination to locate all linear

connections before termination and results in a tendency towards over-segmentation and

perceptually imperceptible connections. One possible method of dealing with this prob-

lem would be to iterate the algorithm and feed grouped regions back into the process. As

illustrated in figure 3.24, this would allow the generation of larger scale primitives and

allow cluster regions to be generated. The primary impediment to implementing such

a process is the need to determine when the linear segmentation should stop trying to

form new connections. At this point it remains unclear how this stopping point could be

evaluated so that more complex composite images such as used in figure 3.21 would cease

grouping once each sub-image is correctly grouped (as in 3.9 and 3.14). This tendency to

form links between what should ideally remain disparate regions (at least in that given

grouping generation) also highlights the difficulty deciding just which types of connection


should form the basis of a bridge between two region primitives in the first place (figure

3.25). Note that a single pass implementation of the Linear Gestalt Grouping algorithm

could only ever join all the regions in this image as in figure 3.25b, although the discon-

tinuity required to join both vertical groupings would be disallowed for most Minimum

Continuity control settings. Related to this problem is the algorithms tendency to search

for linear configurations even within complex and clustered regions where such groupings,

whilst being valid, are in no way apparent.

The bottom up approach of the algorithm and the evaluation of the current linear

grouping, regardless of other groupings present in the image that may perceptually obscure

a grouping can result in the generation of perceptually insignificant groups such as in

figure 3.22. This is particularly obvious when we attempt to generate groupings from

real images such as in figure 3.23. From this image we can see that virtually all the

resulting groupings can not be perceived in the original image at all. The problem that

such a photographic image poses to the Linear Gestalt Grouping algorithm is that the

algorithm is trying to impose a non-parallel linear constraint upon image data where

segments are made up from non-linear clusters of primitives. The non-parallel way in

which linear connections are made (and prohibit other potential connections) during the

grouping results in low-level linear feature extraction that does not allow the further

generation of relationships (or replacement of previously generated groupings given the

new context) between newly generated higher level features as the algorithm progresses.

While good low-level linear groupings are being generated here, it is apparent that these

groups are not the same perceptual groupings that we would associate with the image

content. While some of these negative effects would be reduced if image edge segments

were used as primitives (boundaries are linear in nature) it is apparent that a more parallel

and multi-scale architecture that combines both linear and clustering approaches would

be better suited to this kind of data.

The Linear Gestalt Grouping algorithm was designed to test the Linear and Gestalt

based grouping of primitives from simple test images. It is neither iterative, nor does it

provide a comprehensive framework from which we can build a fully pixel to gestalt region

segmentation engine. Figure 3.23 demonstrates resulting groupings when the algorithm is

applied directly to more complex photographic image types. In truth, this is not a fair

test of the algorithm as little attention has been paid to the initial segmentation process

from which the Linear Gestalt Grouping algorithm takes its primitives.

It can be seen from results such as those in figure 3.23 that the imposition of linear

and Gestalt continuity constraints, whilst appearing reasonable at first, can lead to an

algorithm that is too specific for general image content. The application of such Gestalt

constraints within algorithms does not necessarily result in a useful segmentation/grouping


process in practice. Further to this, there is a lack of knowledge relating to the degree

or the method in which the different Gestalt principles should be properly applied to

generate a human-like segmentation/grouping of image content. Care must be taken to

avoid adherence to Gestalt rules at the expense of realistic and useful results. While

Gestalt principles can be seen to be correct in the general case, they have been compiled

through careful observation of the final groupings and associations that humans tend to

make with image content and are not necessarily representative of the actual underlying

processes that generated those groupings in the first place. It can be argued that Gestalt

principles are actually the visual results of an underlying process rather than the actual

way in which human vision groups visual content. Given such results it seems reasonable

to proceed focusing on the most basic, and most reliable, Gestalt principle of proximity

rather than trying to implement the full set of Gestalt principles. As will be discussed

later, the application of the proximity principle to spaces that incorporate appearance as

well as positional data actually results in many of the other Gestalt principles such as

similarity and common fate.

If our segmentation algorithm is to work from the pixel level upwards whilst being as

intrinsically Gestalt in nature as possible then we will have to develop an algorithm that

can operate faster whilst handling very large numbers of primitive components. Whilst the

inclusion of continuation and linearity constraints would be beneficial, it may be sufficient

to approximate these to facilitate a multi-level engine that can segment images from the

pixel level upwards. The ability to deal with pixel clusterings in a similar way to most

standard segmentation engines would seem essential if the algorithm is to deal with real-

istically sized natural images in a practical time-frame. What is required is a compromise

segmentation (grouping) algorithm that, whilst retaining Gestalt principles, can operate

with the speed and efficiency of many standard segmentation algorithms.

The next section looks closer a implementing such a mid-level segmentation engine and

how we may blend the two approaches to segmentation and grouping together.


(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Figure 3.8: Showing examples from the Gestalt library, compiled to test the algorithms

ability to group objects using Gestalt principles.


Figure 3.9: Examples of successful outputs and linear groupings actually generated from

the final Linear Gestalt Grouping algorithm

Figure 3.10: The number of perceptual groups, region or object varies depending upon

which scale we view the image at.


Figure 3.11: Gestalt clusters generated through the use of second order differences in

ordered feature lists, images taken from [Thorisson94].

SEGMENTATION INTO PRIMITIVES

?

GENERATE AN ORDERED EDGE LIST

?

PROCESS EDGE LIST

?

-

OUTPUT GESTALT GROUPS

Figure 3.12: The basic architecture of Thorisson’s algorithm, [Thorisson94], forms the

basis for the Linear Gestalt Grouping algorithm.


Figure 3.13: The blue arrow represents the region centroid difference (a fast approximation

of region distance), the red line indicates the minimum boundary distance (a slow but

accurate measure).

Figure 3.14: A simple example of the Linear Gestalt Grouping algorithm in action, showing

the relationships of the Node, Region and Edge data structures.


Figure 3.15: Showing linear groupings generated by the LGG algorithm (MC:0.7, CB:0.8).

Green lines indicate the strength of connections (darker=weaker) and node colour indicates

region groups.

Figure 3.16: The effect of description elements on the final groupings


Figure 3.17: Images which require grouping into non-linear clusters should, and do, confuse

the Linear Gestalt Grouping algorithm.

Figure 3.18: Increasing the segmentation complexity and context can lead to difficulties.

Concise region descriptions (left) can cause obvious grouping errors where region primitives

have complex curves and structure. The segmentations to the right show the difficulties

caused where the algorithm reaches optimal groupings but continues to try and segment

the image further.


Figure 3.19: The full set of linear region (group) outputs from the Linear Gestalt Grouping

algorithm when applied to test image 3.8a (MC:0.7, CB:0.8)


Figure 3.20: The full set of linear region (group) outputs from the Linear Gestalt Grouping

algorithm when applied to test image 3.8h (MC:0.7, CB:0.8)


(a) MC:0.2,

CB:0.2

(b) MC:0.2,

CB:0.4

(c) MC:0.2,

CB:0.6

(d) MC:0.2,

CB:0.8

(e) MC:0.4,

CB:0.2

(f) MC:0.4,

CB:0.4

(g) MC:0.4,

CB:0.6

(h) MC:0.4,

CB:0.8

(i) MC:0.6,

CB:0.2

(j) MC:0.6,

CB:0.4

(k) MC:0.6,

CB:0.6

(l) MC:0.6,

CB:0.8

(m) MC:0.8,

CB:0.2

(n) MC:0.8,

CB:0.4

(o) MC:0.8,

CB:0.6

(p) MC:0.8,

CB:0.8

Figure 3.21: Linear Gestalt Grouping algorithm used on a composite of test images. Green

lines indicate successful connections, black lines indicate connections considered but failed

and node colour indicates final group. Each segmentation is generated using different

minimum continuity (MC) and Continuity Bias (CB) values.


(a) Image being

processed

(b) All linear

groupings found

(c) Perceptually

insignificant

grouping

(d) Perceptu-

ally significant

grouping

Figure 3.22: The Linear Gestalt Grouping algorithm can not determine if a grouping is

perceptually significant or not. The algorithm cannot determine the significant perceptual

difference between groups c and d


Figure 3.23: Showing linear groupings generated by the LGG algorithm when applied to a

simple greyscale photograph (MC:0.7, CB:0.8). Central images show initial segmentation

results (the primitives used to form the surrounding linear groupings) and the original

image. Surrounding images are a small selection of linear groupings extracted by the

Linear Gestalt Grouping algorithm. Notice that extracted linear groups, whilst being

valid, are often not perceptually significant in the original image.


(a) 1st Generation (b) 2nd Generation (c) 3rd Generation

Figure 3.24: The Linear Gestalt Grouping algorithm can be used to generate new linear

primitives which could be fed back into the algorithm for larger scale and cluster groupings.


(a) Image Primitives (b) Linear Connection from 1st gen-

eration edges

(c) Prior generation edge forming a

bridge between 2nd generation region

primitives

(d) 2nd generation edges between 2nd

generation primitives

Figure 3.25: Showing three examples of edge types that could be used to form the bridge

between perceptually related regions.

Chapter 4

Gestalt Multi-Scale Feature

Extraction

4.1 Motivation

The first step in this work is the extraction of useful information from the raw image data.

Although comparison of pixel to pixel information can be effective for direct matching

between identical images, most practical recognition systems require faster, more efficient

or more generalized searches. The most commonly used representation that can be toler-

ant to image content variance, yet still detect similarity, is the use of colour histograms.

While colour histograms are highly efficient to generate, and produce very good recogni-

tion results, they are dependent upon colour or intensity information being available in

the image content and do not encode geometric information. This makes them unsuitable

for sketch recognition tasks. Other pixel-based approaches include the derivation of higher

order global and algebraic invariants directly from pixel correlation and moment values.

Of slightly more complexity, but the focus of much current research interest, is the use of

Fourier descriptors and Wavelets to characterize image content in terms of their component

frequencies and the generation of invariant descriptors based upon these. The common

problem with the above approaches is that they are generally reliant upon global image

properties which can effectively describe image similarity, but are less useful when applied

to the detection of image parts. Deriving invariants from the pixel content from an image

can also be processor intensive due to the sheer number of pixel elements within images

of useful size. A further difficulty with pixel-based approaches, especially when applied in

areas such as sketch recognition, is that they are unable to make use of large-scale shape

and boundary information present in image content. A more useful and intuitive primitive

to use for image recognition is the image object or segment. As well as global colour,

texture and geometric properties, humans also perceive sub-components, inferred objects

73

74 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION

and gestalt groupings within images. These image objects can be direct representation of

a real-world object in a photographic image, or the more abstract objects inferred from

sketch based materials or visual occurrences of written language. Although the task of

subdividing an image into such abstract objects is extremely difficult to simulate on a

computer due to the reliance upon higher levels of processing and prior domain knowl-

edge, it is possible to use general grouping rules to satisfactorily segment most images into

meaningful objects. These Gestalt rules are based upon the observation of how humans

perceive and group image content and provide general principles that can be built upon by

practical machine vision applications. The full set of Gestalt rules and their affect upon

how we perceive image content are explained in more detail in the review on page 15 of this

theses. Generally speaking, Proximity and Similarity are considered the most important

Gestalt rules. Continuity, Common Fate, Completion, Closure and Simplicity are usually

thought of as having less importance. Region, Connectedness and Periodicity are often

overlooked completely in Gestalt research. Applying all these rules in an efficient and

congruous manner to achieve image segmentations that perform as well as human based

interpretations is one of the key problems faced by researchers in machine vision today.

For the sake of efficiency, most segmentation engines only apply a subset of these gestalt

rules, and are primarily concerned with proximity on a pixel or texel level. Such common

simplifications include the use of the 8 pixel neighbourhood, where only the immediate

surrounding image pixels are considered for potential grouping. Many segmentation algo-

rithms operate on a pixel level only, assuming that image content is Mondrian in nature

and therefore all salient regions will consist of similar, interconnected, colour values. In

reality, most natural image content consists of objects that are highly textured and contain

large difference between pixel intensity values. Attempts to deal with such content have

resulted in research into the subdivision of image content using texels, larger scale areas

of common texture, rather than individual pixel values. Although these approaches go

some way towards addressing the problems associated with natural images, the selection

of appropriate scales and computational complexity issues still pose difficulties. As texels

are just a larger collection of image pixels, at some point during the segmentation process

the size of the texel must be determined. This is usually done in an arbitrary way, by

using a constant texel dimension, although there has also been research into the use of

dynamic scale selection or multi-scale approaches.

In this research, it is proposed that image object content should be extracted that is

consistent with a human-led segmentation. It follows that a description derived from prim-

itives similar to those a human would perceive is likely to facilitate human-like similarity

judgements. A time limit of approximately 5 minutes (using a commonly available 2.4

GHz computer) will be imposed upon the segmentation/grouping and recognition process

4.2. THE CORE SEGMENTATION ALGORITHM 75

to retain it’s practical usefulness for potential further applications such as the core of a

search engine.

This part of the thesis will be concerned with the extraction of multi-scale image

primitives utilising as many Gestalt rules as practical whilst still maintaining real-time

processing speeds.

4.2 The Core Segmentation Algorithm

Although most segmentation/grouping/clustering algorithms are based around the ma-

nipulation of ordered edge lists, the way they are evaluated has a dramatic impact upon

the type of grouping that will occurs. Approaches, such as ([Thorisson94], [Shi97]), which

subdivide the edge list directly into groups based around 2nd or 3rd order list discontinu-

ities are effective at the fast generation of perceptual clusters such as those in figure 3.11.

Optimal positions for subdivision of the list (which affect the scale of the clusters) can be

difficult to determine and only the first sets of clusters on the sorted edge list are actually

likely to represent good primitives. This represents a very parallel approach, where many

primitives need to be combined into a new cluster in a single step (with no information

about inter-cluster relationships being used). Different cut thresholds can effectively alter

the scale of the clusters but the relationships between subgroups or the use of higher or-

der Gestalt grouping rules such as continuity are difficult to incorporate (see figure 4.8).

Another way of processing an ordered edge list is to process each edge entry in order,

building up groups as each pairing is allowed or disallowed. While this is a less parallel

approach, we no longer need to define any cut thresholds and can update edge and group

descriptions as the grouping process develops. This is essential if we are to capture Gestalt

relationships between segmented sections in the image. The previous section concluded

that the Linear Gestalt Grouping algorithm was too specifically targeted at finding linear

structures and too slow to be useful in image types where clustering is more appropriate.

While we wish to keep the ordered progression through an edge list, we require much less

prohibitive grouping decisions from our core feature extraction algorithm.

A simple algorithm developed by Pedro F. Felzenwalb and Daniel P. Huttenlocher

[Felzenwalb98] seems to offer just such a compromise. Offering ample opportunity for

further development with higher order Gestalt grouping, this segmentation algorithm deals

very cleverly and efficiently with Mondrian, textured and ramped image content. It makes

clever use of edge information to incorporate limited texture and higher level segmentation

performance in a segmentation that performs at a speed equivalent to simpler pixel based

algorithms once initialized correctly. Sharing the approach used in the previous section

on Linear Gestalt Grouping, the algorithm is based around the processing of sorted edge

lists but without the prohibitive insistence upon linear groupings. In its original form,


Figure 4.1: Showing the selection of 8 nearest neighbours based upon the image grid and

the corresponding edge weight values generated.

the algorithm takes approximately O(N log(N)) to initialize and O(N) to complete the

segmentation, where N is the number of pixels in the target image. Initialization requires

generating a sorted edge list, but can be optimized using a tri-median pivot version of the

Robert Sedgewick optimisation of the Quicksort algorithm ([Gosling96]).

Given the similarity in methodologies used plus the efficiency with which this algorithm

can generate a pixel level segmentation, this represents a good foundation from which an

algorithm capable of both low and high level Gestalt grouping and segmentation can be

developed. The original algorithm is not designed to be multi-scale, and instead seeks to

arrive at a meaningful segmentation that is neither too coarse or too fine.

A graph-based approach is taken towards achieving a segmentation, with G = (V, E)

representing an undirected graph with vertices v ∈ V corresponding to the set of pixels

from the image grid to be segmented and the edges (vxy, vjk) ∈ E.

The three primitives used within the algorithm are nodes, edges and groups. Each

node represents a single image pixel. Pixels are connected by weighted edges.

Interconnected edges and nodes form larger groups, or image segments (see figure 4.2).

The algorithm is pre-seeded with a list of edges (E) generated from image pixel intensity

differences between 8 pixel grid neighbourhood pixels (figure 4.1). For each image pixel

(vxy) a set of 8 edges is generated with a corresponding weight W (e) ≡ W (vxy, vjk).

ei = (vxy, vjk)|(vxy ⊆ V, vjk ⊆ V, |x − j| ≤ 1, |y − k| ≤ 1, |x − j| + |y − k| > 0)

E = {ei}|(W (ei) < W (ei+1))

E is then sorted by weight value so that the lowest cost weights are at the beginning


Figure 4.2: Nodes, Edges and Groups (Segments) form the basic building blocks of the

segmentation algorithm.


Figure 4.3: The interior difference Int(S) of a region S is defined as the largest edge in

it’s minimum spanning tree, whilst the difference dif(S1, S2) between two segments is the

minimum weight connecting them.

of the list, edges will be processed in order of increasing weight.

Weights represent a non-negative measure of difference between the two connected

vertices using colour, intensity, position or some other appropriate description attribute.

The original algorithm uses the absolute intensity I(v) difference between connected pixels

to form edge weights 1.

W (vxy, vjk) = |I(vxy) − I(vjk)|

Extracted regions, or segments, represent unique and non-overlapping internally con-

nected subgraphs of G. Each segment S has it’s own internal difference measure, Int(S),

which is the largest weighted edge present in the minimum spanning tree (MST ) of seg-

ment S which is also made up of edges within the edge list E, S(V, E) ∈ G(V, E).

Int(S) = maxe∈MST (S,E)

(W (e))

The difference between two segments, Dif(S1, S2), is defined to be the minimum weight

edge that connects the two segments (see figure 4.3)

Dif(C1, C2) = minvi∈S1,vj∈S2

(W (vi, vj))

Notice that, given edges are processed in order of increasing weight, any edge internal

1Our own implementation uses Euclidean distance of normalized coordinates C within the number of

dimensions D.

W ((vxy, vjk)) =∑D−1

i=0 (C(vxy)i − C(vjk)i)2


to a segment must be either smaller or equal to any edge connecting a segment pair. This

means that the new internal difference Int(S) of a merged pair of segments is identical to

the weight of the edge that joined the two (see development of internal difference vectors

in figure 4.4). Another consequence of using a sorted edge list is that if the current edge

being processed has vertices terminating in different segments, we can be assured that

it is the minimum weight edge connecting the two segments and that all segments are

automatically composed of a series of connected edges which form a minimum spanning

tree.

Initially, each vertex in V is labelled with its own unique segment and no grouping

between pixels has occurred. The algorithm then moves thorough the sorted edge list

(E) and, if the edge terminates in vertices belonging to different segments, compares the

current edge value against the region comparison function. This function (IsEdge(E))

determines if the edge is a true edge or not and bases it’s decision upon the current edge

value and the internal differences of the two segments linked by the edge. If the region

comparison function determines that the difference between the two vertices V that make

up the edge is too small to prevent a merge, then both vertices are adjusted to belong to

the same segment. This process represents a merging between two segments that the edge

vertices belong to, rather than just the two individual vertices themselves.

IF (e(vxy, vij) ∈ G(V, E),vxy ∈ Sk, vij ∈ Sl, k 6= l) AND (IsEdge(e) = FALSE) THEN

Sk(V ) = Sk(V ) + Sl(V )

Sk(E) = Sk(E) + Sl(E) + e

Int(Sk) = W (e)

Sl = ∅

The next edge in E is then selected for processing until the end of the edge list is

reached, or only a single segment S remains. It can be seen that the region comparison

function utilises a simple measure of internal difference to incorporate textural information

into the segment merging process, which follows the pre-seeded edge connections formed

by initial pixel vertex values.

Figure 4.4 shows a simplified illustration of the separation of a dataset into groups,

with the multi-dimensional distances between pixel descriptions represented as distances

on a 2 dimensional plane. Each node (dot) n corresponds to an image pixel that is

described by it’s colour or intensity properties and has a unique pointer to the group in

which it belongs (initially all nodes have their own single groups which, in turn, contain

pointers back to the node). Each node generates 8 edges e, which are described by the


Figure 4.4: Shows the development of segments and the relationship between Nodes,

Edges and Groups. Red lines indicate the current edge being processed, dark blue lines

indicate the maximum internal difference in a group (which is always equivalent to the

last edge that was responsible for an expansion of that group). All other colours indicate

the separate groupings being formed. The ordered edge list is shown at the top of each

box, with the edges, node and group relationships below

Euclidean distance in intensity/colour between the node (pixel) and it’s 8 neighbouring

nodes (pixels). The edge structures also contain references back to the two nodes that

formed them. The edges are sorted so that the weakest edges appear at the head of the

list and will therefore represent the best candidate edges across which groups are likely to

merge. Running through this ordered list, we apply each edge to the region comparison

function. If the region comparison function determines that the edge is a true edge, then

no action is performed and the next edge is selected for processing. Where the region

comparison function determines that an edge should form the basis of a merge between

the two groups at either end of the edge, these groups are merged into a single group. This

can be easily done because each edge contains a pointer to it’s two nodes, which in turn

point to the groups (segments) that they belong to. Two groups are merged by combining


their lists of node pointers into one of the groups and using their pointers to ensure that

member nodes now all point to that group. The orphaned group (which now contains no

nodes and has no references to it) can then be removed.

In this way, any subsequent edge that points to a node belonging to the new group

can instantly access information about that group. Table 4.1 shows the system of two-

way pointers used during the grouping process. Member nodes contain pointers to the

groups/segments they belong to and the groups, in turn, point back to their member

nodes. Eventually the entire edge list is processed, and we are left with the original image

subdivided into a number of groups/segments. Each pixel in the image is represented

by it’s node, which points to the group structure it belongs to. Conversely, each group

structure that survives the process contains a list of pointers to the nodes (and therefore

the pixels in the image) that belong to it.

EDGE (formed from the edge between adjacent pixels)

Pointer to NODE (pixel) 1

Pointer to NODE (pixel) 2

NODE (formed from a single pixel, will be pointed to by multiple Edges)

Pointer to the GROUP (segment) it belongs to

GROUP (a set of pixels that have been merged into a single group)

A list of pointers to the NODES that are members

Table 4.1: Calculating two-dimensional correlation invariants

As all variables are taken from the pre-seeded edge list, this algorithm performs a

pseudo-texture segmentation extremely quickly in linear time. However, the use of a

pre-seeded edge list also means that the possible development pathways of groupings is

predetermined from raw pixel data and does not take into account any high order struc-

ture present in the image. While the internal difference texture property changes during

segmentation and influences the grouping decision, the actual edge pathways considered

for grouping are all predetermined from 8 pixel neighbourhoods. In cases where pathways

cannot be formed between adjacent pixels (figure 4.5), the segmentation process has no

pathways from which to begin the segmentation process. To achieve a Gestalt segmen-

tation that can generate new potential edge pathways based around higher order group

descriptions as the grouping process develops will require a new approach to the seeding


Figure 4.5: Black elements of this constant texture can not be joined to adjacent white

pixels due to the large edge between them and cannot be joined to other black elements

due to them not being adjacent on the 8 neighbourhood image grid. The result is one

large white segment and a very large number of black single pixel segments (see also figure

4.9). Values are edge values between pixels and the central pixel.

and maintenance of the edge list. Our first task is to adress the 8 neighbourhood seeding

problem.

4.3 Seeding with Nearest Neighbours

While representing a good starting point for this work, there are a number of drawbacks

to Felzenwalb’s algorithm that need to be addressed. It was found that although seeding

the segmentation algorithm using the image grid’s 8 nearest neighbours (figure 4.1) was

very efficient, it can result in undesirable side-effects when used on high contrast textures

or artificial images.

Texture areas with large weights between neighbouring vertices are not capable of

generating edge pathways that can lead to the merging of the texture area into a single

segment. This problem can be seen in figure 4.5.

One possible fix suggested in the paper is to smooth the original image before running

the segmentation algorithm. Smoothing (figure 4.6) reduces the magnitude of neighbouring

edge values and allows the interconnection of perceptual groupings that have disparate

pixel values. While this is effective, it is at best only a partial solution and has many

undesirable side effects related to smoothing out segment boundaries over the entire image.

This fundamental difficulty with the algorithm is the dependency upon the image grid

which limits potential edge connections to adjacent vertices only. One solution to this

4.3. SEEDING WITH NEAREST NEIGHBOURS 83

Figure 4.6: Smoothing can be used to partially overcome the 8 neighbour image grid

problem (values are edge values between pixels and the central pixel)

problem is to use the n-nearest neighbours of an image vertex, not governed by the image

grid (figure 4.7). This results in the algorithm being able to capture spatially non-local

regions and prevent the stalling of a segmentation due to local image grid discontinuities

which would otherwise form part of a high-texture segment. With the introduction of

n-nearest neighbour seeding we involve another set of dimensions in our calculations, the

image position coordinates. Whilst the range of possible image position coordinates is

Figure 4.7: The use of n-nearest neighbour algorithms to seed edges is slower than 8

neighbourhood and requires the inclusion of proximity information in calculations but

eliminates the dependency upon adjacency. It also decreases the tendency to generate

edges connections across different visual group boundaries.


Figure 4.8: A circular arrangement of segments is apparent in this image. With localised

connectivity at pixel level only, this image will never be segmented into a circle because

the circular perceptual arrangement occurs at a much larger scale.

fundamentally dependent upon the image dimension, the range of pixel intensity values is

restricted to a static range. Care must be taken when combining these two different types

of measure into a single edge calculation, and different normalization schemes and methods

of combination can lead to very different segmentation behaviour. One fundamental issue

is whether proximity should be treated as a separate entity to similarity. While solutions

that combine the two in the same architecture can prove more elegant, there is some

evidence that they may affect visual perception in very different ways. The problems

associated with the increase in dimensionality of pixel descriptions will be addressed in

more detail in later sections.

In terms of Gestalt theory the use of n-nearest neighbours also increases the algorithms

ability to facilitate boundary continuity and completion, at least at a single pixel vertex

level. Although segment internal differences are used during the segmentation process

to incorporate larger scale segment information, such large scale segment information is

not reflected in the actual edge pathways that the algorithm follows. This results in a

limitation similar to the restriction of edge seeding to the image grid. Those parts of the

image considered for merging during the segmentation process are actually predetermined

based upon single pixel vertex information and are in no way based upon valuable seg-

ment information that can only be determined once the algorithm is in operation. This

means that strong larger scale groupings based upon segment size and appearance may

be completely missed by the algorithm that is currently restricted to following edge path-

ways based upon low-level pixel information (see figure 4.8). To convert our algorithm

into an efficient Gestalt engine we will need to incorporate the speed and efficiency of

the pre-seeded segmentation with some form of edge list updating as more information

about larger scale features and relationships between segments becomes available. Just

4.4. THE EDGE EVALUATION FUNCTION 85

how we include the ability to update the edge list during segmentation is discussed on

page 94. Another point to note when implementing n-nearest neighbour seeding as op-

posed to the 8 pixel neighbourhood is that the assumption of adjacency no longer holds.

Whilst many nearest pixel neighbours will indeed be adjacent to each other, this may

well not be the case if colour values in adjacent pixels differ greatly. This ability can be

critical toward successfully segmenting certain image types where no immediate adjacency

pathways between pixels are available (as in figure 4.9). The use of n-nearest neighbour

seeding inevitably necessitates the introduction of the concept of pixel location distance

into edge calculation. When evaluating both colour and positional difference for edges, we

also face normalization difficulties inherent in combining two different data types into the

same calculation. Such difficulties will be inevitable in any Gestalt system where higher

level descriptions such as shape and texture will need to be combined with position and

colour, so the use of nearest neighbour seeding at this point is not as inexpedient as may

first appear.

Although much of the time there is little perceivable advantage over adjacency grid

seeding, there is no doubt that the use of Nearest Neighbour seeding can significantly

improve segmentation results with some image types. Figure 4.9 shows a good example of

these benefits. Where the image has been seeded using adjacent image grid locations to

form edges, no edges have been formed between the black pixels in the image. This results

in these pixels remaining either ungrouped, being filtered out as noise or being merged

with the background. In the case where nearest neighbour seeding has been used, the edge

list has been pre-seeded with pathways that don’t rely on pixel adjacency and therefore

produces the correct segmentation.

Even when using the optimized KD-Tree nearest neighbour algorithms described later,

the time overheads for nearest neighbour seeding are considerably higher than simple

image grid seeding. Even with the optimal KD-Tree configuration for 100 by 100 images

(a terminating layer of 12, from figure 4.23 on page 114), the average seeding time for

10000 pixel images using nearest neighbour seeding was 20.5 seconds. This is around 70

times slower than image grid seeding at 296 milliseconds. Whilst this is a large difference,

the delay will only ever occur once per image segmentation.

Because nearest neighbour seeding can be seen to demonstrably improve results, this

is the seeding approach implemented in future work.

4.4 The Edge Evaluation Function

Fundamental to the segmentation algorithm proposed in [Felzenwalb98] is the Edge Eval-

uation Function. It is this function D(S1, S2) that determines whether a given segment

pair should remain distinct or merged into a single region.


D(S1, S2) =

{

true if Dif(S1, S2) > MInt(S1, S2)

false otherwise.

where the minimum internal difference is defined as:

MInt(S1, S2) =min(Int(S1) + τ(S1), Int(S2) + τ(S2))

where τ(Si) is defined as:

τ(S) = k/|S|

where k is some constant and |S| is the size of the segment S. The use of τ(S) in

the decision function is to set a scale of segmentation, with smaller components requiring

stronger evidence for a boundary.

Larger values of k result in a convergence towards a segmentation favouring larger

segment sizes (figure 4.10) whilst still preserving smaller regions of sufficient distinctness.

This ability to change the scale of the segmentation whilst maintaining smaller and dis-

tinct regions is useful but suffers from the disadvantage that a large number of single pixel

segments that were too distinct to merge remain in the segmentation. In the original algo-

rithm, the inclusion of threshold function τ is vital to the segmentation as the expression

Dif(S1, S2) >min(Int(S1), Int(S2)) will always be true given the precondition that the

edge list is sorted by increasing value and all Int(Si) are taken from a previous entry in

the edge list E. In other words, without the inclusion of the τ term the segmentation

would simply not begin.

As most components used within the Edge Evaluation Function are taken from the

pre-calculated edge list, with segment size being easily updated as segment merges occur,

the entire segmentation runs very quickly.

In practice, there are several major drawbacks to the use of this form of Edge Eval-

uation Function as it stands. The first is the arbitrary nature of the all important k

constant. In [Felzenwalb98] Felzenszwalb and Huttenlocher do not examine the use of this

constant in great detail and are happy assigning it values that produce visually appealing

segmentations. However, the degree to which k affects a segmentation is fundamentally

linked to the potential size of segments |S| in an image. The use of this constant also acts

against the invariance of the segmentation to scale change, the equivalent segments at dif-

ferent scales will be dealt with differently by the function. Rather than using an arbitrary

constant value for k it could instead be derived as a function of image dimension, which

would at least make it invariant to changes in image dimension. Although the inclusion of

segment size information as part of an ongoing segmentation is a desirable quality, the use

4.4. THE EDGE EVALUATION FUNCTION 87

of this constant inherently biases the segmentation towards a certain segment size. The

static nature of this bias in the segmentation increases the segmentation’s sensitivity to

image context change, image dimension or scale change. A decision function which utilises

segment size information in a more rational manner is required.

A second drawback to the algorithm is that the decision function is using the sum

of two very different variables, with very different limits, to make a decision. Whereas

edge differences can only span between 0 and 255 in most intensity formats, or 0 to 1 in a

normalized system, the potential range and influence of the second τ(S) term is completely

different. Consequently it is difficult to determine the influence of the two terms relative

to each other without some form of normalization to bring them up to comparible scales.

As an example of this, if we were to set k to 255 then the function τ(Si) would have a

range maximum of 255/1 = 255, if we try and combine this with an edge value based upon

an intensity format of range 0 to 1 then it can easily be seen that the τ(S) function is

driving the segmentation algorithm with very little influence from the Int(S) component.

Either some way of normalising these two disperate component values against each other

needs to be found, or the decision function should be completely overhauled.

The third problem surrounding the algorithm as it stands lies in the intiialization of

values. Although vital to the segmentation process, the original paper fails to specify their

own intiialization parameters. Intuitively, the algorithm begins with each pixel vertex

being equivalent to it’s own discreet segment. Each of the segments will consist of a single

pixel |S| = 1 and therefore have an internal difference of zero (Int(S) = 0). This results

in the following decision function at intiialization:

MInt(S1, S2) =min(0+τ(S1), 0+τ(S2)) ≡min(k/|S1|, k/|S2|) ≡min(k/1, k/1) ≡min(k, k)

D(S1, S2) =

{

true if Dif(S1, S2) > k

false otherwise.

Which essentially reduces down to k having total control over the very first generation

of segments and acting as a single threshold value. This also results in a very large

probability that all single pixels will be merged with their nearest neighbour for any value

of k greater than zero. The reliance of the entire algorithm upon a value of k determined

by trial and error doesn’t sit very well with a well designed architecture.

4.4.1 Generating a Multi-Scale Segmentation

The final difficulty with this algorithm, as it stands, is that it is designed to settle on a final

segmentation largely controlled by the k variable. In the context of this thesis, the segment

algorithm should exhibit multi-scale behaviour. To do this in the current framework would


require multiple segmentations using different values of k as in Algorithm 4.1. Another

approach would be to force the grouping process to continue until a single segment remains,

storing valid and useful segments (determined by a function IsValidRegion()) before they

are destroyed in the segmentation merge process as in Algorithm 4.2).

Algorithm 4.1 Using multiple values of k to approximate a multi-scale segmentation

FinalSegmentList=∅k=StartV alue

WorkingList=∅do

{InitSegments (WorkingList)

WorkingList=Segmentation(k, WorkingList)

FinalSegmentList+=WorkingList

Increase k

} (until |WorkingList|==1)

Algorithm 4.2 Altering the Edge Evaluation Function to generate a segmentation con-

verging upon a single segment, whilst extracting relevant segments.

FinalSegmentList=∅WorkingList=∅InitSegments (WorkingList)

do

{S=NextSegment(WorkingList)

FinalSegmentList+=IsValidRegion(S)

} (until |WorkingList|==1 OR E == ∅)

The segmentation process that both the above algorithms rely on is based around the

generation of an ordered edge list from pixel level information and then the rapid pro-

gression through this list applying the Edge Evaluation Function D(S1, S2) to determine

segment groupings. This results in a process where the intiialization of the edge list takes

4.5. A NEW EDGE EVALUATION FUNCTION 89

longer than the actual segmentation process. As can be seen in the algorithms above,

the first approach requires edge list intiialization for every value of k, whereas the latter

algorithm only requires a single edge list intiialization. For this reason, coupled with the

apparent arbitrariness of the control value k, the latter approach was taken and the Edge

Evaluation Function adjusted to converge towards a single segment without the need for

a k control value. This also opens up interesting possibilities for the selection of ‘valid’

segments to preserve from the ongoing segmentation process. The issue of good segment

extraction, the properties such segments will exhibit, and the implementation of a Segment

Evaluation Function will be further discussed in section 5.1.

4.5 A New Edge Evaluation Function

It is apparent that the Edge Evaluation Function as defined in [Felzenwalb98] is not optimal

for the generation of a multi-scale segmentation that this thesis work requires. In particular

a method of removing the dependence upon the external variable k to control the scale

of the entire grouping process was required. Desirable properties for this function are

to segment the image using the same general criteria of the original function, keeping the

same simplicity and speed, whilst allowing a general convergence towards a single segment.

The following function, still based around internal differences, exhibits this desired

behaviour:

D(S1, S2) =

{

true if Dif(S1, S2) > CInt(S1, S2)

false otherwise.

where D(S1, S2) is the decision function that determines whether or not the edge should

remain (true), or whether the edge is weak (false) and a merge between the two regions

connected to it will be initiated. The combined internal difference is defined as:

CInt(S1, S2) = Int(S1) + Int(S2)

However, this function does not include any form of size bias that was inherent in

the previous formulation of D(S1, S2) through the use of τ(S) = k/|S|. As it stands,

the decision function is determined purely through edge comparison with internal texture

differences and does not include any influence determined by the comparative sizes of

segments, which are readily available.

It can also be seen that the current function suffers from difficulties at the beginning

of a segmentation, where each group represents a single pixel that should, intuitively, be

seeded with an internal difference of 0. This would result in the Edge Evaluation Function

accepting all edges as true and no segmentation taking place.


All edges are valid at startup because:

Dif(S1, S2) > CInt(S1, S2) where CInt(S1, S2) = 0

Whilst this difficulty can be worked around by artificially seeding all internal differences

to 1, a better solution would be to include another element in the function to begin the

segmentation process.

A slight modification to the above function allows the optional inclusion of size infor-

mation to influence the Edge Evaluation Function and bias it towards merging regions of

similar size. This is based around the relative size of segments and will remain the same

regardless of most image transformations, representing an improvement upon the way com-

ponent size was treated in the original Region Comparison Fucntion whilst eliminating the

need for any control variable k.

D(S1, S2) =

{

true if Dif(S1, S2) > CInt(S1, S2)

false otherwise.

where the combined internal difference is defined as:

CInt(S1, S2) = Int(S1)+Int(S2)+M(SizeMod(S1,S2))1+0.5M

given that:

0 ≤ M ≤ 1 (Degree of size influence)

SizeMod(S1, S2) = 1 − (ABS(|S1|−|S2|)(MaxArea−1)

and MaxArea = ImageWidth ∗ ImageHeight (maximum possible segment size)

Given this formulation, it is apparent that the Combined Internal Difference Function

will return a potential range of 0 ≤ CInt(S1, S2) ≤ 2 while 0 ≤ Dif(S1, S2) ≤ 1, which will

generally result in a bias towards merging segments and ultimately to the desired single

segment solution (although this is by no means inevitable in every case). It should be noted

that the majority of segments processed by the Combined Internal Difference Function will

be considerably smaller than MaxArea by which they are normalized. The inclusion of

this size component will have a tendency to reduce the magnitude of CInt(S1, S2) and

increase the likelihood of edges being retained. The inclusion of this size component in

the function also overcomes the internal difference seeding problems, discussed earlier,

because:


SizeMod(S1, S2) = 1 − (1−1)(MaxArea−1) = 1 − 0 = 1 at intiialization.

By taking the original algorithm and removing the termination criteria we have cre-

ated a basic segmentation algorithm that can efficiently generate groupings from raw pixel

information as well as a simple textural description of Internal difference. Whilst this

provides us with the basic data structures and decision functions required for a multi-scale

segmentation, it suffers from the same major limitation as previous work on Linear Gestalt

Grouping. Both algorithms can only group according to pathways laid down during ini-

tialization, which are calculated from primitives before the segmentation begins. In the

Linear Gestalt Grouping algorithm, these primitives are the results of an initial segmen-

tation whilst this algorithm uses pixel information. Although internal difference and size

information affects grouping decisions as the algorithm progresses, at present it can only

form connections along edges generated at initialization. This precludes potentially impor-

tant connections from forming due to higher level texture similarities, such as those shown

in figures 4.5, or group similarities as shown in 3.24b and 3.24c. To allow true multiscale

behaviour we need to find an efficient way of generating new edges to form potential path-

ways as the segmentation progresses and larger region primitives are generated through

grouping. This is no trivial task, as the generation of every possible new edge at each stage

of the segmentation would inevitably result in an explosion in processing requirements and

a huge increase in the size of the edge list. The main processing overhead in the original

algorithm is the generation of edge pathways, which in itself has been limited to either

8-nearest neighbour or 8 pixel neighbourhood edges per region, before algorithm execu-

tion. The fact that this overhead occurs a single time only, in the preprocessing stages

of the segmentation, limits its impact. What is now required is an algorithm capable of

generating new edges as new region groups are formed in the segmentation and smoothly

and efficiently inserting them into the edge list being whilst it is being processed. Such an

algorithm will also require a storage structure capable of retrieving high dimensional data

efficiently by nearest neighbour query.


(a) Original Image (b) NN Seeded 100 ac-

tive groups remain

(c) NN Seeded, 3 active

groups remain

(d) Original Image (e) Grid seeded, 100 ac-

tive groups remain

(f) Grid seeded, 1 ac-

tive group remains

Figure 4.9: Although differences in segmentation between pixel grid and nearest neighbour

seeding are usually slight, here is an example where the use of Nearest Neighbour seeding

is crucial for the accurate segmentation. While nearest neighbour seeding allows the

algorithm to correctly join the black pixels despite them not being connected in the image

(above), this is not the case with 8 neighbourhood grid seeding where the only connections

presented to the algorithm are merges with the image background. Generated groupings

where merges have occurred are artificially coloured in these images (ungrouped regions

retain their original black or white colouring).


Figure 4.10: Segmentations using image grid edge seeding and increasing values of k with

the original region comparison function. Colour values have been normalized to a range

between 0 and 1 and the normalized Euclidean distance between pixels used to determine

edges.


4.6 Updating Edges

As discussed in the previous section, although we have a fast and efficient pixel/texture

based segmentation algorithm, it suffers from the limitation that all possible segmentation

pathways are generated prior to the segmentation process. A truly Gestalt segmentation

process will require us to generate new possible segmentation pathways as new regions

are created by the segmentation algorithm. An exhaustive implementation of this process

would result in the generation and insertion of new edges between each newly generated

region and all other regions, into the edge list.

Algorithm 4.3 Exhaustive edge update procedure

For each new region generated by the segmentation process:

Remove two old regions used to form the new one

No Of Regions=No Of Regions-2

Remove the current edge that formed the new region

No Of Edges=No Of Edges-1

Insert new edges between the new region and all old regions

No Of Edges=No Of Edges+No Of Regions

As can be seen in algorithm 4.3, exhaustively adding new edges to the list would result

in an order of n2 growth in the amount of edges in the edges list as the segmentation

proceeds. Whilst this would represent difficulties in terms of the sheer size of the edge

list required, speed would also be severely affected by the requirement to insert each new

edge entry into the current edge list. As many of these new edge pathways are unlikely to

actually be used by the segmentation process at all, it makes sense to limit the number

of new edges placed in the edge list to the better edges only. These useful new edges can

be defined as the edge pathways between the n nearest neighbouring regions in feature

space. An efficient nearest neighbour extraction algorithm, as shown in algorithm 4.4

would reduce the number of new edges to those that are most likely to be followed by the

segmentation algorithm, allowing the fast insertion of edges and keeping the size of the

edge list down to manageable dimensions.

Given a relatively small number of useful new edges, the edge list can be updated very

quickly as each new region is generated. The bottleneck in algorithm 4.4 becomes the

4.6. UPDATING EDGES 95

Algorithm 4.4 Exhaustive edge update procedure

For each new region generated by the segmentation process:

Remove two old regions used to form the new one

No Of Regions=No Of Regions-2

Remove the current edge that formed the new region

No Of Edges=No Of Edges-1

Extract the n nearest neighbours to the new region

GetNeighbours(n)

Insert new edges between the new region and n nearest old regions

No Of Edges=No Of Edges+n

extraction of nearest neighbours in the function GetNeighbours().

4.6.1 Appropriate Data Structures for K-Dimensional Nearest Neigh-

bour Query

Whilst many segmentation approaches concern themselves with pixel level information, or

simple texture measures, this gestalt engine will have to manipulate and make nearest-

neighbour calculations using larger dimensions. At the barest minimum a full colour

Gestalt segmentation engine will require three pixel colour components, two segment po-

sition values and some measure relating to higher level group descriptions. In this work,

the following minimum requirements for region description are anticipated:

Each region group can be considered to represent a point in an n-dimensional volume.

We require an algorithm that can search such a multi-dimensional space for nearest neigh-

bours with great efficiency, as this operation will be required whenever two parent groups

merge to form a new child. The space must also be updatable, so that new groups can be

added to it and included in the search without the need to re-submit all groups.

The nearest neighbour search problem (sometimes referred to as the closest-point prob-

lem) is common to a large number of problem domains, and has a large set of possible

algorithmic solutions with differing degrees of reliability and efficiency. Such algorithms

are necessary due to the large increase in processing required to perform a direct com-

parison as the number of elements to be compared (often using Mean Squared Error or

Manhattan distance) for proximity to an element increases. The majority of approaches

reduce computation time by approximating the data in the search so that fewer compar-


POSITION

1. X Position

2. Y Position

COLOUR

3. Mean Red Component Intensity

4. Mean Green Component Intensity

5. Mean Blue Component Intensity

APPEARANCE

6. Region Size

7. Minimum Internal Difference

Table 4.2: Minimum requirements for region description.

isons are required to find matches that are close the to nearest neighbour match, this

process is often iterated to find even better approximate matches from the candidate set.

Once the algorithm reaches a scale at which the processing required for approximation

outweighs the cost of direct candidate comparison the latter approach can then be used

to determine to exact nearest neighbour (if this level of accuracy is required).

There are three main approaches to optimizing the search for as nearest neighbour

from a candidate set, arranging the data items in such a way that nearest neighbours can

be indexed efficiently from their properties, the projection of data down to a simplified

coordinate system or through the use of a data structure which will enable a more efficient

search for the item. These methods are not exclusive, and many variations that combine

these approaches have been developed.

Sorting data items into an ordered list enables us to use either direct indexing or list

subdivision to quickly identify our nearest neighbours due to the properties inherent in an

ordered list. In a similar way, the projection of data items into lower dimensionality de-

scriptions can allow us to quickly eliminate large numbers of points from exhaustive search

without the need to evaluate the set of projections that could not be nearest neighbours.

If a direct indexing based upon the attributes of the data items to be searched can be

established (usually incurring a degree of approximation), then potential nearest neigh-

bours can be quickly extracted from the set without the need to directly compare each

data item. Trained neural networks and, in particular, Correlation Matrix Memories can

be used to successfully and efficiently establish such relationships so that the attributes


of a target point can be used to directly generate a set of potential nearest neighbours

without exhaustive search. [Hodge02] combines this approach with a dedicated high-speed

binary CMM architecture to enable nearest neighbour query much faster the the standard

computational approach.

Whilst data structures used to optimize nearest neighbour search are usually based

around kd-trees, Voronio diagrams can also be used to partition a space into a simpler and

more efficient structure, decomposing the space into known regions such that the region

containing each data point is that region where the data point is closer than any other data

point. This is particularly useful for low dimensional data, but the size requirement of

storing the Voronoi structure rapidly becomes prohibitive as dimensionality increases. This

approach does not lend itself well to updating as large sections of the Voronoi structure

will require recalculation as new query points are added.

Although many variations of kd-tree have been devised, their purpose is always to

heirachically decompose the search space into cells that contain subsets of the data points.

We can efficiently find which cells are likely to contain the nearest neighbour and reject

those cells that cannot, data points contained in the rejected cells can be eliminated from

the search without the need to exhaustively test each point for proximity. Typically,

algorithms construct kd-trees by partitioning the data points into two sets (or cells) across

a splitting plane. Which planes to partition across are selected by either cycling through

the dimensions of the space, cutting along the largest dimension or through the use of

Quadtree or Octree structures which partition across all dimensions at once (resulting

in four child cells for two dimensional data and eight child cells for three dimensional

data). Whilst non axis-parallel cutting plane have been used, they result in cell boundaries

that are much harder to retain and compare against. Kd-tree structures are effective in

moderate-dimensional spaces and methods of compartmentalizing and negotiating such

trees can be optimized to the type of data being searched. Where such data characteristics

are inherent, such optimizations can remain effective when the tree structures have new

query points added. Such optimizations become less effective where they are derived from

the initial data set and new data points are added that change the nature of the set as a

whole (requiring either recalculation of the structure or suffering from an increasingly less

efficient search). Kd-tree data structures become less effective as the dimensionality of

the dataset increases past 20 because the sphere which represents the current best search

radius progressively fills up less volume relative to the cube structures used in the tree,

resulting in each approximation containing more false nearest neighbour candidates.

Implementations utilizing the three main approaches were compared for use in this

work:


1. Modifiable KD Trees - a KD based approach using largest plane subdivision

2. Mean Constrained Projection - a projection based technique

3. Half Rib Orthogonal Lists - an ordering based technique indexing by axis

KD Trees

Tree structures are most commonly used to enable fast nearest neighbour access to multi-

dimensional data or multidimensional orthogonal range search. By subdividing the range

of data points into groups and iterating this process we can eliminate the need for direct

distance comparisons between a large number of points. This forms a tree like structure

where each tree node encompasses a decreasing number of data points as we move down

it. Each tree node must define the extent of its child nodes. Common approaches are

SS-trees (using bounding hyper-spheres), R-trees (using bounding hyper-boxes) and the

simpler KD Tree (using bounding hyper-planes). [Gonnet02] provides a good online guide

to different nearest neighbour search techniques utilizing tree structures.

KD Trees are a generalization of one dimensional Binary Search Trees to operation in

k dimensions. Branches of a KD Tree are generated by splitting at points along successive

dimensions so that level 0 of a tree will index dimension 0 of the query space, level 1

will index dimension 1, etc. The dimension that is indexed at a given level of the tree

is determined by the discriminator. The discriminator is usually calculated using i mod

k, where i is the current level and k is the dimensionality of the search space. Other

calculations for the discriminator exist, for example to subdivide a volume of search space

across its longest axis. The actual point at which the current dimension is subdivided is

determined by a decision function. The optimal subdivision will form a hyperplane that

separates points with the largest variance. Methods of selecting this division vary from

the use of the eigenvector with the largest eigenvalue of the covariance matrix, to simpler

approaches such as selecting the mean or median point in the current branch.

This branching approach is iterated until either a given level of coordinate precision is

reached, or a given number of tree layers have been developed.

Due to the KD Tree structures hierarchical subdivision of multi-dimensional space, it

is an ideal candidate for a multi-dimensional nearest neighbour search.

Time taken assembling a KD Tree structure is offset by the advantages of much faster

potential recall from the structure, depending upon the nature and accuracy of the recall

required. Whilst tree structures are generally accepted as the most efficient method of

searching for nearest neighbours, we require our algorithm to be fully updateable. For this

reason we have opted for a less efficient, but fully updateable KD Tree that subdivides

hyper-planes by predetermined values (figure 4.13) rather than by data content (which


would require the entire structure to be updated as new points were added or removed).

The tree structure must be searched carefully so that only sections of the KD space that

can possibly contain a nearest neighbour are evaluated directly. If the distance between the

nearest boundary of a hyperplane and the target point is greater than the current furthest

entry in the nearest neighbour table then that branch, and all subsequence sub-branches

can not contain a feature point that belongs on the nearest neighbour table. In practice,

it is much more efficient to calculate the bounding sphere of each volume of feature space

referenced by a KD branch, as shown in figure 4.12. This approach has similarities to the

approach used by Sproull [Sproull91], but instead eliminates tree branches (representing

cubic volumes of the n dimensional search space) before the final points are compared.

Whilst this may allow the expansion of inappropriate branches, the increased simplicity

of the calculation (appendix 12.7) offsets any disadvantage incurred by expanding the

branch. The number of layers a KD Tree can expand to is fundamental to the efficiency

of its operation. Too many layers will result in an unnecessarily fine subdivision of feature

space which will increase the number of calculations made. Too few layers will result in

a coarser subdivision and larger number of point comparisons. As can be seen in figures

4.19 and 4.21, the optimal number of tree layers varies according to the number of points

in the search space, their distribution and the dimensionality of the search space.

The nearest neighbour algorithm will be required to store each current region as a point

in feature space and allow regions to be added or removed efficiently. Whilst this is easily

implemented using exhaustive, MCS or half-rib approaches, we must alter traditional KD-

Tree approaches in order to allow this real-time flexibility. It is important that the the

KD tree structure is modifiable in order to avoid the need to generate a new feature space

for every new nearest neighbour query. This has an important impact upon which type

of normalization we apply to our feature descriptions when using KD Tree search. Any

form of self normalization would result in the need to adjust maximum and minimum

boundaries in the feature space, and the positions of all entries present in the KD tree

would need altering with each shift in parameters. Whilst this is possible, the overheads

involved with changing the spatial positions of every region present in the KD tree every

time a new region is generated are far too high to be practical. The only practical solution

to these requirements is to normalize feature coordinates by maximum possible values and

assign these values as the boundaries of the KD tree feature space. In such a system

old region feature space positions referenced by the KD tree will retain their integrity

whilst allowing the KD tree to be updated with new region entries as the grouping process

continues (algorithm 12.8).


Mean Constrained Search

Another method of quickly retrieving nearest neighbours from an n-dimensional space is

to first assemble all candidate match entries in an ordered list and use this ordering to

minimize the number of entries that will require expanding for full comparison. Searching

a simplified ordered list through indexing or iterative list subdivision (which we used in

this work) is considerably more efficient than exhaustive search through all entries. One

of the most common methods of reducing a set of n-dimensional vectors into a single list is

through projecting them to a single dimension. The simplest of these, Mean Constrained

Projection, maps each vector (item to be searched) to the mean value of its component

dimensions. The Mean Squared Error relationship between this mapping and the original

n-dimensional coordinates can be exploited to constrain the search to a smaller range of

candidate matches from a one dimensional list.

The relationship between two vectors x and y can be defined as:

∑nd=1(xd − yd) = n(Mx − My)

(where M is the mean of all coordinates in that vector)

define A = (xd − yd) and B = (Mx − My)∑n

d=1(A − B) = 0∑n

d=1(A − B)2 ≥ 0∑n

d=1(A2 − 2B(A) + B2) ≥ 0

(∑n

d=1(A2)) − (2B

∑nd=1(A)) + nB2 ≥ 0

substitute nB for∑n

d=1 A∑n

d=1(A2) − 2B(nB) + nB2 ≥ 0

reduces to:∑n

d=1(A)2 − nB2 ≥ 0∑n

d=1(A)2 ≥ nB2

1n

∑nd=1(A)2 ≥ B2

1nMSE =

∑nd=1(xd − yd)

2 ≥ (Mx − My)2

This shows that the Mean Squared Error (MSE) is guaranteed to be either higher

or equal to the the mean difference squared. And that if the a new candidate (y) has

a mean difference squared greater than the previous best matching Mean Squared Error

(MSEbestmatch), it cannot be closer to the target vector.


(difference between candidate mean and target mean)2 ≥ MSEbestmatch =

(Mx − My)2 ≥ MSEbestmatch = y cannot be closer than ybestmatch

Ra and Kim [Ra93] report computational complexity between 5% and 12% of a full

search when using their mean-distance-ordered partial codebook search. Cheng and Lo

[Cheng96] when evaluating performance of their mean constrained selective technique when

applied to differing numbers of 16 dimensional vectors reported a 75% improvement in

search times.

Half Rib Orthogonal Lists

A half rib orthogonal list is formed by the arrangement of data items so that each axis

of the n-dimensional space is represented by value ordered linked lists. A linked list is

generated for (and sorted by) each axis value for a point with new linked lists branching

outwards for the next axis value until all axes have been encoded and the final element a

the terminus of the final axis is the data item itself (figure 4.14). Each node in the linked

lists represents a junction, an axis coordinate, and can contain a reference to a data item.

In this way, each data item to be stored in the n-dimensional space can be encoded into

the branching list structure.

To optimize a nearest neighbour search in this structure, each closest matching axis is

followed until a data point is encountered. All subsequent searches can then be limited

to the next closest axes (moving back down the banches of the structure), searching (and

expanding) only tree nodes that could possibly be closer than the furthest entry (a spherical

range formed around the target coordinate with a radius equal to the distance of furthest

entry to target) in a nearest neighbour table.

Evaluation and implications

Given the expectation of image sizes ranging from 100 to 200 pixels square, we can calculate

the expected worse-case number of groups being queried by a nearest neighbour search as

ranging from 1002 to 2002. For compatibility with the core segmentation seeding algorithm,

which adds 8 edges to the list for every group, we will be searching for 8 nearest neighbours

for each new group submitted. The following experiments are based around searching

through identical random distributions of data points. Unlike MCS and Half Rib search,

the Updateable KD Tree algorithm’s efficiency is considerably affected by the minimum

precision range and number of layers within the tree. Our first experiment (figure 4.19)

was to determine the optimal layer at which to terminate tree branching, given the search


conditions. The best overall terminating layer parameter for the KD Tree search is 15, so

we shall use this value in the next experiments. Figure 4.20 shows the direct comparison

between the three nearest neighbour approaches and an exhaustive search.

A notable problem with these results is that the Mean Constrained Search seems to

be performing much worse than expected. As can be seen in figure 4.21, the MCS perfor-

mance degrades as the dimensionality of the search space increases, eventually becoming

detrimental to the search process. Although a decrease in performance would be expected

as the dimensionality of the search space increases and the projected coordinates become

less effective at constraining the search space, these results show much worse performance

than anticipated. Of further concern is the poor performance when compared with results

[Cheng96, Ra93], which show significant increases in MCS performance when compared to

exhaustive search, even within 16 dimensional spaces. Although the MCS algorithm has

been thoroughly checked, and the experiments repeated, the cause of this discrepancy has

not been found. It was eventually decided to move ahead with the work, using the results

as generated.

A three-dimensional graph showing the full range of parameters,techniques and settings

can be seen in figure 4.18. Figure 4.22 shows a surface chart of Modifiable 8NN KD-Tree

results for the lower range of point spaces from 2500 to 20000. Figures 4.23 shows that

the best performance and optimal terminating layer configuration of the modifiable KD-

Trees follow a logarithmic curve. Figures 4.15, 4.16 and 4.17 compare different KD-tree

results with exhaustive search for 8 nearest neighbour searches over smaller and decreasing

numbers of points (as in the currently defines edge update algorithm).

From these results it is apparent that the performance of all approaches and individual

terminating layer settings follow a linear rule as the number of data points to be searched

increases. Where smaller numbers of points are to be searches, the advantages of using KD

trees to optimize performance decreases. It is also apparent that (for searches in spaces of

19999 to 79999 points) the modifiable KD tree structure, with a terminating layer of 15,

is the most appropriate nearest neighbour approach to be implemented.

4.6.2 Efficiently searching the KD Tree for nearest neighbours

The requirement for a nearest neighbour search, and subsequent generation and placement

of new edges in the edge list at each group merging, results in a considerable slowdown of

the algorithm. Given that we now have a low level segmentation algorithm that operates

efficiently on low level pixel data, and a slower edge update algorithm that is most useful for

grouping more developed segment groups, we can actually run the two processes together.

In this strategy, the segmentation algorithm developed from [Felzenwalb98] can be used

to efficiently generate the initial edge list and perform the primitive pixel level groupings.


Once the number of active groups in the segmentation process is reduced to a manageable

level, we can begin using the slower KD-Tree based edge update algorithm to provide the

higher level Gestalt linkages between developed segment groups, as in figure 4.24. When

the number of active groups in the segmentation reduces to a manageable level (a static

threshold on the number of groups remaining), the modifiable KD-Tree is constructed and

filled with all existing group information. The 8 nearest neighbours of all current groups

are then found and their edges inserted into the appropriate place in the edge list (from the

current edge onwards). After this intialization the generation of new nearest neighbours

is much more efficient, as the KD-Tree space is updated (not rebuilt) at each update.

Figure 4.22 shows a surface chart of Modifiable 8NN KD-Tree results for this lower range

of point spaces. Figures 4.23 shows that the best performance and optimal terminating

layer configuration of the modifiable KD-Trees follow a logarithmic curve.

A working threshold of 2500 groups was selected to begin the update function when a

100x100 (10000 pixel) image is simplified by one quarter.

A modifiable KD space with a terminating layer of 9 was found to be most appropriate

(figures 4.15, 4.16 and 4.17).

Avg.No.Points =∑2500

k=1 (k)2500 = 1250.5

Later changes in the way group information is stored and handled, detailed on page

119, change the optimal value for KD-tree terminating layers. In that architecture, when

the number of active groups in the segmentation reduces to a useful level, the modifiable

KD-Tree is constructed and filled with all existing group information (including previously

generated inactive parent groups). The 8 nearest non-related neighbours of all currently

active points are then found and their edges inserted into the appropriate place in the edge

list (from the current edge onwards). The number of groups in the search space actually

doubles rather than decreasing during the segmentation. This results in search spaces of

points between 19999 and 79999 points for expected image sizes and would make 15 the

optimal KD-tree terminating layer value (figure 4.19).


Figure 4.11: A KD tree subdivides K dimensional space by applying Binary Search Tree

branching along successive dimensions.


Figure 4.12: Using an encompassing sphere to quickly determine if a KD tree branch

references a volume of feature space that could possibly contain a nearest neighbour.


Figure 4.13: Modifiable KD Tree Build Algorithm


Figure 4.14: Showing three dimensional half rib orthogonal list structure, each ordered

two way list represents an axis branching from the previous axis,

(a) 8NN Exhaustive Search (b) 8NN KD-Tree Search

Figure 4.15: Searches conducted among decreasing numbers of points, as in the KD Update

segmentation algorithm.


Figure 4.16: 8NN Modifiable KD-Tree results with Exhaustive 8NN results subtracted.

Searches conducted among decreasing numbers of points, as in the KD Update segmenta-

tion algorithm.


Figure 4.17: 8NN Modifiable KD-Tree results expressed as percentage difference to Ex-

haustive 8NN results. Searches conducted among decreasing numbers of points, as in the

KD Update segmentation algorithm.


Figure 4.18: 3D chart showing the performance of 8 Nearest Neighbour algorithms (as

implemented in this work).


Figure 4.19: Modifiable KD Tree average performance in 8 nearest neighbour searches over

different terminating layers values.

Figure 4.20: Performance of 8-Nearest Neighbour algorithms (as implemented in this

work).


Figure 4.21: Showing the effect of dimensionality change on different Nearest Neighbour

algorithms (as implemented in this work)


Figure 4.22: 3D Chart Showing the performance of 8NN Search using Modifiable KD-Trees

with search spaces of 2500 to 20000 points. The optimal search parameters are highlighted

by the green line.


(a) Optimal Terminating Layers

(b) Optimal Search Times

Figure 4.23: Showing optimal values for 8NN Modifiable KD-Tree search of 2500 to 20000

points.


Figure 4.24: The slower KD Nearest Neighbour update does not begin until the image

content has been grouped and simplified to a certain level.


4.7 The Original Current State Group Description

The segmentation algorithm defined in [Felzenwalb98] treated the problem of segmentation

in terms of set theory, where each group structure contained a set of its member pixels.

As the Gestalt grouping algorithm was based upon this original design, it was natural to

keep this kind of group description (as shown in 13.1) during development. Given this

approach, when a new group is created through the merging of two groups (see 13.2) the

member pixels of one of the groups are added to the other. Once the group descriptions

have been properly updated and the member pixel nodes are adjusted to point to the new

group, the redundant empty group is either stored for later use by the segment filter, or

deleted. Because the member pixel nodes point to the new group, any edge in the edge

list that referenced the old groups now points to the new child group (the same structures

used in 4.2). Although this allows old edges (originally generated between parent groups)

to still form future connections (as in figures 4.30, 4.31), all information about the context

within which these edges were generated is lost. Group/segment information can only be

retained for later use if we copy it to another list during the segmentation (see 4.25), which

makes the evaluation of useful segments at this point even more critical.

Figure 4.26 shows how the group structures (13.1) and edge pointers are changed after

a merge occurs (algorithm detailed in 13.2). The reassignment of edge pointers from par-

ent groups to child groups results in multiple edges (of different value and from differing

stages of the segmentation) joining the same groups. Whilst this ‘current state’ approach is

economical in terms of memory usage (group numbers decrease whilst pixel information re-

mains constant) and the speed with which pixel information can be accessed, such minimal

advantages are offset by the loss information. As work progressed it became increasingly

clear that more detailed information about the segmentation history and context of edges

would be useful and allow much greater scope and flexibility when generating signature

descriptions.

4.7. THE ORIGINAL CURRENT STATE GROUP DESCRIPTION 117

Figure 4.25: An overview of the current state grouping algorithm. Because parent groups

are destroyed during the creation of a child group, groups that score highly as segment

primitives need to be copied into a separate Segment League Table before they are de-

stroyed.


(a) Groups structures before a merge on edge

E1

(b) After the merge, the parent groups are ef-

fectively deleted and any current edges link-

ing to them (such as E2 in this example) are

pointed to the new child group.

Figure 4.26: Showing the original grouping process where group information is lost and

no descendent(D) information is retained. While old links are redirected to new child

groups (resulting in E2 and E3 linking the same groups, but with different values), any

information about the circumstances of the edge generation or parent groups is lost.

4.8. THE NEW BINARY TREE GROUP DESCRIPTION 119

4.8 The New Binary Tree Group Description

Rather than destroying group and edge information once it has been used in a successful

merge, the new binary tree group structures retain all information about that stage of the

segmentation process with relatively few overheads. If we wish to retain the full context of

grouping decisions during the segmentation then we must redesign the algorithm so that

no groups or edges are deleted. In such an algorithm, with an increasing number of groups,

it becomes impractical to store unique lists of image pixels for each group. To avoid this,

we add references between parent and child groups to generate a tree like structure that

can be traced from any child group to the groups representing individual pixel primitives.

Whilst this makes accessing pixels information an iterative and slightly slower process, we

have the advantage of being able to access grouping and decision information from the

tree structure after the segmentation has finished. Our new pixel primitive now becomes

a group structure itself, so we we no longer have a need for node structures, which were

used to identify and redirect pixel group membership. As each group merging effectively

results in one less active group (2 parents made inactive, 1 child added) the maximum

number of group structures required can be calculated as:

Gp = 2Gi − 1

where Gp is the number of groups generated by the entire segmentation and:

Gi =Image Width∗Image Height

is the number of groups generated at intiialization (one per image pixel).

With anticipated sizes of our query images being anywhere between 10000 and 90000

pixels, this doubling of group structures has a negligible impact upon the algorithms. Of

more importance is the slower access to group pixel data, so every effort is made to reduce

the need for this.

Figure 4.27 shows an overview of the way the segmentation tree is built up during

the grouping algorithm. Unlike in 4.26, no information is lost during the creation of a

child group, although more care must be taken to keep track of which groups are active

within the process. Founding Edge, Descendent and Youngest Descendent pointers are

essential for keeping track of the data structure. It should also be noted that while the

segmentation tree can be traced in both directions, the path upwards from parent to child

is not necessarily the inverse of the path down through founding edges. Because we retain

full edge context, the two groups that originally generate an edge (and can be accessed

through the Founding Edge pointer) are not necessarily the same groups that go to parent


the child generated by the edge (see figure 4.28). Different connections such as those in

figure 3.25 are not only possible, but identifiable. Whilst this may look complicated, it

allows the segmentation to group current objects based upon the relationships between

their subcomponents at any time during the process. Grouping decisions are now based

around optimal edges between groups at any point in the segmentation history, and the full

context of such decisions is retained. Information from these segmentation trees can now

be used to determine the recency of edge generation, segmentation ranking and provide

much richer group descriptions based around not only segment appearance but also a

segments development and the tree structure itself. 13.3, 13.4 and 13.5 detail the new

grouping algorithms that build up this tree.

4.8. THE NEW BINARY TREE GROUP DESCRIPTION 121

(a) Groups structures before a merge on edge

E1

(b) After the 1st merge, old groups and edges

are retained.

Figure 4.27: Groups and edges (Ei) are retained while Descendent (D), Youngest Descen-

dent (Y D) and Founding Edge (FE) pointers are updated to keep track of active groups

and the grouping history.

(a) Resulting data structures if the grouping

from 4.27 continues over E2

(b) Resulting data structures if the grouping

from 4.27 continues over E3

Figure 4.28: Merging the same groups across different edges. No part of the segmentation

history is lost in this grouping process. Youngest Descendent (Y D) pointers provide fast

access to currently active groups, Descendent (D) pointers provide a grouping history and

Founding Edge (FE) pointers store the context of edges that build up the group structures.


Figure 4.29: An overview of the tree based grouping algorithm. Whilst groups are still

evaluated for their effectiveness as segment primitives, no separate league table is required

as no groups are deleted during the algorithm.

4.9. DISCUSSION 123

4.9 Discussion

The use of feature space to facilitate a level of Gestalt grouping with spatially disconnected

regions represents an elegant way of generating appropriate structures for human like

recognition. The binary storage tree serves the dual purpose of generating near-parallel

(multi-level) grouping decisions and a rich grouping history for each segment from which

further relationship descriptions could be generated. The retention of edge links from

previous generations of the algorithm also helps facilitate the development of Gestalt

linear and curvilinear groupings, preventing unwanted clumping behaviour between partly

developed primitives (4.30).

Unfortunately, the reverse of this situation occurs where Gestalt groups that are not

fully developed can ‘leak’ into other groupings via older linkages from close matching

primitives (4.31).

The main drawback to the grouping algorithm in this work is a rapidly rising processing

time requirement (4.32) that can get prohibitive as source image size increases. While some

of this processing requirement is due to the non-linear increase in pixels per image as image

dimension increases, reflected in processing times, a large part is taken by the maintenance

of the binary tree structure and the need to move through parent-child branches during

grouping. As the number of branches in this tree increases, so does the amount of time

taken to traverse the structure. This limits the practical range of image sizes to under 200

pixels if our 5 minute recognition target is to be met. In this work, this is not a major

problem as source images are currently restricted (and resized) to 100 by 100 pixels.

This represents the minimum image size range where reasonable geometric content can

be extracted in an acceptable time. As the recognition part of this work is based around

the use of simplified labels stored alongside library images, the grouping work required for

the library can be performed prior to any recognition queries. In a practical user query

situation, only the query image will need to be exhaustively segmented/grouped. Once

we have these raw segment groups, a method of evaluating them and filtering out any

partially developed or non-salient segments is required before they are used to generate

descriptions for the recognition engine.


(a) No retained links (b) Retaining links

Figure 4.30: Retaining groups from previous generations helps prevent clumping behaviour

at higher levels of grouping (left), linear structures are allowed to develop if older links are

retained (right)

Figure 4.31: Although retained links allow the grouping of linear features, they can also

encourage the premature flooding of disparate groups which would ideally be developed

as separate groups before joining.

4.9. DISCUSSION 125

Figure 4.32: Shows the increase in time taken by grouping/segmentation engine as image

dimension increases

Chapter 5

Segment Ranking

5.1 Introduction Motivation

Once the segmentation algorithm has completed we are left with a segmentation tree

structure that can be used to generate our image description. While all groupings are

retained, there is a great deal of noise and redundancy in the segment tree, figure 5.11

shows some examples of this. Many groups will inevitably represent partially formed

segments before they develop into useful image primitives and will be highly sensitive to

noise and segmentation differences. Some segments will represent noise, or simply be too

small to be useful. Other segments may well be large, but too unreliable to use as the

basis for an image description (5.11c). To reduce the negative effect of such noisy groups

the results of the segmentation are evaluated and ranked by their fitness as useful image

primitives.

What constitutes a fit segment group?

1. Consistency

Even where segments may not explicitly represent visual objects,

segment groupings that consistently capture the similarities

between images are considered fitter.

2. Object Definition

Segments should consistently describe/outline those objects

common between the images.

4. Symbolic Content

Segment groupings that allow the extraction of symbolic and

Gestalt group descriptions (not just pixel-level clumps) are

127

128 CHAPTER 5. SEGMENT RANKING

essential for symbolic similarity recognition and recognition

between different image types.

3. Richness of Description

Segments should provide descriptions that are rich enough to

determine similarity/dissimilarity between images.

Groupings that are too common among a wide variety of images,

or are simply too small to provide distinctive descriptions are

of low fitness.

5.1.1 Edge Based Ranking

Good segments will provide image content descriptions that are common between images

that have common photographic or symbolic content. Without prior information or high

level reasoning, segments that represent literal photographic content are reliant upon the

tendency for objects in images to exhibit high colour or texture edges. In a similar way,

fully grown gestalt groupings that represent more symbolic image content can be detected

by the increase in difficulty in grouping with other groups. In this architecture, both types

of edges are treated as the same and can be identified by evaluating the edge between two

parent groups 1. Groupings featuring segments that are dissimilar indicate that the two

parents are fully developed segment groups that were forced together by the segmentation

process. Whilst this reasoning applies to most group properties, such as colour and texture,

it does not hold for size and area information. A characteristic of an underdeveloped

segment group is it’s combination with much smaller groups as the region grows to it’s

full potential (like a flood-fill). Figure 5.1 demonstrates that, unlike the other description

properties, a large difference in area between groups connected by an edge is an indicator

that neither is fully developed.

Scoring based around founding edges presents some difficulty, as a single group may

have been part of many founding edges, while others may never have taken direct part in

a grouping decision. This makes it impossible to evaluate the score in this way for a large

number of groups. There is also a larger likelihood that smaller groups will have been used

as parts of founding edges, as joining larger groups through edges created by subgroups

is allowed. By definition, the most groups in a finite space will be small groups, and will

generate edges that will become founding edges.

Such an approach would use the following:

RankingScore =∑n

k=0(En)n

1Which are not necessarily the same as the groups on either side of the founding edge or the child group

5.1. INTRODUCTION MOTIVATION 129

(a) Very different colour and texture values

indicates that the groups G2 and G3 were

well developed before being joined along the

edge E, and should have high rank as primi-

tive.

(b) Very different area values (especially

where one group has a very small area) in-

dicates that groups G1 and G2 were still

growing when being joined along edge E, and

should not be ranked highly as primitives.

Figure 5.1: Contradictions when using edge values to rank segment groups.

where the n is the number of Founding Edges E the current segment group is part

of. Each edge E has a normalized value between 0 and 1 that represents the difference

between two groups along the Founding Edge (not including the area component).

In contrast, scoring based around the edges between fellow parent groups is much

simpler as all groups are guaranteed to have a single co-parent (with the exception of

the final group in the process which represent the entire image). Figure 5.5 shows the

distinction between Founding Edges and Parental Edges.

Unlike Founding Edges, which are generated and stored during the segmentation pro-

cess, Parental Edges are evaluated during the later ranking process. Using Parental Edges

(and Parent Groups) provides greater simplicity, requires only a single edge calculation

to rank each Parent Group paring and is more likely to pair groups of diverse description

(Founding Edges will never feature groups generated later in the segmentation process

than the Parental Edges). For this reason, it was decided that Parental Edge values would

be used to evaluate segment, as shown in algorithm 5.1.


(a) Evaluating the score of segment group 2

from the founding edges that it is linked with.

(b) Evaluating the score of segment group 2

from the edge between it and group 3, which

both form parents to child group 4.

Figure 5.2: Segment group scoring based upon edge values.

5.1.2 Area Based Ranking

Another important factor when evaluating the usefulness of a segment is its size. Whilst

the inclusion of a size bias may at first seem counterproductive to invariant recognition, it is

an important component in human vision and represents several practical advantages. Seg-

ment groups with larger areas have a greater capacity to contain useful textural, boundary

and shape information. Useful image primitives are also more likely to occur with larger

segments where aliasing in the image has a less detrimental effect and the reliability and

scope for variation in descriptions (such as those derived from boundaries) that depend

upon more developed levels of grouping is increased. In human vision, larger image objects

will also usually take precedence in terms of recognition over smaller ones. Conversely,

noise pixels and groups with little geometrical value will be those groups that have very

little area. As virtually all invariant descriptions are derived from relational measures and

ratios between properties, descriptions based around shape and area (which are limited to

a 1 pixel minimum resolution) will have a larger and more stable range in groups of larger

sizes (see figure 5.6).

Ranking of groups based around their size increases the importance of the less volatile

and information rich groupings while reducing the impact of noise. Another reason for

utilizing size and area information when ranking groups is that, as discussed earlier, large

differences in size indicate undeveloped regions and low differences in size indicate better

developed regions (figure 5.1). Unfortunately, when dealing with real images, the area

ratio alone is not sufficient when ranking groups. Using ratios alone would result in

5.2. COMBINING EDGE AND AREA RANKING 131

(a) Photographic Images (b) Images with Gestalt Content

Avg. 10000.3 out of 19999 are segment Avg. 10165.6 out of 19999 are segment

groups linked to Founding Edges groups linked to Founding Edges

Mean Founding Group area: 1.0 Mean Founding Group area: 1.1

Mean Other Group area: 436.8 Mean Other Group area: 208.4

Figure 5.3: Results relating to segment groups that form the basis for used Founding

Edges and differences between Gestalt and non-Gestalt image types. Groups associated

with Founding Edges have been used as part of a successful merge.

disproportionately high scores where single pixels are first joined into pixel pairs with

a ratio of 1:1, producing a maximal ranking. It is clear that an absolute measure of

group area should be used when ranking segment groups. Algorithm 5.2 uses the smallest

normalized area component of the two groups connected by a parental edge to determine

their rank scores.

5.2 Combining Edge and Area Ranking

Segmentation results using Parental Edge magnitude work well with mondrian image types

(figure 5.8a) but are susceptible to noise in natural images (figure 5.7a). Conversely the use

of minimum Parental Area works effectively with natural images (figure 5.7b) but favours

undeveloped large areas in mondrian images (figure 5.8b). Algorithm 5.3 combines both

measures to generate a ranking algorithm that works effectively for both image types (see

figure 5.9).

5.3 Correspondence Ranking

Although the combined ranking algorithm represents an improvement to segment scoring,

it can be seen from figure 5.9 that both image types still suffer from noise and the affects of

large background sub-segments. As one of the main aims of the segment grouping process


(a) Photographic Images (b) Images with Gestalt Content

Avg. 10000 out of 19999 are segment Avg. 10058 out of 19999 are segment

groups linked to Founding Edges groups linked to Founding Edges

Mean Founding Group area: 1.0 Mean Founding Group area: 1.173

Mean Other Group area: 363.66 Mean Other Group area: 303.782

Figure 5.4: Results relating to segment groups that form the basis for used Founding

Edges, a specific example. The added gestalt groupings between larger groups, generated

by the superimposed black dot pattern, results in extra Founding Groups of larger area.

is to extract equivalent groups from different images (see figure 5.10) good segment groups

are those that remain present even after undergoing a transformation.

We can use this property as a means to further filter out noise and segment groupings

that do not sufficiently represent image objects, such as large background sub-segments.

The development and shape of large background sub-segments that score too highly is

largely dependent upon the way the segmentation process operates. This results in under-

developed segments with many possible identical connections evolving in different ways

depending upon the way the architecture orders identical connections. Such groups are

unlikely to have good corresponding groups in a transformed image (see figure 5.11) and

the comparison of segment groups generated between the original image and a transformed

image can be used to filter out these unwanted segment groups. Finally, only the more

robust segment groups will maintain equivalence under a transformation, a property which

5.3. CORRESPONDENCE RANKING 133

(a) The Founding Edge of group 4 is gen-

erated by the relationship between ancestor

groups 2 and 3, which are also the direct par-

ent groups to group 4. This results in both

the Founding Edge and Parental Edge being

identical and linking identical groups.

(b) In this case, group 5 was created by the

merging of its parents, groups 1 and 4, the

edge between these is the Parental Edge. The

actual edge along which the merge occurred

(the Founding Edge) was between the ances-

tor sub-groups 1 and 2. This demonstrates

that the Founding Edge and the Parental

Edge are not always identical and can adjoin

different groups.

Figure 5.5: Demonstrating the difference between Founding Edges and Parental Edges.

makes them valuable as the basis of an invariant description between equivalent images.

The Correspondence ranking algorithm 5.4 details the process of finding consistent

segment groups by comparing them with (de-transformed) groups generated from the

same image, but under a transformation.

Whilst this is slower than ranking purely by area and grouping history, requiring two

separate segmentations of the same image, it can be speeded up if we select transformations

that reduce the size of the source image, reducing the segmentation overheads. A tradeoff

must be made, depending upon the priorities of the segmentation engine, between speed

and better segment ranking.

Upon visual inspection of ranked semgentation results, the implementation of corre-

spondence ranking was observed to result in minimal improvement while nearly doubling

processing requirements. For the remainder of this work, it was decided that correspon-

dence ranking would we abandoned in favour of the much faster ranking by minimum area

algorithm detailed in the the previous section (5.2). Figures 5.12, 5.13 show final examples

of final segmentation rankings.

As the ranking system is dependent upon the size of the optimal edge required to


Algorithm 5.1 Ranking by Parental Edge

Good for mondrian images.

Less effective for natural images (noise pixels often have high edge values).

E =Edge(Group1, Group2)

Segment Ranking Score(Group1) =Segment Ranking Score(Group2) = 1 − E

Where edge E is the Parental Edge and has a normalized value between

0 and 1 that represents the difference (not including the area component)

between two groups that were merged to form a child group. Both these parent

groups will be assigned this score. The Segment Ranking score lies between 0 and 1,

with lower values indicating higher fitness.

combine to parent groups into a new child group, weighted by the size of the smallest parent

group the ranking output will have a tendency to appear in pairs with the same score (as

shown in rankings 1 and 2, 3 and 4 etc. in 5.13). Whilst the outputs do provide useful

groupings to base a search upon, they are not as good as expected. This would appear

to be partly due to approximations of descriptors such as shape and position resulting

in larger differences between parent groups than are actually perceived by humans, and

partly due to the rather simplistic (but fast) method used to actually rank the segments.

On the whole, whilst the segmentation engine is indeed providing us with many useful

groupings, the ranking system as currently defined is not optimally separating out the

useful groups from the less useful groups. However, as the entire segmentation history is

kept in this system, some method for reducing the number of segment groupings going

on to form descriptions for the search engine is essential, and these results were judged

sufficient for our current purposes. Improvement of the segment ranking engine is one

area that lends itself to exploration in further work and could greatly improve recognition

results in the final search engine.


(a) Area: 5 pixels (b) Area: 24 pixels (c) Area: 104 pixels

Figure 5.6: The larger groups allow the greater resolution, less sensitivity to error and

better textural, shape and boundary information.

Algorithm 5.2 Ranking by smallest parent group

Good for natural images, but can miss important smaller details.

Natural filter for noise.

Less effective with mondrian images.

A1 =Area(Group1)

A2 =Area(Group1)

S =min(A1, A2)

Segment Ranking Score(Group1) =Segment Ranking Score(Group2) = 1 − S

Where A1 and A2 are the normalized area values of both parent groups being evaluated.

The Segment Ranking score lies between 0 and 1, with lower values indicating higher fitness.)


Algorithm 5.3 Combined ranking algorithm

Combines both Parental Edge and minimum Parental area ranking.

E =Edge(Group1, Group2)

A1 =Area(Group1)

A2 =Area(Group1)

R = E∗min(A1, A2)

Segment Ranking Score(Group1) =Segment Ranking Score(Group2) = 1 − R

Where edge E is the Parental Edge and has a normalized value between

0 and 1 that represents the difference (not including the area component)

between two parent groups (Group1,Group2) that were merged to form a child group.

A1 and A2 are the normalized area values of both parent groups. The resulting score

between 0 and 1 (where lower values indicate higher fitness) is assigned to both parent groups.

Algorithm 5.4 Correspondence ranking algorithm

S1=Seg(I)

It=I(M ∗ R ∗ S)

S2=Seg(It)

S2=S2(−S ∗ −R ∗ −M)

Where I is the original image, Seg() is the segmentation algorithm, Sn is a list of segment

groups and M, R and S are Mirror, Rotation and Scale transformations. S1 and S2 should

now contain equivalent good segment groups, with bad or unreliable groups not having

equivalents.

For all Gi in S1

Ge=Most Similar Segment Group in S2

E=Edge(Gi,Ge)

Correspondence Score(Gi)=Segment Ranking Score(Gi)*E

Where both the Correspondence Score and Segment Ranking Score lie between 0 and 1,

with lower values indicating higher fitness.


(a) Ranking based upon Primary Edge values.

(b) Ranking based upon minimum parent area

Figure 5.7: Top 20 ranked segments with a natural image type. Notice that the Primary

Edge based results suffer adversely from small noise pixels.


(a) Ranking based upon Primary Edge values.

(b) Ranking based upon minimum parent area

Figure 5.8: Top 20 ranked segments with a mondrian image type. Notice that the undevel-

oped backgrounds in the minimum Parent area based results are being scored too highly

whereas the use of Primary Edge ranking successfully extracts the major groupings.


(a) Mondrian Image Type

(b) Natural Image Type

Figure 5.9: Top 20 Results of combined Parental Edge and minimum Parental area ranking.


(a) (b)

Figure 5.10: Showing a simulated example of desirable direct segment equivalence. In

order to recognise the similarity between objects in photographic images segment groups

that represent image objects regardless of transformation are encouraged.


(a) Original image (b) Transformed image

(c) Original group (d) Corresponding

group from trans-

formed set

(e) Low fitness group-

ing

(f) Nearest grouping in trans-

formed set

Figure 5.11: Segments of high fitness will appear consistently in images of the same subject,

regardless of transformations. Significant groups from the original set (shown on the

left) will have corresponding groups in the segmentation of a deliberately transformed

image (right). Partially grown segments (such as in e to f) will not have corresponding

counterparts in the transformed set and can therefore be assigned a low score.


Figure 5.12: Example showing final groups extracted by rank, whilst sufficient for our

purposes these results indicate that the ranking engine still has scope for improvement.


Figure 5.13: Example showing final groups extracted by rank, whilst sufficient for our

purposes these results indicate that the ranking engine still has scope for improvement.

Chapter 6

Generating Segment Group

Descriptions

6.1 Overview Of Description Types

At this point in the process we have a ranked list of segment groups, each with a traceable

grouping history and it’s own appearance attributes. From these features it is possible to

generate descriptions of different degrees of invariance that can be used to facilitate efficient

matching between individual segment group lists or pairwise signature descriptions. Such

identifiers can be broadly broken down into the following three primary types:

1. Appearance Descriptions

2. Relational Descriptions

3. Structural Descriptions

Parallel to this, it is important to consider the degree of invariance that description

types offer and which type of similarity we require. The different levels of invariance

rest along a scale that stretches from direct image content matching and cartoon recog-

nition to recognition of photographic image content undergoing perspective and lighting

change. These can be broadly seperated into two categories, geometric (table 6.1 and

colour/intensity (table 6.2).

As a general rule, greater amounts of information are required to generate higher levels

of invariance. The very fact that each level of invariance disregards certain information

contained within the image (in order to remain constant under change in that information)

can result in a loss of descriptive power. High orders of invariance can also suffer from

error propagation where initial discrepancies in values (due to image aliasing or noise) have

a detrimental effect upon the final description, although this can be limited through the

145

146 CHAPTER 6. GENERATING SEGMENT GROUP DESCRIPTIONS

More suitable for direct image matching, trademark recognition etc..

1. Euclidean Invariance

(Invariance to translation and rotation)

2. Similarity Invariance

(Euclidean invariance, scale invariance)

3. Affine Invariance

(Similarity invariance, distortion/stretching invariance)

4. Projective Invariance

(Affine invariance, perspective semi-invariance)

More suitable for photographic image content matching.

Table 6.1: Showing geometric invariants of increasing order. In general, low level invariants

provide good descriptors for tasks requiring direct match while higher levels are more

appropriate for content that has undergone transformation, such as that caused by changed

in viewpoint in photographic images.

careful use of reliability measures. Whilst full perspective invariants cannot be generated

from a single image source, projective and affine invariants can be effective for identifying

such transformed objects if a sufficient quantity of sub-features are considered.

6.2 Photometric Descriptions

Photometric descriptions can be generated from a single segment group and are based

around segments colour values. Information about colour, texture and intensity is most

effective when used to identify photographic content and is particularly robust to geometric

and perspective change. Whilst these measures are generally less effective at identifying

photographic objects under differing lighting conditions, or images (such as greyscale)

without colour diversity, satisfactory levels of semi-invariance can be achieved. Of partic-

ular use is the HLS (Hue, Luminance, Saturation) representation where colour properties

can be more naturally represented and separated out. Figure 6.1 shows both RGB (the

default computer representation of colour information) and the HLS space that can be

calculated from this. Luminance values remain constant over colour change, whereas Sat-

uration levels will generally remain constant over common image transformations such as

brightening/darkening. Saturation is also a useful, often overlooked, description property

of image type. Even with different luminance and hue values, images with similar satu-

ration levels will be perceived as having similar properties. Images with zero saturation

6.2. PHOTOMETRIC DESCRIPTIONS 147

More suitable for photographic and colour/intensity variable content

1. Intensity differences

(semi invariance to intensity changes)

2. Normalized rgb

(semi invariance to saturation changes)

3. Hue descriptions

(semi invariance to changes in saturation and intensity)

4. Geometrical descriptions

(colour information abandoned in favour of geometry)

More suitable for symbolic content where geometry is used for recognition

Table 6.2: Types of colour description in rough ascending order. Many of these are

useless when used to describe grey-scale or black and white images, separation into distinct

categories is also more problematic.

levels are greyscale, mid level saturation suggests more ‘natural’ and photographic scenes

while large saturation values suggest artificial or even cartoon style images. Hue is espe-

cially useful for its tolerance to changes in natural lighting conditions, making it an ideal

measure for colour photographic image content. Unfortunately, Hue has the drawback of

being an angular measure, so for accurate similarity measures, polar comparison must be

used. Standard Euclidean, or even Manhattan, linear similarity measurements can be used

if the angular Hue measure is first transformed into a unit vector for comparison, although

this will result in a warping of similarity measurements. Further distortion occurs due to

transformation from a original discrete RGB space.

Descriptions based around colour/intensity values can be highly localized, generated

from small sub-neighborhoods of component pixels, or be created from average values of

segment appearance. Descriptions that are generated from RGB values include:

Intensity

I(R, G, B) =R + G + B

Normalized colours

r(R, G, B) = RR+G+B

g(R, G, B) = GR+G+B

b(R, G, B) = BR+G+B

Saturation

S(R, G, B) =(max(R,G,B)−min(R,G,B))255

where max() is the highest colour value


min() is the lowest colour value

Unit Hue vector

Hx(R, G, B) = sin(H(R, G, B)) Hy(R, G, B) = cos(H(R, G, B))

where:

H(R,G,B)=

0 if R=G=B (S=0)(

−B+Gmax(R,G,B)−min(R,G,B)

)

Π3 if R=max(R,G,B)

(

2 + −R+Bmax(R,G,B)−min(R,G,B)

)

Π3 if G=max(R,G,B)

(

4 + −G+Rmax(R,G,B)−min(R,G,B)

)

Π3 if B=max(R,G,B)

6.3 Geometric Descriptions

Unfortunately, these colour measures are almost useless when used to describe black and

white, cartoon, symbolic or shape dependent images. Identifying segment groups based

around shape is generally a more difficult task, but necessary where intensity/colour infor-

mation is either not available or disruptive to the similarity desired. Of particular use for

shape description is the convex hull or the segment group which provides extra informa-

tion about the shape and also facilitate similarity recognition between collections of shapes,

sketch outlines and full segment groups that define similar gestalt shapes. As the convex

hull of a group is derived from that group’s contour, relationships between the group and

its convex hull are retained even under Affine tranformations. Slightly harder to calculate

efficiently, but extremely useful for invariant description, are inherent shape topological

features such as Eulers number (which is based around the number of holes within an

object) and hole to region ratios. Topological descriptors are particularly appropriate for

the categorization of shapes into broad types, even where instances of appearance and

dimension may differ greatly. Such topological similarities are easily perceived and widely

used in human vision. Many description types of lower invariance, especially directional

vectors, can form the basis of ratios between multiple related groups in order to achieve

higher invariance levels, as will be shown later.

Single group non-invariants used to detect direct/symbolic similarity:

Proportionate Areas

Pa = areaimage width∗image height

PHa = area of convex hullimage width∗image height

Proportionate Centroid Positions

6.3. GEOMETRIC DESCRIPTIONS 149

Pcx = ximage width

Pcy = yimage height

PHcx = ximage width

PHcy = yimage height

Semi-invariant descriptions derived from single segment groups:

Principle axes of inertia (translation and scale invariant)

AIx = sin(θ)

AIy = cos(θ)

Where θ = 12 tan−1

[

2m11m20−m02

]

Unit Vector between centroid and hull centroid

vcx = vxvl

vcy = vyvl

Where l =√

vx2 + vy2

vx = hull x − x

vy = hull y − y

Eccentricity (Similarity invariant)

E=m20+m02+

√(m20−m02)2+4m2

11

m20+m02−√

(m20−m02)2+4m211

Non compactness (Similarity invariant)

NC(perimeter, area) = perimeter2

area

NCH(hullperimeter, hullarea) = hull perimeter2

hull area

Boundary moments (Similarity invariant)

F1 = (p2)1/2

o1

F2 = p3

(p2)3/2

F3 = p4

(p2)2

where

or = 1N

∑Ni=1[z(i)]r

pr = 1N

∑Ni=1[z(i) − o1]

r

N = Number of boundary points


z(i) = Sequence of Euclidean distances from boundary points to centroid

Shape moments up to order 3 (Similarity invariant)

ρ1 = ν20 + ν02

ρ2 = (ν20 − ν02)2 + 4ν2

11

ρ3 = (ν30 − 3ν12)2 + (3ν21 − ν03)

2

ρ4 = (ν30 + ν12)2 + (ν21 + ν03)

2

Shape moments up to order 4 (Affine invariant)

I1 =m20m02−m2

11

m400

I2 =m2

30m203−6m30m21m12m03+4m30m3

12+4m321m03−3m2

21m212

m1000

I3 =m20(m21m03−m2

12)−m11(m30m03−m21m12)+m02(m30m12−m221)

m700

I3 = (m320m

203 − 6m2

20m11m12m03 − 6m220m02m21m03 + 9m2

20m02m212

+12m20m211m21m03 + 6m20m11m02m30m03 − 18m20m11m02m21m12

−8m311m30m03 − 6m20m

202m30m12 + 9m20m

202m

221

+12m211m02m30m12 − 6m11m

202m30m21 + m3

02m230)/m

1100

where (for all moment calculations above):

νpq =mpq

(m00)γ

γ = p+q2 + 1

mpq =∑

(x − x)p(y − y)qf(x, y)

where f(x, y) = 1 in this (binary) case.

x=x coordinate of group pixel

y=y coordinate of group pixel

x= group x axis centroid

y= group y axis centroid

Convex hull area and group area ratio (Affine invariant)

HA(hull area, group area) = hull areagroup area

Eulers Number (Affine invariant)

EN =Filtered(S) − Filtered(N)

where Filtered(i) = number of elements in i of area greater than the mean area(i)4

S is he number of contiguous parts

N is the number of holes within the contiguous parts

Hole Area and Region Area ratio

6.4. PAIRWISE DESCRIPTIONS 151

Hr =

Hole areaRegion area if Region area ≥ Hole area

2 − Region areaHole area if Region area < Hole area

6.4 Pairwise Descriptions

Whilst single region descriptions are very useful for recognition purposes, especially lower

level invariance, descriptions based around pairwise relationships of regions provide a larger

source of primitives from which higher levels of invariance can be derived. For pairwise

descriptions to be effective, two often conflicting conditions must be satisfied. The two

regions should be related to each other in such a way that they are both similarly trans-

formed by whatever transformations occur in image content. In the perspective case, this

requires the objects they represent to be placed far enough from the camera and close

enough to each other that the transformations between them can be approximated by

affine invariants. The second condition is that the descriptions generated from the two

regions contain enough dissimilarity to generate robust and distinctive invariant descrip-

tions from. In natural images, these two criteria do not often sit well together, as a large

factor in defining a relationship between two objects is their similarity.

Given our pairwise tree structure generated from the segmentation engine, we already

have two immediate types of pairwise region relationships we can use to generate pairwise

descriptions. Each region in our segmentation tree has been generated as a result of some

similarity relationship between two smaller parent/grandparent regions. This Founding

Edge between Founding Parents, precipitates a merge between two distinct Parent Regions

(which could be the founding parents, or their progeny). Because Founding Edges may

have been generated during a much earlier generation of the segmentation process, the two

Founding Regions that cause the pairwise grouping may well be fairly distant ancestors,

and are not necessarily the same as the Parent Regions that are actually combined. Most

Founding Regions in this segmentation scheme are matched during the early stages and

can be considerably different to the child groups that they will actually join. Given that

the same content in different images should follow the same patterns of combination, then

these parent groups form good candidates from which pairwise descriptions/relationships

can be derived. Region Parents are especially useful because (unlike Founding Parents)

they are not as constrained to being similar in appearance and can generate a wider range

of relational values.


Parent region colour/intensity based pairwise linear invariants:

Intensity Difference

I(Region1, Region2) = Abs(I(R1, G1, B1) − I(R2, G2, B2))

Red Channel Difference

R(Region1, Region2) = Abs(R1 − R2)

Green Channel Difference

G(Region1, Region2) = Abs(G1 − G2)

Blue Channel Difference

B(Region1, Region2) = Abs(B1 − B2)

Saturation Difference

S(Region1, Region2) = Abs(S1 − S2)

Parent region geometric pairwise translation invariants:

Region Centroid Axis DifferencesC1x−C2x

image widthC1y−C2y

image height

√C1x∗C2x−C1y−C2y√

image width∗image width+image height∗image height

Parent region colour/intensity based pairwise ratios:

Ir =

I1I2

I2 ≥ I1

I2I1

I2 < I1

Rr =

R1R2

R2 ≥ R1

R2R1

R2 < R1

Gr =

G1G2

G2 ≥ G1

G2G1

G2 < G1

6.5. PREPROCESSING FOR GEOMETRIC DESCRIPTIONS 153

Br =

B1B2

B2 ≥ B1

B2B1

B2 < B1

Sr =

S1S2

S2 ≥ S1

S2S1

S2 < S1

Parent region geometric pairwise invariants:

Ar =

Area1Area2

Area2 ≥ Area1

Area2Area1

Area2 < Area1

6.5 Preprocessing for Geometric Descriptions

While colour descriptions are generated as part of the segmentation process and can be

easily calculated from raw group information, some of the geometric descriptions require

preprocessing before they can be generated. Group boundary and outer boundary pixels

are required to generate convex hull vectors [Gra] and hole/region counting is required

to create Eulers number. The Grab Hole Details function performs a connective binary

segmentation on the group and results in descriptions of disparate hole and region areas

contained within the group. Resulting hole and region groups are rejected if they are

smaller than a threshold based upon the average size of their type.

Number of Holes=∑

( Holes with area > Mean Hole Area4 )

Number of Regions=∑

( Regions with area > Mean Region Area4 )

All holes that lie on the boundary of the grid used in this function are then discounted

so that the surviving holes are truly interior to the group regions.

Each group that will generate a boundary description is also put through the Strip To Boundary

function which removes its inner region pixels to leave the closed boundaries. After

boundary information has been used, these boundary pixels are then submitted to the

Strip To Outer Boundary function, that leaves only the most extreme boundary pixels

that Get Convex Hull uses to generate convex hull vectors.


(a) Our original point in RGB space. (b) Point transformed into HLS space

Figure 6.1: Showing ideal RGB and HLS spaces, the transformation from discrete RGB

space actually results in a double hexagonal pyramid subspace of HLS space being used.

Chapter 7

Searching the Label Database

7.1 Description Labels

The segmentation process retains a very large list of ranked regions linked together by edge

information into a pairwise grouping tree. The majority of the region groups, especially

those representing individual pixels, will be too small to provide any useable geometric

information. Only the highest ranked regions are likely to be of sufficient size and visual

significance to provide useful geometric information, or represent good object primitives

for human like recognition. Given the extra processing needed to generate geometric

descriptions, it is sensible to only use the best n regions from the ranked table when

creating full descriptions. A further motivation for reducing the number of groups used to

generate final descriptions is the increased processing overhead when generating geometric

information such as convex hulls and Eulers number (requiring the recognition of holes

within groups). In this work, the top 256 ranking groups are used to generate description

labels. Each of the selected regions contributes descriptions to a set of arrays that make

up a single label description of that particular image content. It is important to ensure

that the number of individual description types used is sufficient to cover a wide range

of invariant types and levels of invariance available. It was decided that 48 description

types would be sufficient to fulfill this criteria whilst remaining small enough to allow fast

comparison and analysis of results. Each label consists of these 48 description types, of

varying degrees of invariance, each made up of the values and weights generated from

the ranked list of segment groups. The order of each of the 256 individual values in a

description directly mirrors the rank of the segment that produced it, values generated by

a specific group can easily be re-grouped.

Rank Table: Group A, Group B, Group C, Group D, ..

Results in the following image label[no. descriptions][no. groups used]:

155

156 CHAPTER 7. SEARCHING THE LABEL DATABASE

Description[1]=Intensity: i(A), i(B), i(C), i(D), ...

Description[2]=Normalized Red: r(A), r(B), r(C), r(D), ...

....

Description[5]=Saturation: Sat(A), Sat(B), Sat(C), Sat(D), ...

Description[6]=Huex: Huex(A), Huex(B), Huex(C), Huex(D), ...

....

Description[12]=Eccentricity: e(A), e(B), e(C), e(D), ...

....

Each label is stored (without normalization of component values) with an identical

file number to the source image. In this way we are left with a series of labels that

can be searched for similarities and easily matched up with the images they represent

without the need to regenerate expensive segmentation groupings for the library images.

This offline library preprocessing for simplified description labels will facilitate very fast

database querying, also allowing the easy addition of new images to the existing library.

7.2 Normalizing Descriptions

The description types listed above will actually have values with differing degrees of sep-

aration and range. If these measures are to be combined into a single image similarity

evaluation strategy spanning description types then some method of normalizing them

into equivalent units of measure is required. While many of these measures (such as those

relating to area and colour) have easily definable limits, others (such as moments) are

much more difficult to constrain within a normalized range. Merely ensuring that values

lie within an agreed maximum limit is also insufficient, as each measure may well have

differing degrees of separation and normal value ranges which would result in some de-

scription types having a disproportionate influence upon the similarity evaluation (figure

7.1).

The solution used in this work is to use the same range normalization process upon each

description, based around the values generated during database construction rather than

the application of predefined thresholds/limits. In this case we not only need to determine

maximum threshold values for descriptions, but also minimum thresholds. While there

are many ways of determining appropriate upper and lower bounds from a data set, the

approach used in this work is through the use of mean values as shown below.

Normalization (ND) for each description type (D):

7.3. WEIGHTING DESCRIPTIONS 157

ND = CD−lowerhigher−lower

where the constrained description (CD) is:

CD =

D D ≥ lower and D ≤ higher

higher D > higher

lower D < lower

and the higher and lower bounds are determined using mean values:

lower =∑i<no. descriptions

i=0 (Di if Di < mean(D))

higher =∑i<no. descriptions

i=0 (Di if Di ≥ mean(D))

Although the use of median values may provide better higher and lower bounds, and

allow greater tolerance to non-representative values, mean values have been implemented

due to their comparative efficiency. Using this self normalization strategy automatically

constrains each description type to a range defined by the typical variation and bounds of

the values that are to be searched in the image database. This will result in descriptors

that can be evaluated for similarity and combined into an overall score without individual

description type bias.

7.3 Weighting Descriptions

When dealing with ratios we need the ability to deal with infinite numbers caused by

dividing by zero. One way to deal with these exceptions is to assign them a default value

and give them a weight which can be set to zero to remove them from consideration.

Weighting each description also allows us to adjust its influence to reflect the score of the

region group it is generated from.

Each description atom consists of:

value (double precision, normalized by search database so 0 ≤ i ≤ 1)

weight (double precision, 0 ≤ i ≤ 1)

Another case where we may wish to adjust weighting values on an individual level

occurs where the nature of the data available makes a particular description inherently

unsound. An example of this is the volatility of rgb ratios and hue measurements in images

where saturation levels approach zero, which indicates that some weighting based around

saturation levels may be beneficial. Region size has already been used to influence the

region score, so individual weighting against the increased sensitivity to individual pixel


errors/image grid aliasing effects as groups decrease in size would seem unnecessary at

this point. Although not investigated in this work, the implications of such feature-based

weightings may well represent an interesting avenue for further work.

A further possible use for this weighting system that was originally intended to be

covered in this thesis was the use of weights to adjust the contribution of individual de-

scription types to generate optimal recognition results for different image types. Work

reported later in this thesis indicates that the adjustment of weights can indeed improve

recognition results in the general case and could be used to tailor searches to different

search requirements and image types. Automatic relaxation techniques and genetic al-

gorithms would be particularly useful here to determine optimal weighting based upon

search results given different types of source image. These weight profiles could then be

stored and selected either automatically or by the user to enhance and direct the search

engine performance. This topic is discussed later in section 8.2.

7.4 Searching the Database Labels for Image Similarity

Each database image is represented by its label, which in turn is subdivided into description

types which contain the individual descriptors generated from each region group in order

of rank. It is very unlikely in all but the most exact matching images that groups will

be ranked in the same order between two similar images, so a direct comparison of label

content will not be sufficient to determine similarity. We must first establish the best match

between descriptor indexes of each label, which is the equivalent of matching individual

region groups between the two images, as in algorithm 14.1. Correspondence scores are

weighted by both the weight (reliability measure) of the descriptor and a global user

weighting (that can be used to adjust the global input of a description type). Although

one of the fastest (and straightforward) methods of determining equivalence, this does

not represent a one to one correspondence between segments and it is possible for many

query segments to be equated with the same library image segment. This can result in

the counter-intuitive case where recognition scores between query and target images are

non-reversible. The similarity result returned from label comparison is entirely dependent

upon the direction of the query (7.2).

Dependence upon directionality of the query is actually a useful property for this work,

where recognition by image sub-components is a desirable property. It can be argued that

human evaluation of similarity would exhibit a similar bias towards matching. One possible

problem with this type of correspondence selection is where large parts of the query image

are matched to a very small region of the library image, resulting in an unjustifiably high

score. It is anticipated that such problem are unlikely to occur in this work, as selection is

based around the similarity of a large number of description types (including proportional

7.4. SEARCHING THE DATABASE LABELS FOR IMAGE SIMILARITY 159

non-invariants), so the possibility of such effects will be minimal. A possible avenue of

further work would be to investigate the relationship between this correspondence selection

and resulting differences in performance if a one to one constraint was imposed.

Once a best correspondence between region groups has been achieved we can use them

to calculate the similarity between the query and database labels (algorithm 14.2) which

is also weighted by both descriptor weight and a global user weighting. Repeating this

process for each database label, each database label (and corresponding source image

reference) is sorted by similarity to the query image label. After sorting, each database

image and corresponding similarity score is displayed in a results window (7.3).


Figure 7.1: Graphs plotting segment size (middle) and intensity (right) values against the

number of segments at each stage of the segmentation (red lines indicate the global mean

value). The behaviour and spread of these values during a segmentation are very different

indeed.

7.4. SEARCHING THE DATABASE LABELS FOR IMAGE SIMILARITY 161

(a) Difference results (right) will be lower where query content (left) is a subcomponent of

library content (right)

(b) Difference results (right) will be higher where query content (left) contains of library content

(right)

Figure 7.2: Best match correspondence selection results in a one to many match between

query segments and library label segments. Similarity results are non-reversible, dependent

upon which image forms the query.


Figure 7.3: Screenshot of algorithm output

Chapter 8

Evaluation of final algorithm

performance

8.1 General effectiveness with different image types

A dataset of images has been compiled to to test the algorithms general effectiveness

with varying image types. The first set were 55 natural photographic images with 35

equivalent query images, featuring the same image content from different camera positions

(figure 8.1). This dataset was copied and greyscaled to produce a further series of images

to test the effects of eliminating colour information. A further 3 datasets consisting of

facial images [Peipa](figure 8.2), cartoon (figure 8.3) and Gestalt symbolic (figure 8.4)

images were also generated. Correspondences between query images and their equivalent

search target images in the library set were recorded and used to evaluate the performance

of the recognition algorithm. Results where the target images are ranked highly when

queries with their equivalent search images will indicate good recognition performance.

All libraries and queries from these sets were processed using an image dimension of 80 by

80 pixels.

The first set of tests evaluate the actual score values returned from the natural colour

image set. Score values are measures generated and inverted from the absolute difference

between normalized (weighted) query and library percentage label descriptors. The result

is a percentage measure of how similar the image labels are to each other, because they are

based on difference measures the reported ranges will have a bias towards higher values.

Because all values are calculated in this way, the rank order of results is still preserved.

Figure 8.5 shows that successful recognition is occurring over the entire set, with target

images scoring above average in each query.

Figure 8.6 shows the contribution that each descriptor (intensity, eccentricity etc.) type

makes toward these final score values, calculated using the mean scores of each isolated

163

164 CHAPTER 8. EVALUATION OF FINAL ALGORITHM PERFORMANCE

Figure 8.1: Samples from the natural image set, query images (above) and their equivalent

library targets (below)

Figure 8.2: Samples from the facial image set, query images (above) and their equivalent


descriptor over the entire query set. As would be expected given the mean-based self-

normalization of the descriptor values, each descriptor is making an even contribution

toward the overall final score. While this contribution can be further weighted by the user

to artificially suppress description types, in this case user weighting is not activated.

Figure 8.3: Samples from the cartoon image set, query images (above) and their equivalent


8.1. GENERAL EFFECTIVENESS WITH DIFFERENT IMAGE TYPES 165

Figure 8.4: Samples from the Gestalt/symbolic image set, query images (above) and their

equivalent library targets (below)

Figure 8.5: Minimum, mean, maximum and target score results for each of the 35 query

images in the natural image set.

Colour based descriptions provide very effective geometry invariant descriptors for pho-

tographic imagery and are commonly used for recognition tasks. A major aim of this work

is to enable recognition across a diverse range of image content, much of which may not

feature colour information, so the next set of tests were to determine the recognition algo-

rithms dependency on colour information. Figure 8.7 shows that recognition performance

is significantly reduced when the greyscaled natural image set is queried. Although the

mean rank of target recognition is increased from 5.89 with the colour set to 18.46, this

still represents above average recognition.

Another feature evident from figure 8.7 is the unexpected lack of correlation between


Figure 8.6: Contributions of description types to final score, generated from the natural

colour library with user weighting disabled.

greyscale and colour results for each query image. While some deviation can be antic-

ipated due to the non-linear changes resulting from the label segment correspondence

process ‘switching’ between preferred segment matches, this would not be expected to

cause such large differences between greyscale and colour results on the same query. In

such switching between label segments is the cause then it would be expected that ex-

amination of component descriptor performance would show a similar lack of correlation.


Figure 8.7: Comparing rankings generated from greyscale and colour versions of the same

image queries, ranks scores of 1 indicate the target has been selected as the best match

However, figures 8.9, 8.8 show that both the rankings 1 and the scores of individual de-

scriptions show a good degree of correlation (with the obvious exception of colour based

descriptors).

Given that the results are generated from the scores and rankings of the target images

only, it would seem that while target image results are behaving in a correlated manner

between greyscale and colour types the other images in the library may well be ‘switching’

to different label segment matches and interfering with performance. This represents an

area that could be studied in greater depth in future work.

Figure 8.10 shows the overall ranking results of target images through the four im-

age libraries; natural (colour), cartoon (colour), natural (greyscale), faces (greyscale) and

Gestalt/symbolic (black and white). The good performance of the facial image set when

compared to the natural image set is almost certainly due to the nature of the images in

the libraries. While the natural image set contains a lager number of closely matching

images and real-world transformations to impede recognition, the face database images

are taken in relatively controlled environments with fewer image transformation types be-

tween query and target. It can be seen that these results show promise, and the algorithm

is performing good recognition across the different image types.

1All percentage rank scores calculated as (maxrank−rank)(100)maxrank−1


Figure 8.8: Correlation between component descriptor scores for both greyscale and colour

natural images


Figure 8.9: Correlation between component descriptor rankings for both greyscale and

colour natural images


Figure 8.10: Comparing rankings generated from greyscale and colour versions of the same

image queries, ranks scores of 1 indicate the target has been selected as the best match

Currently, there is no automatic weighting compensation on description that are re-

dundant in certain image types. Greyscale query images will report 100 percent matches

over colour descriptors when matching other greyscale image types. This will result in the

recognition algorithm favouring images by presence (or absence) of colour content. This

may well be justifiable in terms of human recognition, as the distinction between colour

and greyscale imagery does play a large part in similarity evaluation. The effect of this

is also offset by the inclusion of many other forms of geometry based descriptions, which

should avoid any unwanted bias resulting. This may well represent an area of potential

further work.

8.2 Effects of global user weighting on recognition

One area of interest is the possibility of tailoring queries to recognize different similarity or

image types through the adjustment of global weights applied to the label descriptors. As

label descriptions are not weighted or normalized before storage in the library they directly

8.2. EFFECTS OF GLOBAL USER WEIGHTING ON RECOGNITION 171

Figure 8.11: Shows an improvement in greyscale image recognition performance through

the global weighting of descriptor values.

represent raw descriptions of image content. The benefit of this strategy is that once a label

is generated there is no need to regenerate the processor intensive segmentation/grouping

processing for subsequent recognition queries. This also means that query weighting and

adjustment can be performed extremely easily and efficiently in real-time. This all relies

on the ability of global descriptor weighting to improve and control recognition results.

The next experiment was set up to test whether this is a viable proposition, and to see if

we can improve on our greyscale natural image recognition results using purely geometric

descriptors. From the component description results in figure 8.9, it can be seen that of

all the geometry based descriptions; boundary compactness, area ratios, centroid y axis

proportions and convex hull centroid y axis proportions are all performing better than

the other geometry descriptors. Figure 8.11 shows that recognition performance has been

improved, with mean rank reduced by 4 places (a 7.7 percent improvement), when the same

experiment was repeated on the greyscale dataset using only these four descriptors. This

indicates that there is good potential for automatic relaxation and user-based weighting

adjustments to generate improved recognition results. Another unexpected result of the

success shown here is that they reveal some unanticipated good descriptor types for natural

image content. The success of the y axis proportion measures (non-invariants expressing

the y coordinate as a proportion to image size) makes sense when you consider that most

camera movements occur on the horizontal plane in photographic images. This explains the

effectiveness of these non-invariant y axis descriptors, and indicates that further analysis

of such occurrences could provide insight into effective descriptor combinations.

Figure 8.11 shows the recognition engine successfully recognizes pure geometric image

content in realistic and difficult image search situations.


8.3 Tolerance to realistic transformations

Our next set of tests was to determine the recognition algorithms behaviour under image

transformations. A series of artificial images was generated from the same three source

images (linear, greyscale and colour source images, featuring the same geometry) after

known transformations; rotation (figure 8.17), affine (figure 8.18) and translation. These

image sets are for the evaluation of the algorithm resistance to transformations and the

difference between colour and geometrical similarity cues. All but the most artificial image

transformations will result in the introduction of new image content and the loss of old

image content, even if the result is the changing dimension of a white background. To

emulate this, these library images are sub-images taken from a larger image context that

is introduced of removed as the transformation requires. This change in context (8.12)

is certain to introduce a level of noise into query results as the query image is no longer

querying the exact same content.

These experiments were conducted on images of dimension 100 by 100 pixels. The

rotation results (figure 8.13) show similarity scores over a library of gradually rotating

images, with variant descriptors allowing recognition of the closest rotation whilst invariant

descriptors facilitating tolerant recognition levels for other rotations. The secondary peak

in results at 180o is to be expected, as this rotation value does not introduce or remove

image features from the query. Similar linear results can be seen for the translation

results (8.14), with the maximum translation value of 100 percent representing completely

different image content to the query.

For both colour and greyscale image types, recognition levels gradually decrease at

greater degrees of translation. Linear image types show a fairly level recognition rate

for rotations from 10 to 100 percent, indicating much poorer performance. Affine results

(figures 8.15, 8.16) also show the expected drop in performance at increasing levels of affine

transformation. The improved scoring during affine ‘squeezing’ when compared to ‘stretch’

is due to the continued (but distorted) presence of the query image features. Affine stretch

results in the original features vanishing over the edge of the image.

8.4 Comparison with human decisions

Anther important aspect of this work is how it compares with human similarity judge-

ments, a JavaScript web survey was written to gather basic information regarding real

human similarity decisions. This application presents five tests, four for the manmade,

natural, human and cartoon/sketch categories and a final test that includes all categories.

In each test the subject is presented with a target image and a series of five randomly

chosen images. The task of the subject is to arrange the five images in order of similarity

8.4. COMPARISON WITH HUMAN DECISIONS 173

to the target image. The 84 test subjects generated a set of 420 results to be used to

compare the performance of this work with human similarity decisions.

Evaluation of the first 30 of these search results indicated relatively poor performance

(ranking decisions deviated from the human ranking decisions by a mean of 2.3, only

marginally better than chance) when compared to previous results. One cause of this may

be due to the limited number of images presented in each test, and the random nature in

which they are selected. Many of the library images presented for similarity evaluation in

the tests are likely to bear very little in common with the query image. While the human

test subjects were still ranking such image sets by similarity, in many cases this may have

involved a degree of randomness. The limitations of the dataset and number of images

per query means that these results are hardly conclusive. This is definitely an area that

warrants further examination in future work.


(a) Query Image (b) Translated Im-

age

(c) Superimposed

Figure 8.12: This translation to the right results in the loss of the information (rightmost

yellow shaded region) and the addition of new information (left blue region)

Figure 8.13: Similarity score performance over increasing rotation


Figure 8.14: Similarity score performance over increasing translation

Figure 8.15: Similarity score performance over Affine transformations(stretch)


Figure 8.16: Similarity score performance over Affine transformations(squeeze)


(a) 0o (b) 36o (c) 72o (d) 108o (e) 144o

(f) 180o (g) 216o (h) 252o (i) 288o (j) 324o

Figure 8.17: The colour rotation sample set

(a) 0o (b) 36o (c) 72o (d) 108o

Figure 8.18: The greyscale Affine sample set


Figure 8.19: Screenshot from the ‘Human-Like Survey’.

Chapter 9

Conclusions

The aim of this thesis was to investigate, and implement, a plausible architecture for

recognition of general image content in a human-like way. First we reviewed the subject

area and formulated an approach that would represent a plausible architecture. The next

critical stage was the development of group primitives from raw image content that could

be used to generate higher level geometric and photometric description types that could

plausibly approximate a range of Gestalt grouping principles. At this point it was decided

that the segmentation/grouping algorithm was to be run directly upon image content,

without any prior filtering or colour normalization. The decision not to use colour nor-

malization on initial image content was a pragmatic one based around the requirement

of the algorithm to operate across a wide range of image types. A reliance upon colour

normalization would preclude the use of images without colour content such as greyscale

and black and white images. A further basis for this decision was the loss of potentially

valuable direct match recognition information once the image has been colour normalized

which would result in us not being able to test its importance to recognition. A major

justification for implementing colour normalization would be to minimize the differences

between target and query images caused by shading or lighting changes. It was antic-

ipated that this would not be necessary in this case as the algorithm is based around

segment/group descriptions that include shape and non-colour related information that

should still facilitate good matching regardless of these effects.

Of more use to this thesis than colour normalization would have been the ability to

pre-filter images taking human perceptual effects such as simultaneous contrast into ac-

count. Illusions such as those shown in figures 2.3, 2.4 and 2.5 demonstrate that humans

do not necessarily perceive colour or intensity information in the same way as the actual

values contained within images. In such cases, while there may be a partial match between

colour values in a search image and a target image, a human observer may disagree due

to the change in perceived colour caused by the context and geometry within the image.

179

180 CHAPTER 9. CONCLUSIONS

Although limited work towards pre-filtering images in such a way that actual colour val-

ues approximate perceived colour values was performed, it was decided that this form of

correction lay beyond the remit of this work. The result is that this algorithm is just as

susceptible to such discrepancies as conventional algorithms.

A novel use of a KD-tree architecture to facilitate n-dimensional Gestalt feature prox-

imity grouping decisions was developed alongside a binary tree storage method that allows

multi-dimensional grouping behaviour while retaining a full grouping relationship history.

The selection, ranking and weighting of group primitives based upon descriptive suitability

was then addressed, with surviving groups providing the basic descriptions for recognition.

Appropriate description types, of different level of invariance and type, capturing a wide

range of image content types was then proposed. Finally, a practical storage and recogni-

tion architecture was outlined and tested against a variety of different image types. The

final result is a plausible architecture that can form the basis a practical recognition al-

gorithm and a useful platform from which to test the contribution of difference image

description types to human similarity judgement.

This work has successfully demonstrated that the proposed architecture provides ef-

fective, and efficient, recognition across a large range of image types and can successfully

recognize image content using both colour and geometric descriptions. The combination

of variant and semi-invariance description types do indeed facilitate recognition tolerant

to image transformations whilst still allowing the distinction between the degree of such

transformations. Results and experimentation are more limited than originally intended

due to the sheer scale and complexity of the problem addressed, combined with time

restraints of the Phd itself. This means that certain areas of the algorithm may not be

optimal implementations, and there is likely to be scope for improvement in both efficiency

and efficacy. One area that would benefit from further work is the selection and weighting

of label descriptions, which suffered under time constraints. In particular, parallel work

on the possible advantages of using signature storage of invariants for recognition had

to be put aside in favour of completing the thesis work using label based descriptions.

While general results do appear to reflect human-like similarity judgements, initial tests

to evaluate this have not proven conclusive and also require further examination.

9.1 Future Work

This architecture provides a good starting point to many potential areas of future work.

One aspect of human vision not directly implemented in this work is the potential benefits

of pre-normalizing raw image data for levels of colour constancy and the simulation of

simultaneous contrast effects. Simultaneous contrast, in particular, seems to be a major

component of the human visual system, providing the basis for much of the low level

9.1. FUTURE WORK 181

grouping and perceptual organization decisions. Development of methods to simulate this

as a pre-processing stage, or integral part of the Gestalt grouping engine would represent

a useful area of further work and should improve recognition. Similarly, a hue, saturation

and luminance representation of initial image content may well have benefits.

Another area worth investigating is the label segment correspondence decisions that

take place in this work before similarity evaluation. While the current architecture works

well, it would be of interest to test this section using a unique one-to-one correspondence

criteria between segments in place of the current many-to-one approach.

The global weighting of descriptions to tailor adjust recognition performance for differ-

ent image types was investigated in this work, but would definitely be a worthwhile area

of further study. It has been demonstrated that adjusting global weightings applied to

descriptors can significantly improve recognition results. A relaxation based approach to

selecting weighting is likely to well generate optimal recognition results tailored to a given

library set and query type. The investigation of descriptor reliability over different image

content types in general could allow us to generate stored descriptor weighting profiles

that could be applied for optimal results tailored to image content. A further extension to

this approach would be to provide an interactive user-interface that could facilitate query

adjustment (section 8.2) allowing the user to refine their search to their requirements

Finally, a possibility considered but not implemented in this work due to time con-

straints, was the possibility of incorporating continuity constraint and organizational prim-

ing into the Gestalt grouping algorithm. Organizational priming, in this case, refers to

the bias towards particular grouping arrangements based upon previous groupings. For

example, where an image is perceived to contain many vertical linear groupings then there

will be a tendency to perceive other image content as vertical linear structures. This is

essentially a global application of the continuation principle, where a linear feature being

developed favours groupings that will form a continuous curve with the current feature

(even if the continuous grouping is not the most proximal one). An interesting approach to

dealing with this would be through the use of a further n-dimensional space to store edge

values as they are generated in the grouping/segmentation stage. Rather than progressing

through an ordered list of edges (whose values are generated from an n-dimensional volume

nearest neighbour search) this list could be replaced by another n-dimensional volume of

edge values. With such an architecture (that would still be updateable) the next edge to

be processed would be the point in the space closest to the origin (which would then be

removed from the space). With edge selection based around proximity to the origin, this

would generate similar results to the ordered edge-list of the original grouping algorithm.

If, however, we use a moveable origin that moves in the direction of the last edge processed

then the entire next edge search becomes biased towards favouring edges of that partic-


ular direction in the n-dimensional space. Figure 9.1 shows a very simplified illustration

showing how this process should work. It is anticipated that this approach will facilitate

the continuity principle, and better groupings based upon image geometry. The use of

such a second order edge space, using the same fundamental n-dimensional architecture

as the nearest neighbour search space would also represent a more elegant and complete

architecture. Exact implementation issues, and whether or not such an edge selection

process would have unwanted side-effects have not been determined, but this approach

would represent an interesting avenue of further research.

9.1. FUTURE WORK 183

(a) Turn 1 (b) Turn 2

(c) Turn 3 (d) Turn 4

(e) Turn 5 (f) Turn 6

Figure 9.1: Directing Edge Selection using a 2nd Order Edge Search Space and moveable

origin.

Chapter 10

Self-Similar Convolution Image

Distribution Histograms

The following provides a brief overview of work undertaken during the writing of this

thesis (although not directly used for the work) and presented at the British Machine

Vision Conference 2001. Trademark recognition can be considered a specialized, simplistic

form of sketch recognition. Reading in this area [Alwis00], led to the development of

a novel invariant storage method for trademark descriptions. “Self Similar Convolution

Image Histograms” achieve limited invariant storage using scaled copies of the original

image as a filter to generate a spatio-intensity histogram which would form the basis of a

similarity invariant signature.

The basis of this technique is to generate an identifying signature from a binary trade-

mark image by using a scaled down version of itself as a convolution mask to generate a

scalar convolution image. As the convolution filter will always be aligned to the original

image, the normalized distribution histogram of the resulting grey scale image is an affine

invariant description based purely upon the image structure, which can be then be used

for database search using a database of binary images. [Tuke01] explains this process in

detail and presents practical test results. Although this technique represents an interesting

and novel approach to achieving invariance, it is only effective in the tightly constrained

environments that trademark imagery represents. This fact, combined with the need for a

specialised storage architecture incompatible to most invariant signature techniques sug-

gests that this technique is therefore not applicable to the generalised invariant image

search.

185

186CHAPTER 10. SELF-SIMILAR CONVOLUTION IMAGE DISTRIBUTION HISTOGRAMS

Figure 10.1: Showing the invariance of self-similar convolution image histograms to image

transformations (but NOT perspective)

187

Figure 10.2: Screen-shot of a typical query and database response using the self similar

geometric histogram as an identifier.

188CHAPTER 10. SELF-SIMILAR CONVOLUTION IMAGE DISTRIBUTION HISTOGRAMS

Chapter 11

The Linear Gestalt Grouping

Algorithm and Data Types

189

190CHAPTER 11. THE LINEAR GESTALT GROUPING ALGORITHM AND DATA TYPES

Algorithm 11.1 KD Data Types

BEST KD TYPE

(Linked list structure used to return results from nearest neighbour queries of the KD tree)

VOID *Group List (Pointer to a group structure being returned by this node)

float *Coords (The KD tree coordinates of the region being returned)

float Dist ()

BEST KD TYPE *Next (Pointer to next item in linked list)

KD TYPE

(Dual linked list/tree structure used to store KD tree branches and leaves)

int Dimension (Dimension index on which the feature space is to be subdivided)

float Radius ()

void *Group (Pointer to a group structure on a leaf of the KD tree, otherwise NULL)

float *Coords (The centre of the bounding box referenced by this branch or

the KD tree coordinates of a leaf)

KD TYPE *Higher (Pointer to the next branch or next leaf)

KD TYPE *Lower (Pointer to the next branch or NULL when a leaf node)

KD DYNAMIC TREE CLASS

(A class to keep track of the KD tree structure, including functions)

private:

KD TYPE *Tree (The root of the iterative KD tree structure)

float Min[],Max[] (Store dimensions of the current section of feature space being worked on)

float Dist

int BestSize

public:

bool Initialized (Flag to indicate status of KD tree)

int Dimensions (Dimensionality of space to store)

float Terminating Value (Minimum subdivision of space before branches become leaves)

float Furthest Dist (Current largest distance from origin)

BEST KD TYPE **Best (Data structure used to return nearest neighbour results)

int No To Find (Number of nearest neighbours to be found by a query)

int No Node

191

Algorithm 11.2 Linear Gestalt Grouping Algorithm Data Types

REGION TYPE

NODE TYPE *Node List (pointer to list of nodes that make up this region)

Description Variables (Variables that describe the region)

NODE TYPE

EDGE TYPE *Edge1=NULL (pointer to edge terminating in this node)

EDGE TYPE *Edge2=NULL (pointer to edge terminating in this node)

REGION TYPE *Region (pointer to the Region this node belongs to)

NODE TYPE *Next (pointer to next node in the list)

EDGE TYPE

NODE TYPE Node1 (terminating node of edge)

NODE TYPE Node2 (other terminating node of edge)

double Original Score (the perceptual significance score without continuity)

double Score (the working perceptual significance score, including continuity)

unsigned char Status=EDGE ENABLED (used to store information during processing)

EDGE TYPE *Next (pointer to next edge in the list)

Algorithm 11.3 Overview of the Linear Gestalt Grouping Algorithm

For each entry k in edge list e

{if ((Edge[k].Status)<>EDGE DISABLED)

if (JunctionTest(Edge[k]))

if (ContinuityTest(Edge[k]))

Reposition(Edge[k])

if Status(Edge[k].Status)==EDGE REPOSITIONED

{AddEdgeToRegion(Edge[k])

}}


Algorithm 11.4 Overview of the JunctionTest function

bool JunctionTest(Edge)

(Only linear structures allowed here, clustering with 3 way nodes is false)

{LeftNode = Edge.Node1

RightNode = Edge.Node2

if (LeftNode.Edge2)

{(Node already has 2 nodes attached to it so prevent further connections)

Edge.Status =EDGE DISABLED

return false

}else

{(Node has either no or only a single edge running from it, valid)

return true

}}

193

Algorithm 11.5 Overview of the ContinuityTest function

bool ContinuityTest(Edge)

(Evaluate the continuity implications of adding this edge)

(Optionally disallow edges that have very poor continuity)

{OScore=Get Continuation Score(Edge) *returns 0-1*

(Disallow edges with continuity scores lower than MC)

if (OScore <MC)

{Edge.Status =EDGE DISABLED

return false

}

(Add continuity values to edge score)

OScore = Edge.Original Score + OScore − CB

If (OScore < 0) OScore = 0

elseif (OScore > 1) OScore = 1

Edge.Score = OScore

return true

}


Algorithm 11.6 Overview of the Get Continuity Score function

float Get Continuity Score(Edge)

(Return continuity implications of adding this edge)

{LeftNode = Edge.Node1

RightNode = Edge.Node2

A1 =Evaluate minimum angle (degrees) between edges connecting LeftNode

(180 if node isn’t a junction between edges)

A2 =Evaluate minimum angle (degrees) between edges connecting RightNode

(180 if node isn’t a junction between edges)

(Select the lowest angle (worst score))

if (A2 < A1)

A1 = A2

(Normalize angle to range between 0 and 1)

A1 = A1/180

return A1

}

Algorithm 11.7 Overview of the Reposition function

Reposition (Edge)

{Reposition Edge based on its score

(between current k position to end of list)

Edge.Status = EDGE REPOSITIONED

}

195

Algorithm 11.8 Overview of the AddEdgeToRegion function

AddEdgeToRegion (Edge)

{(Add all nodes in the region pointed to by Edge.Node2

to the node list of Edge.Node1.Region)

R1 = Edge.Node1.Region

R2 = Edge.Node2.Region

for each Node in R2.Node List

{Node.Region = R1

Remove Node from R2.Node List, place on R1.Node List

}Remove R2 from Regions

}

Chapter 12

The n-Dimensional KD Tree

Structure and Algorithms

197

198CHAPTER 12. THE N-DIMENSIONAL KD TREE STRUCTURE AND ALGORITHMS

Algorithm 12.1 Return the current largest axis, which will be split by the tree branching

process

Find Largest Dimension ()

{(Search for the largest axis defining the currently

referenced volume of space)

Largest Value=0

Largest Dimension=0

for all f in Dimensions

(Max and Min define the currently referenced volume of space)

if (Max[f]-Min[f]≥Largest Value)

Largest Value=Max[f]-Min[f]

Largest Dimension=f

return Largest Dimension

}

199

Algorithm 12.2 Create a new KD tree branch

KD TYPE *Create New Limb()

{(Create a new limb)

Allocate memory for Limb

Limb.Dimension=Find Largest Dimension()

if (Max[Limb.Dimension]-Min[Limb.Dimension]≤Terminating Value)

{(Is a leaf node)

Limb.Dimension= -1 (Dimension of -1 used to indicate leaf nodes)

}else

(Is a branch node)


{(Set Limb.Coords to the centre of the current branch bounding box)

Limb.Coords[f] = min[f]+(max[f]-min[f])/2

(Set Limb.Radius to the largest vector possible in bounding box)

Limb.Radius+=(max[f]-min[f])/2

}(Square root of Limb.Radius is now minimum encompassing radius

from Limb.Coords to bounding box)

return Limb

}


Algorithm 12.3 Adding a point to the KD tree

Generate Point (Current Branch, Group, Coords)

{(Iteratively travel down tree branches, creating new branches where

required, add a leaf node pointing to a group.

The entire tree, leaves and branches, is made up

of iteratively generated KD TYPE nodes)

if (!Current Branch)

{(Limb currently set to NULL, need to generate a new one)

Current Branch=Create New Limb()

Fresh=true

}else Fresh=false

if (Current Branch.Dimension<0)

{(Limb is a leaf node, nodes behave in a linked list manner from now on)

(We have found where we will place our point group structure)

if (!Fresh)

{(More than one leaf on this branch, so need to generate

memory and insert into linked list of leaves)

Tmp=new KD TYPE

Tmp→Next=Current Branch

Current Branch=Tmp

}

(Copy across group references and coordinate values)

Current Branch→Group=Group

Dist=0


{Current Branch.Coords[f]=Coords[f]

Dist+=Coords[f]

}if (Dist>Furthest Dist) Furthest Dist=Dist

}else

201

{(Limb is a tree branch, lets see which limb to continue down)

if (Coords[Current Branch.Dimension]>

Current Branch[Current Branch.Dimension])

(Entry being added belongs down the higher branch)

(Store current dimensions so we can restore them after iteration)

tmp=Min[Current Branch.Dimension]

(Reduce working volume dimensions ready for next iteration)

Min[Current Branch.Dimension]=

Current Branch.Coords[Current Branch.Dimension]

(Iterate down the highest branch)

Branch→Higher=

Generate Point(Current Branch→Higher, Group, Coords)

(Restore dimensions back to original values)

Min[Current Branch.Dimension]=tmp

}else

(Entry being added belongs down the lower branch)

(Store current dimensions so we can restore them after iteration)

tmp=Max[Current Branch.Dimension]

(Reduce working volume dimensions ready for next iteration)

Max[Current Branch.Dimension]=

Current Branch.Coords[Current Branch.Dimension]

(Iterate down the lowest branch)

Branch→Higher=

Generate Point(Current Branch→Lower, Group, Coords)

(Restore dimensions back to original values)

Max[Current Branch.Dimension]=tmp

}}

return Current Branch

}


Algorithm 12.4 Return the squared Euclidean distance between two points in feature

space

float Get Distance Between(Coords1, Coords2)

{(Find distance between Coords1 and Coords2)

dist=0


{dist=dist+(Coords1[f ]-Coords2[f ])2

}return dist

}

Algorithm 12.5 Copy coordinate values from one group to another

Copy Coords(E1, E2)

{(Copy coordinate values from E2 to E1)


E1.Coords[f ]=E2.Coords[f ]

}

203

Algorithm 12.6 Check to see if a leaf group belongs in the nearest neighbour table

Check Against Table(Branch, Current Group, Coords)

{(Ensure we don’t try to place the target group, we don’t

want the query group appearing in the neighbour table)

if (Group==Branch→Group) return

dist=Get Distance Between(Branch.Coords, Coords)

(See if the table is already full)

if (Best[No To Find-1])

{(Is full, we only need to query the final entry to see

if this entry belongs in the table)

if (dist>Best[No To Find-1].Dist)

{(Entry is further than furthest table entry,

leave neighbour table intact)

return

}}

(We know the entry belongs in the table,

so scan to find the entries position)

for all i in No To Find-1

{(If a blank entry in table encountered,

this is our neighbour table position)

if (!Best[i]) break

(If distances are greater or equal, insert at this point

if (Best[i].Dist≥dist) break

}


(If table has a blank entry then just create new entry, otherwise

we have to rearrange the existing table entries)

if (Best[i])

if (Best[i].Dist==dist)

{(Best entry exists, and is the same distance)

(No table shuffling needed, just add to linked list pointed

at by neighbour table)

if (!Check List For Duplication(Best[i]))

{(Exit if group is already present in neighbour table)

return

}else

{(Best entry exists and is different distance to table entry)

(A table shuffle is required to make space for new entry)

(Destroy final table entry, along with it’s linked list)

Destroy Entry(Best[No To Find-1])

(BestSize keeps track of the total number of groups held in

the nearest neighbour table lists)

BestSize–

(Move table entries along, making room for a new table entry)

for (f=No To Find-2;f≥i;f–) Best[f + 1]=Best[f ]

Best[i]=NULL

}

(Best[i] now points to an appropriate neighbour table linked list)

(Create a new BEST KD TYPE entry)

BPnt=new BEST KD TYPE

BPnt.Coords=new float[Dimensions]

BestSize++

Copy Coords(BPnt, Branch)

BPnt.Dist=dist

BPnt→Group=Branch→Group

BPnt→Next=Best[i]

Best[i]=BPnt

}

205

Algorithm 12.7 Determine if a KD tree branch is worth exploring

bool Is Branch Possibly Closer(Coords, Branch, OldDist)

{(See if this branch could possibly contain a leaf closer than

current entries on the neighbour table)

(Detect distance between the encompassing sphere of the section of

feature space referenced by this branch and the target coordinates

if (Get Distance Between(Branch.Coords, Coords)-Branch.Radius>OldDist)

return false

else

return true

}


Algorithm 12.8 Iteratively search the KD tree for the n-nearest neighbours to a point

in feature space

Iterate Find Points (Current Branch, Group, Coords, Firstonly)

{(Search the KD tree for the n-nearest neighbours to a point in feature space)

(First check to ensure that the branch points to an active part of the feature space)

if (!Branch) return

(Are we dealing with a branch or a leaf?)

if (Branch.Dimension<0)

{(Is a leaf, search like a linked list)

Check each leaf to see if it belongs in the nearest neighbour table

Check Against Table(Branch, Group, Coords)

(Process any more leaves, unless rough mode is selected)

if (!firstonly)

Iterate Find Points(Branch→Higher, Group, Coords, firstonly)

}else

{(Is a Branch)

(Check to see if it’s worth exploring this block of feature space)

if (!Is Branch Possibly Closer(Coords, Branch, Best[No To Find-1]→Dist)

return

(This block of space could contain a nearest neighbour

so continue iteratively probing the tree structure)

Iterate Find Points(Branch→Higher, Group, Coords, firstonly)

Iterate Find Points(Branch→Lower, Group, Coords, firstonly)

}return

}

Chapter 13

Combined Gestalt Grouping

Algorithms and Results

Algorithm 13.3 shows the new group data structure. Rather than explicitly storing a

list of nodes that refer to member pixels, the new structure stores the original edge that

formed the group. If the group represents a single pixel then it can easily be recognised

by checking for a null value in the founding edge. Where a group is larger than a single

pixel, the founding edge points to the two original groups (or sub-groups) whose merging

resulted in this group. This results in a binary tree structure.

207

208CHAPTER 13. COMBINED GESTALT GROUPING ALGORITHMS AND RESULTS

Algorithm 13.1 Original Group Structure

GROUP TYPE

float Coords[NO COORDINATES] (group N dimensional coordinate)

NODE TYPE *Node List (pointer to list of nodes that make up this region)

Description Variables (Variables that describe the region)

GROUP TYPE *Next (pointer to next group in list)

GROUP TYPE *Prev *pointer to previous group in list)

Algorithm 13.2 Orginal group merge procedure

float Merge Groups (*Group1, *Group2, *Edge)

{(Loop through Group2’s Node List, making each node point to Group1)

for all n in Group2.Node List

{n.Group=Group1

}(Move Node List from Group2 to the end of Group1’s Node List)

Group1.Node List+=Group2.Node List

(Update Group1 description to new combined group description)

Group1Update Description(Group1, Group2)

(Remove and destroy Group2, Group1 is now the new child group)

Destroy(Group2)(Remove founding edge from the edge list and destroy it)

DestroyEdge

}

209

Algorithm 13.3 New tree based Group Structure

GROUP TYPE

float Coords[NO COORDINATES] (Group N dimensional coordinate)

EDGE TYPE *Founding Edge (Pointer to the edge that created this group)

Description Variables ((Variables that describe the region)

GROUP TYPE *Descendent (Immediate child (if this is a subgroup)

GROUP TYPE *YoungestDescendent (Most developed child group)

GROUP TYPE *Next (pointer to next group in list)

GROUP TYPE *Prev (pointer to previous group in list)


Algorithm 13.4 Set YoungestDescendent procedure

void Set YoungestDescendent (GROUP TYPE *Grp, GROUP TYPE *Youngest)

{(Iteratively crawl both up and down the tree structure resetting

YoungestDescendent pointers to the new child group)

Grp.YoungestDescendent=Youngest

(Expand down the tree if the current group has a founding edge)

if (GPnt.Founding Edge)

{(If parent groups haven’t already been reset then reset them)

if Grp.Founding Edge.Group1.YoungestDescendent!=Youngest)

Set YoungestDescendents(Grp.Founding Edge.Group1, Youngest)

if Grp.Founding Edge.Group2.YoungestDescendent!=Youngest)

Set YoungestDescendents(Grp.Founding Edge.Group2, Youngest)

}(We have to consider crawling back up the tree too because

edge structures may point to less developed subgroups lower

down in the tree structure)

if (Grp.Descendent)

if (Grp.Descendent.YoungestDescendent!=Youngest)

Set YoungestDescendents(GPnt.Descendent,Youngest)

}

211

Algorithm 13.5 New, tree-based, group merge procedure

float Merge Groups (*Group1, *Group2, *Edge)

{(Each group referred to by the Founding Edge could be an

old group currently forms part of a larger child group,

so we need to actually combine the youngest descendants

of the groups.)

RGroup1=Group1.YoungestDescendent

RGroup2=Group2.YoungestDescendent

(Create ChildGroup and add it to the group list)

ChildGroup=Create Group()

(The new child group is the immediate descendent of the two

merged groups, so set their descendent pointers to it)

RGroup1.Descendent=ChildGroup

RGroup2.Descendent=ChildGroup

(Update ChildGroup description to new combined group description)

ChildGroupUpdate Description(RGroup1, RGroup2)

(Store the edge that formed this group)

ChildGroup.Founding EdgeEdge

(The new group is now the youngest descendent of that tree,

so we need to move through the tree and update each Groups YoungestDescendent pointer)

Set YoungestDescendent(ChildGroup, ChildGroup)

}

Chapter 14

Database Search Algorithms

213

214 CHAPTER 14. DATABASE SEARCH ALGORITHMS

Algorithm 14.1 Label Component Correspondence Algorithm

Generate Matches(query label, correspondence[])

{For each database image

{Load database label

For q=each group in query label

{min difference=1

For d=each group in database label

{difference=

∑

difference between descriptor values(descriptor weights)∑

descriptor weights*global user weighting for that description type

if (difference < min difference)

{correspondence[q]=d

min difference=difference

}}

}}

}

215

Algorithm 14.2 Similarity Evaluation

Compare Labels(query label, database label, correspondence[])

(Evaluates weighted differences between best matching groups from

the two labels, for each description type)

{For q=each description type in query label

{overall similarity=0

score[q]=0

sum weights=0

For d=each group in description

{w=query label[q][d].weight*global user weighting for that description type

v=query label[q][d].value

wd=database label[q][correspondence[d]].weight

vd=database label[q][correspondence[d]].value

diff=abs(v − vd)

if (w > wd) w=wd

diff=diff*w

sum weights=sum weights+w

score[q]=score[q]+diff

}if (sum weights > 0)

score[q]= score[q]sum weights

else

score[q]=1

(Invert from difference to percentage similarity)

score[q]=100 − (score[q] ∗ 100)

overall similarity=overall similarity + score[q]

}overall similarity= overall similarity

number of description types

}

216 CHAPTER 14. DATABASE SEARCH ALGORITHMS

Bibliography

[Adelson93] Edward. H. Adelson Perceptual Organization and the Judgement of Bright-

ness Science,Volume 262, pp.2042-2044,1993

[Adelson00] Edward. H. Adelson Lightness Perception and Light-

ness Illusions The New Cognitive Neurosciences (2nd

Edition) MIT Press, Chapter 24, 339-351, 2000

http://persci.mit.edu/people/adelson/publications/gazzan.dir/gazzan.htm/#intro

[Albert90] J. Albert, F. Ferri, J. Domingo and M. Vicens,An approach to natural scene

segmentation by means of genetic algorithms with fuzzy data, Prez de la

Blanca N., SanFeliu A., and Vidal E. (eds) Fourth National Symposium

in Pattern Recognition and Image Analysis ( Selected Papers), 1990, pp

97-112.

[Alwis00] T. P. G. L. S. Alwis, Content-Based Retrieval of Trademark Images, DPhil

Thesis Dept. of Computer Science, University of York. Feb 2000

[Austin96] J. Austin, High Speed Image Segmentation using a Binary Neural Network,

Aug. 1996

[Bach96] J. Bach, C. Fuller, A. Grupta, A. Hampapur, B. Horowitz, R. Humphrey,

R. Jain and C. Shu, Virage image search engine: An open framework for

image management, Proc. SPIE Storage and Retrieval for Still Image and

Video Databases IV, San Jose, California, pp. 76-87

[Barnard96] K. Barnard, G. Finlayson and B. Funt, Colour Constancy for Scenes

with Varying Illumination, 4th European Conference on Computer Vision,

Springer, 1996

[Barnard00] K. Barnard and G. Finlayson, Shadow Identification using Colour Ratios

Proceedings of the IS&T/SID Eigth Color Imaging Conference: Color Sci-

ence, Systems and Applications pp. 97-101, 2000.

217

218 BIBLIOGRAPHY

[Bergen88] James R. Bergen and Edward H. Adeleson Early vision and texture percep-

tion Nature, Vol.333 pp.363-364, 1988

[Besag86] J. Besag, On the statistical analysis of dirty pictures, Journal of the Royal

Statistical Society, series B, 48, pp. 259-302, 1986.

[Biederman87] I. Beiderman, Recognition by components: A theory of human image un-

derstanding, Psychological Review, 94, pp. 115-147, 1987.

[Boole1872] G. Boole, A Treatise on the Calculus of Finite Differences. 1872.

[Brainard97] D.H. Brainard, W.T. Freeman, Bayesian color constancy, Journal of the

Optical Society of America A, 14:1393-1411, 1997

[Brainard01] D.H. Brainard, Sensation and Perception: Color Vision Theory, The In-

ternational Encyclopedia of Social & Behavioral Sciences, Pergamon Press,

2001.

[Brill92] M.H. Brill, E.B. Barrett and P.M. Payton, Projective invariants for curves in

2 and 3 dimensions, Geometric Invariance in Computer Vision, pp. 193-214,

1992.

[Buchsbaum80] G. Buchsbaum, A Spatial Processor Model for Object Colour Perception,

Journal of the Franklin Institute, 310:1-26, 1980.

[Bulthoff95] H.H. Bulthoff, S.Y. Edelman and M.J. Tarr, How are three-dimensional

objects represented in the brain?, Cerebral Cortex, 5(3), pp. 247-260, 1995.

[Califano94] A. Califano and R. Mohan, Multidimensional Indexing for Recognizing Vi-

sual Shapes, IEEE Transactions on Pattern Analysis and Machine Intelli-

gence Vol 16, No 4, pp.373-391, April 1994.

[Canny86] J. Canny, A Computational Approach to Edge Detection, IEEE Transac-

tions on Pattern Analysis and Machine Intelligence, Vol. 8, No. 6, Nov.

1986.

[Cayley] A. Cayley Cross-referenced overview of his work and life history

Available via the internet from the University of St Andrews, St

Andrews, Fife, Scotland. http://www-groups.dcs.st-andrews.ac.uk/ his-

tory/Mathematicians/Cayley.html

[Chen00] K. Chen, D. Wang and X. Liu, Weight Adaptation and Oscillatory Corre-

lation for Image Segmentation, IEEE Trans. Neural Networks, vol. 11, no.

5, Sept 2000

BIBLIOGRAPHY 219

[Cheng96] S.M. Cheng and K.T. Lo, Fast clustering process for vector quantisation

codebook design, Electronic Letters, vol. 32(4), pp. 311-312, February 1996

[Chialvo95] D. R. Chialvo and M. Millonas, How Swarms Build Cognitive Maps, The Bi-

ology and Technology of Intelligent Autonomous Agents, NATO ASI Series,

1995, pp. 439-450

[Christou03] C.G. Christou, B.S. Tjan and H. Bulthoff, Extrinsic cues aid shape recog-

nition from novel viewpoints, Journal Of Vision, pp. 183-197, 2003

[Cohen93] Cohen and Cohen, Finite element methods for active contour models and

balloons for 2D and 3D images, IEEE-Pattern Analysis and Machine Intel-

ligence, Nov. 1993.

[Conners80] R.W. Conners and C.A Harlow, A theoretical comparison of texture algo-

rithms, IEEE Transactions on Pattern Analysis and Machine Intelligence,

Vol.2, 1980, pp. 204-222.

[Cook96] R. Cook, I. McConnel and D. Stewart, Segmentation and Simulated An-

nealing, Microwave Sensing and Synthetic Aperture Radar, edited by G.

Franceschetti et al, Proc. SPIE 2958 (1996) pp30-35.

[Cutzu96] F. Cutzu and S. Edelman, Representation of object similarity in human

vision: psychophysics and a computational model, The Weizmann Institure

of Science, 1996

[Dudek97] G. Dudek and J.K. Tsotsos, Shape Representation and Recognition from

Multiscale Curvature, Computer Vision and Image Understanding, vol. 68,

no. 2, pp. 170-189(20), November 1997.

[Edelman95] S. Edelman, Representation of similarity in 3D object discriminations, Neu-

ral Computation, 7 pp. 407-422, 1995

[Felzenwalb98] P. F. Felzenwalb an D. P. Huttenlocher, Efficiently Computing a Good

Segmentation, DARPA Image Understanding Workshop, 1998

[Finlayson92] Graham D. Finlayson, Mark S. Drew and Brian V.Funt Diagonal Trans-

forms Suffice for Color Constancy IEEE Proceedings Fourth International

Conference on Computer Vision pp. 164-171, May 1993

[Finlayson95] Graham D. Finlayson Coefficient Color Constancy DPhil Thesis submission

Simon Fraser University

220 BIBLIOGRAPHY

[Finlayson97a] Graham D. Finlayson, Mark S. Drew White-point preserving color correc-

tion 5th Color Imaging Conference: Color, Science, Systems and Applica-

tions, IS&T/SID, pp.258-261, Nov. 1997.

[Finlayson97b] G. Finlayson and S. Hordley, Selection for Gamut Mapping Colour Con-

stancy, British Machine Vision Conference, 1997

[Finlayson01] G. Finlayson and G. Schaefer, Hue that is invariant to brightness and

gamma, British Machine Vision Conference, Vol. 1, 2001, pp. 303-313

[Flickner95] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M.

Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele and P. Yanker, Query

by image and video content: The QBIC system, IEEE Computer 28(9),

pp.23-32, 1995.

[Forbes98] F. Forbes and A. Raftery, Bayesian Morphology: Fast Unsupervised

Bayesian Image Analysis,RR-3374, INRIA, March 1998.

[Forsyth90] D.A. Forsyth, A Novel Algorithm for Colour Constancy International Jour-

nal of Computer Vision, 5 pp. 5-36, 1990

[Forsyth90a] D. Forsyth, J. Munday, A. Zisserman and C. Brown, Projectively invariant

representations using implicit algebraic curves, Image Vision Computing 8,

pp. 130-136, 1990.

[Forsyth98] D.A. Forsyth Sampling, resampling and colour constancy Proceedings of the

Computer Vision and Pattern Recognition Conference 1998

[Gagaudakis03] G. Gagaudakis and P.L. Rosin, Shape measures for image retrieval, Pat-

tern Recognition Letters, vol. 24, no. 15, pp. 2711-2721, 2003.

[Gevers99] Theo Gevers and Arnold W.M. Smeulders Color Based Object Recognition

Pattern Recognition, 32(3) pp:453-464,1999.

[Gilchrist97] Gilchrist, 1997.

[Gonnet02] G. Gonnet, Scientific Computation (website),

http://linneus20.ethz.ch:8080/2 6 2.html, Institute for Scientific Com-

puting, ETH Zrich, Switzerland 2002.

[Gonzalez92] R. Gonzalez and R. Woods, Digital Image Processing, Addison-Wesley Pub-

lishing Company, Chap. 4. 1992.

BIBLIOGRAPHY 221

[Gool92] L. J. Van Gool, T. Moons, E. Pauwels and A. Oosterlinck, Semi-Differential

Invariants, Geometric Invariance in Computer Vision, pp. 157-192. Cam-

bridge, MA, MIT Press, 1992.

[Gordan] P.A. Gordan Cross-referenced overview of his work and life history

Available via the internet from the University of St Andrews, St

Andrews, Fife, Scotland. http://www-groups.dcs.st-andrews.ac.uk/ his-

tory/Mathematicians/Gordan.html

[Gosling96] J. Gosling and A. Smith, Sun Microsystems, FastQSortAlgorithm

- A quick sort demonstration algorithm using a tri-median pivot,

http://www.cs.ubc.ca/spider/harrison/Java/FastQSortAlgorithm.java.html

[Gra] A version of Graham’s Algorithm, http://www.pms.informatik.uni-

muenchen.de/lehre/compgeometry/Gosper/convex hull/convex hull.html

[Graps95] A. Graps, An Introduction to Wavelets, IEEE Computational Science and

Engineering, Vol. 2, 1995

[Greenspan94] H. Greenspan, S. Belongie, R. Goodman and P. Perona, Rotation Invariant

Texture Recognition Using a Steerable Pyramid, ICPR 1994, vol 2, pp. 162-7

[Gros98] P. Gros, O. Bournez and E. Boyer, Using Local Planar Geometric Invariants

to Match and Model Images of Line Segments, Computer Vision and Image

Understanding, vol. 69, no. 2, pp. 135-155(21), February 1998.

[Harwood87] D. Harwood, M. Subbarao, H. Hakalahti and L.S. Davis, A New Class of

Edge-Preserving Smoothing Filters, Pattern Recognition Letters, Vol. 6,

1987, pp. 155-162

[Hering64] E. Hering, Outlines of a theory of the light senses, translated by Leo M.

Hurvich and Dorothea. Cambridge, MA:Harvard Univ. Press, 1964.

[Hodge02] V. Hodge and J. Austin. A High Performance k-NN Approach using Binary

Neural Networks, Neural Networks, Elsevier Science, 2002

[Hoffmann97] T. Hofmann, J. Puzicha and J. Buhmann, Deterministic Annealing for

Unsupervised Texture Segmentation, Proceedings EMMCVPR’97, Venice,

1997.

[Hoffmann98] T. Hofmann, J. Puzicha and J. Buhmann, Unsupervised Texture Segmen-

tation in a Deterministic Annealing Framework, IEEE Transactions on Pat-

tern Analysis and Machine Intelligence, May, 1998.

222 BIBLIOGRAPHY

[Hsu93] T. Hsu, A.D. Calway, R. Wilson, Texture Analysis using the Multiresolution

Fourier Transform, 8th Scandinavian Conference on Image Analysis, May,

1993.

[Iivarinen97] J. Iivarinen, M. Peura, J. Srel, A. Visa, Comparison of Combined Shape

Descriptios for Irregular Objects’, 8th British Machine Vision Conference,

British Machine Vision Conference, 1997)

[Kam99] L. Kam and J. Blanc-Talon, Multifractal Texture Characterization For Real

World Image Segmentation, ACIVS 1999

[Kasari96] M. Hauta-Kasari, J. Parkkinen, T. Jaaskelainen and R. Lenz, Generalized

Cooccurence Matrix for Multispectral Texture Analysis, Proc. 13th Inter-

national Conference on Pattern Recognition 96, Aug. 1996

[Kass87] M. Kass, A. Witkin and D. Terzopoulos, Snakes: Active Contour Models,

International Journal of Computer Vision, Vol. 1, 1987, pp 321-331

[Katz35] D. Katz, The World of Colour, 1935 London: Kegan Paul.

[Keren89] D. Keren, R. Marcus and M. Werman, Segmenting and Compressing Wave-

forms by Minimum Length Encoding, Technical Report, Leibniz Center,

1989.

[Klinker88] G.J. Klinker, S.A. Shafer and T. Kanade, Color image analysis with an

intrinsic reflection model, Proceedings of the International Conference on

Computer Vision, 1988.

[Kliot98] M. Kliot and E. Rivlin, Invariant-Based Shape Retrieval in Pictorial

Databases, Computer Vision and Image Understanding, vol. 71, no. 2, pp.

182-197(16), 1998.

[Koubaroulis00a] D. Koubaroulis, J.Matas and J.Kittler, Illumination Invariant Object

Recognition Using The MNS Method, Proceedings of the 10th European

Signal Processing Conference, 2000

[Koubaroulis00b] D. Koubaroulis, J.Matas and J.Kittler Colour-based Image Retrieval

from Video Sequences In John P Eakins and Peter G B Enser, editors,

Proceedings of the Czech Pattern Recognition Workshop, pp1-12, 2000

[Kruizinga99] P. Kruizinga, N. Petkov and S.E. Grigorescu, Comparison of texture features

based of Gabor filters, Proceedings of the 10th International Conference on

Image Analysis and Processing, Sep. 1999, pp. 142-147

BIBLIOGRAPHY 223

[Lamdan88] Y. Lamdan and H. J. Wolfson, Geometric hashing: a general and efficient

model-based recognition scheme, Proceedings of the 2nd International Con-

ference on Computer Vision, pp. 238-249, 1988.

[Land71] Land, E. H. and McCann, Lightness and retinex theory, J. J. Journal of the

Optical Society of America, 1971, 61 (1), pp. 1-11.

[Laws80] K. Laws, Textured Image Segmentation, Ph.D. Dissertation, University of

Southern California, January 1980.

[Levine85] M. D. Levine, Vision in Man and Machine, Publisher: McGraw-Hill, 1985

[Ma95] W.Y. Ma and B. S. Manjunath, Image indexing using a texture dictionary,

Proc. of SPIE conference on Image Storage and Archiving System, volume

2606. Oct. 1995

[Marr82] D. Marr, Vision, Freeman Press, 1982.

[Mokhtarian99] F. Mokhtarian and S. Abbasi, Shape-Based Indexing using Curvature

Scale Space with Affine Curvature, Proc. European Workshop on Content-

Based Multi-Media Indexing, pp. 255-262, 1999.

[Mundy92] J.L. Mundy and A. Zisserman (editors), Geometric Invariance In Computer

Vision, MIT Press, 1992

[Nagao95] Kenji Nagao and W.Eric.L. Grimson Recognizing 3D Objects Using Pho-

tometric Invariant International Conference in Computer Vision, 1995 pp.

480-487

[Peipa] The Pilot European Image Processing Archive (PEIPA),

http://peipa.essex.ac.uk/

[Palmer96] Palmer, Neff and Besk, 1996.

[Palmer00] Palmer and Nelson, 2000.

[Peleg90] S. Peleg, D. Keren, R. Marcus and M. Werman, Segmentation by Minimum

Length Encoding, 10th International Conference of Pattern Recognition,

June 1990

[Pentland94] A. Pentland, R. Picard and S. Sclaroff, Photobook: Tools for content-based

manipulation of image databases, Proc. SPIE Storage and Retrieval for

Image and Video Databases II, San Jose, California, pp. 34-47.

224 BIBLIOGRAPHY

[Peura97] M. Peura and J. Iivarinen, Efficiency of simple shape descriptors, Aspects

of Visual Form, World Scientific, pp. 443451, 1997.

[Quan98] L. Quan and F. Veillon, Joint Invariants of a Triplet of Coplanar Conics:

Stability and Discriminating Power for Object Recognition, Computer Vi-

sion and Image Understanding, vol. 70, no. 1, pp. 111-119(9), April 1998.

[Ra93] S.W. Ra and J.K. Kim. A fast mean-distance-ordered search algorithm par-

tial codebook search algorithm for image vector quantization, IEEE Trans-

actions on Circuits and SystemsII: Analogue and Digital Signal Processing,

Vol. 40(9), pp. 576579, September 1993.

[Rahman] Z. Rahman, D. J. Jobson and G. A. Woodell NASA Langley Research Centre

http://dragon.larc.nasa.gov/viplab/retinex/retinex.html

[Ramos00] V. Ramo, F. Almeida, Artificial Ant Colonies in Digital Image Habitats

A Mass Behaviour Effect Study on Pattern Recognition, ANTS 2000 - 2nd

International Workshop on Ant Algorithms (From Ant Colonies to Artificial

Ants), Sep. 2000, pp. 113-116

[Reiss93] T.H. Reiss, Recognizing Planar Objects Using Invariant Image Features,

Publisher: Springer-Verlag, pp. 14, 1993.

[Roberts65] L. Roberts, Machine Perception of 3-D Solids, Optical and Electro-optical

Information Processing, MIT Press 1965.

[Rock92] Rock et al., Grouping can occur after perception of lightness constancy,

1992.

[Rock64] Rock and Brosgole, Perceptual grouping after stereoscopic deph perception,

1964.

[Rosin03] P.L. Rosin, Measuring shape: ellipticity, rectangularity, and triangularity,

Machine Vision and Applications, vol. 14, no. 3, pp. 172-184, 2003.

[Schulz03] M.F. Schulz and T. Sanocki, Time Course of Perceptual Grouping by Color,

Psychological Science, Vol. 14, Number 1, pp. 26-30, January 2003.

[Shi97] J. Shi and J. Malik, Normalized Cuts and Image Segmentation, IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 1997

[SimonFraser] Colour Constancy Algorithms http://www.cs.sfu.ca/ colour/research/colour-

constancy.html Computational Vision Lab Computing Science, Simon

Fraser University, Burnaby, BC, Canada, V5A 1S6

BIBLIOGRAPHY 225

[Simons98] D.J. Simons and R.F. Wang, Perceiving real-world viewpoint changes, Psy-

chological Science, 9, pp. 315-320, 1998.

[Singh80] S. Singh, M. Sharma, M. Marku,”Evaluation of texture Methods for Image

Analysis”, Pattern Recognition Letters. 1980

[Singh99] S. Singh, J. Haddon and M. Markou, Nearest Neighbour Strategies For Im-

age Understanding, Proc. Workshop on Advanced Concepts for Intelligent

Vision Systems, August, 1999

[Smith94] J.R. Smith and S.F. Chang, Quad-Tree Segmentation for Texture-Based

Query, Proc. 2nd Annual Multimedia Conference, San Francisco, Oct. 1994

[Smith97] J.R. Smith and S.F. Chang, Querying by color regions using the VisualSEEk

content-based visual query system, Intelligent Multimedia Information Re-

trieval, The MIT Press, Massachusetts Institute of Technology, Cambridge,

Massachusets and London, England, pp 23-41.

[Sproull91] R.F. Sproull, Refinements to nearest-neighbour searching in k-dimensional

trees, Algorithmica 6 (4), pp. 579-589, 1991.

[Squire00] D.M.G. Squire and T.M. Caelli, Invariance Signatures: Characterizing Con-

tours by Their Departures from Invariance, Computer Vision and Image

Understanding, vol. 77, no. 3, pp. 284-316(33), March 2000.

[Stegmann00] Mikkel B. Stegmann and Rune Fisker, On Properties of

Active Shape Models Technical Report, IMM-REP-2000-12,

http://www.imm.dtu.dk/ aam/downloads/asmprops/

[Talon00] J. Blanc-Talon, Fractal Techniques in Image Analysis, Image Segmentation

and Image Compression, AISTA 2000

[Tarr90] M.J. Tarr, S. Pinker, When does human object recognition use a viewer-

centred reference frame? Psychological Science, 1, pp. 253-256, 1990.

[Taubin92] G. Taubin and D. Copper, Object recognition based on moment (or alge-

braic) invariants, Geometric Invariance in Computer Vision, pp. 375-397.

Cambridge, MA, MIT Press, 1992.

[Thacker95] N.A. Thacker, P.A. Riocreux and R.B. Yates, Assessing the completeness

properties of pairwise geometric histograms, Image and Vision Computing,

Vol 13, No. 5, pp. 423 - 429, June 1995

226 BIBLIOGRAPHY

[Thorisson94] K.R. Thorisson, Simulated Perceptual Grouping: An Application to

Human-Computer Interaction, Proceedings of the Sixteenth Annual Confer-

ence of the Cognitive Science Society. Atlanta, Georgia, 1994, pp. 876-881.

[Tuke01] C.E. Tuke, J. Austin and S.O’Keefe,Self-Similar Convolution Image Distri-

bution Histograms as Invariant Identifiers, British Machine Vision Confer-

ence, Sept 2001

[Vecera97] S.P. Vecera and M.J. Farah, Is visual image segmentation a bottom-up or

an interactive process?, Perception and Psychophysics, 59, pp. 1280-1296,

1997.

[Wang97] D. Wang and D. Terman, Image Segmentation based upon Oscillatory Cor-

relation, Neural Computation, Vol. 9, 1997, pp. 805-836

[Weiss88] I. Weiss, Projective invariants of shape, Proc. DARPA Image Understanding

Workshop, 1988

[Xilin99] Y. Xilin, I. Octavia, Line-Based Recognition Using a Multidimensional

Hausdorff Distance, IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol 21, No 9, pp.901-914, Sept 1999.

[Xu90] Lei Xu, Erkki Oja, and Pekka Kultanen, Randomized Hough Transform,

101st International Conference on Pattern Recognition, 1989, pp 631-635

[Yu01] S. X. Yu, J. Shi, Understanding Popout: Pre-attentive Segmentation

through Nondirectional Repulsion, CMU-RI-TR-01-20, Jul. 2001

[Zucker77] S.W. Zucker, R.A. Hummel and A. Rosenfeld, An application of Relaxation

Labelling to Line and Curve Enhancement, IEEE Transactions on Comput-

ers, Vol. 26, 1977, pp. 394-403

content based semi-invariant search for natural, symbolic ... · pdf filecontent based...

Documents