content based semi-invariant search for natural, symbolic ... · pdf filecontent based...
TRANSCRIPT
Content based Semi-Invariant
Search for
Natural, Symbolic and Sketch
Images
C.E. Tuke
Thesis submitted for the degree of
Doctor of Philosophy
University of York
Department of Computer Science
September 2004
Abstract
The overall aim of this thesis is to develop a working architecture to facilitate the
automatic recognition of single query image content from within a library of images with
similarity decisions similar to human judgement with unconstrained image content in
reasonable periods (less than two minutes for an average query). The secondary aim of
this work is to investigate the implementation of an architecture that can plausibly
approximate human Gestalt grouping decisions to generate description types useful for
recognition. Images used in this work include linear, greyscale and colour types depicting
natural, cartoon, facial and symbolic content. The architecture consists of a novel
segmentation/grouping engine using a KD-Tree structure to facilitate nearest neighbour
decisions in an n-dimensional space and generate a multi-dimensional binary tree
architecture of groups. These groups are then ranked and by salience and a series of
weighted description labels preserving group order and description type are generated
from the top n-groupings to form a library of descriptor labels. For recognition we
retrieve these labels from the pre-generated library, normalize query and library labels by
the from library label set and select best correspondence between groups contained with
the labels. The final stage is the evaluation of general similarity from the corresponding
label values, ranking and output of results. Analysis shows that this architecture
provides good results for natural colour and greyscale images and reasonable results for
symbolic and linear geometric image types.
Contents
Acknowledgements xiii
Declaration xv
1 Introduction 1
2 Recognition/Vision, Storage & Grouping Processes 5
2.1 Human Vision/Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Machine Vision/Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Gestalt Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Colour Constancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Assumptions of surface reflectance or illuminant properties in image 17
2.4.2 Assumptions of a finite image gamut . . . . . . . . . . . . . . . . . . 18
2.4.3 Measuring the illuminant indirectly from image properties . . . . . . 18
2.4.4 Limitations to Colour Constancy Techniques . . . . . . . . . . . . . 19
2.5 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Photometric Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.2 Geometric Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.3 Global Invariant Techniques . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.4 Local Invariant Techniques . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.1 Measures used to guide segmentation . . . . . . . . . . . . . . . . . . 30
2.6.2 Region grouping processes . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7 Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3 Preliminary Work and Experimentation 45
3.1 Signature Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Initial Experiments with Linear Gestalt segmentations . . . . . . . . . . . . 52
iii
iv CONTENTS
4 Gestalt Multi-Scale Feature Extraction 73
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 The Core Segmentation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Seeding with Nearest Neighbours . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 The Edge Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4.1 Generating a Multi-Scale Segmentation . . . . . . . . . . . . . . . . 87
4.5 A New Edge Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6 Updating Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.1 Appropriate Data Structures for K-Dimensional Nearest Neighbour
Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6.2 Efficiently searching the KD Tree for nearest neighbours . . . . . . . 102
4.7 The Original Current State Group Description . . . . . . . . . . . . . . . . 116
4.8 The New Binary Tree Group Description . . . . . . . . . . . . . . . . . . . . 119
4.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5 Segment Ranking 127
5.1 Introduction Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.1.1 Edge Based Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.1.2 Area Based Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.2 Combining Edge and Area Ranking . . . . . . . . . . . . . . . . . . . . . . . 131
5.3 Correspondence Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6 Generating Segment Group Descriptions 145
6.1 Overview Of Description Types . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.2 Photometric Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.3 Geometric Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4 Pairwise Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.5 Preprocessing for Geometric Descriptions . . . . . . . . . . . . . . . . . . . 153
7 Searching the Label Database 155
7.1 Description Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.2 Normalizing Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.3 Weighting Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.4 Searching the Database Labels for Image Similarity . . . . . . . . . . . . . . 158
8 Evaluation of final algorithm performance 163
8.1 General effectiveness with different image types . . . . . . . . . . . . . . . . 163
8.2 Effects of global user weighting on recognition . . . . . . . . . . . . . . . . . 170
8.3 Tolerance to realistic transformations . . . . . . . . . . . . . . . . . . . . . . 172
CONTENTS v
8.4 Comparison with human decisions . . . . . . . . . . . . . . . . . . . . . . . 172
9 Conclusions 179
9.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
10 Self-Similar Convolution Image Distribution Histograms 185
11 The Linear Gestalt Grouping Algorithm and Data Types 189
12 The n-Dimensional KD Tree Structure and Algorithms 197
13 Combined Gestalt Grouping Algorithms and Results 207
14 Database Search Algorithms 213
vi CONTENTS
List of Tables
2.1 Calculating two-dimensional correlation invariants . . . . . . . . . . . . . . 21
2.2 Calculating Moment invariants . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Common local invariant architectures . . . . . . . . . . . . . . . . . . . . . 22
2.4 Degrees of freedom (order) of primitive geometric image features . . . . . . 23
2.5 The four main levels of local geometric invariance . . . . . . . . . . . . . . . 24
4.1 Calculating two-dimensional correlation invariants . . . . . . . . . . . . . . 81
4.2 Minimum requirements for region description. . . . . . . . . . . . . . . . . . 96
6.1 Levels of geometric invariance . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.2 Types of colour description . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
vii
viii LIST OF TABLES
List of Figures
1.1 Library Generation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Query Generation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Illusion caused by mid-level processing . . . . . . . . . . . . . . . . . . . . . 8
2.2 Illusion caused by mid-level processing . . . . . . . . . . . . . . . . . . . . . 9
2.3 Simultaneous Contrast I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Simultaneous Contrast II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Simultaneous Contrast - Snake illusion . . . . . . . . . . . . . . . . . . . . . 11
2.6 Simultaneous Contrast - Colour Channels . . . . . . . . . . . . . . . . . . . 11
2.7 Gestalt Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Angle between vectors, a similarity invariant . . . . . . . . . . . . . . . . . 24
2.9 The ratio of linear lengths, a similarity invariant . . . . . . . . . . . . . . . 25
2.10 Four point area ratio, an affine invariant. . . . . . . . . . . . . . . . . . . . 25
2.11 Ratio of parallel line lengths, an affine invariant . . . . . . . . . . . . . . . . 26
2.12 Geometric Hashing, two reference line segments are used to define a local
geometry that will remain unaffected by affine transformations. . . . . . . . 26
2.13 A point, tangent and two lines can form an affine invariant . . . . . . . . . 26
2.14 The Shape Query Using Image Database recognition system uses invariant
curve based signatures built up from different degrees of boundary smoothing 27
2.15 The point on the curve furthest from the endpoint vector is affine invariant 27
2.16 The projective invariant Cross Ratio. . . . . . . . . . . . . . . . . . . . . . . 27
2.17 A Two Dimensional Signature . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1 The set of 12 natural query images . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 The set of 12 natural images searched . . . . . . . . . . . . . . . . . . . . . 46
3.3 Difficulties arising from the use of the neighbour region constraint and
boundary ased culling. Generation of pairwise invariant signatures would
be impossible in these examples, where the grey areas are culled and each
remaining region therefore has no immediate neighbour. . . . . . . . . . . . 48
ix
x LIST OF FIGURES
3.4 Direct storage of binary invariant signature. . . . . . . . . . . . . . . . . . . 49
3.5 Initial non-binary signature blurring and final binary signature. . . . . . . . 49
3.6 Initial non-binary signature blurring, using area bias and the final binary
signature calculated from mean value. . . . . . . . . . . . . . . . . . . . . . 50
3.7 Multi-scale central moment storage . . . . . . . . . . . . . . . . . . . . . . . 51
3.8 Examples from the Gestalt library . . . . . . . . . . . . . . . . . . . . . . . 60
3.9 Output from the Linear Gestalt Grouping algorithm . . . . . . . . . . . . . 61
3.10 Scale and region grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.11 Gestalt Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.12 The basic architecture of Thorisson’s algorithm . . . . . . . . . . . . . . . . 62
3.13 Minimum Boundary Distance versus Region Centroid Distance . . . . . . . 63
3.14 LGG algorithm example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.15 Resulting groupings in artificial images using LGGA . . . . . . . . . . . . . 64
3.16 Effect of description elements . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.17 Confusing the LGG algorithm - Clustering . . . . . . . . . . . . . . . . . . . 65
3.18 Images that should confuse the LGG algorithm - Context and Complexity 65
3.19 Linear regions output from the LGG algorithm . . . . . . . . . . . . . . . . 66
3.20 Linear regions output from the LGG algorithm . . . . . . . . . . . . . . . . 67
3.21 Segmentations using the LGG algorithm on a composite of test images . . . 68
3.22 Perceptually insignificant Linear groupings . . . . . . . . . . . . . . . . . . . 69
3.23 Resulting groupings in more complex photographic images using LGGA . . 70
3.24 Iterating the LGG algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.25 Connections that could be used to bridge regions . . . . . . . . . . . . . . . 72
4.1 Image grid based 8 nearest neighbours . . . . . . . . . . . . . . . . . . . . . 76
4.2 Segmentation primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Interior Difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Development of a segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5 Example of a texture that can not be properly merged into a single segment
using only 8 neighbourhood edges . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6 Showing how smoothing can decrease the 8 neighbour image grid problem . 83
4.7 Edges chosen using 8 nearest neighbours . . . . . . . . . . . . . . . . . . . . 83
4.8 Circular arrangement of larger segments . . . . . . . . . . . . . . . . . . . . 84
4.9 Example of the improved segmentation using Nearest Neighbour Seeding . . 92
4.10 Original decision function and the effect of the k component . . . . . . . . 93
4.11 KD Tree explained . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.12 Determining if a KD branch could contain a nearest neighbour . . . . . . . 105
4.13 Modifiable KD Tree Build Algorithm . . . . . . . . . . . . . . . . . . . . . . 106
LIST OF FIGURES xi
4.14 Half rib orthogonal list structure . . . . . . . . . . . . . . . . . . . . . . . . 107
4.15 8NN Modifiable KD-Tree results surface and Exhaustive 8NN results. . . . 107
4.16 8NN Modifiable KD-Tree results with Exhaustive 8NN results subtracted
and decreasing numbers of search points. . . . . . . . . . . . . . . . . . . . . 108
4.17 8NN Modifiable KD-Tree results expressed as percentage difference to Ex-
haustive 8NN results, over decreasing numbers of search points. . . . . . . . 109
4.18 3D chart showing the performance of 8-NN algorithms . . . . . . . . . . . . 110
4.19 Modifiable KD Tree performance. . . . . . . . . . . . . . . . . . . . . . . . . 111
4.20 Performance of Nearest Neighbour algorithms . . . . . . . . . . . . . . . . . 111
4.21 Effects of dimensionality on 8-NN Search algorithims . . . . . . . . . . . . . 112
4.22 3D Chart Showing the performance of 8NN Search using Modifiable KD-Trees113
4.23 Showing optimal values for 8NN Modifiable KD-Tree search . . . . . . . . 114
4.24 Showing the segmentation process, with a delayed update . . . . . . . . . . 115
4.25 Current state grouping/segmentation process . . . . . . . . . . . . . . . . . 117
4.26 Merging and erasing parent groups connected by an edge . . . . . . . . . . 118
4.27 Merging parent groups without deletion in a tree based structure . . . . . . 121
4.28 Merging the same groups across different edges in the tree based structure. 121
4.29 Binary tree grouping/segmentation process . . . . . . . . . . . . . . . . . . 122
4.30 Benefits of retaining groups from previous generations . . . . . . . . . . . . 124
4.31 Drawbacks to retaining links from previous generations . . . . . . . . . . . . 124
4.32 Increase in time taken by grouping/segmentation engine as image dimension
increases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.1 Contradictions when using edge values to rank segment groups. . . . . . . . 129
5.2 Segment group scoring based upon edge values. . . . . . . . . . . . . . . . . 130
5.3 Results relating to segment groups that form the basis for used Founding
Edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.4 Founding Groups and Image Type, an example. . . . . . . . . . . . . . . . . 132
5.5 Demonstrating the difference between Founding Edges and Parental Edges. 133
5.6 Size and Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.7 Segment group ranking results, natural . . . . . . . . . . . . . . . . . . . . . 137
5.8 Segment group ranking results, mondrian . . . . . . . . . . . . . . . . . . . 138
5.9 Results of combined Parental Edge and minimum Parental area ranking . . 139
5.10 Directly equivalent segment groups are important to recognition in photo-
graphic material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.11 Segments ranking based upon correspondence . . . . . . . . . . . . . . . . . 141
5.12 Example showing final groups extracted by rank, example 1 . . . . . . . . . 142
5.13 Example showing final groups extracted by rank, example 2 . . . . . . . . . 143
xii LIST OF FIGURES
6.1 RGB and HLS Space diagrams . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.1 Differences in properties such as size and intensity during segmentation . . 160
7.2 Simplified example of direction dependence of recognition . . . . . . . . . . 161
7.3 Screenshot of algorithm output . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.1 Samples from the natural image set . . . . . . . . . . . . . . . . . . . . . . . 164
8.2 Samples from the facial image set . . . . . . . . . . . . . . . . . . . . . . . . 164
8.3 Samples from the cartoon image set . . . . . . . . . . . . . . . . . . . . . . 164
8.4 Samples from the Gestalt/symbolic image set . . . . . . . . . . . . . . . . . 165
8.5 Recognition score results for the natural image set . . . . . . . . . . . . . . 165
8.6 Contributions of description types to final score . . . . . . . . . . . . . . . . 166
8.7 Comparing rankings generated from greyscale and colour versions of the
same image queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.8 Correlation between component descriptor scores for both greyscale and
colour natural images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.9 Correlation between component descriptor rankings for both greyscale and
colour natural images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.10 Comparing rankings generated from greyscale and colour versions of the
same image queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.11 Greyscale natural image recognition performance with different global weight-
ings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.12 Changing context after tranformation . . . . . . . . . . . . . . . . . . . . . 174
8.13 Similarity score performance over increasing rotation . . . . . . . . . . . . . 174
8.14 Similarity score performance over increasing translation . . . . . . . . . . . 175
8.15 Similarity score performance over Affine transformations(stretch) . . . . . . 175
8.16 Similarity score performance over Affine transformations(squeeze) . . . . . . 176
8.17 The colour rotation sample set . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.18 The greyscale affine sample set . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.19 Screenshot from the ‘Human-Like Survey’. . . . . . . . . . . . . . . . . . . . 178
9.1 Directing Edge Selection using a 2nd Order Edge Search Space and move-
able origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.1 Showing the invariance of self-similar convolution image histograms to im-
age transformations (but NOT perspective) . . . . . . . . . . . . . . . . . . 186
10.2 Screen-shot of a typical query and database response using the self similar
geometric histogram as an identifier. . . . . . . . . . . . . . . . . . . . . . . 187
Acknowledgements
I would like to thank my supervisor, Simon O’Keefe, and my assessor, Jim Austin, for
their invaluable support and advice, providing me with the opportunity to work at the
University of York in the first place. Thanks to the University of York and the Computer
Science department as a whole for making me feel welcome and providing a relaxed and
friendly atmosphere within which to complete this work. Special thanks to Tori Kemp
for the support and supplies, Joanna Moy for sharing the trials and tribulations of the
Phd experience, and a final thanks to my family for their invaluable encouragement and
support throughout the years.
xiii
xiv ACKNOWLEDGEMENTS
Declaration
I declare that this thesis has been completed by myself and that, except where indicated
to the contrary, the research documented is entirely my own.
Charles Edward Tuke
xv
xvi DECLARATION
Chapter 1
Introduction
The proliferation of image based content over the Internet, coupled with increased proces-
sor capabilities and the ready availability of image capture hardware for personal comput-
ers has largely been responsible for fuelling demand for a new generation of methods to
efficiently process the resulting visual imagery. There exists a particular need to process
visual imagery in a much more human fashion. Human like recognition is required in
order to make sense of complex imagery very quickly and base decisions upon the abstract
objects contained within images rather than their low-level intensity information. The
ability to process visual information in a human-like way is particularly important when
considering sketch and symbolic inputs, whereby the similarity of equivalent sketch images
is impossible to establish using direct low level comparison techniques and higher level as-
sumptions about image content is required. Another task humans perform very well when
compared to machine recognition is the ability to recognize both 3-dimensional object sim-
ilarity and symbolic similarity from limited 2-dimensional image information. While much
of this success is due to the human ability to abstract and compare to prior experience,
experiments comparing recognition when using scrambled shapes [Cutzu96, Edelman95]
indicate that similarity can (although taking more time, and with a reduction in accuracy)
be determined in 3-Dimensional objects from unfamiliar arrangements of localized object
subcomponents.
The proposed approach in this work is that, through the use of Gestalt segmenta-
tion/grouping processes to generate multiple semi-invariant descriptions 1, it is possible to
efficiently rate image similarity based on derived descriptions that capture both symbolic
and literal content from unconstrained image content. Such an approach will facilitate
recognition in photographic (regardless of perspective and viewing condition change), ar-
tificial, symbolic and sketch images without prior training. The architecture will have
1Invariant descriptors are properties of an image object/shape that remain constant regardless of image
transformations
1
2 CHAPTER 1. INTRODUCTION
Figure 1.1: Overview of the proposed library generation architecture.
three main elements; a Gestalt segmentation/grouping algorithm that is capable of cap-
turing symbolic groups, the creation of semi-invariant description labels from these groups
and a recognition algorithm that can compare these descriptions (figures 1.11.2. Results
should approximate, and will be compared to, human similarity decisions.
3
Figure 1.2: Overview of the proposed image search process, pre-generated library labels
can be compared against a query label for fast similarity matching.
4 CHAPTER 1. INTRODUCTION
Chapter 2
Recognition/Vision, Storage &
Grouping Processes
The primary task of this thesis is to facilitate, as far as possible, machine recognition of
unconstrained single image content to give the impression of human-like recognition. In
order to approximate human-like performance we require the recognition of two dimen-
sional images and their content, which may be either two dimensional or three dimensional,
natural, artificial or symbolic in nature.
It is clear that if we wish to generate any recognition algorithm that gives the impres-
sion of human-like judgement then key aspects of human perceptual organization need to
be inherent in whatever primitives we extract from raw image content, regardless of the
final representation. The grouping of raw image information into specific visual objects
is of fundamental importance to this process, the segmentation stage of our algorithm
needs to supersede traditional approaches to include a wider range of Gestalt grouping
principles. While our immediate requirements are a practical approximation, an elegant
and plausible architecture for simulating human visual grouping would be ideal. Static
thresholds, or the reliance upon prior training, should be avoided in this work to allow
unconstrained recognition and a more globally justifiable solution. As will be further dis-
cussed in 2.3, the application of Gestalt principles is likely to require the abstraction of
image groups into a higher dimensional space of extracted and changing object properties.
While the efficient processing of high dimensional spaces is more problematic than group-
ing based purely around image pixel information, it does reduce many Gestalt principles
to the same inclusive framework. Similarity and proximity principles reduce to the same
proximity calculation within such a space, common fate and continuation also reduce to
the same measure and certain aspects common to colour constancy algorithms may also
be generated. The Gestalt principle of completion should be partially attained through
the use of convex hull based descriptors. Closure and simplicity still represent a challenge,
5
6 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
although it is anticipated that the application of convex hulls along with proximity, similar-
ity and continuation rules will facilitate these in the general case. Such an approach could
also facilitate future expansion to video image analysis, with the easy incorporation of a
time dimension that would naturally expand the perceptual grouping to include periodic-
ity principles and common fate over time. Research [Palmer96, Palmer00] indicates that
any such grouping process should operate, as much as feasible, on a multi-scale (with pre-
vious groupings forming parents to new groupings) and as parallel as possible. Although
true parallelization is likely to be beyond the scope of this work, the retention of previous
groupings between successive generations may go some way towards approximating this.
Once we have these Gestalt geometric primitives, we will require a method of extracting
and efficiently storing information essential for fast recognition between images. The use of
invariant descriptions represent a simple and efficient method of overcoming the detrimen-
tal effect of changing viewpoint, image transformations and lighting change between images
and should enable us to approximate much of the invariant low-level grouping that occurs
in human vision. Signature representations seem a promising and convenient method of
storage for these invariant measures and should allow the level of control and generaliza-
tion required to simulate the fast ’first impression‘ aspect of human vision. Through the
careful choice of a range of descriptors that capture basic qualities of image content, such
as colour, texture, shape, size, position, it is anticipated that such an algorithm should be
able to generate similar results to human-based judgements of similarity that do not rely
heavily on prior knowledge/experience. The following sections represent a brief overview
of the key areas involved with achieving the above; vision, Gestalt grouping, preprocessing,
Invariance, segmentation and storage.
2.1 Human Vision/Recognition
Human perception of image content and shape recognition is still an area of much on-
going debate. Where three dimensional shape recognition is concerned, two prominent
theories have emerged. [Marr82], [Biederman87] advocate that recognition is due to the
encoding of shape information in terms of its three dimensional components, the other
approach is that the internal representation is that of a collection of two dimensional
views [Bulthoff95]. Variations on the latter approach also propose that recognition relies
upon the indexing of three dimensional shape by properties that remain similar regardless
of viewpoint, an approach paralleled by research into invariant representations (see section
2.5). If perfect invariance was the only method of recognition used in human vision (ie. no
information about viewing transformations was recognized by the viewer) then we would be
unable to perceive the transformations that the recognized object had undergone, which
is not the case. Conversely, research into the impact of extrinsic cues [Christou03] on
2.1. HUMAN VISION/RECOGNITION 7
shape recognition (taking care to remove the possibility of the target object being encoded
with respect to its surrounding context) provides evidence that contextual information
about the viewpoint transformation increases the accuracy of recognition, although has
minimal impact upon response times. The impact of viewpoint transformations upon the
speed and accuracy of human recognition seems to vary according to the nature of the
specific recognition tasks. [Christou03] found that recognition performance decreased as
target objects are rotated away from the initial viewpoint, with performance decaying
as rotations approach 90o and improving slightly as the object is rotated toward 180o.
This, coupled with participants verbal feedback, provided some indication that egocentric
encoding of two dimensional correspondences between object features plays at least some
part in the internal encoding of shape information in the human mind. Such encoding
would be least effective at right angles to the learnt viewing position. Contrary to this,
[Tarr90] finds that in certain cases this is not the case, with recognition performance
remaining constant under target transformation. It would appear that both orientation
dependence and invariance play a part in human recognition.
In realistic situations, object recognition is supplemented by the full range of hu-
man senses, reasoning and previous experience. While attempts to simulate aspects of
higher level human reasoning have been made (Bayesian reasoning and mid-level pro-
cessing [Forsyth98]), these methods often suffer from insufficiently consistent and com-
parable internal representations or image primitives. While recognition performance is
increased when extrinsic visual cues to movement around an object of interest are pro-
vided, [Simons98] shows an even greater improvement where the subject performs their
own movements around the object. This indicates that the subjects knowledge of their
bodies position and movement in space is also an important factor in object recognition.
Inclusion of such non-visual stimuli, chaotic processes or any higher order of reasoning
is beyond the scope of this work, which will be based around limited two dimensional visual
cues. Fortunately, work in pre-attentive grouping [Yu01], segmentation and recognition
indicates that a much of the initial recognition process is performed very quickly at a
relatively low level. Whilst still a non-trivial problem, this level of performance should be
achievable through machine recognition algorithms. Humans possess the ability to quickly
make basic distinctions between visual objects on a subconscious level, quickly determin-
ing the presence of familiarity or potential threats in the environment. This ability for
fast recognition and generalization of complex stimuli bears some similarity to properties
inherent in machine encoding using signatures (see section 2.7). Evidence suggests that
these initial responses are then refined, with conscious reasoning and recognition of other
factors such as viewpoint change and lighting conditions occurring later in the process.
While internal representations of objects is variant to viewpoint change and knowledge
8 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
Figure 2.1: A simple demonstration of Opponent Colour Theory and human sensitivity
adjustment to image content. Stare at the green square for a while, then quickly shift your
eyes to the spot in the adjacent white square. The majority of people with normal colour
vision will see a red square which is the opponent to green and also shows that the eye
has become de-sensitized to green.
of viewpoint change can increase accuracy of recognition, changes in lighting conditions
are largely ignored during the preliminary subconscious stages of recognition. The initial
grouping of raw visual information (analogous to an image segmentation) appears to incor-
porate a natural normalization of colour and intensity values at a low level. The presence
of lateral inhibition in centre-surround cells in human vision [Adelson00] (where light on
the centre cells is excitatory but light on the surround is inhibitory) suggests that the
actual signal transmitted to the brain is a normalized, edge based, description discarding
the absolute light intensity information received from cones (the Opponent Colour Model
[Hering64]). That centre-surround cells receive light information in red-green, yellow-blue
and black-white pairs explains why human vision cannot register certain colour combina-
tions such as greenish reds, yellowish blues or blackish whites while yellowish reds and
bluish reds are easily perceived. CIE (Commission Internationale de l’Eclairage) Stan-
dard Observervations based around Opponent Colour Theory show that de-sensitization
to dominant colour stimulus occurs in human vision. Another result of this model of
human vision is the colour latency effect (figure 2.1).
If this is the case, then perceptual grouping and the perception of light in the brain can
only ever be determined in relation to the surrounding visual context. The evolutionary
advantage to this form of perception is that low level grouping and recognition of similar
objects can take place with a large degree of invariance to change in real-world lighting
conditions. Attempts to simulate this process in machine vision have resulted in the fields
of photometric invariance and colour constancy (see section 2.4), which usually pre-filter
raw image information before any significant grouping/segmentation processes occur. A
2.1. HUMAN VISION/RECOGNITION 9
Figure 2.2: Left and Centre images show the affect of mid-level processing, the central
square in both images appears to be a different shade to the human observer even though
it is in fact the same. The snake illusion on the right shows that such illusions can be both
local and contradictory (the four diamond squares are the same intensity).
thorough investigation into colour vision theory can be found in [Brainard01]. In reality,
work using optical illusions (figure 2.2) indicates that both grouping and light percep-
tion are fundamentally linked, with local geometric features also affecting the perception
lightness in low level vision. [Adelson93][Adelson00] shows that there is a two way rela-
tionship between the perception of colour and shading and the geometry of visual stimuli.
Three dimensional cues appear to have a dominant role to play in the human perception
of image content. Where shading is apparent in an image, these effects (intrinsic illumi-
nation image) are removed and the perceived lightness of the surface is greater than the
actual intensity registered on the retina [Gilchrist97]. In many cases, these perceptions of
colour difference are so compelling that test subjects refuse to believe that the physical
intensities are the same in the optical illusions. Whether this perception of shading and
transparency is the result of three dimensional interpretation or a more basic reaction to
two dimensional local configurations within image content is still unresolved.
Simultaneous contrast may be one of the two dimensional low-level processes that
facilitates this compensation for three dimensional lighting, shading and transparency
effects. Simultaneous contrast effects are the result of a localized normalization of object
intensity in the context of the surrounding neighbourhood intensity. Figures 2.3 and
2.4 demonstrate the difference between perceived intensity and true intensity, objects
with a darker background appear lighter than they really are, and objects with lighter
backgrounds appear darker than their true intensity values.
Of the two figures, figure 2.3 provides the best representation of Simultaneous Contrast
because of the lack of geometrical interference. The more extreme apparent intensity
differences in figure 2.4 cannot be totally explained in this way, as is demonstrated in
10 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
Figure 2.3: Simultaneous contrast in action. Both inner circles are in fact the same inten-
sity, but appear very different to the human observer due to their different surrounding
context.
Figure 2.4: The diamond shapes in the left hand examples are exactly the same intensity,
although they appear to the observer as very different shades. The right hand images
show the same examples compensated to appear the same intensity, although the actual
intensities are nearly twice the original. This is due to simultaneous contrast occurring
in human vision, where the different surrounding contexts to the diamonds have a large
impact upon their perceived brightness.
2.1. HUMAN VISION/RECOGNITION 11
Figure 2.5: Snake illusion with its ‘anti-snake’ image on the left.
Simultaneous Contrast cannot account for the effectiveness of this illusion of intensity
difference as the anti-shadow image (left) does not show much difference in diamond in-
tensity, even though the surrounding background intensity is the same as the illusion.
It is only as the shadow geometry is added that the difference becomes more noticable.
Figure 2.6: If simultaneous contrast effects relied upon summative or average intensity
values, rather than being calculated from individual colour channels/receptors then these
three examples would all contain similar diamonds. The left image uses just the green
colour channel.The centre image introduces spurious blue elements which equal the same
summative (R+G+B) values as in the left. The right image contains spurious blue elements
which equal the same avarage intensity (R+G+B)3 .
figure 2.5. If the comparison of immediate intensity background was responsible for the
effect in this image then the left hand image would show a much greater apparent intensity
difference between the top and bottom rows of diamonds. It is not until the full background
geometry surrounding the diamonds is replaced (giving the Gestalt impression of two dark
shadow strips across the pattern) that the intensity difference is apparent. The logical
explanation for such geometrical effects is that the eye is automatically interpreting the
image in a natural world context and adjusting intensity values to compensate for shadows
and transparencies [Adelson93].
Greyscale and single colour channel examples in figures 2.4 and 2.3 exhibit the same
effect, this indicates that simultaneous contrast is likely to be occurring independently in
the separate colour channels. Figure 2.6 demonstrates that, even though mean intensity
may remain the same, the introduction of other (disproportionate) colour elements disrupts
the simultaneous contrast effect.
12 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
Image normalization relative to surrounding image geometry appears to be a dominant
feature of this mid-level processing. All normalization processes require some form of
anchor feature from which to generate a canonical system. Research [Adelson00] indicates
that the following three major factors that determine how an image is perceived during
human perception.
1. ‘Highest Luminance Rule’ - The surface of highest luminance in a scene tends
to be anchored to the perception of white [Gilchrist97] (used by Land and
McCann [Land71] in combination with Retinex theory)
2. ‘Largest Area Rule’ - A secondary factor, the largest visible area will tend to
appear white [Gilchrist97].
3. ‘Articulation’ - The size of colour constancy backgrounds decreases (the sur-
rounding contextual area that affects the appearance of object colour) as over-
all colour and shape complexity in the image increases.
Another important factor when considering the segmentation of natural images is the
human perception of texture. Although often considered different disciplines, colour and
texture are fundamentally linked in the human perception of image content. Traditional
texture segmentation algorithms rely on localized structure or statistical clues to group
areas of an image together. The size of the context of a texture element has a fundamental
impact upon the perception of texture boundaries [Bergen88]. A method of combining
texture with colour in grouping processes is required if true human-like grouping is to be
simulated in machine vision.
Whatever preliminary processes occur in human vision, it apparent that at some point
the raw visual input is abstracted and grouped in such a way that the scene becomes
a collection of perceived objects. Conscious perception of image content is in terms of
the human interpretation of the scene, not the actual intensity values of the image. For
example, an image of a chair is first perceived as a chair-like thing before the finer detail
(such as what it may be made of, or its orientation) is discerned. This ability to quickly
reach a preliminary impression of what an object may be provides further evidence for
a signature-like internal representations in human vision, and at least some form of top
down processing or refinement process. Whether human visual grouping is a top down
process, bottom up process or a combination of the two, is still an unresolved issue. The
grouping of raw visual information into meaningful entities equates, at a simplified level,
to segmentation processes (see section 2.6) in machine vision. Whilst most machine vi-
sion approaches, including that of this work, cannot take into account the contribution
of higher level understanding and memory to image understanding, the study of Gestalt
processes (section 2.3) indicates that raw visual information can successfully be segmented
into perceptual groups that can form useful and simplified descriptions of image content
2.2. MACHINE VISION/RECOGNITION 13
without the requirement for actual understanding. Further to traditional segmentation
rules, Gestalt grouping can facilitate the generation of symbolic correspondences not ex-
plicitly contained within an image, such as the perception of a larger circle from a loosely
circular arrangement of other features within the image. It is anticipated that the inclu-
sion of Gestalt grouping principles (and the ability to recognize more abstract symbolic
content) is one of the key ways in which this work will be able to give the impression of
human-like performance.
2.2 Machine Vision/Recognition
Machine vision is primarily concerned with the construction of meaningful and useful de-
scriptions based upon real world objects present in images. In this respect, computer vision
is attempting to emulate aspects of human visual perception and therefore is integrated
closely with the field of Artificial Intelligence. The most common tasks in computer vision
are generally related to the labelling, or segmentation, of input images into meaningful
objects. While at first glance this may seem a trivial task, the difficulties involved in
this basic vision process are compounded by the fact that this initial level of processing
is largely a subconscious activity in human vision. Where humans see a world of mean-
ingful shapes, textures and objects automatically, there is currently no formalized correct
method of achieving, or even evaluating, these abilities in machine vision. Even if hu-
man vision and recognition were thoroughly understood, computer technology available at
present has so far been unable to fully model the general recognition and understanding
capabilities present in human vision. Current research in machine vision usually repre-
sents a compromise between performance and speed, and often hampered by the need to
re-invent the most basic, but inaccessible, talents of biological visual systems. This has
resulted in fragmented approaches to machine vision with large degrees of specialization
in different machine vision applications such as face recognition, geometry reconstruction,
satellite image analysis and trademark recognition.
Constructing a meaningful interpretation of a natural perspective image is a non-
trivial task, and in the case of single images can have no unique mathematically correct
solution. In the case of most photographic content recognition, the objects in an image
that need to be recognized and labelled are the projected images of three-dimensional
objects. The appearance of such objects can change dramatically at different orientations,
lighting conditions and when under occlusion. By the time conscious thought is applied
to the visual process the scene has already been segmented into separate objects in a
manner largely invariant to the multitude of possible differences in object appearance.
Object surfaces are seen are seen as continuous, even where they may exhibit a reflective
highlight and colours are initially perceived as being constant even in the presence of
14 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
shadows and other lighting effects. These low-level capabilities, that we take for granted,
are essential for the initial interpretation of image content, upon which logical reasoning
can later be applied. This suggests that the generation of representations that are to some
extent invariant to these unwanted environmental effects represents an important area of
computer vision research.
The difficulty with single images made up from real world projections is that the depth
component of objects is completely lost. The perceived shape and shading components
of the image are intrinsically connected to this missing information, with object occlusion
further complicating the task of generating a meaningful description of the scene. Although
it is impossible to uniquely determine the three dimensional information of objects depicted
by a single image, the use of assumptions about content, along with geometry cues, allows
the generation of a hypothesis or ‘best fit’ description of a single image scene. Although
human visual processes primarily operate in a stereoscopic domain, they are still capable
of generating accurate descriptions from single image sources such as photographs.
At the base level, domain knowledge consists of lighting; shading and perspective cues
where a basic underlying set of object and lighting properties are assumed. The description
of a scene can be further enhanced if higher-level domain or content knowledge is available
to refine the image description, this will not be the case in this work. Many processing
architectures employ a feedback process, where high-level information derived from low
level processes is fed back into the system to refine base assumptions about image content.
Due to the incomplete nature of single image representations of three dimensional
scenes, the imposition of assumptions about scene illumination, object surface properties,
or geometry type is essential to enable the reasoned analysis of the scene as a three di-
mensional entity. Shape from shading algorithms such as the face recognition approach
usually assume that surfaces have constant surface reflectance properties and either ambi-
ent or single point light sources. Projective invariants (explained on page 25) rely on the
assumption that the component features will be coplanar or co-linear in nature.
2.3 Gestalt Grouping
Gestalt principles derive from experimental observations related to human perceptual
grouping. This represents the low level process in which humans make sense of noisy
or incomplete image content, often extracting more geometrical and symbolic informa-
tion than an image may literally contain [Levine85]. Significant structures are perceived
and grouped almost instantaneously, at many levels of abstraction. Perceptual (Gestalt)
grouping is commonly associated with preconstancy, the grouping of raw visual informa-
tion into significant structures before any significant colour/lightness constancy occurs
in the human visual system. However, [Rock92, Rock64] indicates that some degree of
2.3. GESTALT GROUPING 15
grouping occurs after the perception of lightness constancy and stereoscopic depth per-
ception. Gestalt grouping is a multi-level process, with large-scale groupings being formed
from previous groups, illusory contours and amodal completion [Palmer96, Palmer00].
The point at which visual Gestalt grouping actually occurs is still an unresolved issue
[Vecera97, Schulz03], although evidence suggests that it may well be a fundamental com-
ponent to many aspects of perceptual grouping. It is likely that Gestalt processes may
represent a global mechanism that accounts for all low-level perceptual processes, from
grouping, simultaneous contrast and texture to colour constancy and even certain aspects
of three dimensional interpretation. The following commonly accepted list of Gestalt prin-
ciples can be used to approximate what occurs a fundamental stage of human perceptual
grouping.
• Proximity -
Objects in close proximity are likely to be grouped.
• Similarity -
Objects similar in appearance are likely to be grouped.
• Continuity -
Objects on a continuous path are likely to be grouped.
• Closure -
Boundaries/groupings will have a tendency to be perceived as closed paths.
• Completion -
Groupings will be perceived as complete objects, even if partially obscured.
• Region -
Objects contained within the same visual structure are likely to be grouped.
• Connectedness -
Objects connected together by other features are likely to be grouped.
• Common Fate -
Objects with same degree of change in appearance or position are likely to be
grouped.
• Periodicity -
Objects that appear at the same time/frequency are likely to be grouped.
• Simplicity -
The simplest interpretation and grouping of image content will be favoured.
16 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
(a) Proximity (b) Similarity (c) Continuity, Closure,
Simplicity
(d) Common Fate (e) Closure, Comple-
tion
Figure 2.7: Examples of common Gestalt principles
There is some degree of overlap between these principles, as closure can often be con-
sidered the result of both continuity, completion and simplicity. Similarly, if objects are
considered in a high-dimensional space including aspects such as size, position and ap-
pearance, then it follows that common fate is a result of continuation within this space.
Although this work is based around single image sources, video sequences could well in-
crease the dimensionality of this space to include time, in which case periodicity is an
implicit result of continuity in this space. Unfortunately, calculating continuity in such a
large dimensional space is a non-trivial task, although this thesis will go some way towards
achieving this.
While these Gestalt grouping principles are well known, the actual implementation
and ordering of these into an algorithmic framework remains highly problematic due to
the parallel and multi-level nature of the task. Localized proximity and similarity are
implicit to most segmentation algorithms, but the connection, continuity and closure of
disparate objects on a multi-scale level represents a much greater challenge. In this work we
2.4. COLOUR CONSTANCY 17
intend to address, to some degree, the five main principles; proximity, similarity, continuity,
common fate and closure (figure 2.7).
2.4 Colour Constancy
The colour appearance in raw images is formed from a combination of illumination spectra
and the surface reflectance properties of the objects contained in the scene as the illumina-
tion is reflected from the objects back into the camera. A common aim of colour constancy
algorithms is to enable the separation of object and lighting properties within images, usu-
ally to enable lighting and pose invariant matching between objects of the same surface
properties. While colour constancy algorithms generally rely on world assumptions and
simplified environments, such as Mondrian surfaces under a single uniform light source,
they can be successfully used to reduce lighting effects detrimental to recognition tasks.
Unfortunately, the practical application of colour constancy theory is often adversely af-
fected by limitations in camera sensitivity which can cause noise and normalization errors
in calculations.
The separation into intrinsic images may not always be required for certain recognition
tasks, in this case it may be sufficient to use illumination quasi-invariants. An example of
this can be found in [Koubaroulis00a] where the Multimodal Neighbourhood Signature,
a signature formed from the cross-ratio (invariant to the linear illumination model) of
neighbouring patches. Finlayson [Finlayson92] proposes that a linear relationship between
illumination and reflectance is sufficient for natural images, although the assumption of a
single illumination source and Mondrian surfaces is not. Gamut algorithms [Finlayson95,
Brainard97, Barnard00] attempt to address these problems by utilizing information about
expected lighting and surface properties previously measured from target environments to
establish illumination constraints.
Conventional algorithmic approaches can be roughly categorized by the world assump-
tions they rely on to provide the extra constraints for colour constancy.
2.4.1 Assumptions of surface reflectance or illuminant properties in im-
age
The ‘Grey-World Algorithm’ [Buchsbaum80] makes the assumption that the surface RGB
values in an image will average to grey if viewed under a canonical (white) light. With
these constraints, any deviation in the image average from grey has to be due to the
illumination of the scene. The ‘White Patch Algorithm’ which lies at the heart of many
Retinex algorithms [Land71] (derived from centre-surround theory), assumes that there
will be a surface in the image that shows maximal reflectance in the various colour bands
18 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
that can be used to anchor colour normalization.
Retinex Theory [Land71, Rahman] asserts that reflectance is constant across space,
except where there exists a transition between objects/pigments.
Given these constraints, Retinex theory makes the following assumptions:
1. Reflectance change is shown as a large step edge in an image.
2. Illuminance changes gradually over space (low edge values)
3. The Reflectance image and, conversely, the illumination map
can be extrapolated by removing low edge derivatives that
are caused by illumination change, from the image.
2.4.2 Assumptions of a finite image gamut
Gamut algorithms (such as the CRULE algorithm [Forsyth90]) make use of the Coefficient
Rule, that although a camera is able to record a large range of colours, only a subset of these
will frequently occur in real images. A set of illuminants are mapped directly from target
environments and their convex hull taken to represent the canonical illumination gamut.
By combining real reflectance measurements taken from objects with the illuminations
possible from the canonical gamut we can add constraints to possible illuminants in images
by taking each RGB value in an image and calculating the gamut of illuminations that
would be required to map that RGB value into the canonical RGB convex hull. Each
coloured surface in an image generates a new illumination gamut set that can be intersected
with past results to provide ever-tighter constraints on the illumination source. The final
illuminant is chosen from the remaining set by either finding the mean illumination or the
illumination which would describe the greatest scene reflectivity. This approach is only
effective where a scene contains a sufficient number of differently coloured objects and
minimal specularity and shading. The use of colour direction, which is largely unaffected
by shading, [Finlayson95, Barnard96, Barnard00] can overcome some of these limitations.
[SimonFraser] provides a useful explanation of gamut constraint theory.
2.4.3 Measuring the illuminant indirectly from image properties
Some approaches attempt to measure illumination through indirect methods. One method
used is the Dichromatic Reflectance Model and differences between the reflectance prop-
erties of ambient and specular illumination [Klinker88]. Ambient reflection components
are a combination of illumination and reflectance (object colour), specular reflection is
much more closely related to the original illumination source. This relationship can possi-
bly be used to estimate the illumination spectra and therefore facilitate colour constancy
correction.
2.5. INVARIANCE 19
2.4.4 Limitations to Colour Constancy Techniques
While a large number of well established colour constancy techniques do exist, it must be
noted that there remain several issues that can prove problematic with such approaches.
The most obvious are the limitations and inaccuracies that can occur through the im-
position of the artificial assumptions that these algorithms are based upon. The other
difficulty arises through the assumption that an intrinsic image can be generated from real
colour information available in an image based upon reflectance and lighting properties.
As discussed in section 2.1, there can be a vast difference between the human perception of
colour and the actual physical stimulous arriving at the eye. Such perceptual differences
such as Simultaneous Contrast and Opponent Colour Theory are not addressed in the
standard colour constancy model.
2.5 Invariance
Invariant theory was originally pioneered by 19th century mathematicians Boole, Cayley
and Gordan [Boole1872, Cayley, Gordan]. With the advent of machine vision, it has since
been developed as a practical and useful tool for image recognition. Invariants are prop-
erties of image content that remain stable regardless of image transformations that would
otherwise be detrimental to recognition. Invariants can be of different levels, depending
upon the number of transformations they are required to be invariant to, although the
higher the order of the invariant the more components it requires, increasing susceptibility
to noise.
2.5.1 Photometric Invariants
Optical and photometric invariants are descriptions of image colour or shading content that
are invariant to image or scene transformations. Especially useful where colour or shading
information is rich and geometric image content is unreliable, as in natural images, photo-
metric invariants provide very robust image content description qualities. Being generated
from photometric properties of the image, this type of invariant can be used to provide
an effective image description regardless of geometric image transformations. Although
this may be advantageous in the case of general image database search, the very fact that
geometric information is completely ignored can result in false image matches and make
photometric invariants unsuitable for pointwise mapping between images. Photographic
images contain colour information that are sensitive not only to the base colour of the ob-
jects in the scene, but lighting conditions and shading effects caused by object shape. The
combination of colour-constancy techniques coupled with evidence accumulation frame-
works can minimize these detrimental factors and facilitate useful recognition. A major
20 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
advantage of Photometric descriptions over geometric descriptions is that they can usu-
ally be derived directly from pixel information in an image. Photometric properties of
images are also less sensitive to image transformations common in photographic material
when the viewpoint has altered. While a movement in camera position can have a rad-
ical difference upon the geometrical appearance of an object, its photometric properties
will generally remain stable. For similar reasons, photometric properties are compara-
tively tolerant to other image artifacts that can adversely affect geometric descriptions,
such as occlusion and motion. This makes them ideal for use as robust descriptors with
realistic/photographic image content.
2.5.2 Geometric Invariants
Geometric invariant features are concerned with generating descriptions of the apparent
geometry contained in an image that are invariant to unwanted image, object or viewpoint
transformations. While geometric invariants are usually applied to two dimensional image
content, they can be created to describe objects of any dimensionality. Early research into
geometric invariant descriptors were based around describing target objects as a whole
(Global Invariants), but the need to describe objects in situations where no true invariants
exist (for example, describing the content of projective scenes from a single image source)
has resulted in the use of localized features (local invariants) to generate semi-invariant
descriptors for evidence accumulation frameworks.
2.5.3 Global Invariant Techniques
Global invariants are formed from the global properties of an image, shape or object as
a whole. Correlation invariants (table 2.1), are formed directly from the image function
or pixel content and can be combined to higher orders to represent different levels of
completeness. First order correlation invariants represent a simple measure of region area
or, for non-binary images, mass.
Moment invariants tend to be based upon extracting positional cues, such as an im-
age functions centre of mass or object spread, from raw image information. Moments
can be used to effectively simplify image information and encode general shape trends in
image data as well as form the basis for invariant information (Table 2.2). An indirect
use of moment information to achieve invariant descriptions is as a basis for normalising
position, scale and orientation of image data using moments of different orders. [Reiss93]
provides a good overview of the use of correlation and moment invariants. Both correla-
tion and moment invariants can be considered primary members of the group of global
invariants that rely on image mass for their results. Another group of global invariants
can be generated from the boundary of an image object. Boundary based invariant fea-
2.5. INVARIANCE 21
1ST ORDER CORRELATION (OBJECT AREA/MASS)∑
x,y
f(x, y)
2ND ORDER CORRELATION∑
x,y
f(x, y)f(x + a, y + b) a, b = displacement vaues
3RD ORDER CORRELATION∑
x,y
f(x, y)f(x + a, y + b)f(x + c, y + d) a, b, c, d = displacement vaues
Table 2.1: Calculating two-dimensional correlation invariants
tures include polynomial equations [Taubin92] and Arc Length Space [Gool92]. Although
effective at generating similarity and affine level invariants these approaches rely heavily
upon accurate boundary information and are sensitive to occlusion.
0TH ORDER MOMENT: MASS
M =∑
x,y
f(x, y)
1ST ORDER MOMENT: CENTROID
Cx =∑
x,y
x(f(x,y))M
Cy =∑
x,y
y(f(x,y))M
2ND ORDER CENTRAL MOMENT: VARIANCE
Cxx =∑
x,y
f(x, y)(x − Cx)(x − Cx)
Cxy =∑
x,y
f(x, y)(x − Cx)(y − Cy)
Cyy =∑
x,y
f(x, y)(y − Cy)(y − Cy)
Table 2.2: Calculating Moment invariants
Frequency based descriptors use signal theory to overcome geometric transformations.
Applicable to both pixel and boundary information, Fourier descriptors and Wavelets
describe the image objects in terms of their component frequencies.
Global invariants are usually formed using algebraic techniques that operate on the
low-level image data as a whole. With most global invariants being formed from basic
low-level object image features, they require little pre-processing and can be very fast and
efficient to implement. Unfortunately, reliance of global invariants upon the entire shape
22 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
of the object image makes them very sensitive to real image effects such as occlusion or
overlap with image boundaries, rendering them ineffective as invariant descriptors under
these conditions.
2.5.4 Local Invariant Techniques
Local invariant features are defined from localized image information or geometry such
as points, lines or curves and are therefore more robust to occlusion than their global
counterparts. An image object can contain many local invariants, each of which represents
invariant evidence that can be used to identify it. Methods that use local invariants
usually feature some form of evidence accumulation such as histogram matches [Kliot98]
for database search or transformation evidence accumulation in parameter space. Such
invariant features can be used directly for recognition (for component labelling or canonical
mappings) or as a means to isolate the different transformations within the image to
facilitate a direct comparison. Table 2.3 lists some of the most common applications of
local invariant descriptions.
1. Signatures
Two independent invariant features plotted against each
other to form a single invariant canonical representation of
the object [Weiss88].
2.Histogram
A one dimensional histogram of a single description
[Kliot98].
3. Localised Signatures
Individual signature descriptions generated from the imme-
diate neighborhood, for each image feature [Thacker95].
4. Creating anchor points
Invariant boundary properties used to describe shape
[Mokhtarian99] or reverse image transformations for tem-
plate matching.
5. Parameter Space (Hough Transform)
Invariant feature evidence plotted into parameter space to
facilitate the reversal of transformations between images
[Xilin99].
Table 2.3: Common local invariant architectures
Local geometric invariant features are derived through differentiation, the order of
2.5. INVARIANCE 23
which is determined by the number of transformations they need to be invariant to. Usually
generates from a combination of primitive geometric features such as lines, points and arcs,
these features represent different degrees of freedom that can be combined to generate an
invariant of a predictable order. Table 2.4, below, shows some common image primitive
features and their corresponding degrees of freedom
Features Degrees of Freedom
Tangent 1
Points 2
Curves 3
Conics 5
Table 2.4: Degrees of freedom (order) of primitive geometric image features
Invariance to a given set of transformations can be generated by ensuring that the sum
order of the component features exceeds the order of the transformations.
An example:
A Weak Projective Transformation has 8 degrees of freedom
Cross Ratio, projective invariant derived from 5 co-planar points (see figure 2.16).
(While the standard model for a Cross ratio is to use 4
co-linear points, any 5 co-planar points can be mapped
into 4 co-linear points by a simple projection)
Order of Cross Ratio = 5(2) = 10
Number of independent Invariants from Cross ratio = 10 − 8 = 2
[Reiss93, Mundy92] contain more detail on the use of geometric primitives and the
mathematic principles behind local geometric invariance. Local geometric invariants can
be categorized according to their order of invariance and hence the image transformation
types they are unaffected by. In the case of image analysis, invariants fall in one of the
four orders of invariance presented in table 2.5
24 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
NAME ORDER TRANSFORMATIONS INVARIANT TO
Translation Rotation Scale Shear Skew
Euclidean 3 YES YES NO NO NO
Similarity 4 YES YES YES NO NO
Affine 6 YES YES YES YES NO
Projective 8 YES YES YES YES YES
Table 2.5: The four main levels of local geometric invariance
Figure 2.8: Angle between vectors, a similarity invariant
Euclidean Invariants
Euclidian invariants are of relatively low order and are unaffected by Euclidean transforms
(translation and rotation). This makes Euclidian invariance ideal for most trademark
recognition or similarly constrained two dimensional document analysis tasks. Linear
length is a simple example of a Euclidean invariant.
Similarity Invariance
Similarity invariants are measures that remain unaffected by scaling and Euclidian trans-
forms. Similarity invariants are ideal for recognition in most document analysis tasks,
where this limited range of transformations are common. The angle between vectors (fig-
ure 2.8) and the ratio of linear lengths (figure 2.9) are both easily calculated Similarity
invariants and are commonly used to generate two-dimensional signature descriptions.
Curvature is also a commonly used Similarity invariant [Weiss88, Califano94], and when
combined with image smoothing to enable closure of image curves [Dudek97] can form a
useful approach to sketch image analysis.
Affine Invariants
Affine invariants are unaffected by shear plus the Similarity transforms. This makes them
particularly useful for our purposes, as affine transformations are common in photographic
2.5. INVARIANCE 25
Figure 2.9: The ratio of linear lengths, a similarity invariant
imagery. Invariants of this level are still fairly efficient to calculate and can provide usefully
tolerant quasi-invariant local measures for photographic image content, particularly where
objects are sufficient difference from the camera to minimize projective effects. Area
ratios are particularly useful Affine invariants as area can be easily updated during a
segmentation process. Boundary descriptions can also be derived from area ratios by
using the ratio of areas between four points on a curve (figure 2.10) [Kliot98]. The ratio
of parallel linear lengths (figure 2.11) is a another common example of an Affine Invariant
[Gros98]. Geometric Hashing [Lamdan88], uses adjoining line pairs to form the basis for an
affine local planar geometry into which the other graph features are mapped (figure 2.12).
Another affine invariant is the ratio of vectors formed from a point at a tangent to two
linear vectors [Reiss93]. Boundary curvature has can be used to generate Affine invariants,
and has been successfully implemented in the SQUID image recogniton [Mokhtarian99]
(figure 2.14). A further useful property of curves is that the point on a curve which lies
furthest from the vector between the two curve endpoints will remain the same regardless
of affine transformations [Reiss93] (figure 2.15).
Figure 2.10: Four point area ratio, an affine invariant.
Projective Invariants
Projective invariants, or weak perspective invariants, are the highest order of invariant
possible when dealing with single images. Unaffected by skew as well Affine transfor-
mations, they can be used to approximate perspective tolerant descriptions where image
content is generally planar in nature. The most commonly used projective invariant is the
26 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
Figure 2.11: Ratio of parallel line lengths, an affine invariant
Figure 2.12: Geometric Hashing, two reference line segments are used to define a local
geometry that will remain unaffected by affine transformations.
Figure 2.13: A point, tangent and two lines can form an affine invariant
Cross Ratio, which is the ratio of distances between four points on a line (figure 2.16). A
Cross Ratio can also be derived from five coplanar points by forming a vector between two
of the points and projecting the other points onto this line.
Properties of planar conics can also be used to generate high-level invariants [Mundy92,
Forsyth90a, Quan98, Brill92].
The major problem associated with implementing high order invariants is that a large
number of primitive features are required to generate them, this makes them sensitive to
initial feature extraction errors, aliasing and noise present in query images.
2.6 Segmentation
Image segmentation usually refers to the subdivision of an image into regions that repre-
sent disparate image subcomponents. In the general case, the aim of segmentation is to
2.6. SEGMENTATION 27
Figure 2.14: The Shape Query Using Image Database recognition system uses invariant
curve based signatures built up from different degrees of boundary smoothing
Figure 2.15: The point on the curve furthest from the endpoint vector is affine invariant
Figure 2.16: The projective invariant Cross Ratio.
identify which parts of an image constitute separate visual objects. Segmentation is often
used as a first step towards the analysis of a given scene, assigning labels to areas of an
image to generate concise and computationally useful information about scene content. In
the case of image compression, segmentation can be used to eliminate redundant image
information, reducing storage requirements. The other main application of segmentation
is in image recognition and understanding, where regions of connected pixels with similar
characteristics will usually belong to a single visual object.
Segmentation algorithms can be roughly divided into five overlapping approaches: mo-
tion, edge, boundary, region and model techniques. Motion segmentation measures rely on
the principle that points in an image belonging to the same object in an animated sequence
of images will generally share the same common fate. Edge and boundary based techniques
rely on discontinuities at region borders to find region boundaries from which the regions
themselves can be derived. Region techniques rely on point proximity and continuity to
directly identify regions and, conversely, the boundaries between them. Model approaches
rely upon prior knowledge of image content to enhance the segmentation process and
directly search the image space for known objects.
28 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
While many the above approaches should ideally yield the same results with the same
given image, problems arising from noise, image complexity and algorithm assumptions
mean that this is very rarely the case.
For natural images containing three-dimensional objects, the segmentation becomes
more problematic due to noise, complex lighting and shading effects. Pre-processing such
as Symmetric Neighbourhood Filtering [Harwood87], median filtering or Gaussian filtering
to reduce noise, and colour constancy techniques [Finlayson97b],[Finlayson01] to reduce
shading and lighting effects is often applied before edge or region detection in such images.
The following sections provide an overview of the four main segmentation approaches that
are applicable to a single image segmentation.
Edge Based Segmentation
Edge based techniques attempt to determine regions by detecting and linking the edge
discontinuities in an image into region boundaries. Edge detection techniques commonly
utilise localised partial derivates calculated from pixel neighbourhood masks. The Roberts
[Roberts65], Laplacian, Sobel [Gonzalez92] and the popular Canny [Canny86] operators,
along with the many variations possible provide quite effective edge detection using the
derivatives of pixel nearest neighbours can be extremely efficient to implement. Unfortu-
nately, due to the bottom-up nature of these methods, these approaches can be sensitive
to camera artifacts, low resolution contours and sub-optimal contour groupings formed
from boundary intersections within the image at early stages of edge contour generation.
After region boundaries have been located, then the regions themselves can be easily
labelled. This process relies heavily upon the retrieval of fully enclosing boundaries, which
edge detectors alone are unlikely to generate in all but the most contrived or simple images,
due to imaging noise or similarity between adjacent regions.
Although binary and intensity measures are generally used with edge based techniques,
colour and texture measures, most often implemented with region segmentation, can also
be applied. To compensate for incomplete boundary recovery, most edge based techniques
require some form of post processing using image, boundary or geometric assumptions to
correct any errors. Edge relaxation [Zucker77] involves the use of localised edge direction
information to encourage the growth of weak adjoining edges whilst inhibiting edges due
to noise. Sequential edge detectors, use strong edges as seeds for object boundaries from
which all possible edge paths are calculated and evaluated for likelihood and connectivity.
Because this process can be quite computationally intensive, it is often directed towards
likely paths based upon edge direction in order to reduce the number of potential paths
searched.
The Hough transform can be implemented with noisy edge data to gather evidence for,
2.6. SEGMENTATION 29
and extract, line or curve segments. The Hough transform plots edge pixels into a curve
or line parameter space where evidence is accumulated for the existence of a particular
boundary segment. This information can then be used to inhibit outlying edge points
whilst encouraging weaker edge points which lie on the curve. An adaptation of this is
the Generalised Hough Transform, which can be used to detect the entire boundary of an
object with sparse edge data, although prior knowledge of the shape to be extracted is
required. One drawback to the use of Hough algorithms is that they have a tendency to
be computationally expensive. A variety of speed-up and shortcut techniques exist which
direct the search or use multiple edge points in combination, such as the Randomised
Hough Transform [Xu90].
Boundary methods: Snakes, Active Contour Models, Balloons
Boundary based segmentation lies somewhere between object recognition, edge detection
and region detection, although they are primarily concerned with detecting object bound-
aries as a whole.
Snakes, or active contour models [Kass87] are moving curve segments that negotiate
and adapt to the image space to find a position that minimises an energy function, usually
to seek smooth curve segments whilst maximising edge potential. The practical imple-
mentation of snakes to real images can prove quite sensitive to initial conditions, so prior
knowledge of image content or careful seeding is required for effective results. Snake al-
gorithms result in multiple detected curves that, as in the case of edge detectors, require
some form of linking into complete boundaries before segmentation can be performed. Bal-
loons [Cohen93], implemented to seek out both 2D and 3D contours are a more controlled
method of using snakes to find full region boundaries. Balloons represent continuous snake
like boundaries that seek to find maximal area, maximal smoothness and maximal edge
potential and inherently remove the need for edge linking into boundaries. Snake/balloon
based boundary techniques are very effective where some information about the shape,
curvature or content of the image to be segmented is already known and use assumptions
to enhance the boundary detection. Although methods exist to adapt these approaches
to unknown image content, they can prove very sensitive to starting conditions and image
content.
Model Based Segmentation
These techniques are primarily devoted to the direct fitting of a known shape or model
to the image data. Similar in nature to Active Contour Models, and often based upon
these, they use edge information to adaptively fit a known object boundary to the image
data. The simplest form of this approach is template matching, where a known object,
30 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
which will usually cycle through a series of transformations, is tested directly for match
with the image data. In the case of deformable models, known boundaries are adaptively
fitted to image edge areas using fitness functions similar to Balloons. Active Shape Mod-
eling [Stegmann00] relies upon a statistical description of object boundaries based upon
boundary variation by applying principle component analysis. Although very effective at
locating and segmenting known objects within images, and especially resistant to image
noise, their reliance upon prior knowledge of image content makes these approaches of
limited use to generalised image segmentation.
Region Based Segmentation
Segmentation based upon the detection of regions in an image approaches the problem
directly, grouping points into regions based upon their proximity and the similarity of
their characteristics. A segment is represented directly by some measure of the image
data contained within it and the boundary between regions is usually conceptualized as the
difference between these measures. Segments can be extracted directly from the statistical
properties of the image, generated as a result of merging lower level regions with similar
properties or created through the iterative subdivision of regions in a top down process.
Region based approaches are probably best defined in terms of the characteristic measures
and the actual region grouping processes that use these measures to create the segment.
2.6.1 Measures used to guide segmentation
There are a wide range of measures that can be used to form the basis of a segmentation
process, optimal measures are dependent upon the type of segmentation required and
anticipated image content. Essentially, these measures represent a localized property of
an image at a given position which can be compared with a neighbouring image point for
similarity and determine whether the two points belong to a common region. Measures can
be subdivided into single point (colour, intensity, binary) or multiple point descriptions.
Single Pixel Measures
Single pixel properties represent a measure of similarity between localised points and the
subsequent grouping into regions or segments. Although less appropriate for textured im-
ages, single point measures largely avoid the problems associated with sampling scale and
dimensionality, and can generate computationally inexpensive algorithms Pixel intensity
is the most commonly used single point measure, representing grey-scale images, whereas
colour can be denoted through hue measures or component colour intensity measures such
as RGB.
2.6. SEGMENTATION 31
If the image source is of a binary nature, or has been reduced to a binary image through
pre-filtering techniques such as thresholding, then similarity between individual points is
either true or false. Unless a texture measure is being used within a wider neighbourhood
of pixel, the proximity or inter-connectedness of same state points controls the region
grouping procedure. Although extremely efficient, binary image representations lack the
detail required by most modern analysis tasks and are most useful in highly simplified
or controlled environments, where a limited number of non-complex image regions are
expected.
If image content cannot easily be reduced to a binary measure, such as in natural im-
ages, we can increase the dimensionality of our measures to represent linear pixel intensity.
The image is represented as a function consisting of greyscale light intensity measures. In
the general case, images form a three dimensional surface consisting of the pixel position in
the image and it’s corresponding intensity value. In photographic images, a similar image
intensity between nearby points in an image indicate that the two points share the same
reflectance and surface properties and are therefore likely to belong to the same surface
or region. Segmentation through reflectance intensity still represents a good trade-off be-
tween processing time and the level of image detail, although valuable colour information
may be lost in the process. Grouping by intensity values and proximity results in good
segmentations where image regions are Mondrian in nature and under ambient illumintion.
Changes due to surface shape, texture and lighting conditions within an image scene will
usually result in an over segmentation.
When dealing with photographic images, the maximum amount of information we can
use from a single point is that of colour. Colour information helps differentiate between
surfaces with similar reflectance and can also be used to overcome some of the three di-
mensional lighting difficulties encountered when grouping by intensity only. RGB (Red,
Green, Blue) and CMY (Cyan, Magenta, Yellow) colour descriptions are essentially in-
tensity measures of different parts of the visual spectrum, and can viewed as a simple
extension of intensity descriptions. HSB (Hue, Saturation, Brightness) descriptions can
be more problematic as they combine linear and polar coordinate systems, but have useful
inherent invariant properties. The increase of point information facilitates the generation
of semi-invariant measures such as normalized rgb and hue that prove more tolerant to
lighting and shading changes in an image, although the Mondrian constraint is still as-
sumed. Segmentation based upon colour can be more computationally expensive than
intensity due to the increase in information that requires evaluating.
32 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
Texture Measures
Texture measures are generated from information provided by a group of pixels within an
image. Texture descriptions are usually generated either through the use of statistical mea-
sures describing a given rectangular or circular neighbourhood around a point or through
a functional description of image data. The resulting increase in information, often at
the expense of extra processing and poorer boundary localisation, can be further used to
overcome complex shading effects in real imagery. More importantly, texture measures
are capable of grouping complex and highly textured images, which are typical of natural
scenes, into disparate regions. With the introduction of wider neighborhood information
comes the problem of dimensionality, how to reduce the dimensions of the description to
a manageable form and which particular neighbourhood size or shape is most appropriate
to describe a given texture region.
Random Fields
Random fields rely on the assumption that a given two-dimensional neighborhood of point
values can be modelled by the parameters of the distribution function they best fit. Given
a distribution function and the point values, these parameters can be easily calculated,
although in the general case of image segmentation the actual distribution of a entire
image is unknown. Following from the Markov property which states that the probability
of a point value given an entire image is the same as in a smaller surrounding area, and
the general prevalence of Gaussian distributions, sample regions are often assumed to be
Gaussian. This special class of fields are called Gaussian Markov Random Fields. Another
common class, which uses exponential distributions determined from image point values,
are Gibbs Markov Random Fields. Random Field parameters can be used to determine the
mean, variance and the directional auto covariance of a neighbourhood of image points.
Dissimilarity measures
Texture dissimilarity from a known, or assumed, distribution can provide a sparse, ef-
fective, description of texture. Chi-Square measures can be implemented to determine
the variation of a table of image values from a table of values following an expected dis-
tribution. The Kolmogorov-Smirnov distance measure and Cramer-von Mises distance
estimator can both provide useful dissimilarity measures.
Cooccurence Matrices
Cooccurence Matrices, also referred to as Spatial Grey Level Dependence Matrices, are
a series of matrices that each represent a given set of offsets from a sample point. Each
2.6. SEGMENTATION 33
matrix stores the number of different grey level combinations that occur at the given
offset. Cooccurence Matrices can be used to determine a number of region properties,
including mean, variance, entropy, energy, contrast, and correlation. Both [Conners80]
and [Singh80] show that co-occurrence matrices provide the best texture discrimination
when compared against other approaches, with the possible exception of Laws [Laws80]
texture measures. An example of the application of co-occurrence matrices to segmenting
multi-spectral texture images can be found in [Kasari96]
Auto-correlation
This statistical measure determines how much a sample function varies with itself as it is
offset from the origin. When applied to two-dimensional neighbourhoods it can be used
as a measure of texture coarseness in different directions.
Moments
Calculating the moments of a point neighbourhood can provide a wide range of useful
texture descriptions for use with segmentation. Mass, center of mass, variance, skewness
and kurtosis (4th order moment representing non-gaussianity) can all be derived in this
way and used to indicate texture similarity.
Edges/Extrema in Unit Area
A simple count or normalized average of edges, corners or other notable extrema within a
point neighbourhood provide an extra measure of texture similarity. The Edge frequency
method is a simple summation of absolute point differences at given sample offsets that
can provide effective smoothness or roughness measures. Not a very effective discriminant
between textures when used in isolation.
Laws Texture Measures
Laws method [Laws80] produces a set of 14 rotationally invariant, or 24 rotationally de-
pendent, measures derived through the application of simple convolution kernels. The
basis kernals are a set of 5 one-dimensional kernels of length 5, each representing level,
edge, spot, wave, and ripple.
L5 = [ 1 4 6 4 1 ]
E5 = [ -1 -2 0 2 1 ]
S5 = [ -1 0 2 0 -1 ]
W5= [ -1 2 0 -2 1 ]
R5 = [ 1 -4 6 -4 1 ]
34 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
These kernels are convolved with each other to generate 25 combination kernels which
are applied to the image to generate 25 separate new images. These images are replaced
by their texture energy measures, gained through a summing point in a window around
a given pixels. The images are then normalized by the level-level convolved image which
is usually not used further. A limited degree of rotational invariance can be achieved by
summing the set of 24 images with their rotational opposites. For example, the L5E5
image, which is sensitive to vertical edges, can be summed with the E5L5 image which
is sensitive to horizontal edges to generate a combined image which is sensitive to both.
This results in a set of 14 unique images derived from texture properties, which in turn
provides 14 texture measures for every point in the original image.
Although a relatively simple approach, Laws texture measures actually provide a very
effective set of texture descriptions which have been shown in [Conners80] and [Singh80]
to outperform most other techniques, with the exception of co-occurrence matrices.
Fractals
Fractals, and multi-scale multifractals [Talon00] [Kam99] derive descriptions through the
analysis of the image contents fractal geometry, and are usually based around co-occurrence
features. Fractals analysis can be performed at different fractal dimensions, or scales,
and also be used to determine optimum fractal dimensions to implement with a given
image. This approach shares some considerable similarities to wavelet based descriptions,
especially in its ability to reduce noise whilst highlighting multi-scale signal information.
Primitive length texture features
Primitive length texture features rely on measures derived from the fact that coarse fea-
tures will have a larger number of interconnected points at a similar grey level, whilst fine
textures will have smaller clusters. They are calculated by recording the number of con-
nected points in a neighbourhood with a particular length and grey level in each direction.
This information is then biased to highlight either short primitive lengths (fine texture)
or long primitive lengths (coarse texture), which can then be used to measure primitive
length uniformity, grey level uniformity and a primitive percentage.
Gabor Filters
A Gabor filter is a linear two-dimensional, local filter type similar to a local band-pass
filter that extracts information from an image’s spatial and spatial frequency information.
Because one filter only samples a particular spatial frequency and orientation, a bank
2.6. SEGMENTATION 35
of multiple Gabor filters are commonly used the provide texture descriptions. Although
the results can be used directly to categorize image textures, they are more commonly
combined to derive other measures such as complex moments, grating cell operators and
Gabor-energy quantities. A comparison of the effectiveness of different Gabor filter ap-
proaches can be found in [Kruizinga99].
Although this filter type can be effective at describing texture features, the need for a
large number of individual features and their static nature has lead to them being largely
superseded by more flexible signal based techniques such as Fourier analysis and wavelets.
Fourier Analysis
Originally designed for linear signals, the Fourier series expansion of continuous and pe-
riodic waveform provides a means of expanding a function into it’s major sine/cosine or
complex exponential terms. These individual terms represent various frequency compo-
nents which make up the original waveform and posses properties that can be useful for
segmentation and image processing.
The Discrete Fourier Transform was developed to be used where both time and fre-
quency variables are discrete, and are particularly useful for image analysis problems where
the time element can be replaced with axis position and the function run for each line and
column of the image.
The Fast Fourier Transform is a class of special algorithms which implement the Dis-
crete Fourier Transform with considerable savings in computational time. Whilst it is
possible to develop Fast Fourier Transform algorithms to work with any number of points,
the number of points used is generally limited to integers with a power of 2 to allow
maximum efficiency to be obtained.
Fourier techniques have many useful properties when applied to computer images and
are primarily used for filtering, compression and the extraction of textural information.
Once the DFT of an image has been generated, the resulting frequencies can be manip-
ulated and then re-transformed back into a reconstructed image using the inverse DFT.
High frequencies in the DFT correspond to high frequency image elements such as edges
with low frequencies corresponding to general large scale image structure. If high frequen-
cies are filtered out of the image we are left with the overall shape of the image content,
with edge details lost. Conversely, if low frequencies are filtered out, then we are left with
information about the edges and their location in the image but little about overall shape,
a method useful for edge detection algorithms. In a similar way, neglecting very low (close
to 0) magnitude frequencies and storing of only the largest frequency coefficients can re-
sult in image compression and de-noising that still retains general structure. The image
Power Spectrum can be determined from Fourier analysis to provide information about
36 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
the frequency and direction of a texture pattern. While this information is useful when
applied to images or sections of image that contain regular patterns of constant interval,
segment edges in the image can have an adverse affect upon these descriptions. For this
reason texture descriptions using Fourier analysis are often generated for localized image
neighbourhoods within the larger image, and such descriptions are sensitive to the scale
of these neighbourhoods and the type of image content.
As component frequencies represent image features at different scales, this allows the
analysis of texture information at a multi-scale level and can help determine optimum
scales for analysis. Extensions of multi-scale Fourier analysis to include affine invariant
analysis have also been proposed [Hsu93]. The highly popular descendant of Fourier
analysis, Wavelets, was developed to address some of the difficulties inherent with the
use of continuous sinusoidal signals.
Wavelets
Wavelets were developed through a combination of different scientific disciplines to address
the non-localised nature of the sinusoidal components used with Fourier analysis. Both
Wavelet analysis and Fourier analysis map discrete data into the frequency domain and
share many of the same features and advantages in texture description. The main differ-
ence between the two is that, unlike sines and cosines, wavelet functions are localised in
space and frequency. This localisation results in wavelets producing a much more sparse
representation that Fourier approaches, which makes them more effective for application
to noise removal, compression and feature detection. Wavelets can also be applied with
varying image window sizes, which is an extremely useful property where the localisation
of image discontinuities is required. Many wavelet approaches use small windows with
high frequency analysis to detect discontinuities and apply larger windows with lower fre-
quencies to obtain detailed image frequency analysis. Another advantage is that whereas
Fourier analysis is restricted to two kinds of basis function, sine and cosine, there are a
large number of potential wavelet basis functions to choose from. This can result in more
effective analysis for different image content. Examples of wavelet types are the fractal
Daubechies, Coiflet,square-wave Haar, and Symmlet wavelets. An overview of wavelets,
their formation and applications can be found in [Graps95].
2.6.2 Region grouping processes
There are a wide variety of algorithms available to implement the segmentation of an im-
age and their effectiveness varies according to the nature of image content, homogeneity
measures and the intended nature of the segmentation. Bottom up algorithms begin with
low level information and gradually build regions through some form of merging process.
2.6. SEGMENTATION 37
Top down approaches begin with the entire image as the first segment and iteratively
split segments until a threshold homogeneity is reached. Determining the optimal di-
mensionality for a segmentation is crucial, especially in the case where a single segmented
image surface is required. Top down approaches can settle on sub-optimal solutions due to
splitting across regions and bottom up segmentations are adversely affected by noise and
image imperfections. Both approaches have their limitations, so hybrid algorithms featur-
ing splitting and merging processes are often implemented in an attempt to integrate low
level and high level information. Some approaches attempt to overcome the dimensionality
problem by producing a set of segmentations at different scales and either post-evaluating
them or returning them all as valid results. The main difficulty with this approach, apart
from the lack of a definitive solution, is the increased processing requirements of these
algorithms and the need to maintain multiple different scale representations of the original
image at the same time.
The nature of the segmentation required is also important in influencing the choice
of algorithm. In the general case of satellite imagery, where the desired texture elements
are already known, then the algorithm becomes a basic search for the nearest match to
the known elements. Where the number of segments required is known, then the task
is to subdivide image content into the given number of regions whilst optimising some
homgeneity measure. The general case, where no prior assumptions are made about the
number or nature of regions, relies totally upon the evaluating function to define the
segmentation and is less likely to obtain useful segmentations. Region connectivity is also
an issue, many tasks require regions to be grouped together spatially whereas for some
tasks this is not important and disconnected regions of similar properties can share the
same label, histogram techniques which disregard spatial information are usually much
more useful in the latter case.
Pre-attentive Segmentation
Pre-attentive segmentation algorithms represent pre-processing algorithms that can aid
in successive segmentation tasks. Such algorithms commonly analyse raw image data for
low level discontinuities, smooth contours and pop out targets and mark these as areas of
interest for other segmentation processes. [Yu01] presents such a strategy for pre-attentive
segmentation based upon nondirectional repulsion, detecting boundaries as discontinuities
in texture orientation.
Histogram Techniques
In some cases it may be advantageous to simplify the segmentation process through the use
of histogrammic representations. Histograms are usually implemented where images can be
38 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
separated into simple regions or regions of anticipated description, or as a preproccessing
step in seeding likely regions for other segmentation algorithms. Disparate regions will
have a tendency to show as peaks in a histogram of the image and troughs will indicate
potential segment boundaries. Although there may be labelling difficulties stemming from
the removal of spatial information, where unconnected regions are assigned the same label,
histogram segmentation can provide an efficient method of segmentation and is especially
useful where simple bimodal segmentations are required.
Histogram thresholding is one of the most commonly used methods for identifying
image regions, where all values below the threshold are disregarded as background and all
remaining image points are assigned labels corresponding to their histogrammic peaks. A
multi-scale or adaptive variation of this is the use of a Histogram Watershed, where the
threshold is gradually increased and adjusted.
Common histogram based segmentation algorithms are:
The Midpoint Method
Minimum Error Method
High-pass masking
Local and Global Thresholding
Primitives/Texture Databases
Primitive based segmentation processes are the subdivision of an image space according to
its similarity to a database of anticipated textures. Where image content is known, such
as landsat imagery, then this reduces the problem to a least-cost matching process and
can generate very effective results. Statistical texture descriptions within the image are
compared with a library of textures and assigned labels that provide the closest match, a
process almost identical to texture recognition.
Because of the success of this constrained method and the growing need to automati-
cally process landsat data, this approach to segmentation has been widely researched and
implemented. See [Ma95] and [Austin96] for examples of this kind of segmentation.
Adjustable Models
Adjustable models rely upon user feedback to identify incorrect segment classifications
and adjusts itself to accommodate this.
2.6. SEGMENTATION 39
Watershed Segmentation
Watershed segmentation is a well established approach that is usually implemented in non-
textured, Mondrian images and could be extended to deal with texture. Object boundaries
represent high values in the 1st derivative, edge space, of an image, whereas region cen-
tres represent shallow basins. The grouping of regions in edge space can be modelled as
comparable to water flowing downhill and collecting in the region basins.
One method of watershed segmentation is to treat each image point as a particle that
seeks the path of minimum cost through edge space until a stable state results, essentially
flowing downhill in edge space. Eventually, every point in an image will settle in a region
basin and those point that belong to common basins can be assigned the same label in the
original image.
Another approach uses gradual flooding to determine regions, all edge values below a
threshold value are considered to be ’under water’ and are given a common label according
to which basin they share. As the threshold increases, more and more of the edge space
will be below the threshold until separate region basins will eventually meet. Once regions
meet, their labels are fixed and the flooding process is carried on with other regions until
the entire image is labelled and segmented.
Region Growing
Region growing represent bottom up approaches that begin by clustering, or growing,
regions from primitive data points into larger scale segment descriptions. An example
of this is Pairwise clustering, where each point in the image is initially given a unique
segment label and an iterative process is used to gradually merge together similar segment
descriptions into a larger single segment. Although this process can be initially slow, with
many segments to consider, the process can become progressively faster depending upon
the exact nature of the segment comparison and grouping algorithms.
K-Means Nearest Neighbour
This classic approach to segmentation has many variant methods of application and inti-
ialization. A number of initial region labels are usually chosen and their position in the
image is seeded either randomly or through some rough post-processing such as histogram
thresholding. The iterative process examines each image point’s position and description
to determine which region the point lies closest to. Once all points in the image have been
assigned to their nearest region seed, the position of each region is recalculated as the
mean position of all member points and the assignment process is repeated. This usually
results in a movement of the region centre to a position of less-cost until all region centres
40 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
eventually reach a stable or periodic state. The final labelling of image points then rep-
resents the optimal position for each segment, given the starting conditions. K-means is
both a popular, efficient and effective technique but it’s sensitivity to initial seeding con-
ditions can make the results less predictable than other approaches. [Singh99] developes
the standard K-Nearest neighbour algorithm and applies it to texture segmentation.
Split and Merge
Split and merge algorithms base segmentation upon a top down approach, where the image
is recursively split into homogenous regions which are themselves subdivided until some
cost function is satisfied. The most common approach to split and merge is the use of
Quad-Tree splitting, which subdivides using a rectangular geometry, and example of which
can be found in [Smith94]. The main drawback to this splitting procedure is the artificial
imposition of a given geometry to the splitting process, which may result in regions that
span the geometry being incorrectly categorised as different segments. To address this, a
merging algorithm is usually applied to the final segments to correct any such errors.
Image Pyramids
Image pyramids attempt to address the problem of segmentation scale by providing a
multiscale representation of image content. Often implemented using Laplacian filters,
Image pyramids are also extensively used in conjunction with frequency domain filters
such as Gabor, Fourier and Wavelets. These pyramid structures represent a bank of filters
that sample image content at gradually increasing resolutions, presenting a multiscale and
sometimes multi-directional description of image content.
Steerable image pyramids use basis functions which are directionally derivative, this
allows the pyramids filter output in any direction to be calculated from the simple weighted
sum of previously determined function outputs. Steerable pyramids can be thought of as
a type of over-complete wavelet transform.
An application of Steerable Pyramids to achieve invariant texture recognition can be
found in [Greenspan94].
Simulated Annealing
Simulated Annealing is a function minimisation approach that takes its approach from the
annealing process with metals. Although many variants exist, the fundamental process
is to gradually arrive at a solution by probabilistically minimising an objective, or cost
function. The thresholding to this cost function is initially set high so that many potential
solutions can be investigated without settling upon a local minima and is gradually reduced
until an optimum solution is found.
2.6. SEGMENTATION 41
In practical terms, an initial segmentation of the image is first generated, often ran-
domly, and a slightly modified version is compared to it to determine if it represents a
better solution. If the new segmentation is better, or within the tolerance threshold then
the new representation is kept and the process is iterated. After a given number of such
iterations, the threshold is reduced and the process repeated. This is analogous to reducing
the energy of the system until it ’freezes’ at the solution. In this way, an optimal segmen-
tation of the image will eventually be produced and the system is likely to have avoided
settling into early non-optimal solutions due to the higher tolerances in early development.
[Cook96] presents an example of Simulated Annealing to achieve image segmentation.
Deterministic Annealing represents a combination of deterministic approaches guiding
the Simulated Annealing framework, this has greater control over initial and new seg-
mentation propositions. Although usually faster to arrive at a solution, Deterministic
Annealing is more likely to fall into local minima due to the less randomised nature of the
algorithm. Examples of segmentation through Deterministic Annealing can be found in
[Hoffmann98],[Hoffmann97].
Genetic Algorithms
As in most NP-hard search problems, segmentation can be achieved iteratively through
the use of Genetic Algorithms. Similar in nature to Simulated Annealing and modelled
upon natural evolution, this approach relies on a survival of the fittest process. Multiple
candidates for segmentation are first generated and then evaluated against a fitness func-
tion, those that perform the best, and a few chosen at random to maintain diversity, are
then ’mated’ with each other to create the next generation of segmentation candidates.
This next generation, generally seeded from the good candidates, are likely to represent
even better segmentation solutions. Gradually, after many such generations, and optimal
solution will be reached. Genetic algorithm based segmentation of natural scenes can be
found in [Albert90].
Bayesian Classifiers
A Bayesian approach can also be applied to image segmentation, using probability rules to
compare and classify different regions of texture or points in the image. The Iterative Con-
ditional Mode, first proposed in [Besag86], which relies heavily upon the Markov random
field probability model is commonly used in a Bayesian framework for image restoration
and can be applied to segmentation tasks. This algorithm relies upon the high probability
that an image points value will be the same as neighbouring point values and iteratively
updates and relabels these until a stable state is reached.
42 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
[Forbes98] provides an example of the use of Bayesian classification for segmentation,
implementing ICM.
Voting
The voting method is usually applied in a single scale texture segmentation process where
image content is subdivided into a series of overlapping grids. For each grid the nearest
predefined texture class is found and every point in the grid casts a vote for that class.
At the end of this relatively fast algorithm, each point is examined to determine which
class it voted most for and is assigned a region label. Voting provides a fast grid-based
segmentation process that, to some extent, overcomes the effects of implementing the grid
geometry.
Minimum Length Encoding
Minimum Length Encoding is particularly useful where segmentation of images contain-
ing non-textured mondrian surfaces is required. Based upon the Occam’s Razor Princi-
ple, minimum length encoding examines polynomial function representations of an input
waveform, or series of image points, and subdivides them into a series of minimum cost
best-fit polynomials. This is usually achieved by generating a graph of adjacent potential
polynomial pairs along with a cost function associated with the complexity of replacing
them with a single polynomial. This process results in a sparse set of functions which,
when applied in segmentation, naturally correspond to individual surfaces of both different
colour/intensity and surface shading.
For this reason, Minimum Length Encoding can be most effective when dealing with
non-textured image segmentation, as can be seen in [Keren89] and [Peleg90].
Behavioural Systems
Behavioural systems use communal behaviour models, such as the complexity exhibited
by bee colonies, to generate self organising systems. These techniques define simple be-
havioural rules for low-level inter communicating entities which lead to large scale complex
organisation. [Ramos00] and [Chialvo95] propose the use of swarm modeling for self or-
ganisation tasks such as segmentation and cognitive modeling. The self-organising nature
of these iterative processes can be seen as similar to many self organising neural network
approaches.
2.6. SEGMENTATION 43
Neural Networks
Neural Network techniques lend themselves well to image segmentation as their parallel
nature can overcome many of the processing problems encountered in serial algorithms.
Self organising neural networks and Oscillatory Correlation nets [Chen00] such as LEGION
[Wang97] are ideal for the data clustering required for unsupervised segmentation. Most
networks suffer where region continuity is required as their parallel approach is not easily
applied to differentiating between separate regions of similar properties. [Austin96] uses
the AURA neural network architecture to search and label images from a library of pre-
encoded textures.
44 CHAPTER 2. RECOGNITION/VISION, STORAGE & GROUPING PROCESSES
2.7 Signatures
Signatures form a simplified, non-reversible, accumulation of variables that can be used
for fast and approximate recognition. Sharing aspects in common with histograms and
parameter-space representations, signatures are usually formed from the plots of two or
more independent variables against each other (figure 2.17). While information is lost,
signatures can facilitate be highly efficient methods of fast image trait recognition. Plot-
ting two variables against each other in a two dimensional also space provides a much
greater discrimination and separation of values than the use of single dimensional his-
togram descriptions. These plots can be linear or binary in nature, with the generality of
the description easily manipulated through the careful selection of signature dimension,
resolution, smoothing and storage. Care must be taken to constrain variables to maximum
and minimum ranges to facilitate normalization into signature coordinates. A common
approach is to use the results of variable ratios, constraining results to a 0 ≤ v ≤ 1 range,
which is also a common method used for the generation localized geometric invariant de-
scriptions. Signature representations provide a very natural framework for the storage and
recognition of invariant description data [Weiss88, Squire00].
Figure 2.17: A simple two dimensional signature plotting the green-red and blue-red colour
ratios against each other
Earlier work using invariant measures combined with signature storage and statistical
comparison showed promising results, and it was originally anticipated that signature
storage would be the most likely method used in this thesis work. Towards the end of the
work, signature storage would be abandoned in favour of one-to-one label based methods.
Chapter 3
Preliminary Work and
Experimentation
3.1 Signature Storage
A brief review of signature storage has been presented in 2.7, this work intends to look
into the practicality of encoding two dimensional signatures from invariant descriptions for
image component recognition. This section details preliminary work into the effectiveness
of signature storage and possible ways of encoding these to facilitate generalization and
efficient storage.
Local invariant values are usually generated as the expression of the relationship of one
image region property to another, and are therefore fractions by nature. The first issue
to be addressed is our strategy to normalize coordinate values into a predefined range
for each axis. The most convenient method of limiting ratio values is to ensure that the
smallest value is divided by the largest, limiting all results to a 0 ≤ v ≤ 1 range (0 where
both components are 0). While information about the directionality of the ratio is lost, if
consistently applied then this information is redundant as only one point for every pairwise
description is required. A greater problem is likely to occur where ratio components can
have different signs (and cannot be adjusted into a positive range), although this is not
common for image feature descriptions. For these preliminary tests, a signature size of
160 by 160 was selected.
Neighbouring Region Pairs or all Region Combinations?
The proximity of features and regions is regarded as an indicator that they will be in some
way related to each other [Alwis00]. When dealing with three dimensional image content,
there is a conflict between the need to generate descriptive invariants (where components
45
46 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
Figure 3.1: The set of 12 natural query images
Figure 3.2: The set of 12 natural images searched
should be sufficiently different) and the need to limit components to localized ranges where
Mondrian planar geometry assumptions are most likely to hold (but components are likely
to be similar). The use of localized components also reduces the number of calculations
required from an exponential O(N 2) range to a more manageable linear range using only a
predefined subset of nearest neighbours. A neighbour constraint was incorporated into the
algorithm that could limit the signature values to components from neighbouring groups.
Tests were conducted on the 12 natural images (figure 3.1, 3.2) using combined results
of simple geometric (area ratios, circularity ratios) and optical invariants (normalized
colour ratios) with the neighbour constraint switched on and off. The initial segmenta-
3.1. SIGNATURE STORAGE 47
tion methods tested were split and merge and watershed region growing techniques, with
multi-generation watershed region growing eventually selected to generate the best region
primitives. General scores attained in the target image match are reduced when the neigh-
bour constraint is active. However, the direct comparison of image match scores between
different signature generation architectures is appropriate in this case, as each image sig-
nature and resulting score range is unique to that invariant type. A more reliable indicator
of cumulative performance is the sum of rankings. The rank of the target image represents
the degree of image match compared to the full dataset of potential matching images, in
this case a rank of 1 indicates a successful match. Rankings for signatures results are
expressed as percentages.
If rank is taken into consideration then it was found that the use of the neighbour
constraint resulted in a marked improvement in the search results and the number of
successful image matches.
Region Culling
The next stage of this preliminary work involved looking into the effect of removing un-
wanted, underdeveloped groups from the signature input. A culling algorithm was applied
to discount groups that were too small to provide any useful geometric information (lead-
ing to distortion of ratio results due to image aliasing) and those intersecting the boundary
of the image (shape information is likely to be clipped).
if I > B3 then cull
if B + B2 > A then cull
where:
B=Region boundary
I=Region boundary intersecting image boundary
A=Region Area
It was found that this kind of culling resulted in signatures too sparse when applied to
images with line drawings or images with few non-boundary regions (figure 3.3). To avoid
the culling of linear features, size based culling was reduced to a static threshold based
around group area.
The modified region culling and the neighbour constraint algorithms were tested on a
test-set of 12 natural images (figure 3.1, 3.2). Tests were carried out on both of the primary
48 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
Figure 3.3: Difficulties arising from the use of the neighbour region constraint and bound-
ary ased culling. Generation of pairwise invariant signatures would be impossible in these
examples, where the grey areas are culled and each remaining region therefore has no
immediate neighbour.
signature types, one set for geometric properties such as area and relative position and
another on normalized colour. Little difference was found in recognition results while
processing speed was increased significantly, indicating that some form of boundary and
area based culling would be advantageous to our final algorithm.
Signature Comparison Criteria and Signature Storage
Initial development of the test software implemented direct binary signature storage. Both
invariant coordinates were calculated and scaled up to the appropriate signature size and
the point set to true on a binary array of the same size. These binary arrays (the signatures)
were then compared directly to each other. Figure 3.4 shows the original direct signature
storage and scoring criteria.
It was found that, unless a coarse signature size was used, this method of storage
lacked sufficient generalization and tolerance to small deviations in signature components.
To increase generalization, our next set of tests implemented non-guassian smoothing of
cumulative signature bins before reduction to binary storage for comparison 3.5.
signature bin=
0 B < M
1 B ≥ M
3.1. SIGNATURE STORAGE 49
Figure 3.4: Direct storage of binary invariant signature.
Figure 3.5: Initial non-binary signature blurring and final binary signature.
where S = the value of the signature bin and M = mean bin value
The next test set was generated in the same way, but using component areas to in-
crement the signature points (rather than a simple count) before blurring and conversion
to binary storage (figure 3.6. The rationale behind this was that larger objects are more
perceptively significant and will be most tolerant to aliasing effects.
Unless we significantly decrease the resolution of signature descriptions, their direct
storage (even those reduced to binary representations) does not form a sufficiently compact
representation, especially where large numbers of signature descriptions for each image are
anticipated. It would also be convenient if our storage methods included the generalization
we have so far been using signature blurring to achieve. For these reasons, it was decided to
test the use of multi-scale moments to encode signature descriptions into a more compact
label form. The signature is subdivided into grids of different scale (figure 3.7), each grid
generates two coordinate values representing the normalized deviation of the grid central
50 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
Final
signature size = 160*160 = 25600 binary values
Figure 3.6: Initial non-binary signature blurring, using area bias and the final binary
signature calculated from mean value.
moment from the actual center, each converted to 8 bit storage. The grid subdivision was
iterated six times to constrain the data storage requirement to one comparable with direct
binary storage. The actual resulting storage requirement at this resolution was still less
than the equivalent direct binary storage.
Final direct binary signature size = 160*160 = 25600 bits
Final moment signature size = 2(1+4+16+64+256+1024)
2(1365)(8)= 21840 bits
The actual method used to divide the signature image into multi-scale grids is shown
in Figure 3.7, and decreases the grid size by half at each level of scale. The final grid level
of scale actually implemented in the test algorithms was chosen to be 6, largely because
the storage requirements for the list of moments was equivalent to those required by the
direct binary signature storage method. To avoid the need for grid weights, and to avoid
bias towards signatures with widely dispersed plots, empty grids were assigned with the
moment value of an average distribution (0,0), representing the center of the grid.
Evaluation of signature similarity was then performed by calculating the normalized
length of the vector formed between the two sets of stored moment vectors in that grid
and summing the results.
All five methods of signature storage; direct binary plot, blurred binary plot, blurred
binary plot with area bias and multi-scale moment storage and multi-scale moment storage
with area bias were compared against each other.
3.1. SIGNATURE STORAGE 51
5 level example, of 2(1+4+16+64+256)=2(341)=682 integer entries.
Figure 3.7: The signature is divided into multi-scale grids and the two vector coordinates
for each central moment is stored in a single combined list.
Neighbour Constraint: TRUE
IMAGE Rank classification of image (1=correct recognition)
D Rank DB Rank DBA Rank M Rank MA Rank
Mean Rank: 3.75 2.33333 2.5 1.16667 1.16667
These results indicate that as well as allowing economical storage, multi-scale mo-
ment storage can produce improved recognition results, with implicit generalization, when
encoding signatures.
52 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
3.2 Initial Experiments with Linear Gestalt segmentations
While the development from traditional pixel-based segmentation methods toward a truly
Gestalt engine will be discussed later, it is first prudent to consider Gestalt grouping in a
more simplistic context. Images specifically designed to test Gestalt grouping principles
are pre-segmented using traditional segmentation algorithms to generate simple region
primitives that can then be used in conjunction with Gestalt grouping algorithms. In this
way, different approaches towards Gestalt grouping can be implemented and tested, and an
insight into the difficulties and issues of incorporating Gestalt grouping principles can be
gained. A truly Gestalt engine would need to apply Gestalt rules (as discussed on page 15)
to larger scale visual objects that can provide richer descriptions. Although our goal is to
develop an algorithm that is capable of dealing with both types of segmentation (symbolic
and photographic), it is appropriate to study both forms in isolation before beginning the
integral engine.
A segmentation/grouping engine was created based around similar fundamental princi-
ples, but specifically designed to operate upon simplified image content and the application
of Gestalt rules. A library of Gestalt images was created which could be used to test this
algorithms ability to perceive symbolic Gestalt groupings using the three most important
Gestalt rules: proximity, similarity and continuity (figure 3.8).
Specifically, the algorithm was to work on pre-segmented primitives and attempt to
find the best linear groupings. For example, the top left image in figure 3.8 would be
presented to the algorithm as a set of 45 nodes describing the position and appearance of
the 45 distinct black object primitives in the image (the white background is discarded).
Using these primitives, the image can be grouped into a number of Gestalt regions which
depend upon the scale at which you are examining the image. In figure 3.10 we can see the
two most apparent groupings, a purely linear grouping that forms 9 distinct groups and
a larger scale grouping of these lines to form 3 distinct groups/clusters. As it is the first
level of grouping into new linear primitives that this algorithm is designed to examine, we
would expect it to output the 9 group result given this image.
The Linear Gestalt Grouping algorithm was developed from the algorithm originally
detailed in [Thorisson94] to simulate perceptual grouping. This algorithm clusters pre-
generated image primitives into perceptually significant groups based upon similarity of
appearance (colour, shape, brightness, size) weighted by their proximity within the image.
Their approach is to create separate ordered edge lists for each appearance attribute by
calculating the similarity of each attribute between all pairs of objects in the image and as-
signing them an inversely proportional score. The resulting lists are then further weighted
by a score inversely proportional to the proximity of each pair of primitives that formed
the edge. The edges are then sorted with the highest scores at the head of the list, ensuring
3.2. INITIAL EXPERIMENTS WITH LINEAR GESTALT SEGMENTATIONS 53
that the most perceptually significant groupings (those that are similar in appearance and
proximal to each other) will be processed first. The algorithm then iteratively searches
through each list looking for the most significant difference between adjacent edge scores
(essentially evaluating second order edge differences). As each significant difference is
found, all primitives belonging to edges prior to that difference are labelled as a grouping.
The lists are then iteratively searched in the same way for smaller and smaller significant
differences until a minimum threshold value is reached. This process results in sets of per-
ceptually significant groupings and their subgroups for each of the appearance attributes.
Groupings from the different appearance lists are then combined (with identical groupings
merged), assigned a ’goodness‘ score and form the output of the algorithm. Figure 3.11
shows a typical output from the Thorisson algorithm.
Whilst the use of ordered lists provides a useful starting point for Gestalt processing,
[Thorisson94] is only capable of detecting Gestalt cluster types based around similarity
and proximity. This results in the drawback that perceptually significant clusters of the
same type, but separate from each other in the image, are assigned to the same group.
For this thesis, we require the ability to separately identify such disconnected grouping
of the same type, so some modification to this algorithm is essential if we are to use
elements of this work in this thesis. A further drawback is that, in its original form, this
algorithm contains no elements relating to continuity or linearity, something which this
work intended to investigate.
The algorithm, whilst providing a useful starting point for this work (in particular
the use of ordered edge lists and appearance descriptions), was eventually altered into
something very different. The Linear Gestalt Grouping algorithm that evolved from this
starting poitn was to use a single edge list of combined features, first order edge differences
and the imposition of both weighting by continuity and a linear constraint to generate
groupings, as detailed below.
Similarity measures, data structures and equations relating to Gestalt similarity largely
follow the those detailed in the original paper. The algorithm relies heavily on self nor-
malization and treating the region similarity description types separately until final com-
bination. Proximity information is based around the minimum distance between region
boundaries, rather than region centroid, and is implemented as a function of the separate
similarity measures. An overview of the algorithm can be seen in figure 3.12.
Once a simple thresholded nearest neighbour algorithm has extracted the basic objects
from the highly artificial gestalt images, the algorithm generates a set of edge lists. Each
edge list represents the difference between region pairs in that particular attribute. Size
(e1), colour (e2), orientation (e3) and appearance (e4) edge lists are generated and weighted
by the boundary proximity of the regions.
54 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
Size Difference = e1 = (s1 − s2)2
where s1 and s2 represents the area of two regions, in pixels.
Colour Difference = e2 = (R1 − R2)2 + (G1 − G2)
2 + (B1 − B2)2
where R1, G1, B1 and R2, G2, B2 are the RGB colour components of two regions.
Orientation Difference = e3 = min(θ1, θ2)
where θ1 and θ2 are the angles between unit orientation vectors o1 and o2.
Orientation vectors are calculated using second order correlation moments from the
centroid of the region. If ri is the set of pixel positions in region i (disregarding pixel
colour information), then:
oi =∑
~vi∈ri
D(~vi)si
≡ ∑
pixels in region i
sum of vectors from centroidnumber of pixels in region i
Function D() ensures that vectors in the left hand plane are reversed into the same
180◦ range.
D(~vi) =
{
~ci − ~vi if (x component of ci)>(x component of vi)
~vi − ~ci otherwise.
with ~ci being the the centroid of Region ri
Circularity Difference = e4 = (circ1 − circ2)2
where circ1 and circ2 measure the circularity or linearity of the two regions and are
defined as:
0 ≤ circi ≤ 1
(
0 → circle
1 → line
)
.
circi =
{
bi − ai/si − ai if |ai| ≥ |bi|0 region too small to use boundary count.
ai =√
4πsi
With bi being the number of pixels that lie on the region boundary and si being the
number of pixels that form the region. ai is the circumference of the region if it was
circular in shape, and also represents the theoretical minimum possible boundary for the
region. In reality it is possible for the boundary pixel count (which is approximated by
the image grid) of small regions to exceed this theoretical minimum, so a check if made to
3.2. INITIAL EXPERIMENTS WITH LINEAR GESTALT SEGMENTATIONS 55
compensate for this.
Once these attribute difference lists (ejk where j indexes the attribute type) are gener-
ated from each region pair within the image they are self-normalized 1 and summed with
the normalized minimum boundary distance (dk) between the region pair k .
dk = minimum distance between the pixel boundaries of both regions joined by edge k
Normalized ejk ⇐√
e2jk+d2
k√1+1
While this algorithm makes use of the minimum boundary distance, which provides the
true minimum distance between two image regions, calculating this distance requires all
boundary points in one region to be compared with all boundary points in the other. This
approach can be prove to be a considerable drawback where speed of operation is an issue.
Later work relies on the region centroid difference, calculated using first order moments,
as the less accurate but more efficient distance component (see figure 3.13).
Each list is now sorted so that ej,k−1 ≥ ej,k ≥ ej,k+1
It was found that the self-normalization of description types was causing unwanted
bias towards certain description types when the description lists are combined to form an
overall score. This also resulted in an over-sensitivity to change in image content when a
particular description had a narrow range of values. A different approach to normalizing
the score values of list items was implemented that would allow the production of an
overall score based, instead, upon the rank of an item within a description list.
ejk = ksize of description list
With this alteration, edges that have the same significance (but different score mea-
sures) in differing feature description lists are considered equivalent. The replacement of
actual proximity values with list placement values also neatly circumvents any potential
unit bias when combining significance scores of different feature types (for example, size
and colour) for a single edge.
We now have a set of j ordered edge lists for each attribute, e with 0 ≤ k ≤ Size(ej)
entries, running from unlikely connections with small scores to the most likely, with higher
scores, ejk−1 ≤ ejk ≤ ejk+1.
At this point a different approach was taken to [Thorisson94], which continues treating
the attribute lists separately and runs the grouping algorithm for each, only combining
extracted groups afterward. We eliminate the need to run a separate grouping process for
each description type by combining the individual description lists into a single master list
1normalized using maximum and minimum list values to the range 0 ≤ v ≤ 1
56 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
(m) such that for each region pair k:
mk =∑Number of feature types
j=1
e2jk
Number of feature types
The final master list is sorted so that pairings with higher scores(those with the most
perceptual significance) are at the front of the list.
The following data structures are required by the Linear Gestalt Grouping algorithm.
The main structure is the sorted edge list Edge which is traversed during algorithm ex-
ecution and contains pointers to individual nodes. Individual nodes represent the image
(region) primitives that are presented to the algorithm before execution and are stored in
the node list Nodes. Before execution, each node points to its own separate region, stored
in the list Regions, and vice versa. During execution regions are merged and developed by
having parent region node references added to their node lists and the component nodes
pointed toward the new child region structure. During execution, nodes also act as junc-
tions between edges, the linear constraint in the algorithm prevents more than two edges
connecting to each node (see algorithm 11.2 and figure 3.14).
The main Linear Gestalt Grouping algorithm (algorithm 11.3) runs through the edge
list, evaluating continuity and junction constraints and reordering edges once continuation
information is available. Edges that remain in the current position and follow the linear
constraint form bridges between regions, which are combined.
Each time an active edge is reached in the sorted edge list Edge it is first checked
in JunctionTest() to ensure that the edge could form a valid connection. Preventing
more than two edges from connecting to the same node ensures linear groupings only
and checking that each end of the edge resides in a different region prevents unnecessary
computation. Continuity is then evaluated each time a new edge junction occurs (11.5). If
neither of the nodes pointed to by the edge are connected by other edges then no continuity
evaluation is possible and the edge is automatically accepted. If the edge’s nodes connect
to other edges then a continuity evaluation is possible. In this situation the angle between
each edge junction is calculated (11.6) and the perceptual significance score of the edge is
altered according to the continuity of the worst junction and the predefined Continuity
Bias.
Edge.Score = Edge.Orginal Score + Minimum Junction Angle − Continuity Bias
0◦
180 ≤ Minimum Junction Angle ≤ 180◦
180
Edge.Score =
0 Edge.Score < 0
1 Edge.Score > 1
Edge.Score otherwise.
3.2. INITIAL EXPERIMENTS WITH LINEAR GESTALT SEGMENTATIONS 57
If the junction angle is lower than the predefined minimum angle (MC ) then the
junction is too acute for the continuity constraint and the edge is rejected. If the edge
is not rejected and the new score differs from its previous value then it is repositioned
in the list according to this score. If the edge has already been repositioned according
to continuity constraints then the new edge score will be the same as the previous score,
which indicates that the edge can be accepted and regions merged. The algorithm repeats
this process until all edges have been evaluated and the output is a set of region groupings
that adhere to both proximity, similarity and linear continuity constraints.
Discussion
Figure 3.15 shows how effective the Linear Gestalt Grouping algorithm is on the test
images it was designed to deal with. With simplified images where clustering does not
play an important part of the perceptual grouping process the algorithm correctly groups
the patterns into linear primitives. As the algorithm only update edges according to local
junction continuity constraints and does not update edge proximity or similarity scores as
regions develop, the algorithm is incapable of performing a multi-scale segmentation in its
current form. Given a series of primitive regions to work with, this algorithm can only
find perceptually significant linear groupings using the original primitives alone. Figure
3.19 shows the different linear regions that form the output from the algorithm.
As in figure 3.21, the influence of the Continuity Bias (CB) variable is fairly minimal
when compared to Minimum Continuity (MC) and serves to determine the value at which
continuity information begins to increase or decrease initial edge score values.
As discussed earlier, this algorithm is not designed to detect groupings where non-
linear clustering would be required. Figure 3.17 shows the confusion caused when the
Linear Gestalt Grouping algorithm attempts to segment such image types. These group-
ing errors are further compounded by the algorithms determination to locate all linear
connections before termination and results in a tendency towards over-segmentation and
perceptually imperceptible connections. One possible method of dealing with this prob-
lem would be to iterate the algorithm and feed grouped regions back into the process. As
illustrated in figure 3.24, this would allow the generation of larger scale primitives and
allow cluster regions to be generated. The primary impediment to implementing such
a process is the need to determine when the linear segmentation should stop trying to
form new connections. At this point it remains unclear how this stopping point could be
evaluated so that more complex composite images such as used in figure 3.21 would cease
grouping once each sub-image is correctly grouped (as in 3.9 and 3.14). This tendency to
form links between what should ideally remain disparate regions (at least in that given
grouping generation) also highlights the difficulty deciding just which types of connection
58 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
should form the basis of a bridge between two region primitives in the first place (figure
3.25). Note that a single pass implementation of the Linear Gestalt Grouping algorithm
could only ever join all the regions in this image as in figure 3.25b, although the discon-
tinuity required to join both vertical groupings would be disallowed for most Minimum
Continuity control settings. Related to this problem is the algorithms tendency to search
for linear configurations even within complex and clustered regions where such groupings,
whilst being valid, are in no way apparent.
The bottom up approach of the algorithm and the evaluation of the current linear
grouping, regardless of other groupings present in the image that may perceptually obscure
a grouping can result in the generation of perceptually insignificant groups such as in
figure 3.22. This is particularly obvious when we attempt to generate groupings from
real images such as in figure 3.23. From this image we can see that virtually all the
resulting groupings can not be perceived in the original image at all. The problem that
such a photographic image poses to the Linear Gestalt Grouping algorithm is that the
algorithm is trying to impose a non-parallel linear constraint upon image data where
segments are made up from non-linear clusters of primitives. The non-parallel way in
which linear connections are made (and prohibit other potential connections) during the
grouping results in low-level linear feature extraction that does not allow the further
generation of relationships (or replacement of previously generated groupings given the
new context) between newly generated higher level features as the algorithm progresses.
While good low-level linear groupings are being generated here, it is apparent that these
groups are not the same perceptual groupings that we would associate with the image
content. While some of these negative effects would be reduced if image edge segments
were used as primitives (boundaries are linear in nature) it is apparent that a more parallel
and multi-scale architecture that combines both linear and clustering approaches would
be better suited to this kind of data.
The Linear Gestalt Grouping algorithm was designed to test the Linear and Gestalt
based grouping of primitives from simple test images. It is neither iterative, nor does it
provide a comprehensive framework from which we can build a fully pixel to gestalt region
segmentation engine. Figure 3.23 demonstrates resulting groupings when the algorithm is
applied directly to more complex photographic image types. In truth, this is not a fair
test of the algorithm as little attention has been paid to the initial segmentation process
from which the Linear Gestalt Grouping algorithm takes its primitives.
It can be seen from results such as those in figure 3.23 that the imposition of linear
and Gestalt continuity constraints, whilst appearing reasonable at first, can lead to an
algorithm that is too specific for general image content. The application of such Gestalt
constraints within algorithms does not necessarily result in a useful segmentation/grouping
3.2. INITIAL EXPERIMENTS WITH LINEAR GESTALT SEGMENTATIONS 59
process in practice. Further to this, there is a lack of knowledge relating to the degree
or the method in which the different Gestalt principles should be properly applied to
generate a human-like segmentation/grouping of image content. Care must be taken to
avoid adherence to Gestalt rules at the expense of realistic and useful results. While
Gestalt principles can be seen to be correct in the general case, they have been compiled
through careful observation of the final groupings and associations that humans tend to
make with image content and are not necessarily representative of the actual underlying
processes that generated those groupings in the first place. It can be argued that Gestalt
principles are actually the visual results of an underlying process rather than the actual
way in which human vision groups visual content. Given such results it seems reasonable
to proceed focusing on the most basic, and most reliable, Gestalt principle of proximity
rather than trying to implement the full set of Gestalt principles. As will be discussed
later, the application of the proximity principle to spaces that incorporate appearance as
well as positional data actually results in many of the other Gestalt principles such as
similarity and common fate.
If our segmentation algorithm is to work from the pixel level upwards whilst being as
intrinsically Gestalt in nature as possible then we will have to develop an algorithm that
can operate faster whilst handling very large numbers of primitive components. Whilst the
inclusion of continuation and linearity constraints would be beneficial, it may be sufficient
to approximate these to facilitate a multi-level engine that can segment images from the
pixel level upwards. The ability to deal with pixel clusterings in a similar way to most
standard segmentation engines would seem essential if the algorithm is to deal with real-
istically sized natural images in a practical time-frame. What is required is a compromise
segmentation (grouping) algorithm that, whilst retaining Gestalt principles, can operate
with the speed and efficiency of many standard segmentation algorithms.
The next section looks closer a implementing such a mid-level segmentation engine and
how we may blend the two approaches to segmentation and grouping together.
60 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
Figure 3.8: Showing examples from the Gestalt library, compiled to test the algorithms
ability to group objects using Gestalt principles.
3.2. INITIAL EXPERIMENTS WITH LINEAR GESTALT SEGMENTATIONS 61
Figure 3.9: Examples of successful outputs and linear groupings actually generated from
the final Linear Gestalt Grouping algorithm
Figure 3.10: The number of perceptual groups, region or object varies depending upon
which scale we view the image at.
62 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
Figure 3.11: Gestalt clusters generated through the use of second order differences in
ordered feature lists, images taken from [Thorisson94].
SEGMENTATION INTO PRIMITIVES
?
GENERATE AN ORDERED EDGE LIST
?
PROCESS EDGE LIST
?
-
OUTPUT GESTALT GROUPS
Figure 3.12: The basic architecture of Thorisson’s algorithm, [Thorisson94], forms the
basis for the Linear Gestalt Grouping algorithm.
3.2. INITIAL EXPERIMENTS WITH LINEAR GESTALT SEGMENTATIONS 63
Figure 3.13: The blue arrow represents the region centroid difference (a fast approximation
of region distance), the red line indicates the minimum boundary distance (a slow but
accurate measure).
Figure 3.14: A simple example of the Linear Gestalt Grouping algorithm in action, showing
the relationships of the Node, Region and Edge data structures.
64 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
Figure 3.15: Showing linear groupings generated by the LGG algorithm (MC:0.7, CB:0.8).
Green lines indicate the strength of connections (darker=weaker) and node colour indicates
region groups.
Figure 3.16: The effect of description elements on the final groupings
3.2. INITIAL EXPERIMENTS WITH LINEAR GESTALT SEGMENTATIONS 65
Figure 3.17: Images which require grouping into non-linear clusters should, and do, confuse
the Linear Gestalt Grouping algorithm.
Figure 3.18: Increasing the segmentation complexity and context can lead to difficulties.
Concise region descriptions (left) can cause obvious grouping errors where region primitives
have complex curves and structure. The segmentations to the right show the difficulties
caused where the algorithm reaches optimal groupings but continues to try and segment
the image further.
66 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
Figure 3.19: The full set of linear region (group) outputs from the Linear Gestalt Grouping
algorithm when applied to test image 3.8a (MC:0.7, CB:0.8)
3.2. INITIAL EXPERIMENTS WITH LINEAR GESTALT SEGMENTATIONS 67
Figure 3.20: The full set of linear region (group) outputs from the Linear Gestalt Grouping
algorithm when applied to test image 3.8h (MC:0.7, CB:0.8)
68 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
(a) MC:0.2,
CB:0.2
(b) MC:0.2,
CB:0.4
(c) MC:0.2,
CB:0.6
(d) MC:0.2,
CB:0.8
(e) MC:0.4,
CB:0.2
(f) MC:0.4,
CB:0.4
(g) MC:0.4,
CB:0.6
(h) MC:0.4,
CB:0.8
(i) MC:0.6,
CB:0.2
(j) MC:0.6,
CB:0.4
(k) MC:0.6,
CB:0.6
(l) MC:0.6,
CB:0.8
(m) MC:0.8,
CB:0.2
(n) MC:0.8,
CB:0.4
(o) MC:0.8,
CB:0.6
(p) MC:0.8,
CB:0.8
Figure 3.21: Linear Gestalt Grouping algorithm used on a composite of test images. Green
lines indicate successful connections, black lines indicate connections considered but failed
and node colour indicates final group. Each segmentation is generated using different
minimum continuity (MC) and Continuity Bias (CB) values.
3.2. INITIAL EXPERIMENTS WITH LINEAR GESTALT SEGMENTATIONS 69
(a) Image being
processed
(b) All linear
groupings found
(c) Perceptually
insignificant
grouping
(d) Perceptu-
ally significant
grouping
Figure 3.22: The Linear Gestalt Grouping algorithm can not determine if a grouping is
perceptually significant or not. The algorithm cannot determine the significant perceptual
difference between groups c and d
70 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
Figure 3.23: Showing linear groupings generated by the LGG algorithm when applied to a
simple greyscale photograph (MC:0.7, CB:0.8). Central images show initial segmentation
results (the primitives used to form the surrounding linear groupings) and the original
image. Surrounding images are a small selection of linear groupings extracted by the
Linear Gestalt Grouping algorithm. Notice that extracted linear groups, whilst being
valid, are often not perceptually significant in the original image.
3.2. INITIAL EXPERIMENTS WITH LINEAR GESTALT SEGMENTATIONS 71
(a) 1st Generation (b) 2nd Generation (c) 3rd Generation
Figure 3.24: The Linear Gestalt Grouping algorithm can be used to generate new linear
primitives which could be fed back into the algorithm for larger scale and cluster groupings.
72 CHAPTER 3. PRELIMINARY WORK AND EXPERIMENTATION
(a) Image Primitives (b) Linear Connection from 1st gen-
eration edges
(c) Prior generation edge forming a
bridge between 2nd generation region
primitives
(d) 2nd generation edges between 2nd
generation primitives
Figure 3.25: Showing three examples of edge types that could be used to form the bridge
between perceptually related regions.
Chapter 4
Gestalt Multi-Scale Feature
Extraction
4.1 Motivation
The first step in this work is the extraction of useful information from the raw image data.
Although comparison of pixel to pixel information can be effective for direct matching
between identical images, most practical recognition systems require faster, more efficient
or more generalized searches. The most commonly used representation that can be toler-
ant to image content variance, yet still detect similarity, is the use of colour histograms.
While colour histograms are highly efficient to generate, and produce very good recogni-
tion results, they are dependent upon colour or intensity information being available in
the image content and do not encode geometric information. This makes them unsuitable
for sketch recognition tasks. Other pixel-based approaches include the derivation of higher
order global and algebraic invariants directly from pixel correlation and moment values.
Of slightly more complexity, but the focus of much current research interest, is the use of
Fourier descriptors and Wavelets to characterize image content in terms of their component
frequencies and the generation of invariant descriptors based upon these. The common
problem with the above approaches is that they are generally reliant upon global image
properties which can effectively describe image similarity, but are less useful when applied
to the detection of image parts. Deriving invariants from the pixel content from an image
can also be processor intensive due to the sheer number of pixel elements within images
of useful size. A further difficulty with pixel-based approaches, especially when applied in
areas such as sketch recognition, is that they are unable to make use of large-scale shape
and boundary information present in image content. A more useful and intuitive primitive
to use for image recognition is the image object or segment. As well as global colour,
texture and geometric properties, humans also perceive sub-components, inferred objects
73
74 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
and gestalt groupings within images. These image objects can be direct representation of
a real-world object in a photographic image, or the more abstract objects inferred from
sketch based materials or visual occurrences of written language. Although the task of
subdividing an image into such abstract objects is extremely difficult to simulate on a
computer due to the reliance upon higher levels of processing and prior domain knowl-
edge, it is possible to use general grouping rules to satisfactorily segment most images into
meaningful objects. These Gestalt rules are based upon the observation of how humans
perceive and group image content and provide general principles that can be built upon by
practical machine vision applications. The full set of Gestalt rules and their affect upon
how we perceive image content are explained in more detail in the review on page 15 of this
theses. Generally speaking, Proximity and Similarity are considered the most important
Gestalt rules. Continuity, Common Fate, Completion, Closure and Simplicity are usually
thought of as having less importance. Region, Connectedness and Periodicity are often
overlooked completely in Gestalt research. Applying all these rules in an efficient and
congruous manner to achieve image segmentations that perform as well as human based
interpretations is one of the key problems faced by researchers in machine vision today.
For the sake of efficiency, most segmentation engines only apply a subset of these gestalt
rules, and are primarily concerned with proximity on a pixel or texel level. Such common
simplifications include the use of the 8 pixel neighbourhood, where only the immediate
surrounding image pixels are considered for potential grouping. Many segmentation algo-
rithms operate on a pixel level only, assuming that image content is Mondrian in nature
and therefore all salient regions will consist of similar, interconnected, colour values. In
reality, most natural image content consists of objects that are highly textured and contain
large difference between pixel intensity values. Attempts to deal with such content have
resulted in research into the subdivision of image content using texels, larger scale areas
of common texture, rather than individual pixel values. Although these approaches go
some way towards addressing the problems associated with natural images, the selection
of appropriate scales and computational complexity issues still pose difficulties. As texels
are just a larger collection of image pixels, at some point during the segmentation process
the size of the texel must be determined. This is usually done in an arbitrary way, by
using a constant texel dimension, although there has also been research into the use of
dynamic scale selection or multi-scale approaches.
In this research, it is proposed that image object content should be extracted that is
consistent with a human-led segmentation. It follows that a description derived from prim-
itives similar to those a human would perceive is likely to facilitate human-like similarity
judgements. A time limit of approximately 5 minutes (using a commonly available 2.4
GHz computer) will be imposed upon the segmentation/grouping and recognition process
4.2. THE CORE SEGMENTATION ALGORITHM 75
to retain it’s practical usefulness for potential further applications such as the core of a
search engine.
This part of the thesis will be concerned with the extraction of multi-scale image
primitives utilising as many Gestalt rules as practical whilst still maintaining real-time
processing speeds.
4.2 The Core Segmentation Algorithm
Although most segmentation/grouping/clustering algorithms are based around the ma-
nipulation of ordered edge lists, the way they are evaluated has a dramatic impact upon
the type of grouping that will occurs. Approaches, such as ([Thorisson94], [Shi97]), which
subdivide the edge list directly into groups based around 2nd or 3rd order list discontinu-
ities are effective at the fast generation of perceptual clusters such as those in figure 3.11.
Optimal positions for subdivision of the list (which affect the scale of the clusters) can be
difficult to determine and only the first sets of clusters on the sorted edge list are actually
likely to represent good primitives. This represents a very parallel approach, where many
primitives need to be combined into a new cluster in a single step (with no information
about inter-cluster relationships being used). Different cut thresholds can effectively alter
the scale of the clusters but the relationships between subgroups or the use of higher or-
der Gestalt grouping rules such as continuity are difficult to incorporate (see figure 4.8).
Another way of processing an ordered edge list is to process each edge entry in order,
building up groups as each pairing is allowed or disallowed. While this is a less parallel
approach, we no longer need to define any cut thresholds and can update edge and group
descriptions as the grouping process develops. This is essential if we are to capture Gestalt
relationships between segmented sections in the image. The previous section concluded
that the Linear Gestalt Grouping algorithm was too specifically targeted at finding linear
structures and too slow to be useful in image types where clustering is more appropriate.
While we wish to keep the ordered progression through an edge list, we require much less
prohibitive grouping decisions from our core feature extraction algorithm.
A simple algorithm developed by Pedro F. Felzenwalb and Daniel P. Huttenlocher
[Felzenwalb98] seems to offer just such a compromise. Offering ample opportunity for
further development with higher order Gestalt grouping, this segmentation algorithm deals
very cleverly and efficiently with Mondrian, textured and ramped image content. It makes
clever use of edge information to incorporate limited texture and higher level segmentation
performance in a segmentation that performs at a speed equivalent to simpler pixel based
algorithms once initialized correctly. Sharing the approach used in the previous section
on Linear Gestalt Grouping, the algorithm is based around the processing of sorted edge
lists but without the prohibitive insistence upon linear groupings. In its original form,
76 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Figure 4.1: Showing the selection of 8 nearest neighbours based upon the image grid and
the corresponding edge weight values generated.
the algorithm takes approximately O(N log(N)) to initialize and O(N) to complete the
segmentation, where N is the number of pixels in the target image. Initialization requires
generating a sorted edge list, but can be optimized using a tri-median pivot version of the
Robert Sedgewick optimisation of the Quicksort algorithm ([Gosling96]).
Given the similarity in methodologies used plus the efficiency with which this algorithm
can generate a pixel level segmentation, this represents a good foundation from which an
algorithm capable of both low and high level Gestalt grouping and segmentation can be
developed. The original algorithm is not designed to be multi-scale, and instead seeks to
arrive at a meaningful segmentation that is neither too coarse or too fine.
A graph-based approach is taken towards achieving a segmentation, with G = (V, E)
representing an undirected graph with vertices v ∈ V corresponding to the set of pixels
from the image grid to be segmented and the edges (vxy, vjk) ∈ E.
The three primitives used within the algorithm are nodes, edges and groups. Each
node represents a single image pixel. Pixels are connected by weighted edges.
Interconnected edges and nodes form larger groups, or image segments (see figure 4.2).
The algorithm is pre-seeded with a list of edges (E) generated from image pixel intensity
differences between 8 pixel grid neighbourhood pixels (figure 4.1). For each image pixel
(vxy) a set of 8 edges is generated with a corresponding weight W (e) ≡ W (vxy, vjk).
ei = (vxy, vjk)|(vxy ⊆ V, vjk ⊆ V, |x − j| ≤ 1, |y − k| ≤ 1, |x − j| + |y − k| > 0)
E = {ei}|(W (ei) < W (ei+1))
E is then sorted by weight value so that the lowest cost weights are at the beginning
4.2. THE CORE SEGMENTATION ALGORITHM 77
Figure 4.2: Nodes, Edges and Groups (Segments) form the basic building blocks of the
segmentation algorithm.
78 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Figure 4.3: The interior difference Int(S) of a region S is defined as the largest edge in
it’s minimum spanning tree, whilst the difference dif(S1, S2) between two segments is the
minimum weight connecting them.
of the list, edges will be processed in order of increasing weight.
Weights represent a non-negative measure of difference between the two connected
vertices using colour, intensity, position or some other appropriate description attribute.
The original algorithm uses the absolute intensity I(v) difference between connected pixels
to form edge weights 1.
W (vxy, vjk) = |I(vxy) − I(vjk)|
Extracted regions, or segments, represent unique and non-overlapping internally con-
nected subgraphs of G. Each segment S has it’s own internal difference measure, Int(S),
which is the largest weighted edge present in the minimum spanning tree (MST ) of seg-
ment S which is also made up of edges within the edge list E, S(V, E) ∈ G(V, E).
Int(S) = maxe∈MST (S,E)
(W (e))
The difference between two segments, Dif(S1, S2), is defined to be the minimum weight
edge that connects the two segments (see figure 4.3)
Dif(C1, C2) = minvi∈S1,vj∈S2
(W (vi, vj))
Notice that, given edges are processed in order of increasing weight, any edge internal
1Our own implementation uses Euclidean distance of normalized coordinates C within the number of
dimensions D.
W ((vxy, vjk)) =∑D−1
i=0 (C(vxy)i − C(vjk)i)2
4.2. THE CORE SEGMENTATION ALGORITHM 79
to a segment must be either smaller or equal to any edge connecting a segment pair. This
means that the new internal difference Int(S) of a merged pair of segments is identical to
the weight of the edge that joined the two (see development of internal difference vectors
in figure 4.4). Another consequence of using a sorted edge list is that if the current edge
being processed has vertices terminating in different segments, we can be assured that
it is the minimum weight edge connecting the two segments and that all segments are
automatically composed of a series of connected edges which form a minimum spanning
tree.
Initially, each vertex in V is labelled with its own unique segment and no grouping
between pixels has occurred. The algorithm then moves thorough the sorted edge list
(E) and, if the edge terminates in vertices belonging to different segments, compares the
current edge value against the region comparison function. This function (IsEdge(E))
determines if the edge is a true edge or not and bases it’s decision upon the current edge
value and the internal differences of the two segments linked by the edge. If the region
comparison function determines that the difference between the two vertices V that make
up the edge is too small to prevent a merge, then both vertices are adjusted to belong to
the same segment. This process represents a merging between two segments that the edge
vertices belong to, rather than just the two individual vertices themselves.
IF (e(vxy, vij) ∈ G(V, E),vxy ∈ Sk, vij ∈ Sl, k 6= l) AND (IsEdge(e) = FALSE) THEN
Sk(V ) = Sk(V ) + Sl(V )
Sk(E) = Sk(E) + Sl(E) + e
Int(Sk) = W (e)
Sl = ∅
The next edge in E is then selected for processing until the end of the edge list is
reached, or only a single segment S remains. It can be seen that the region comparison
function utilises a simple measure of internal difference to incorporate textural information
into the segment merging process, which follows the pre-seeded edge connections formed
by initial pixel vertex values.
Figure 4.4 shows a simplified illustration of the separation of a dataset into groups,
with the multi-dimensional distances between pixel descriptions represented as distances
on a 2 dimensional plane. Each node (dot) n corresponds to an image pixel that is
described by it’s colour or intensity properties and has a unique pointer to the group in
which it belongs (initially all nodes have their own single groups which, in turn, contain
pointers back to the node). Each node generates 8 edges e, which are described by the
80 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Figure 4.4: Shows the development of segments and the relationship between Nodes,
Edges and Groups. Red lines indicate the current edge being processed, dark blue lines
indicate the maximum internal difference in a group (which is always equivalent to the
last edge that was responsible for an expansion of that group). All other colours indicate
the separate groupings being formed. The ordered edge list is shown at the top of each
box, with the edges, node and group relationships below
Euclidean distance in intensity/colour between the node (pixel) and it’s 8 neighbouring
nodes (pixels). The edge structures also contain references back to the two nodes that
formed them. The edges are sorted so that the weakest edges appear at the head of the
list and will therefore represent the best candidate edges across which groups are likely to
merge. Running through this ordered list, we apply each edge to the region comparison
function. If the region comparison function determines that the edge is a true edge, then
no action is performed and the next edge is selected for processing. Where the region
comparison function determines that an edge should form the basis of a merge between
the two groups at either end of the edge, these groups are merged into a single group. This
can be easily done because each edge contains a pointer to it’s two nodes, which in turn
point to the groups (segments) that they belong to. Two groups are merged by combining
4.2. THE CORE SEGMENTATION ALGORITHM 81
their lists of node pointers into one of the groups and using their pointers to ensure that
member nodes now all point to that group. The orphaned group (which now contains no
nodes and has no references to it) can then be removed.
In this way, any subsequent edge that points to a node belonging to the new group
can instantly access information about that group. Table 4.1 shows the system of two-
way pointers used during the grouping process. Member nodes contain pointers to the
groups/segments they belong to and the groups, in turn, point back to their member
nodes. Eventually the entire edge list is processed, and we are left with the original image
subdivided into a number of groups/segments. Each pixel in the image is represented
by it’s node, which points to the group structure it belongs to. Conversely, each group
structure that survives the process contains a list of pointers to the nodes (and therefore
the pixels in the image) that belong to it.
EDGE (formed from the edge between adjacent pixels)
Pointer to NODE (pixel) 1
Pointer to NODE (pixel) 2
NODE (formed from a single pixel, will be pointed to by multiple Edges)
Pointer to the GROUP (segment) it belongs to
GROUP (a set of pixels that have been merged into a single group)
A list of pointers to the NODES that are members
Table 4.1: Calculating two-dimensional correlation invariants
As all variables are taken from the pre-seeded edge list, this algorithm performs a
pseudo-texture segmentation extremely quickly in linear time. However, the use of a
pre-seeded edge list also means that the possible development pathways of groupings is
predetermined from raw pixel data and does not take into account any high order struc-
ture present in the image. While the internal difference texture property changes during
segmentation and influences the grouping decision, the actual edge pathways considered
for grouping are all predetermined from 8 pixel neighbourhoods. In cases where pathways
cannot be formed between adjacent pixels (figure 4.5), the segmentation process has no
pathways from which to begin the segmentation process. To achieve a Gestalt segmen-
tation that can generate new potential edge pathways based around higher order group
descriptions as the grouping process develops will require a new approach to the seeding
82 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Figure 4.5: Black elements of this constant texture can not be joined to adjacent white
pixels due to the large edge between them and cannot be joined to other black elements
due to them not being adjacent on the 8 neighbourhood image grid. The result is one
large white segment and a very large number of black single pixel segments (see also figure
4.9). Values are edge values between pixels and the central pixel.
and maintenance of the edge list. Our first task is to adress the 8 neighbourhood seeding
problem.
4.3 Seeding with Nearest Neighbours
While representing a good starting point for this work, there are a number of drawbacks
to Felzenwalb’s algorithm that need to be addressed. It was found that although seeding
the segmentation algorithm using the image grid’s 8 nearest neighbours (figure 4.1) was
very efficient, it can result in undesirable side-effects when used on high contrast textures
or artificial images.
Texture areas with large weights between neighbouring vertices are not capable of
generating edge pathways that can lead to the merging of the texture area into a single
segment. This problem can be seen in figure 4.5.
One possible fix suggested in the paper is to smooth the original image before running
the segmentation algorithm. Smoothing (figure 4.6) reduces the magnitude of neighbouring
edge values and allows the interconnection of perceptual groupings that have disparate
pixel values. While this is effective, it is at best only a partial solution and has many
undesirable side effects related to smoothing out segment boundaries over the entire image.
This fundamental difficulty with the algorithm is the dependency upon the image grid
which limits potential edge connections to adjacent vertices only. One solution to this
4.3. SEEDING WITH NEAREST NEIGHBOURS 83
Figure 4.6: Smoothing can be used to partially overcome the 8 neighbour image grid
problem (values are edge values between pixels and the central pixel)
problem is to use the n-nearest neighbours of an image vertex, not governed by the image
grid (figure 4.7). This results in the algorithm being able to capture spatially non-local
regions and prevent the stalling of a segmentation due to local image grid discontinuities
which would otherwise form part of a high-texture segment. With the introduction of
n-nearest neighbour seeding we involve another set of dimensions in our calculations, the
image position coordinates. Whilst the range of possible image position coordinates is
Figure 4.7: The use of n-nearest neighbour algorithms to seed edges is slower than 8
neighbourhood and requires the inclusion of proximity information in calculations but
eliminates the dependency upon adjacency. It also decreases the tendency to generate
edges connections across different visual group boundaries.
84 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Figure 4.8: A circular arrangement of segments is apparent in this image. With localised
connectivity at pixel level only, this image will never be segmented into a circle because
the circular perceptual arrangement occurs at a much larger scale.
fundamentally dependent upon the image dimension, the range of pixel intensity values is
restricted to a static range. Care must be taken when combining these two different types
of measure into a single edge calculation, and different normalization schemes and methods
of combination can lead to very different segmentation behaviour. One fundamental issue
is whether proximity should be treated as a separate entity to similarity. While solutions
that combine the two in the same architecture can prove more elegant, there is some
evidence that they may affect visual perception in very different ways. The problems
associated with the increase in dimensionality of pixel descriptions will be addressed in
more detail in later sections.
In terms of Gestalt theory the use of n-nearest neighbours also increases the algorithms
ability to facilitate boundary continuity and completion, at least at a single pixel vertex
level. Although segment internal differences are used during the segmentation process
to incorporate larger scale segment information, such large scale segment information is
not reflected in the actual edge pathways that the algorithm follows. This results in a
limitation similar to the restriction of edge seeding to the image grid. Those parts of the
image considered for merging during the segmentation process are actually predetermined
based upon single pixel vertex information and are in no way based upon valuable seg-
ment information that can only be determined once the algorithm is in operation. This
means that strong larger scale groupings based upon segment size and appearance may
be completely missed by the algorithm that is currently restricted to following edge path-
ways based upon low-level pixel information (see figure 4.8). To convert our algorithm
into an efficient Gestalt engine we will need to incorporate the speed and efficiency of
the pre-seeded segmentation with some form of edge list updating as more information
about larger scale features and relationships between segments becomes available. Just
4.4. THE EDGE EVALUATION FUNCTION 85
how we include the ability to update the edge list during segmentation is discussed on
page 94. Another point to note when implementing n-nearest neighbour seeding as op-
posed to the 8 pixel neighbourhood is that the assumption of adjacency no longer holds.
Whilst many nearest pixel neighbours will indeed be adjacent to each other, this may
well not be the case if colour values in adjacent pixels differ greatly. This ability can be
critical toward successfully segmenting certain image types where no immediate adjacency
pathways between pixels are available (as in figure 4.9). The use of n-nearest neighbour
seeding inevitably necessitates the introduction of the concept of pixel location distance
into edge calculation. When evaluating both colour and positional difference for edges, we
also face normalization difficulties inherent in combining two different data types into the
same calculation. Such difficulties will be inevitable in any Gestalt system where higher
level descriptions such as shape and texture will need to be combined with position and
colour, so the use of nearest neighbour seeding at this point is not as inexpedient as may
first appear.
Although much of the time there is little perceivable advantage over adjacency grid
seeding, there is no doubt that the use of Nearest Neighbour seeding can significantly
improve segmentation results with some image types. Figure 4.9 shows a good example of
these benefits. Where the image has been seeded using adjacent image grid locations to
form edges, no edges have been formed between the black pixels in the image. This results
in these pixels remaining either ungrouped, being filtered out as noise or being merged
with the background. In the case where nearest neighbour seeding has been used, the edge
list has been pre-seeded with pathways that don’t rely on pixel adjacency and therefore
produces the correct segmentation.
Even when using the optimized KD-Tree nearest neighbour algorithms described later,
the time overheads for nearest neighbour seeding are considerably higher than simple
image grid seeding. Even with the optimal KD-Tree configuration for 100 by 100 images
(a terminating layer of 12, from figure 4.23 on page 114), the average seeding time for
10000 pixel images using nearest neighbour seeding was 20.5 seconds. This is around 70
times slower than image grid seeding at 296 milliseconds. Whilst this is a large difference,
the delay will only ever occur once per image segmentation.
Because nearest neighbour seeding can be seen to demonstrably improve results, this
is the seeding approach implemented in future work.
4.4 The Edge Evaluation Function
Fundamental to the segmentation algorithm proposed in [Felzenwalb98] is the Edge Eval-
uation Function. It is this function D(S1, S2) that determines whether a given segment
pair should remain distinct or merged into a single region.
86 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
D(S1, S2) =
{
true if Dif(S1, S2) > MInt(S1, S2)
false otherwise.
where the minimum internal difference is defined as:
MInt(S1, S2) =min(Int(S1) + τ(S1), Int(S2) + τ(S2))
where τ(Si) is defined as:
τ(S) = k/|S|
where k is some constant and |S| is the size of the segment S. The use of τ(S) in
the decision function is to set a scale of segmentation, with smaller components requiring
stronger evidence for a boundary.
Larger values of k result in a convergence towards a segmentation favouring larger
segment sizes (figure 4.10) whilst still preserving smaller regions of sufficient distinctness.
This ability to change the scale of the segmentation whilst maintaining smaller and dis-
tinct regions is useful but suffers from the disadvantage that a large number of single pixel
segments that were too distinct to merge remain in the segmentation. In the original algo-
rithm, the inclusion of threshold function τ is vital to the segmentation as the expression
Dif(S1, S2) >min(Int(S1), Int(S2)) will always be true given the precondition that the
edge list is sorted by increasing value and all Int(Si) are taken from a previous entry in
the edge list E. In other words, without the inclusion of the τ term the segmentation
would simply not begin.
As most components used within the Edge Evaluation Function are taken from the
pre-calculated edge list, with segment size being easily updated as segment merges occur,
the entire segmentation runs very quickly.
In practice, there are several major drawbacks to the use of this form of Edge Eval-
uation Function as it stands. The first is the arbitrary nature of the all important k
constant. In [Felzenwalb98] Felzenszwalb and Huttenlocher do not examine the use of this
constant in great detail and are happy assigning it values that produce visually appealing
segmentations. However, the degree to which k affects a segmentation is fundamentally
linked to the potential size of segments |S| in an image. The use of this constant also acts
against the invariance of the segmentation to scale change, the equivalent segments at dif-
ferent scales will be dealt with differently by the function. Rather than using an arbitrary
constant value for k it could instead be derived as a function of image dimension, which
would at least make it invariant to changes in image dimension. Although the inclusion of
segment size information as part of an ongoing segmentation is a desirable quality, the use
4.4. THE EDGE EVALUATION FUNCTION 87
of this constant inherently biases the segmentation towards a certain segment size. The
static nature of this bias in the segmentation increases the segmentation’s sensitivity to
image context change, image dimension or scale change. A decision function which utilises
segment size information in a more rational manner is required.
A second drawback to the algorithm is that the decision function is using the sum
of two very different variables, with very different limits, to make a decision. Whereas
edge differences can only span between 0 and 255 in most intensity formats, or 0 to 1 in a
normalized system, the potential range and influence of the second τ(S) term is completely
different. Consequently it is difficult to determine the influence of the two terms relative
to each other without some form of normalization to bring them up to comparible scales.
As an example of this, if we were to set k to 255 then the function τ(Si) would have a
range maximum of 255/1 = 255, if we try and combine this with an edge value based upon
an intensity format of range 0 to 1 then it can easily be seen that the τ(S) function is
driving the segmentation algorithm with very little influence from the Int(S) component.
Either some way of normalising these two disperate component values against each other
needs to be found, or the decision function should be completely overhauled.
The third problem surrounding the algorithm as it stands lies in the intiialization of
values. Although vital to the segmentation process, the original paper fails to specify their
own intiialization parameters. Intuitively, the algorithm begins with each pixel vertex
being equivalent to it’s own discreet segment. Each of the segments will consist of a single
pixel |S| = 1 and therefore have an internal difference of zero (Int(S) = 0). This results
in the following decision function at intiialization:
MInt(S1, S2) =min(0+τ(S1), 0+τ(S2)) ≡min(k/|S1|, k/|S2|) ≡min(k/1, k/1) ≡min(k, k)
D(S1, S2) =
{
true if Dif(S1, S2) > k
false otherwise.
Which essentially reduces down to k having total control over the very first generation
of segments and acting as a single threshold value. This also results in a very large
probability that all single pixels will be merged with their nearest neighbour for any value
of k greater than zero. The reliance of the entire algorithm upon a value of k determined
by trial and error doesn’t sit very well with a well designed architecture.
4.4.1 Generating a Multi-Scale Segmentation
The final difficulty with this algorithm, as it stands, is that it is designed to settle on a final
segmentation largely controlled by the k variable. In the context of this thesis, the segment
algorithm should exhibit multi-scale behaviour. To do this in the current framework would
88 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
require multiple segmentations using different values of k as in Algorithm 4.1. Another
approach would be to force the grouping process to continue until a single segment remains,
storing valid and useful segments (determined by a function IsValidRegion()) before they
are destroyed in the segmentation merge process as in Algorithm 4.2).
Algorithm 4.1 Using multiple values of k to approximate a multi-scale segmentation
FinalSegmentList=∅k=StartV alue
WorkingList=∅do
{InitSegments (WorkingList)
WorkingList=Segmentation(k, WorkingList)
FinalSegmentList+=WorkingList
Increase k
} (until |WorkingList|==1)
Algorithm 4.2 Altering the Edge Evaluation Function to generate a segmentation con-
verging upon a single segment, whilst extracting relevant segments.
FinalSegmentList=∅WorkingList=∅InitSegments (WorkingList)
do
{S=NextSegment(WorkingList)
FinalSegmentList+=IsValidRegion(S)
} (until |WorkingList|==1 OR E == ∅)
The segmentation process that both the above algorithms rely on is based around the
generation of an ordered edge list from pixel level information and then the rapid pro-
gression through this list applying the Edge Evaluation Function D(S1, S2) to determine
segment groupings. This results in a process where the intiialization of the edge list takes
4.5. A NEW EDGE EVALUATION FUNCTION 89
longer than the actual segmentation process. As can be seen in the algorithms above,
the first approach requires edge list intiialization for every value of k, whereas the latter
algorithm only requires a single edge list intiialization. For this reason, coupled with the
apparent arbitrariness of the control value k, the latter approach was taken and the Edge
Evaluation Function adjusted to converge towards a single segment without the need for
a k control value. This also opens up interesting possibilities for the selection of ‘valid’
segments to preserve from the ongoing segmentation process. The issue of good segment
extraction, the properties such segments will exhibit, and the implementation of a Segment
Evaluation Function will be further discussed in section 5.1.
4.5 A New Edge Evaluation Function
It is apparent that the Edge Evaluation Function as defined in [Felzenwalb98] is not optimal
for the generation of a multi-scale segmentation that this thesis work requires. In particular
a method of removing the dependence upon the external variable k to control the scale
of the entire grouping process was required. Desirable properties for this function are
to segment the image using the same general criteria of the original function, keeping the
same simplicity and speed, whilst allowing a general convergence towards a single segment.
The following function, still based around internal differences, exhibits this desired
behaviour:
D(S1, S2) =
{
true if Dif(S1, S2) > CInt(S1, S2)
false otherwise.
where D(S1, S2) is the decision function that determines whether or not the edge should
remain (true), or whether the edge is weak (false) and a merge between the two regions
connected to it will be initiated. The combined internal difference is defined as:
CInt(S1, S2) = Int(S1) + Int(S2)
However, this function does not include any form of size bias that was inherent in
the previous formulation of D(S1, S2) through the use of τ(S) = k/|S|. As it stands,
the decision function is determined purely through edge comparison with internal texture
differences and does not include any influence determined by the comparative sizes of
segments, which are readily available.
It can also be seen that the current function suffers from difficulties at the beginning
of a segmentation, where each group represents a single pixel that should, intuitively, be
seeded with an internal difference of 0. This would result in the Edge Evaluation Function
accepting all edges as true and no segmentation taking place.
90 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
All edges are valid at startup because:
Dif(S1, S2) > CInt(S1, S2) where CInt(S1, S2) = 0
Whilst this difficulty can be worked around by artificially seeding all internal differences
to 1, a better solution would be to include another element in the function to begin the
segmentation process.
A slight modification to the above function allows the optional inclusion of size infor-
mation to influence the Edge Evaluation Function and bias it towards merging regions of
similar size. This is based around the relative size of segments and will remain the same
regardless of most image transformations, representing an improvement upon the way com-
ponent size was treated in the original Region Comparison Fucntion whilst eliminating the
need for any control variable k.
D(S1, S2) =
{
true if Dif(S1, S2) > CInt(S1, S2)
false otherwise.
where the combined internal difference is defined as:
CInt(S1, S2) = Int(S1)+Int(S2)+M(SizeMod(S1,S2))1+0.5M
given that:
0 ≤ M ≤ 1 (Degree of size influence)
SizeMod(S1, S2) = 1 − (ABS(|S1|−|S2|)(MaxArea−1)
and MaxArea = ImageWidth ∗ ImageHeight (maximum possible segment size)
Given this formulation, it is apparent that the Combined Internal Difference Function
will return a potential range of 0 ≤ CInt(S1, S2) ≤ 2 while 0 ≤ Dif(S1, S2) ≤ 1, which will
generally result in a bias towards merging segments and ultimately to the desired single
segment solution (although this is by no means inevitable in every case). It should be noted
that the majority of segments processed by the Combined Internal Difference Function will
be considerably smaller than MaxArea by which they are normalized. The inclusion of
this size component will have a tendency to reduce the magnitude of CInt(S1, S2) and
increase the likelihood of edges being retained. The inclusion of this size component in
the function also overcomes the internal difference seeding problems, discussed earlier,
because:
4.5. A NEW EDGE EVALUATION FUNCTION 91
SizeMod(S1, S2) = 1 − (1−1)(MaxArea−1) = 1 − 0 = 1 at intiialization.
By taking the original algorithm and removing the termination criteria we have cre-
ated a basic segmentation algorithm that can efficiently generate groupings from raw pixel
information as well as a simple textural description of Internal difference. Whilst this
provides us with the basic data structures and decision functions required for a multi-scale
segmentation, it suffers from the same major limitation as previous work on Linear Gestalt
Grouping. Both algorithms can only group according to pathways laid down during ini-
tialization, which are calculated from primitives before the segmentation begins. In the
Linear Gestalt Grouping algorithm, these primitives are the results of an initial segmen-
tation whilst this algorithm uses pixel information. Although internal difference and size
information affects grouping decisions as the algorithm progresses, at present it can only
form connections along edges generated at initialization. This precludes potentially impor-
tant connections from forming due to higher level texture similarities, such as those shown
in figures 4.5, or group similarities as shown in 3.24b and 3.24c. To allow true multiscale
behaviour we need to find an efficient way of generating new edges to form potential path-
ways as the segmentation progresses and larger region primitives are generated through
grouping. This is no trivial task, as the generation of every possible new edge at each stage
of the segmentation would inevitably result in an explosion in processing requirements and
a huge increase in the size of the edge list. The main processing overhead in the original
algorithm is the generation of edge pathways, which in itself has been limited to either
8-nearest neighbour or 8 pixel neighbourhood edges per region, before algorithm execu-
tion. The fact that this overhead occurs a single time only, in the preprocessing stages
of the segmentation, limits its impact. What is now required is an algorithm capable of
generating new edges as new region groups are formed in the segmentation and smoothly
and efficiently inserting them into the edge list being whilst it is being processed. Such an
algorithm will also require a storage structure capable of retrieving high dimensional data
efficiently by nearest neighbour query.
92 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
(a) Original Image (b) NN Seeded 100 ac-
tive groups remain
(c) NN Seeded, 3 active
groups remain
(d) Original Image (e) Grid seeded, 100 ac-
tive groups remain
(f) Grid seeded, 1 ac-
tive group remains
Figure 4.9: Although differences in segmentation between pixel grid and nearest neighbour
seeding are usually slight, here is an example where the use of Nearest Neighbour seeding
is crucial for the accurate segmentation. While nearest neighbour seeding allows the
algorithm to correctly join the black pixels despite them not being connected in the image
(above), this is not the case with 8 neighbourhood grid seeding where the only connections
presented to the algorithm are merges with the image background. Generated groupings
where merges have occurred are artificially coloured in these images (ungrouped regions
retain their original black or white colouring).
4.5. A NEW EDGE EVALUATION FUNCTION 93
Figure 4.10: Segmentations using image grid edge seeding and increasing values of k with
the original region comparison function. Colour values have been normalized to a range
between 0 and 1 and the normalized Euclidean distance between pixels used to determine
edges.
94 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
4.6 Updating Edges
As discussed in the previous section, although we have a fast and efficient pixel/texture
based segmentation algorithm, it suffers from the limitation that all possible segmentation
pathways are generated prior to the segmentation process. A truly Gestalt segmentation
process will require us to generate new possible segmentation pathways as new regions
are created by the segmentation algorithm. An exhaustive implementation of this process
would result in the generation and insertion of new edges between each newly generated
region and all other regions, into the edge list.
Algorithm 4.3 Exhaustive edge update procedure
For each new region generated by the segmentation process:
Remove two old regions used to form the new one
No Of Regions=No Of Regions-2
Remove the current edge that formed the new region
No Of Edges=No Of Edges-1
Insert new edges between the new region and all old regions
No Of Edges=No Of Edges+No Of Regions
As can be seen in algorithm 4.3, exhaustively adding new edges to the list would result
in an order of n2 growth in the amount of edges in the edges list as the segmentation
proceeds. Whilst this would represent difficulties in terms of the sheer size of the edge
list required, speed would also be severely affected by the requirement to insert each new
edge entry into the current edge list. As many of these new edge pathways are unlikely to
actually be used by the segmentation process at all, it makes sense to limit the number
of new edges placed in the edge list to the better edges only. These useful new edges can
be defined as the edge pathways between the n nearest neighbouring regions in feature
space. An efficient nearest neighbour extraction algorithm, as shown in algorithm 4.4
would reduce the number of new edges to those that are most likely to be followed by the
segmentation algorithm, allowing the fast insertion of edges and keeping the size of the
edge list down to manageable dimensions.
Given a relatively small number of useful new edges, the edge list can be updated very
quickly as each new region is generated. The bottleneck in algorithm 4.4 becomes the
4.6. UPDATING EDGES 95
Algorithm 4.4 Exhaustive edge update procedure
For each new region generated by the segmentation process:
Remove two old regions used to form the new one
No Of Regions=No Of Regions-2
Remove the current edge that formed the new region
No Of Edges=No Of Edges-1
Extract the n nearest neighbours to the new region
GetNeighbours(n)
Insert new edges between the new region and n nearest old regions
No Of Edges=No Of Edges+n
extraction of nearest neighbours in the function GetNeighbours().
4.6.1 Appropriate Data Structures for K-Dimensional Nearest Neigh-
bour Query
Whilst many segmentation approaches concern themselves with pixel level information, or
simple texture measures, this gestalt engine will have to manipulate and make nearest-
neighbour calculations using larger dimensions. At the barest minimum a full colour
Gestalt segmentation engine will require three pixel colour components, two segment po-
sition values and some measure relating to higher level group descriptions. In this work,
the following minimum requirements for region description are anticipated:
Each region group can be considered to represent a point in an n-dimensional volume.
We require an algorithm that can search such a multi-dimensional space for nearest neigh-
bours with great efficiency, as this operation will be required whenever two parent groups
merge to form a new child. The space must also be updatable, so that new groups can be
added to it and included in the search without the need to re-submit all groups.
The nearest neighbour search problem (sometimes referred to as the closest-point prob-
lem) is common to a large number of problem domains, and has a large set of possible
algorithmic solutions with differing degrees of reliability and efficiency. Such algorithms
are necessary due to the large increase in processing required to perform a direct com-
parison as the number of elements to be compared (often using Mean Squared Error or
Manhattan distance) for proximity to an element increases. The majority of approaches
reduce computation time by approximating the data in the search so that fewer compar-
96 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
POSITION
1. X Position
2. Y Position
COLOUR
3. Mean Red Component Intensity
4. Mean Green Component Intensity
5. Mean Blue Component Intensity
APPEARANCE
6. Region Size
7. Minimum Internal Difference
Table 4.2: Minimum requirements for region description.
isons are required to find matches that are close the to nearest neighbour match, this
process is often iterated to find even better approximate matches from the candidate set.
Once the algorithm reaches a scale at which the processing required for approximation
outweighs the cost of direct candidate comparison the latter approach can then be used
to determine to exact nearest neighbour (if this level of accuracy is required).
There are three main approaches to optimizing the search for as nearest neighbour
from a candidate set, arranging the data items in such a way that nearest neighbours can
be indexed efficiently from their properties, the projection of data down to a simplified
coordinate system or through the use of a data structure which will enable a more efficient
search for the item. These methods are not exclusive, and many variations that combine
these approaches have been developed.
Sorting data items into an ordered list enables us to use either direct indexing or list
subdivision to quickly identify our nearest neighbours due to the properties inherent in an
ordered list. In a similar way, the projection of data items into lower dimensionality de-
scriptions can allow us to quickly eliminate large numbers of points from exhaustive search
without the need to evaluate the set of projections that could not be nearest neighbours.
If a direct indexing based upon the attributes of the data items to be searched can be
established (usually incurring a degree of approximation), then potential nearest neigh-
bours can be quickly extracted from the set without the need to directly compare each
data item. Trained neural networks and, in particular, Correlation Matrix Memories can
be used to successfully and efficiently establish such relationships so that the attributes
4.6. UPDATING EDGES 97
of a target point can be used to directly generate a set of potential nearest neighbours
without exhaustive search. [Hodge02] combines this approach with a dedicated high-speed
binary CMM architecture to enable nearest neighbour query much faster the the standard
computational approach.
Whilst data structures used to optimize nearest neighbour search are usually based
around kd-trees, Voronio diagrams can also be used to partition a space into a simpler and
more efficient structure, decomposing the space into known regions such that the region
containing each data point is that region where the data point is closer than any other data
point. This is particularly useful for low dimensional data, but the size requirement of
storing the Voronoi structure rapidly becomes prohibitive as dimensionality increases. This
approach does not lend itself well to updating as large sections of the Voronoi structure
will require recalculation as new query points are added.
Although many variations of kd-tree have been devised, their purpose is always to
heirachically decompose the search space into cells that contain subsets of the data points.
We can efficiently find which cells are likely to contain the nearest neighbour and reject
those cells that cannot, data points contained in the rejected cells can be eliminated from
the search without the need to exhaustively test each point for proximity. Typically,
algorithms construct kd-trees by partitioning the data points into two sets (or cells) across
a splitting plane. Which planes to partition across are selected by either cycling through
the dimensions of the space, cutting along the largest dimension or through the use of
Quadtree or Octree structures which partition across all dimensions at once (resulting
in four child cells for two dimensional data and eight child cells for three dimensional
data). Whilst non axis-parallel cutting plane have been used, they result in cell boundaries
that are much harder to retain and compare against. Kd-tree structures are effective in
moderate-dimensional spaces and methods of compartmentalizing and negotiating such
trees can be optimized to the type of data being searched. Where such data characteristics
are inherent, such optimizations can remain effective when the tree structures have new
query points added. Such optimizations become less effective where they are derived from
the initial data set and new data points are added that change the nature of the set as a
whole (requiring either recalculation of the structure or suffering from an increasingly less
efficient search). Kd-tree data structures become less effective as the dimensionality of
the dataset increases past 20 because the sphere which represents the current best search
radius progressively fills up less volume relative to the cube structures used in the tree,
resulting in each approximation containing more false nearest neighbour candidates.
Implementations utilizing the three main approaches were compared for use in this
work:
98 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
1. Modifiable KD Trees - a KD based approach using largest plane subdivision
2. Mean Constrained Projection - a projection based technique
3. Half Rib Orthogonal Lists - an ordering based technique indexing by axis
KD Trees
Tree structures are most commonly used to enable fast nearest neighbour access to multi-
dimensional data or multidimensional orthogonal range search. By subdividing the range
of data points into groups and iterating this process we can eliminate the need for direct
distance comparisons between a large number of points. This forms a tree like structure
where each tree node encompasses a decreasing number of data points as we move down
it. Each tree node must define the extent of its child nodes. Common approaches are
SS-trees (using bounding hyper-spheres), R-trees (using bounding hyper-boxes) and the
simpler KD Tree (using bounding hyper-planes). [Gonnet02] provides a good online guide
to different nearest neighbour search techniques utilizing tree structures.
KD Trees are a generalization of one dimensional Binary Search Trees to operation in
k dimensions. Branches of a KD Tree are generated by splitting at points along successive
dimensions so that level 0 of a tree will index dimension 0 of the query space, level 1
will index dimension 1, etc. The dimension that is indexed at a given level of the tree
is determined by the discriminator. The discriminator is usually calculated using i mod
k, where i is the current level and k is the dimensionality of the search space. Other
calculations for the discriminator exist, for example to subdivide a volume of search space
across its longest axis. The actual point at which the current dimension is subdivided is
determined by a decision function. The optimal subdivision will form a hyperplane that
separates points with the largest variance. Methods of selecting this division vary from
the use of the eigenvector with the largest eigenvalue of the covariance matrix, to simpler
approaches such as selecting the mean or median point in the current branch.
This branching approach is iterated until either a given level of coordinate precision is
reached, or a given number of tree layers have been developed.
Due to the KD Tree structures hierarchical subdivision of multi-dimensional space, it
is an ideal candidate for a multi-dimensional nearest neighbour search.
Time taken assembling a KD Tree structure is offset by the advantages of much faster
potential recall from the structure, depending upon the nature and accuracy of the recall
required. Whilst tree structures are generally accepted as the most efficient method of
searching for nearest neighbours, we require our algorithm to be fully updateable. For this
reason we have opted for a less efficient, but fully updateable KD Tree that subdivides
hyper-planes by predetermined values (figure 4.13) rather than by data content (which
4.6. UPDATING EDGES 99
would require the entire structure to be updated as new points were added or removed).
The tree structure must be searched carefully so that only sections of the KD space that
can possibly contain a nearest neighbour are evaluated directly. If the distance between the
nearest boundary of a hyperplane and the target point is greater than the current furthest
entry in the nearest neighbour table then that branch, and all subsequence sub-branches
can not contain a feature point that belongs on the nearest neighbour table. In practice,
it is much more efficient to calculate the bounding sphere of each volume of feature space
referenced by a KD branch, as shown in figure 4.12. This approach has similarities to the
approach used by Sproull [Sproull91], but instead eliminates tree branches (representing
cubic volumes of the n dimensional search space) before the final points are compared.
Whilst this may allow the expansion of inappropriate branches, the increased simplicity
of the calculation (appendix 12.7) offsets any disadvantage incurred by expanding the
branch. The number of layers a KD Tree can expand to is fundamental to the efficiency
of its operation. Too many layers will result in an unnecessarily fine subdivision of feature
space which will increase the number of calculations made. Too few layers will result in
a coarser subdivision and larger number of point comparisons. As can be seen in figures
4.19 and 4.21, the optimal number of tree layers varies according to the number of points
in the search space, their distribution and the dimensionality of the search space.
The nearest neighbour algorithm will be required to store each current region as a point
in feature space and allow regions to be added or removed efficiently. Whilst this is easily
implemented using exhaustive, MCS or half-rib approaches, we must alter traditional KD-
Tree approaches in order to allow this real-time flexibility. It is important that the the
KD tree structure is modifiable in order to avoid the need to generate a new feature space
for every new nearest neighbour query. This has an important impact upon which type
of normalization we apply to our feature descriptions when using KD Tree search. Any
form of self normalization would result in the need to adjust maximum and minimum
boundaries in the feature space, and the positions of all entries present in the KD tree
would need altering with each shift in parameters. Whilst this is possible, the overheads
involved with changing the spatial positions of every region present in the KD tree every
time a new region is generated are far too high to be practical. The only practical solution
to these requirements is to normalize feature coordinates by maximum possible values and
assign these values as the boundaries of the KD tree feature space. In such a system
old region feature space positions referenced by the KD tree will retain their integrity
whilst allowing the KD tree to be updated with new region entries as the grouping process
continues (algorithm 12.8).
100 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Mean Constrained Search
Another method of quickly retrieving nearest neighbours from an n-dimensional space is
to first assemble all candidate match entries in an ordered list and use this ordering to
minimize the number of entries that will require expanding for full comparison. Searching
a simplified ordered list through indexing or iterative list subdivision (which we used in
this work) is considerably more efficient than exhaustive search through all entries. One
of the most common methods of reducing a set of n-dimensional vectors into a single list is
through projecting them to a single dimension. The simplest of these, Mean Constrained
Projection, maps each vector (item to be searched) to the mean value of its component
dimensions. The Mean Squared Error relationship between this mapping and the original
n-dimensional coordinates can be exploited to constrain the search to a smaller range of
candidate matches from a one dimensional list.
The relationship between two vectors x and y can be defined as:
∑nd=1(xd − yd) = n(Mx − My)
(where M is the mean of all coordinates in that vector)
define A = (xd − yd) and B = (Mx − My)∑n
d=1(A − B) = 0∑n
d=1(A − B)2 ≥ 0∑n
d=1(A2 − 2B(A) + B2) ≥ 0
(∑n
d=1(A2)) − (2B
∑nd=1(A)) + nB2 ≥ 0
substitute nB for∑n
d=1 A∑n
d=1(A2) − 2B(nB) + nB2 ≥ 0
reduces to:∑n
d=1(A)2 − nB2 ≥ 0∑n
d=1(A)2 ≥ nB2
1n
∑nd=1(A)2 ≥ B2
1nMSE =
∑nd=1(xd − yd)
2 ≥ (Mx − My)2
This shows that the Mean Squared Error (MSE) is guaranteed to be either higher
or equal to the the mean difference squared. And that if the a new candidate (y) has
a mean difference squared greater than the previous best matching Mean Squared Error
(MSEbestmatch), it cannot be closer to the target vector.
4.6. UPDATING EDGES 101
(difference between candidate mean and target mean)2 ≥ MSEbestmatch =
(Mx − My)2 ≥ MSEbestmatch = y cannot be closer than ybestmatch
Ra and Kim [Ra93] report computational complexity between 5% and 12% of a full
search when using their mean-distance-ordered partial codebook search. Cheng and Lo
[Cheng96] when evaluating performance of their mean constrained selective technique when
applied to differing numbers of 16 dimensional vectors reported a 75% improvement in
search times.
Half Rib Orthogonal Lists
A half rib orthogonal list is formed by the arrangement of data items so that each axis
of the n-dimensional space is represented by value ordered linked lists. A linked list is
generated for (and sorted by) each axis value for a point with new linked lists branching
outwards for the next axis value until all axes have been encoded and the final element a
the terminus of the final axis is the data item itself (figure 4.14). Each node in the linked
lists represents a junction, an axis coordinate, and can contain a reference to a data item.
In this way, each data item to be stored in the n-dimensional space can be encoded into
the branching list structure.
To optimize a nearest neighbour search in this structure, each closest matching axis is
followed until a data point is encountered. All subsequent searches can then be limited
to the next closest axes (moving back down the banches of the structure), searching (and
expanding) only tree nodes that could possibly be closer than the furthest entry (a spherical
range formed around the target coordinate with a radius equal to the distance of furthest
entry to target) in a nearest neighbour table.
Evaluation and implications
Given the expectation of image sizes ranging from 100 to 200 pixels square, we can calculate
the expected worse-case number of groups being queried by a nearest neighbour search as
ranging from 1002 to 2002. For compatibility with the core segmentation seeding algorithm,
which adds 8 edges to the list for every group, we will be searching for 8 nearest neighbours
for each new group submitted. The following experiments are based around searching
through identical random distributions of data points. Unlike MCS and Half Rib search,
the Updateable KD Tree algorithm’s efficiency is considerably affected by the minimum
precision range and number of layers within the tree. Our first experiment (figure 4.19)
was to determine the optimal layer at which to terminate tree branching, given the search
102 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
conditions. The best overall terminating layer parameter for the KD Tree search is 15, so
we shall use this value in the next experiments. Figure 4.20 shows the direct comparison
between the three nearest neighbour approaches and an exhaustive search.
A notable problem with these results is that the Mean Constrained Search seems to
be performing much worse than expected. As can be seen in figure 4.21, the MCS perfor-
mance degrades as the dimensionality of the search space increases, eventually becoming
detrimental to the search process. Although a decrease in performance would be expected
as the dimensionality of the search space increases and the projected coordinates become
less effective at constraining the search space, these results show much worse performance
than anticipated. Of further concern is the poor performance when compared with results
[Cheng96, Ra93], which show significant increases in MCS performance when compared to
exhaustive search, even within 16 dimensional spaces. Although the MCS algorithm has
been thoroughly checked, and the experiments repeated, the cause of this discrepancy has
not been found. It was eventually decided to move ahead with the work, using the results
as generated.
A three-dimensional graph showing the full range of parameters,techniques and settings
can be seen in figure 4.18. Figure 4.22 shows a surface chart of Modifiable 8NN KD-Tree
results for the lower range of point spaces from 2500 to 20000. Figures 4.23 shows that
the best performance and optimal terminating layer configuration of the modifiable KD-
Trees follow a logarithmic curve. Figures 4.15, 4.16 and 4.17 compare different KD-tree
results with exhaustive search for 8 nearest neighbour searches over smaller and decreasing
numbers of points (as in the currently defines edge update algorithm).
From these results it is apparent that the performance of all approaches and individual
terminating layer settings follow a linear rule as the number of data points to be searched
increases. Where smaller numbers of points are to be searches, the advantages of using KD
trees to optimize performance decreases. It is also apparent that (for searches in spaces of
19999 to 79999 points) the modifiable KD tree structure, with a terminating layer of 15,
is the most appropriate nearest neighbour approach to be implemented.
4.6.2 Efficiently searching the KD Tree for nearest neighbours
The requirement for a nearest neighbour search, and subsequent generation and placement
of new edges in the edge list at each group merging, results in a considerable slowdown of
the algorithm. Given that we now have a low level segmentation algorithm that operates
efficiently on low level pixel data, and a slower edge update algorithm that is most useful for
grouping more developed segment groups, we can actually run the two processes together.
In this strategy, the segmentation algorithm developed from [Felzenwalb98] can be used
to efficiently generate the initial edge list and perform the primitive pixel level groupings.
4.6. UPDATING EDGES 103
Once the number of active groups in the segmentation process is reduced to a manageable
level, we can begin using the slower KD-Tree based edge update algorithm to provide the
higher level Gestalt linkages between developed segment groups, as in figure 4.24. When
the number of active groups in the segmentation reduces to a manageable level (a static
threshold on the number of groups remaining), the modifiable KD-Tree is constructed and
filled with all existing group information. The 8 nearest neighbours of all current groups
are then found and their edges inserted into the appropriate place in the edge list (from the
current edge onwards). After this intialization the generation of new nearest neighbours
is much more efficient, as the KD-Tree space is updated (not rebuilt) at each update.
Figure 4.22 shows a surface chart of Modifiable 8NN KD-Tree results for this lower range
of point spaces. Figures 4.23 shows that the best performance and optimal terminating
layer configuration of the modifiable KD-Trees follow a logarithmic curve.
A working threshold of 2500 groups was selected to begin the update function when a
100x100 (10000 pixel) image is simplified by one quarter.
A modifiable KD space with a terminating layer of 9 was found to be most appropriate
(figures 4.15, 4.16 and 4.17).
Avg.No.Points =∑2500
k=1 (k)2500 = 1250.5
Later changes in the way group information is stored and handled, detailed on page
119, change the optimal value for KD-tree terminating layers. In that architecture, when
the number of active groups in the segmentation reduces to a useful level, the modifiable
KD-Tree is constructed and filled with all existing group information (including previously
generated inactive parent groups). The 8 nearest non-related neighbours of all currently
active points are then found and their edges inserted into the appropriate place in the edge
list (from the current edge onwards). The number of groups in the search space actually
doubles rather than decreasing during the segmentation. This results in search spaces of
points between 19999 and 79999 points for expected image sizes and would make 15 the
optimal KD-tree terminating layer value (figure 4.19).
104 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Figure 4.11: A KD tree subdivides K dimensional space by applying Binary Search Tree
branching along successive dimensions.
4.6. UPDATING EDGES 105
Figure 4.12: Using an encompassing sphere to quickly determine if a KD tree branch
references a volume of feature space that could possibly contain a nearest neighbour.
106 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Figure 4.13: Modifiable KD Tree Build Algorithm
4.6. UPDATING EDGES 107
Figure 4.14: Showing three dimensional half rib orthogonal list structure, each ordered
two way list represents an axis branching from the previous axis,
(a) 8NN Exhaustive Search (b) 8NN KD-Tree Search
Figure 4.15: Searches conducted among decreasing numbers of points, as in the KD Update
segmentation algorithm.
108 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Figure 4.16: 8NN Modifiable KD-Tree results with Exhaustive 8NN results subtracted.
Searches conducted among decreasing numbers of points, as in the KD Update segmenta-
tion algorithm.
4.6. UPDATING EDGES 109
Figure 4.17: 8NN Modifiable KD-Tree results expressed as percentage difference to Ex-
haustive 8NN results. Searches conducted among decreasing numbers of points, as in the
KD Update segmentation algorithm.
110 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Figure 4.18: 3D chart showing the performance of 8 Nearest Neighbour algorithms (as
implemented in this work).
4.6. UPDATING EDGES 111
Figure 4.19: Modifiable KD Tree average performance in 8 nearest neighbour searches over
different terminating layers values.
Figure 4.20: Performance of 8-Nearest Neighbour algorithms (as implemented in this
work).
112 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Figure 4.21: Showing the effect of dimensionality change on different Nearest Neighbour
algorithms (as implemented in this work)
4.6. UPDATING EDGES 113
Figure 4.22: 3D Chart Showing the performance of 8NN Search using Modifiable KD-Trees
with search spaces of 2500 to 20000 points. The optimal search parameters are highlighted
by the green line.
114 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
(a) Optimal Terminating Layers
(b) Optimal Search Times
Figure 4.23: Showing optimal values for 8NN Modifiable KD-Tree search of 2500 to 20000
points.
4.6. UPDATING EDGES 115
Figure 4.24: The slower KD Nearest Neighbour update does not begin until the image
content has been grouped and simplified to a certain level.
116 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
4.7 The Original Current State Group Description
The segmentation algorithm defined in [Felzenwalb98] treated the problem of segmentation
in terms of set theory, where each group structure contained a set of its member pixels.
As the Gestalt grouping algorithm was based upon this original design, it was natural to
keep this kind of group description (as shown in 13.1) during development. Given this
approach, when a new group is created through the merging of two groups (see 13.2) the
member pixels of one of the groups are added to the other. Once the group descriptions
have been properly updated and the member pixel nodes are adjusted to point to the new
group, the redundant empty group is either stored for later use by the segment filter, or
deleted. Because the member pixel nodes point to the new group, any edge in the edge
list that referenced the old groups now points to the new child group (the same structures
used in 4.2). Although this allows old edges (originally generated between parent groups)
to still form future connections (as in figures 4.30, 4.31), all information about the context
within which these edges were generated is lost. Group/segment information can only be
retained for later use if we copy it to another list during the segmentation (see 4.25), which
makes the evaluation of useful segments at this point even more critical.
Figure 4.26 shows how the group structures (13.1) and edge pointers are changed after
a merge occurs (algorithm detailed in 13.2). The reassignment of edge pointers from par-
ent groups to child groups results in multiple edges (of different value and from differing
stages of the segmentation) joining the same groups. Whilst this ‘current state’ approach is
economical in terms of memory usage (group numbers decrease whilst pixel information re-
mains constant) and the speed with which pixel information can be accessed, such minimal
advantages are offset by the loss information. As work progressed it became increasingly
clear that more detailed information about the segmentation history and context of edges
would be useful and allow much greater scope and flexibility when generating signature
descriptions.
4.7. THE ORIGINAL CURRENT STATE GROUP DESCRIPTION 117
Figure 4.25: An overview of the current state grouping algorithm. Because parent groups
are destroyed during the creation of a child group, groups that score highly as segment
primitives need to be copied into a separate Segment League Table before they are de-
stroyed.
118 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
(a) Groups structures before a merge on edge
E1
(b) After the merge, the parent groups are ef-
fectively deleted and any current edges link-
ing to them (such as E2 in this example) are
pointed to the new child group.
Figure 4.26: Showing the original grouping process where group information is lost and
no descendent(D) information is retained. While old links are redirected to new child
groups (resulting in E2 and E3 linking the same groups, but with different values), any
information about the circumstances of the edge generation or parent groups is lost.
4.8. THE NEW BINARY TREE GROUP DESCRIPTION 119
4.8 The New Binary Tree Group Description
Rather than destroying group and edge information once it has been used in a successful
merge, the new binary tree group structures retain all information about that stage of the
segmentation process with relatively few overheads. If we wish to retain the full context of
grouping decisions during the segmentation then we must redesign the algorithm so that
no groups or edges are deleted. In such an algorithm, with an increasing number of groups,
it becomes impractical to store unique lists of image pixels for each group. To avoid this,
we add references between parent and child groups to generate a tree like structure that
can be traced from any child group to the groups representing individual pixel primitives.
Whilst this makes accessing pixels information an iterative and slightly slower process, we
have the advantage of being able to access grouping and decision information from the
tree structure after the segmentation has finished. Our new pixel primitive now becomes
a group structure itself, so we we no longer have a need for node structures, which were
used to identify and redirect pixel group membership. As each group merging effectively
results in one less active group (2 parents made inactive, 1 child added) the maximum
number of group structures required can be calculated as:
Gp = 2Gi − 1
where Gp is the number of groups generated by the entire segmentation and:
Gi =Image Width∗Image Height
is the number of groups generated at intiialization (one per image pixel).
With anticipated sizes of our query images being anywhere between 10000 and 90000
pixels, this doubling of group structures has a negligible impact upon the algorithms. Of
more importance is the slower access to group pixel data, so every effort is made to reduce
the need for this.
Figure 4.27 shows an overview of the way the segmentation tree is built up during
the grouping algorithm. Unlike in 4.26, no information is lost during the creation of a
child group, although more care must be taken to keep track of which groups are active
within the process. Founding Edge, Descendent and Youngest Descendent pointers are
essential for keeping track of the data structure. It should also be noted that while the
segmentation tree can be traced in both directions, the path upwards from parent to child
is not necessarily the inverse of the path down through founding edges. Because we retain
full edge context, the two groups that originally generate an edge (and can be accessed
through the Founding Edge pointer) are not necessarily the same groups that go to parent
120 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
the child generated by the edge (see figure 4.28). Different connections such as those in
figure 3.25 are not only possible, but identifiable. Whilst this may look complicated, it
allows the segmentation to group current objects based upon the relationships between
their subcomponents at any time during the process. Grouping decisions are now based
around optimal edges between groups at any point in the segmentation history, and the full
context of such decisions is retained. Information from these segmentation trees can now
be used to determine the recency of edge generation, segmentation ranking and provide
much richer group descriptions based around not only segment appearance but also a
segments development and the tree structure itself. 13.3, 13.4 and 13.5 detail the new
grouping algorithms that build up this tree.
4.8. THE NEW BINARY TREE GROUP DESCRIPTION 121
(a) Groups structures before a merge on edge
E1
(b) After the 1st merge, old groups and edges
are retained.
Figure 4.27: Groups and edges (Ei) are retained while Descendent (D), Youngest Descen-
dent (Y D) and Founding Edge (FE) pointers are updated to keep track of active groups
and the grouping history.
(a) Resulting data structures if the grouping
from 4.27 continues over E2
(b) Resulting data structures if the grouping
from 4.27 continues over E3
Figure 4.28: Merging the same groups across different edges. No part of the segmentation
history is lost in this grouping process. Youngest Descendent (Y D) pointers provide fast
access to currently active groups, Descendent (D) pointers provide a grouping history and
Founding Edge (FE) pointers store the context of edges that build up the group structures.
122 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Figure 4.29: An overview of the tree based grouping algorithm. Whilst groups are still
evaluated for their effectiveness as segment primitives, no separate league table is required
as no groups are deleted during the algorithm.
4.9. DISCUSSION 123
4.9 Discussion
The use of feature space to facilitate a level of Gestalt grouping with spatially disconnected
regions represents an elegant way of generating appropriate structures for human like
recognition. The binary storage tree serves the dual purpose of generating near-parallel
(multi-level) grouping decisions and a rich grouping history for each segment from which
further relationship descriptions could be generated. The retention of edge links from
previous generations of the algorithm also helps facilitate the development of Gestalt
linear and curvilinear groupings, preventing unwanted clumping behaviour between partly
developed primitives (4.30).
Unfortunately, the reverse of this situation occurs where Gestalt groups that are not
fully developed can ‘leak’ into other groupings via older linkages from close matching
primitives (4.31).
The main drawback to the grouping algorithm in this work is a rapidly rising processing
time requirement (4.32) that can get prohibitive as source image size increases. While some
of this processing requirement is due to the non-linear increase in pixels per image as image
dimension increases, reflected in processing times, a large part is taken by the maintenance
of the binary tree structure and the need to move through parent-child branches during
grouping. As the number of branches in this tree increases, so does the amount of time
taken to traverse the structure. This limits the practical range of image sizes to under 200
pixels if our 5 minute recognition target is to be met. In this work, this is not a major
problem as source images are currently restricted (and resized) to 100 by 100 pixels.
This represents the minimum image size range where reasonable geometric content can
be extracted in an acceptable time. As the recognition part of this work is based around
the use of simplified labels stored alongside library images, the grouping work required for
the library can be performed prior to any recognition queries. In a practical user query
situation, only the query image will need to be exhaustively segmented/grouped. Once
we have these raw segment groups, a method of evaluating them and filtering out any
partially developed or non-salient segments is required before they are used to generate
descriptions for the recognition engine.
124 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
(a) No retained links (b) Retaining links
Figure 4.30: Retaining groups from previous generations helps prevent clumping behaviour
at higher levels of grouping (left), linear structures are allowed to develop if older links are
retained (right)
Figure 4.31: Although retained links allow the grouping of linear features, they can also
encourage the premature flooding of disparate groups which would ideally be developed
as separate groups before joining.
4.9. DISCUSSION 125
Figure 4.32: Shows the increase in time taken by grouping/segmentation engine as image
dimension increases
126 CHAPTER 4. GESTALT MULTI-SCALE FEATURE EXTRACTION
Chapter 5
Segment Ranking
5.1 Introduction Motivation
Once the segmentation algorithm has completed we are left with a segmentation tree
structure that can be used to generate our image description. While all groupings are
retained, there is a great deal of noise and redundancy in the segment tree, figure 5.11
shows some examples of this. Many groups will inevitably represent partially formed
segments before they develop into useful image primitives and will be highly sensitive to
noise and segmentation differences. Some segments will represent noise, or simply be too
small to be useful. Other segments may well be large, but too unreliable to use as the
basis for an image description (5.11c). To reduce the negative effect of such noisy groups
the results of the segmentation are evaluated and ranked by their fitness as useful image
primitives.
What constitutes a fit segment group?
1. Consistency
Even where segments may not explicitly represent visual objects,
segment groupings that consistently capture the similarities
between images are considered fitter.
2. Object Definition
Segments should consistently describe/outline those objects
common between the images.
4. Symbolic Content
Segment groupings that allow the extraction of symbolic and
Gestalt group descriptions (not just pixel-level clumps) are
127
128 CHAPTER 5. SEGMENT RANKING
essential for symbolic similarity recognition and recognition
between different image types.
3. Richness of Description
Segments should provide descriptions that are rich enough to
determine similarity/dissimilarity between images.
Groupings that are too common among a wide variety of images,
or are simply too small to provide distinctive descriptions are
of low fitness.
5.1.1 Edge Based Ranking
Good segments will provide image content descriptions that are common between images
that have common photographic or symbolic content. Without prior information or high
level reasoning, segments that represent literal photographic content are reliant upon the
tendency for objects in images to exhibit high colour or texture edges. In a similar way,
fully grown gestalt groupings that represent more symbolic image content can be detected
by the increase in difficulty in grouping with other groups. In this architecture, both types
of edges are treated as the same and can be identified by evaluating the edge between two
parent groups 1. Groupings featuring segments that are dissimilar indicate that the two
parents are fully developed segment groups that were forced together by the segmentation
process. Whilst this reasoning applies to most group properties, such as colour and texture,
it does not hold for size and area information. A characteristic of an underdeveloped
segment group is it’s combination with much smaller groups as the region grows to it’s
full potential (like a flood-fill). Figure 5.1 demonstrates that, unlike the other description
properties, a large difference in area between groups connected by an edge is an indicator
that neither is fully developed.
Scoring based around founding edges presents some difficulty, as a single group may
have been part of many founding edges, while others may never have taken direct part in
a grouping decision. This makes it impossible to evaluate the score in this way for a large
number of groups. There is also a larger likelihood that smaller groups will have been used
as parts of founding edges, as joining larger groups through edges created by subgroups
is allowed. By definition, the most groups in a finite space will be small groups, and will
generate edges that will become founding edges.
Such an approach would use the following:
RankingScore =∑n
k=0(En)n
1Which are not necessarily the same as the groups on either side of the founding edge or the child group
5.1. INTRODUCTION MOTIVATION 129
(a) Very different colour and texture values
indicates that the groups G2 and G3 were
well developed before being joined along the
edge E, and should have high rank as primi-
tive.
(b) Very different area values (especially
where one group has a very small area) in-
dicates that groups G1 and G2 were still
growing when being joined along edge E, and
should not be ranked highly as primitives.
Figure 5.1: Contradictions when using edge values to rank segment groups.
where the n is the number of Founding Edges E the current segment group is part
of. Each edge E has a normalized value between 0 and 1 that represents the difference
between two groups along the Founding Edge (not including the area component).
In contrast, scoring based around the edges between fellow parent groups is much
simpler as all groups are guaranteed to have a single co-parent (with the exception of
the final group in the process which represent the entire image). Figure 5.5 shows the
distinction between Founding Edges and Parental Edges.
Unlike Founding Edges, which are generated and stored during the segmentation pro-
cess, Parental Edges are evaluated during the later ranking process. Using Parental Edges
(and Parent Groups) provides greater simplicity, requires only a single edge calculation
to rank each Parent Group paring and is more likely to pair groups of diverse description
(Founding Edges will never feature groups generated later in the segmentation process
than the Parental Edges). For this reason, it was decided that Parental Edge values would
be used to evaluate segment, as shown in algorithm 5.1.
130 CHAPTER 5. SEGMENT RANKING
(a) Evaluating the score of segment group 2
from the founding edges that it is linked with.
(b) Evaluating the score of segment group 2
from the edge between it and group 3, which
both form parents to child group 4.
Figure 5.2: Segment group scoring based upon edge values.
5.1.2 Area Based Ranking
Another important factor when evaluating the usefulness of a segment is its size. Whilst
the inclusion of a size bias may at first seem counterproductive to invariant recognition, it is
an important component in human vision and represents several practical advantages. Seg-
ment groups with larger areas have a greater capacity to contain useful textural, boundary
and shape information. Useful image primitives are also more likely to occur with larger
segments where aliasing in the image has a less detrimental effect and the reliability and
scope for variation in descriptions (such as those derived from boundaries) that depend
upon more developed levels of grouping is increased. In human vision, larger image objects
will also usually take precedence in terms of recognition over smaller ones. Conversely,
noise pixels and groups with little geometrical value will be those groups that have very
little area. As virtually all invariant descriptions are derived from relational measures and
ratios between properties, descriptions based around shape and area (which are limited to
a 1 pixel minimum resolution) will have a larger and more stable range in groups of larger
sizes (see figure 5.6).
Ranking of groups based around their size increases the importance of the less volatile
and information rich groupings while reducing the impact of noise. Another reason for
utilizing size and area information when ranking groups is that, as discussed earlier, large
differences in size indicate undeveloped regions and low differences in size indicate better
developed regions (figure 5.1). Unfortunately, when dealing with real images, the area
ratio alone is not sufficient when ranking groups. Using ratios alone would result in
5.2. COMBINING EDGE AND AREA RANKING 131
(a) Photographic Images (b) Images with Gestalt Content
Avg. 10000.3 out of 19999 are segment Avg. 10165.6 out of 19999 are segment
groups linked to Founding Edges groups linked to Founding Edges
Mean Founding Group area: 1.0 Mean Founding Group area: 1.1
Mean Other Group area: 436.8 Mean Other Group area: 208.4
Figure 5.3: Results relating to segment groups that form the basis for used Founding
Edges and differences between Gestalt and non-Gestalt image types. Groups associated
with Founding Edges have been used as part of a successful merge.
disproportionately high scores where single pixels are first joined into pixel pairs with
a ratio of 1:1, producing a maximal ranking. It is clear that an absolute measure of
group area should be used when ranking segment groups. Algorithm 5.2 uses the smallest
normalized area component of the two groups connected by a parental edge to determine
their rank scores.
5.2 Combining Edge and Area Ranking
Segmentation results using Parental Edge magnitude work well with mondrian image types
(figure 5.8a) but are susceptible to noise in natural images (figure 5.7a). Conversely the use
of minimum Parental Area works effectively with natural images (figure 5.7b) but favours
undeveloped large areas in mondrian images (figure 5.8b). Algorithm 5.3 combines both
measures to generate a ranking algorithm that works effectively for both image types (see
figure 5.9).
5.3 Correspondence Ranking
Although the combined ranking algorithm represents an improvement to segment scoring,
it can be seen from figure 5.9 that both image types still suffer from noise and the affects of
large background sub-segments. As one of the main aims of the segment grouping process
132 CHAPTER 5. SEGMENT RANKING
(a) Photographic Images (b) Images with Gestalt Content
Avg. 10000 out of 19999 are segment Avg. 10058 out of 19999 are segment
groups linked to Founding Edges groups linked to Founding Edges
Mean Founding Group area: 1.0 Mean Founding Group area: 1.173
Mean Other Group area: 363.66 Mean Other Group area: 303.782
Figure 5.4: Results relating to segment groups that form the basis for used Founding
Edges, a specific example. The added gestalt groupings between larger groups, generated
by the superimposed black dot pattern, results in extra Founding Groups of larger area.
is to extract equivalent groups from different images (see figure 5.10) good segment groups
are those that remain present even after undergoing a transformation.
We can use this property as a means to further filter out noise and segment groupings
that do not sufficiently represent image objects, such as large background sub-segments.
The development and shape of large background sub-segments that score too highly is
largely dependent upon the way the segmentation process operates. This results in under-
developed segments with many possible identical connections evolving in different ways
depending upon the way the architecture orders identical connections. Such groups are
unlikely to have good corresponding groups in a transformed image (see figure 5.11) and
the comparison of segment groups generated between the original image and a transformed
image can be used to filter out these unwanted segment groups. Finally, only the more
robust segment groups will maintain equivalence under a transformation, a property which
5.3. CORRESPONDENCE RANKING 133
(a) The Founding Edge of group 4 is gen-
erated by the relationship between ancestor
groups 2 and 3, which are also the direct par-
ent groups to group 4. This results in both
the Founding Edge and Parental Edge being
identical and linking identical groups.
(b) In this case, group 5 was created by the
merging of its parents, groups 1 and 4, the
edge between these is the Parental Edge. The
actual edge along which the merge occurred
(the Founding Edge) was between the ances-
tor sub-groups 1 and 2. This demonstrates
that the Founding Edge and the Parental
Edge are not always identical and can adjoin
different groups.
Figure 5.5: Demonstrating the difference between Founding Edges and Parental Edges.
makes them valuable as the basis of an invariant description between equivalent images.
The Correspondence ranking algorithm 5.4 details the process of finding consistent
segment groups by comparing them with (de-transformed) groups generated from the
same image, but under a transformation.
Whilst this is slower than ranking purely by area and grouping history, requiring two
separate segmentations of the same image, it can be speeded up if we select transformations
that reduce the size of the source image, reducing the segmentation overheads. A tradeoff
must be made, depending upon the priorities of the segmentation engine, between speed
and better segment ranking.
Upon visual inspection of ranked semgentation results, the implementation of corre-
spondence ranking was observed to result in minimal improvement while nearly doubling
processing requirements. For the remainder of this work, it was decided that correspon-
dence ranking would we abandoned in favour of the much faster ranking by minimum area
algorithm detailed in the the previous section (5.2). Figures 5.12, 5.13 show final examples
of final segmentation rankings.
As the ranking system is dependent upon the size of the optimal edge required to
134 CHAPTER 5. SEGMENT RANKING
Algorithm 5.1 Ranking by Parental Edge
Good for mondrian images.
Less effective for natural images (noise pixels often have high edge values).
E =Edge(Group1, Group2)
Segment Ranking Score(Group1) =Segment Ranking Score(Group2) = 1 − E
Where edge E is the Parental Edge and has a normalized value between
0 and 1 that represents the difference (not including the area component)
between two groups that were merged to form a child group. Both these parent
groups will be assigned this score. The Segment Ranking score lies between 0 and 1,
with lower values indicating higher fitness.
combine to parent groups into a new child group, weighted by the size of the smallest parent
group the ranking output will have a tendency to appear in pairs with the same score (as
shown in rankings 1 and 2, 3 and 4 etc. in 5.13). Whilst the outputs do provide useful
groupings to base a search upon, they are not as good as expected. This would appear
to be partly due to approximations of descriptors such as shape and position resulting
in larger differences between parent groups than are actually perceived by humans, and
partly due to the rather simplistic (but fast) method used to actually rank the segments.
On the whole, whilst the segmentation engine is indeed providing us with many useful
groupings, the ranking system as currently defined is not optimally separating out the
useful groups from the less useful groups. However, as the entire segmentation history is
kept in this system, some method for reducing the number of segment groupings going
on to form descriptions for the search engine is essential, and these results were judged
sufficient for our current purposes. Improvement of the segment ranking engine is one
area that lends itself to exploration in further work and could greatly improve recognition
results in the final search engine.
5.3. CORRESPONDENCE RANKING 135
(a) Area: 5 pixels (b) Area: 24 pixels (c) Area: 104 pixels
Figure 5.6: The larger groups allow the greater resolution, less sensitivity to error and
better textural, shape and boundary information.
Algorithm 5.2 Ranking by smallest parent group
Good for natural images, but can miss important smaller details.
Natural filter for noise.
Less effective with mondrian images.
A1 =Area(Group1)
A2 =Area(Group1)
S =min(A1, A2)
Segment Ranking Score(Group1) =Segment Ranking Score(Group2) = 1 − S
Where A1 and A2 are the normalized area values of both parent groups being evaluated.
The Segment Ranking score lies between 0 and 1, with lower values indicating higher fitness.)
136 CHAPTER 5. SEGMENT RANKING
Algorithm 5.3 Combined ranking algorithm
Combines both Parental Edge and minimum Parental area ranking.
E =Edge(Group1, Group2)
A1 =Area(Group1)
A2 =Area(Group1)
R = E∗min(A1, A2)
Segment Ranking Score(Group1) =Segment Ranking Score(Group2) = 1 − R
Where edge E is the Parental Edge and has a normalized value between
0 and 1 that represents the difference (not including the area component)
between two parent groups (Group1,Group2) that were merged to form a child group.
A1 and A2 are the normalized area values of both parent groups. The resulting score
between 0 and 1 (where lower values indicate higher fitness) is assigned to both parent groups.
Algorithm 5.4 Correspondence ranking algorithm
S1=Seg(I)
It=I(M ∗ R ∗ S)
S2=Seg(It)
S2=S2(−S ∗ −R ∗ −M)
Where I is the original image, Seg() is the segmentation algorithm, Sn is a list of segment
groups and M, R and S are Mirror, Rotation and Scale transformations. S1 and S2 should
now contain equivalent good segment groups, with bad or unreliable groups not having
equivalents.
For all Gi in S1
Ge=Most Similar Segment Group in S2
E=Edge(Gi,Ge)
Correspondence Score(Gi)=Segment Ranking Score(Gi)*E
Where both the Correspondence Score and Segment Ranking Score lie between 0 and 1,
with lower values indicating higher fitness.
5.3. CORRESPONDENCE RANKING 137
(a) Ranking based upon Primary Edge values.
(b) Ranking based upon minimum parent area
Figure 5.7: Top 20 ranked segments with a natural image type. Notice that the Primary
Edge based results suffer adversely from small noise pixels.
138 CHAPTER 5. SEGMENT RANKING
(a) Ranking based upon Primary Edge values.
(b) Ranking based upon minimum parent area
Figure 5.8: Top 20 ranked segments with a mondrian image type. Notice that the undevel-
oped backgrounds in the minimum Parent area based results are being scored too highly
whereas the use of Primary Edge ranking successfully extracts the major groupings.
5.3. CORRESPONDENCE RANKING 139
(a) Mondrian Image Type
(b) Natural Image Type
Figure 5.9: Top 20 Results of combined Parental Edge and minimum Parental area ranking.
140 CHAPTER 5. SEGMENT RANKING
(a) (b)
Figure 5.10: Showing a simulated example of desirable direct segment equivalence. In
order to recognise the similarity between objects in photographic images segment groups
that represent image objects regardless of transformation are encouraged.
5.3. CORRESPONDENCE RANKING 141
(a) Original image (b) Transformed image
(c) Original group (d) Corresponding
group from trans-
formed set
(e) Low fitness group-
ing
(f) Nearest grouping in trans-
formed set
Figure 5.11: Segments of high fitness will appear consistently in images of the same subject,
regardless of transformations. Significant groups from the original set (shown on the
left) will have corresponding groups in the segmentation of a deliberately transformed
image (right). Partially grown segments (such as in e to f) will not have corresponding
counterparts in the transformed set and can therefore be assigned a low score.
142 CHAPTER 5. SEGMENT RANKING
Figure 5.12: Example showing final groups extracted by rank, whilst sufficient for our
purposes these results indicate that the ranking engine still has scope for improvement.
5.3. CORRESPONDENCE RANKING 143
Figure 5.13: Example showing final groups extracted by rank, whilst sufficient for our
purposes these results indicate that the ranking engine still has scope for improvement.
144 CHAPTER 5. SEGMENT RANKING
Chapter 6
Generating Segment Group
Descriptions
6.1 Overview Of Description Types
At this point in the process we have a ranked list of segment groups, each with a traceable
grouping history and it’s own appearance attributes. From these features it is possible to
generate descriptions of different degrees of invariance that can be used to facilitate efficient
matching between individual segment group lists or pairwise signature descriptions. Such
identifiers can be broadly broken down into the following three primary types:
1. Appearance Descriptions
2. Relational Descriptions
3. Structural Descriptions
Parallel to this, it is important to consider the degree of invariance that description
types offer and which type of similarity we require. The different levels of invariance
rest along a scale that stretches from direct image content matching and cartoon recog-
nition to recognition of photographic image content undergoing perspective and lighting
change. These can be broadly seperated into two categories, geometric (table 6.1 and
colour/intensity (table 6.2).
As a general rule, greater amounts of information are required to generate higher levels
of invariance. The very fact that each level of invariance disregards certain information
contained within the image (in order to remain constant under change in that information)
can result in a loss of descriptive power. High orders of invariance can also suffer from
error propagation where initial discrepancies in values (due to image aliasing or noise) have
a detrimental effect upon the final description, although this can be limited through the
145
146 CHAPTER 6. GENERATING SEGMENT GROUP DESCRIPTIONS
More suitable for direct image matching, trademark recognition etc..
1. Euclidean Invariance
(Invariance to translation and rotation)
2. Similarity Invariance
(Euclidean invariance, scale invariance)
3. Affine Invariance
(Similarity invariance, distortion/stretching invariance)
4. Projective Invariance
(Affine invariance, perspective semi-invariance)
More suitable for photographic image content matching.
Table 6.1: Showing geometric invariants of increasing order. In general, low level invariants
provide good descriptors for tasks requiring direct match while higher levels are more
appropriate for content that has undergone transformation, such as that caused by changed
in viewpoint in photographic images.
careful use of reliability measures. Whilst full perspective invariants cannot be generated
from a single image source, projective and affine invariants can be effective for identifying
such transformed objects if a sufficient quantity of sub-features are considered.
6.2 Photometric Descriptions
Photometric descriptions can be generated from a single segment group and are based
around segments colour values. Information about colour, texture and intensity is most
effective when used to identify photographic content and is particularly robust to geometric
and perspective change. Whilst these measures are generally less effective at identifying
photographic objects under differing lighting conditions, or images (such as greyscale)
without colour diversity, satisfactory levels of semi-invariance can be achieved. Of partic-
ular use is the HLS (Hue, Luminance, Saturation) representation where colour properties
can be more naturally represented and separated out. Figure 6.1 shows both RGB (the
default computer representation of colour information) and the HLS space that can be
calculated from this. Luminance values remain constant over colour change, whereas Sat-
uration levels will generally remain constant over common image transformations such as
brightening/darkening. Saturation is also a useful, often overlooked, description property
of image type. Even with different luminance and hue values, images with similar satu-
ration levels will be perceived as having similar properties. Images with zero saturation
6.2. PHOTOMETRIC DESCRIPTIONS 147
More suitable for photographic and colour/intensity variable content
1. Intensity differences
(semi invariance to intensity changes)
2. Normalized rgb
(semi invariance to saturation changes)
3. Hue descriptions
(semi invariance to changes in saturation and intensity)
4. Geometrical descriptions
(colour information abandoned in favour of geometry)
More suitable for symbolic content where geometry is used for recognition
Table 6.2: Types of colour description in rough ascending order. Many of these are
useless when used to describe grey-scale or black and white images, separation into distinct
categories is also more problematic.
levels are greyscale, mid level saturation suggests more ‘natural’ and photographic scenes
while large saturation values suggest artificial or even cartoon style images. Hue is espe-
cially useful for its tolerance to changes in natural lighting conditions, making it an ideal
measure for colour photographic image content. Unfortunately, Hue has the drawback of
being an angular measure, so for accurate similarity measures, polar comparison must be
used. Standard Euclidean, or even Manhattan, linear similarity measurements can be used
if the angular Hue measure is first transformed into a unit vector for comparison, although
this will result in a warping of similarity measurements. Further distortion occurs due to
transformation from a original discrete RGB space.
Descriptions based around colour/intensity values can be highly localized, generated
from small sub-neighborhoods of component pixels, or be created from average values of
segment appearance. Descriptions that are generated from RGB values include:
Intensity
I(R, G, B) =R + G + B
Normalized colours
r(R, G, B) = RR+G+B
g(R, G, B) = GR+G+B
b(R, G, B) = BR+G+B
Saturation
S(R, G, B) =(max(R,G,B)−min(R,G,B))255
where max() is the highest colour value
148 CHAPTER 6. GENERATING SEGMENT GROUP DESCRIPTIONS
min() is the lowest colour value
Unit Hue vector
Hx(R, G, B) = sin(H(R, G, B)) Hy(R, G, B) = cos(H(R, G, B))
where:
H(R,G,B)=
0 if R=G=B (S=0)(
−B+Gmax(R,G,B)−min(R,G,B)
)
Π3 if R=max(R,G,B)
(
2 + −R+Bmax(R,G,B)−min(R,G,B)
)
Π3 if G=max(R,G,B)
(
4 + −G+Rmax(R,G,B)−min(R,G,B)
)
Π3 if B=max(R,G,B)
6.3 Geometric Descriptions
Unfortunately, these colour measures are almost useless when used to describe black and
white, cartoon, symbolic or shape dependent images. Identifying segment groups based
around shape is generally a more difficult task, but necessary where intensity/colour infor-
mation is either not available or disruptive to the similarity desired. Of particular use for
shape description is the convex hull or the segment group which provides extra informa-
tion about the shape and also facilitate similarity recognition between collections of shapes,
sketch outlines and full segment groups that define similar gestalt shapes. As the convex
hull of a group is derived from that group’s contour, relationships between the group and
its convex hull are retained even under Affine tranformations. Slightly harder to calculate
efficiently, but extremely useful for invariant description, are inherent shape topological
features such as Eulers number (which is based around the number of holes within an
object) and hole to region ratios. Topological descriptors are particularly appropriate for
the categorization of shapes into broad types, even where instances of appearance and
dimension may differ greatly. Such topological similarities are easily perceived and widely
used in human vision. Many description types of lower invariance, especially directional
vectors, can form the basis of ratios between multiple related groups in order to achieve
higher invariance levels, as will be shown later.
Single group non-invariants used to detect direct/symbolic similarity:
Proportionate Areas
Pa = areaimage width∗image height
PHa = area of convex hullimage width∗image height
Proportionate Centroid Positions
6.3. GEOMETRIC DESCRIPTIONS 149
Pcx = ximage width
Pcy = yimage height
PHcx = ximage width
PHcy = yimage height
Semi-invariant descriptions derived from single segment groups:
Principle axes of inertia (translation and scale invariant)
AIx = sin(θ)
AIy = cos(θ)
Where θ = 12 tan−1
[
2m11m20−m02
]
Unit Vector between centroid and hull centroid
vcx = vxvl
vcy = vyvl
Where l =√
vx2 + vy2
vx = hull x − x
vy = hull y − y
Eccentricity (Similarity invariant)
E=m20+m02+
√(m20−m02)2+4m2
11
m20+m02−√
(m20−m02)2+4m211
Non compactness (Similarity invariant)
NC(perimeter, area) = perimeter2
area
NCH(hullperimeter, hullarea) = hull perimeter2
hull area
Boundary moments (Similarity invariant)
F1 = (p2)1/2
o1
F2 = p3
(p2)3/2
F3 = p4
(p2)2
where
or = 1N
∑Ni=1[z(i)]r
pr = 1N
∑Ni=1[z(i) − o1]
r
N = Number of boundary points
150 CHAPTER 6. GENERATING SEGMENT GROUP DESCRIPTIONS
z(i) = Sequence of Euclidean distances from boundary points to centroid
Shape moments up to order 3 (Similarity invariant)
ρ1 = ν20 + ν02
ρ2 = (ν20 − ν02)2 + 4ν2
11
ρ3 = (ν30 − 3ν12)2 + (3ν21 − ν03)
2
ρ4 = (ν30 + ν12)2 + (ν21 + ν03)
2
Shape moments up to order 4 (Affine invariant)
I1 =m20m02−m2
11
m400
I2 =m2
30m203−6m30m21m12m03+4m30m3
12+4m321m03−3m2
21m212
m1000
I3 =m20(m21m03−m2
12)−m11(m30m03−m21m12)+m02(m30m12−m221)
m700
I3 = (m320m
203 − 6m2
20m11m12m03 − 6m220m02m21m03 + 9m2
20m02m212
+12m20m211m21m03 + 6m20m11m02m30m03 − 18m20m11m02m21m12
−8m311m30m03 − 6m20m
202m30m12 + 9m20m
202m
221
+12m211m02m30m12 − 6m11m
202m30m21 + m3
02m230)/m
1100
where (for all moment calculations above):
νpq =mpq
(m00)γ
γ = p+q2 + 1
mpq =∑
(x − x)p(y − y)qf(x, y)
where f(x, y) = 1 in this (binary) case.
x=x coordinate of group pixel
y=y coordinate of group pixel
x= group x axis centroid
y= group y axis centroid
Convex hull area and group area ratio (Affine invariant)
HA(hull area, group area) = hull areagroup area
Eulers Number (Affine invariant)
EN =Filtered(S) − Filtered(N)
where Filtered(i) = number of elements in i of area greater than the mean area(i)4
S is he number of contiguous parts
N is the number of holes within the contiguous parts
Hole Area and Region Area ratio
6.4. PAIRWISE DESCRIPTIONS 151
Hr =
Hole areaRegion area if Region area ≥ Hole area
2 − Region areaHole area if Region area < Hole area
6.4 Pairwise Descriptions
Whilst single region descriptions are very useful for recognition purposes, especially lower
level invariance, descriptions based around pairwise relationships of regions provide a larger
source of primitives from which higher levels of invariance can be derived. For pairwise
descriptions to be effective, two often conflicting conditions must be satisfied. The two
regions should be related to each other in such a way that they are both similarly trans-
formed by whatever transformations occur in image content. In the perspective case, this
requires the objects they represent to be placed far enough from the camera and close
enough to each other that the transformations between them can be approximated by
affine invariants. The second condition is that the descriptions generated from the two
regions contain enough dissimilarity to generate robust and distinctive invariant descrip-
tions from. In natural images, these two criteria do not often sit well together, as a large
factor in defining a relationship between two objects is their similarity.
Given our pairwise tree structure generated from the segmentation engine, we already
have two immediate types of pairwise region relationships we can use to generate pairwise
descriptions. Each region in our segmentation tree has been generated as a result of some
similarity relationship between two smaller parent/grandparent regions. This Founding
Edge between Founding Parents, precipitates a merge between two distinct Parent Regions
(which could be the founding parents, or their progeny). Because Founding Edges may
have been generated during a much earlier generation of the segmentation process, the two
Founding Regions that cause the pairwise grouping may well be fairly distant ancestors,
and are not necessarily the same as the Parent Regions that are actually combined. Most
Founding Regions in this segmentation scheme are matched during the early stages and
can be considerably different to the child groups that they will actually join. Given that
the same content in different images should follow the same patterns of combination, then
these parent groups form good candidates from which pairwise descriptions/relationships
can be derived. Region Parents are especially useful because (unlike Founding Parents)
they are not as constrained to being similar in appearance and can generate a wider range
of relational values.
152 CHAPTER 6. GENERATING SEGMENT GROUP DESCRIPTIONS
Parent region colour/intensity based pairwise linear invariants:
Intensity Difference
I(Region1, Region2) = Abs(I(R1, G1, B1) − I(R2, G2, B2))
Red Channel Difference
R(Region1, Region2) = Abs(R1 − R2)
Green Channel Difference
G(Region1, Region2) = Abs(G1 − G2)
Blue Channel Difference
B(Region1, Region2) = Abs(B1 − B2)
Saturation Difference
S(Region1, Region2) = Abs(S1 − S2)
Parent region geometric pairwise translation invariants:
Region Centroid Axis DifferencesC1x−C2x
image widthC1y−C2y
image height
√C1x∗C2x−C1y−C2y√
image width∗image width+image height∗image height
Parent region colour/intensity based pairwise ratios:
Ir =
I1I2
I2 ≥ I1
I2I1
I2 < I1
Rr =
R1R2
R2 ≥ R1
R2R1
R2 < R1
Gr =
G1G2
G2 ≥ G1
G2G1
G2 < G1
6.5. PREPROCESSING FOR GEOMETRIC DESCRIPTIONS 153
Br =
B1B2
B2 ≥ B1
B2B1
B2 < B1
Sr =
S1S2
S2 ≥ S1
S2S1
S2 < S1
Parent region geometric pairwise invariants:
Ar =
Area1Area2
Area2 ≥ Area1
Area2Area1
Area2 < Area1
6.5 Preprocessing for Geometric Descriptions
While colour descriptions are generated as part of the segmentation process and can be
easily calculated from raw group information, some of the geometric descriptions require
preprocessing before they can be generated. Group boundary and outer boundary pixels
are required to generate convex hull vectors [Gra] and hole/region counting is required
to create Eulers number. The Grab Hole Details function performs a connective binary
segmentation on the group and results in descriptions of disparate hole and region areas
contained within the group. Resulting hole and region groups are rejected if they are
smaller than a threshold based upon the average size of their type.
Number of Holes=∑
( Holes with area > Mean Hole Area4 )
Number of Regions=∑
( Regions with area > Mean Region Area4 )
All holes that lie on the boundary of the grid used in this function are then discounted
so that the surviving holes are truly interior to the group regions.
Each group that will generate a boundary description is also put through the Strip To Boundary
function which removes its inner region pixels to leave the closed boundaries. After
boundary information has been used, these boundary pixels are then submitted to the
Strip To Outer Boundary function, that leaves only the most extreme boundary pixels
that Get Convex Hull uses to generate convex hull vectors.
154 CHAPTER 6. GENERATING SEGMENT GROUP DESCRIPTIONS
(a) Our original point in RGB space. (b) Point transformed into HLS space
Figure 6.1: Showing ideal RGB and HLS spaces, the transformation from discrete RGB
space actually results in a double hexagonal pyramid subspace of HLS space being used.
Chapter 7
Searching the Label Database
7.1 Description Labels
The segmentation process retains a very large list of ranked regions linked together by edge
information into a pairwise grouping tree. The majority of the region groups, especially
those representing individual pixels, will be too small to provide any useable geometric
information. Only the highest ranked regions are likely to be of sufficient size and visual
significance to provide useful geometric information, or represent good object primitives
for human like recognition. Given the extra processing needed to generate geometric
descriptions, it is sensible to only use the best n regions from the ranked table when
creating full descriptions. A further motivation for reducing the number of groups used to
generate final descriptions is the increased processing overhead when generating geometric
information such as convex hulls and Eulers number (requiring the recognition of holes
within groups). In this work, the top 256 ranking groups are used to generate description
labels. Each of the selected regions contributes descriptions to a set of arrays that make
up a single label description of that particular image content. It is important to ensure
that the number of individual description types used is sufficient to cover a wide range
of invariant types and levels of invariance available. It was decided that 48 description
types would be sufficient to fulfill this criteria whilst remaining small enough to allow fast
comparison and analysis of results. Each label consists of these 48 description types, of
varying degrees of invariance, each made up of the values and weights generated from
the ranked list of segment groups. The order of each of the 256 individual values in a
description directly mirrors the rank of the segment that produced it, values generated by
a specific group can easily be re-grouped.
Rank Table: Group A, Group B, Group C, Group D, ..
Results in the following image label[no. descriptions][no. groups used]:
155
156 CHAPTER 7. SEARCHING THE LABEL DATABASE
Description[1]=Intensity: i(A), i(B), i(C), i(D), ...
Description[2]=Normalized Red: r(A), r(B), r(C), r(D), ...
....
Description[5]=Saturation: Sat(A), Sat(B), Sat(C), Sat(D), ...
Description[6]=Huex: Huex(A), Huex(B), Huex(C), Huex(D), ...
....
Description[12]=Eccentricity: e(A), e(B), e(C), e(D), ...
....
Each label is stored (without normalization of component values) with an identical
file number to the source image. In this way we are left with a series of labels that
can be searched for similarities and easily matched up with the images they represent
without the need to regenerate expensive segmentation groupings for the library images.
This offline library preprocessing for simplified description labels will facilitate very fast
database querying, also allowing the easy addition of new images to the existing library.
7.2 Normalizing Descriptions
The description types listed above will actually have values with differing degrees of sep-
aration and range. If these measures are to be combined into a single image similarity
evaluation strategy spanning description types then some method of normalizing them
into equivalent units of measure is required. While many of these measures (such as those
relating to area and colour) have easily definable limits, others (such as moments) are
much more difficult to constrain within a normalized range. Merely ensuring that values
lie within an agreed maximum limit is also insufficient, as each measure may well have
differing degrees of separation and normal value ranges which would result in some de-
scription types having a disproportionate influence upon the similarity evaluation (figure
7.1).
The solution used in this work is to use the same range normalization process upon each
description, based around the values generated during database construction rather than
the application of predefined thresholds/limits. In this case we not only need to determine
maximum threshold values for descriptions, but also minimum thresholds. While there
are many ways of determining appropriate upper and lower bounds from a data set, the
approach used in this work is through the use of mean values as shown below.
Normalization (ND) for each description type (D):
7.3. WEIGHTING DESCRIPTIONS 157
ND = CD−lowerhigher−lower
where the constrained description (CD) is:
CD =
D D ≥ lower and D ≤ higher
higher D > higher
lower D < lower
and the higher and lower bounds are determined using mean values:
lower =∑i<no. descriptions
i=0 (Di if Di < mean(D))
higher =∑i<no. descriptions
i=0 (Di if Di ≥ mean(D))
Although the use of median values may provide better higher and lower bounds, and
allow greater tolerance to non-representative values, mean values have been implemented
due to their comparative efficiency. Using this self normalization strategy automatically
constrains each description type to a range defined by the typical variation and bounds of
the values that are to be searched in the image database. This will result in descriptors
that can be evaluated for similarity and combined into an overall score without individual
description type bias.
7.3 Weighting Descriptions
When dealing with ratios we need the ability to deal with infinite numbers caused by
dividing by zero. One way to deal with these exceptions is to assign them a default value
and give them a weight which can be set to zero to remove them from consideration.
Weighting each description also allows us to adjust its influence to reflect the score of the
region group it is generated from.
Each description atom consists of:
value (double precision, normalized by search database so 0 ≤ i ≤ 1)
weight (double precision, 0 ≤ i ≤ 1)
Another case where we may wish to adjust weighting values on an individual level
occurs where the nature of the data available makes a particular description inherently
unsound. An example of this is the volatility of rgb ratios and hue measurements in images
where saturation levels approach zero, which indicates that some weighting based around
saturation levels may be beneficial. Region size has already been used to influence the
region score, so individual weighting against the increased sensitivity to individual pixel
158 CHAPTER 7. SEARCHING THE LABEL DATABASE
errors/image grid aliasing effects as groups decrease in size would seem unnecessary at
this point. Although not investigated in this work, the implications of such feature-based
weightings may well represent an interesting avenue for further work.
A further possible use for this weighting system that was originally intended to be
covered in this thesis was the use of weights to adjust the contribution of individual de-
scription types to generate optimal recognition results for different image types. Work
reported later in this thesis indicates that the adjustment of weights can indeed improve
recognition results in the general case and could be used to tailor searches to different
search requirements and image types. Automatic relaxation techniques and genetic al-
gorithms would be particularly useful here to determine optimal weighting based upon
search results given different types of source image. These weight profiles could then be
stored and selected either automatically or by the user to enhance and direct the search
engine performance. This topic is discussed later in section 8.2.
7.4 Searching the Database Labels for Image Similarity
Each database image is represented by its label, which in turn is subdivided into description
types which contain the individual descriptors generated from each region group in order
of rank. It is very unlikely in all but the most exact matching images that groups will
be ranked in the same order between two similar images, so a direct comparison of label
content will not be sufficient to determine similarity. We must first establish the best match
between descriptor indexes of each label, which is the equivalent of matching individual
region groups between the two images, as in algorithm 14.1. Correspondence scores are
weighted by both the weight (reliability measure) of the descriptor and a global user
weighting (that can be used to adjust the global input of a description type). Although
one of the fastest (and straightforward) methods of determining equivalence, this does
not represent a one to one correspondence between segments and it is possible for many
query segments to be equated with the same library image segment. This can result in
the counter-intuitive case where recognition scores between query and target images are
non-reversible. The similarity result returned from label comparison is entirely dependent
upon the direction of the query (7.2).
Dependence upon directionality of the query is actually a useful property for this work,
where recognition by image sub-components is a desirable property. It can be argued that
human evaluation of similarity would exhibit a similar bias towards matching. One possible
problem with this type of correspondence selection is where large parts of the query image
are matched to a very small region of the library image, resulting in an unjustifiably high
score. It is anticipated that such problem are unlikely to occur in this work, as selection is
based around the similarity of a large number of description types (including proportional
7.4. SEARCHING THE DATABASE LABELS FOR IMAGE SIMILARITY 159
non-invariants), so the possibility of such effects will be minimal. A possible avenue of
further work would be to investigate the relationship between this correspondence selection
and resulting differences in performance if a one to one constraint was imposed.
Once a best correspondence between region groups has been achieved we can use them
to calculate the similarity between the query and database labels (algorithm 14.2) which
is also weighted by both descriptor weight and a global user weighting. Repeating this
process for each database label, each database label (and corresponding source image
reference) is sorted by similarity to the query image label. After sorting, each database
image and corresponding similarity score is displayed in a results window (7.3).
160 CHAPTER 7. SEARCHING THE LABEL DATABASE
Figure 7.1: Graphs plotting segment size (middle) and intensity (right) values against the
number of segments at each stage of the segmentation (red lines indicate the global mean
value). The behaviour and spread of these values during a segmentation are very different
indeed.
7.4. SEARCHING THE DATABASE LABELS FOR IMAGE SIMILARITY 161
(a) Difference results (right) will be lower where query content (left) is a subcomponent of
library content (right)
(b) Difference results (right) will be higher where query content (left) contains of library content
(right)
Figure 7.2: Best match correspondence selection results in a one to many match between
query segments and library label segments. Similarity results are non-reversible, dependent
upon which image forms the query.
162 CHAPTER 7. SEARCHING THE LABEL DATABASE
Figure 7.3: Screenshot of algorithm output
Chapter 8
Evaluation of final algorithm
performance
8.1 General effectiveness with different image types
A dataset of images has been compiled to to test the algorithms general effectiveness
with varying image types. The first set were 55 natural photographic images with 35
equivalent query images, featuring the same image content from different camera positions
(figure 8.1). This dataset was copied and greyscaled to produce a further series of images
to test the effects of eliminating colour information. A further 3 datasets consisting of
facial images [Peipa](figure 8.2), cartoon (figure 8.3) and Gestalt symbolic (figure 8.4)
images were also generated. Correspondences between query images and their equivalent
search target images in the library set were recorded and used to evaluate the performance
of the recognition algorithm. Results where the target images are ranked highly when
queries with their equivalent search images will indicate good recognition performance.
All libraries and queries from these sets were processed using an image dimension of 80 by
80 pixels.
The first set of tests evaluate the actual score values returned from the natural colour
image set. Score values are measures generated and inverted from the absolute difference
between normalized (weighted) query and library percentage label descriptors. The result
is a percentage measure of how similar the image labels are to each other, because they are
based on difference measures the reported ranges will have a bias towards higher values.
Because all values are calculated in this way, the rank order of results is still preserved.
Figure 8.5 shows that successful recognition is occurring over the entire set, with target
images scoring above average in each query.
Figure 8.6 shows the contribution that each descriptor (intensity, eccentricity etc.) type
makes toward these final score values, calculated using the mean scores of each isolated
163
164 CHAPTER 8. EVALUATION OF FINAL ALGORITHM PERFORMANCE
Figure 8.1: Samples from the natural image set, query images (above) and their equivalent
library targets (below)
Figure 8.2: Samples from the facial image set, query images (above) and their equivalent
library targets (below)
descriptor over the entire query set. As would be expected given the mean-based self-
normalization of the descriptor values, each descriptor is making an even contribution
toward the overall final score. While this contribution can be further weighted by the user
to artificially suppress description types, in this case user weighting is not activated.
Figure 8.3: Samples from the cartoon image set, query images (above) and their equivalent
library targets (below)
8.1. GENERAL EFFECTIVENESS WITH DIFFERENT IMAGE TYPES 165
Figure 8.4: Samples from the Gestalt/symbolic image set, query images (above) and their
equivalent library targets (below)
Figure 8.5: Minimum, mean, maximum and target score results for each of the 35 query
images in the natural image set.
Colour based descriptions provide very effective geometry invariant descriptors for pho-
tographic imagery and are commonly used for recognition tasks. A major aim of this work
is to enable recognition across a diverse range of image content, much of which may not
feature colour information, so the next set of tests were to determine the recognition algo-
rithms dependency on colour information. Figure 8.7 shows that recognition performance
is significantly reduced when the greyscaled natural image set is queried. Although the
mean rank of target recognition is increased from 5.89 with the colour set to 18.46, this
still represents above average recognition.
Another feature evident from figure 8.7 is the unexpected lack of correlation between
166 CHAPTER 8. EVALUATION OF FINAL ALGORITHM PERFORMANCE
Figure 8.6: Contributions of description types to final score, generated from the natural
colour library with user weighting disabled.
greyscale and colour results for each query image. While some deviation can be antic-
ipated due to the non-linear changes resulting from the label segment correspondence
process ‘switching’ between preferred segment matches, this would not be expected to
cause such large differences between greyscale and colour results on the same query. In
such switching between label segments is the cause then it would be expected that ex-
amination of component descriptor performance would show a similar lack of correlation.
8.1. GENERAL EFFECTIVENESS WITH DIFFERENT IMAGE TYPES 167
Figure 8.7: Comparing rankings generated from greyscale and colour versions of the same
image queries, ranks scores of 1 indicate the target has been selected as the best match
However, figures 8.9, 8.8 show that both the rankings 1 and the scores of individual de-
scriptions show a good degree of correlation (with the obvious exception of colour based
descriptors).
Given that the results are generated from the scores and rankings of the target images
only, it would seem that while target image results are behaving in a correlated manner
between greyscale and colour types the other images in the library may well be ‘switching’
to different label segment matches and interfering with performance. This represents an
area that could be studied in greater depth in future work.
Figure 8.10 shows the overall ranking results of target images through the four im-
age libraries; natural (colour), cartoon (colour), natural (greyscale), faces (greyscale) and
Gestalt/symbolic (black and white). The good performance of the facial image set when
compared to the natural image set is almost certainly due to the nature of the images in
the libraries. While the natural image set contains a lager number of closely matching
images and real-world transformations to impede recognition, the face database images
are taken in relatively controlled environments with fewer image transformation types be-
tween query and target. It can be seen that these results show promise, and the algorithm
is performing good recognition across the different image types.
1All percentage rank scores calculated as (maxrank−rank)(100)maxrank−1
168 CHAPTER 8. EVALUATION OF FINAL ALGORITHM PERFORMANCE
Figure 8.8: Correlation between component descriptor scores for both greyscale and colour
natural images
8.1. GENERAL EFFECTIVENESS WITH DIFFERENT IMAGE TYPES 169
Figure 8.9: Correlation between component descriptor rankings for both greyscale and
colour natural images
170 CHAPTER 8. EVALUATION OF FINAL ALGORITHM PERFORMANCE
Figure 8.10: Comparing rankings generated from greyscale and colour versions of the same
image queries, ranks scores of 1 indicate the target has been selected as the best match
Currently, there is no automatic weighting compensation on description that are re-
dundant in certain image types. Greyscale query images will report 100 percent matches
over colour descriptors when matching other greyscale image types. This will result in the
recognition algorithm favouring images by presence (or absence) of colour content. This
may well be justifiable in terms of human recognition, as the distinction between colour
and greyscale imagery does play a large part in similarity evaluation. The effect of this
is also offset by the inclusion of many other forms of geometry based descriptions, which
should avoid any unwanted bias resulting. This may well represent an area of potential
further work.
8.2 Effects of global user weighting on recognition
One area of interest is the possibility of tailoring queries to recognize different similarity or
image types through the adjustment of global weights applied to the label descriptors. As
label descriptions are not weighted or normalized before storage in the library they directly
8.2. EFFECTS OF GLOBAL USER WEIGHTING ON RECOGNITION 171
Figure 8.11: Shows an improvement in greyscale image recognition performance through
the global weighting of descriptor values.
represent raw descriptions of image content. The benefit of this strategy is that once a label
is generated there is no need to regenerate the processor intensive segmentation/grouping
processing for subsequent recognition queries. This also means that query weighting and
adjustment can be performed extremely easily and efficiently in real-time. This all relies
on the ability of global descriptor weighting to improve and control recognition results.
The next experiment was set up to test whether this is a viable proposition, and to see if
we can improve on our greyscale natural image recognition results using purely geometric
descriptors. From the component description results in figure 8.9, it can be seen that of
all the geometry based descriptions; boundary compactness, area ratios, centroid y axis
proportions and convex hull centroid y axis proportions are all performing better than
the other geometry descriptors. Figure 8.11 shows that recognition performance has been
improved, with mean rank reduced by 4 places (a 7.7 percent improvement), when the same
experiment was repeated on the greyscale dataset using only these four descriptors. This
indicates that there is good potential for automatic relaxation and user-based weighting
adjustments to generate improved recognition results. Another unexpected result of the
success shown here is that they reveal some unanticipated good descriptor types for natural
image content. The success of the y axis proportion measures (non-invariants expressing
the y coordinate as a proportion to image size) makes sense when you consider that most
camera movements occur on the horizontal plane in photographic images. This explains the
effectiveness of these non-invariant y axis descriptors, and indicates that further analysis
of such occurrences could provide insight into effective descriptor combinations.
Figure 8.11 shows the recognition engine successfully recognizes pure geometric image
content in realistic and difficult image search situations.
172 CHAPTER 8. EVALUATION OF FINAL ALGORITHM PERFORMANCE
8.3 Tolerance to realistic transformations
Our next set of tests was to determine the recognition algorithms behaviour under image
transformations. A series of artificial images was generated from the same three source
images (linear, greyscale and colour source images, featuring the same geometry) after
known transformations; rotation (figure 8.17), affine (figure 8.18) and translation. These
image sets are for the evaluation of the algorithm resistance to transformations and the
difference between colour and geometrical similarity cues. All but the most artificial image
transformations will result in the introduction of new image content and the loss of old
image content, even if the result is the changing dimension of a white background. To
emulate this, these library images are sub-images taken from a larger image context that
is introduced of removed as the transformation requires. This change in context (8.12)
is certain to introduce a level of noise into query results as the query image is no longer
querying the exact same content.
These experiments were conducted on images of dimension 100 by 100 pixels. The
rotation results (figure 8.13) show similarity scores over a library of gradually rotating
images, with variant descriptors allowing recognition of the closest rotation whilst invariant
descriptors facilitating tolerant recognition levels for other rotations. The secondary peak
in results at 180o is to be expected, as this rotation value does not introduce or remove
image features from the query. Similar linear results can be seen for the translation
results (8.14), with the maximum translation value of 100 percent representing completely
different image content to the query.
For both colour and greyscale image types, recognition levels gradually decrease at
greater degrees of translation. Linear image types show a fairly level recognition rate
for rotations from 10 to 100 percent, indicating much poorer performance. Affine results
(figures 8.15, 8.16) also show the expected drop in performance at increasing levels of affine
transformation. The improved scoring during affine ‘squeezing’ when compared to ‘stretch’
is due to the continued (but distorted) presence of the query image features. Affine stretch
results in the original features vanishing over the edge of the image.
8.4 Comparison with human decisions
Anther important aspect of this work is how it compares with human similarity judge-
ments, a JavaScript web survey was written to gather basic information regarding real
human similarity decisions. This application presents five tests, four for the manmade,
natural, human and cartoon/sketch categories and a final test that includes all categories.
In each test the subject is presented with a target image and a series of five randomly
chosen images. The task of the subject is to arrange the five images in order of similarity
8.4. COMPARISON WITH HUMAN DECISIONS 173
to the target image. The 84 test subjects generated a set of 420 results to be used to
compare the performance of this work with human similarity decisions.
Evaluation of the first 30 of these search results indicated relatively poor performance
(ranking decisions deviated from the human ranking decisions by a mean of 2.3, only
marginally better than chance) when compared to previous results. One cause of this may
be due to the limited number of images presented in each test, and the random nature in
which they are selected. Many of the library images presented for similarity evaluation in
the tests are likely to bear very little in common with the query image. While the human
test subjects were still ranking such image sets by similarity, in many cases this may have
involved a degree of randomness. The limitations of the dataset and number of images
per query means that these results are hardly conclusive. This is definitely an area that
warrants further examination in future work.
174 CHAPTER 8. EVALUATION OF FINAL ALGORITHM PERFORMANCE
(a) Query Image (b) Translated Im-
age
(c) Superimposed
Figure 8.12: This translation to the right results in the loss of the information (rightmost
yellow shaded region) and the addition of new information (left blue region)
Figure 8.13: Similarity score performance over increasing rotation
8.4. COMPARISON WITH HUMAN DECISIONS 175
Figure 8.14: Similarity score performance over increasing translation
Figure 8.15: Similarity score performance over Affine transformations(stretch)
176 CHAPTER 8. EVALUATION OF FINAL ALGORITHM PERFORMANCE
Figure 8.16: Similarity score performance over Affine transformations(squeeze)
8.4. COMPARISON WITH HUMAN DECISIONS 177
(a) 0o (b) 36o (c) 72o (d) 108o (e) 144o
(f) 180o (g) 216o (h) 252o (i) 288o (j) 324o
Figure 8.17: The colour rotation sample set
(a) 0o (b) 36o (c) 72o (d) 108o
Figure 8.18: The greyscale Affine sample set
178 CHAPTER 8. EVALUATION OF FINAL ALGORITHM PERFORMANCE
Figure 8.19: Screenshot from the ‘Human-Like Survey’.
Chapter 9
Conclusions
The aim of this thesis was to investigate, and implement, a plausible architecture for
recognition of general image content in a human-like way. First we reviewed the subject
area and formulated an approach that would represent a plausible architecture. The next
critical stage was the development of group primitives from raw image content that could
be used to generate higher level geometric and photometric description types that could
plausibly approximate a range of Gestalt grouping principles. At this point it was decided
that the segmentation/grouping algorithm was to be run directly upon image content,
without any prior filtering or colour normalization. The decision not to use colour nor-
malization on initial image content was a pragmatic one based around the requirement
of the algorithm to operate across a wide range of image types. A reliance upon colour
normalization would preclude the use of images without colour content such as greyscale
and black and white images. A further basis for this decision was the loss of potentially
valuable direct match recognition information once the image has been colour normalized
which would result in us not being able to test its importance to recognition. A major
justification for implementing colour normalization would be to minimize the differences
between target and query images caused by shading or lighting changes. It was antic-
ipated that this would not be necessary in this case as the algorithm is based around
segment/group descriptions that include shape and non-colour related information that
should still facilitate good matching regardless of these effects.
Of more use to this thesis than colour normalization would have been the ability to
pre-filter images taking human perceptual effects such as simultaneous contrast into ac-
count. Illusions such as those shown in figures 2.3, 2.4 and 2.5 demonstrate that humans
do not necessarily perceive colour or intensity information in the same way as the actual
values contained within images. In such cases, while there may be a partial match between
colour values in a search image and a target image, a human observer may disagree due
to the change in perceived colour caused by the context and geometry within the image.
179
180 CHAPTER 9. CONCLUSIONS
Although limited work towards pre-filtering images in such a way that actual colour val-
ues approximate perceived colour values was performed, it was decided that this form of
correction lay beyond the remit of this work. The result is that this algorithm is just as
susceptible to such discrepancies as conventional algorithms.
A novel use of a KD-tree architecture to facilitate n-dimensional Gestalt feature prox-
imity grouping decisions was developed alongside a binary tree storage method that allows
multi-dimensional grouping behaviour while retaining a full grouping relationship history.
The selection, ranking and weighting of group primitives based upon descriptive suitability
was then addressed, with surviving groups providing the basic descriptions for recognition.
Appropriate description types, of different level of invariance and type, capturing a wide
range of image content types was then proposed. Finally, a practical storage and recogni-
tion architecture was outlined and tested against a variety of different image types. The
final result is a plausible architecture that can form the basis a practical recognition al-
gorithm and a useful platform from which to test the contribution of difference image
description types to human similarity judgement.
This work has successfully demonstrated that the proposed architecture provides ef-
fective, and efficient, recognition across a large range of image types and can successfully
recognize image content using both colour and geometric descriptions. The combination
of variant and semi-invariance description types do indeed facilitate recognition tolerant
to image transformations whilst still allowing the distinction between the degree of such
transformations. Results and experimentation are more limited than originally intended
due to the sheer scale and complexity of the problem addressed, combined with time
restraints of the Phd itself. This means that certain areas of the algorithm may not be
optimal implementations, and there is likely to be scope for improvement in both efficiency
and efficacy. One area that would benefit from further work is the selection and weighting
of label descriptions, which suffered under time constraints. In particular, parallel work
on the possible advantages of using signature storage of invariants for recognition had
to be put aside in favour of completing the thesis work using label based descriptions.
While general results do appear to reflect human-like similarity judgements, initial tests
to evaluate this have not proven conclusive and also require further examination.
9.1 Future Work
This architecture provides a good starting point to many potential areas of future work.
One aspect of human vision not directly implemented in this work is the potential benefits
of pre-normalizing raw image data for levels of colour constancy and the simulation of
simultaneous contrast effects. Simultaneous contrast, in particular, seems to be a major
component of the human visual system, providing the basis for much of the low level
9.1. FUTURE WORK 181
grouping and perceptual organization decisions. Development of methods to simulate this
as a pre-processing stage, or integral part of the Gestalt grouping engine would represent
a useful area of further work and should improve recognition. Similarly, a hue, saturation
and luminance representation of initial image content may well have benefits.
Another area worth investigating is the label segment correspondence decisions that
take place in this work before similarity evaluation. While the current architecture works
well, it would be of interest to test this section using a unique one-to-one correspondence
criteria between segments in place of the current many-to-one approach.
The global weighting of descriptions to tailor adjust recognition performance for differ-
ent image types was investigated in this work, but would definitely be a worthwhile area
of further study. It has been demonstrated that adjusting global weightings applied to
descriptors can significantly improve recognition results. A relaxation based approach to
selecting weighting is likely to well generate optimal recognition results tailored to a given
library set and query type. The investigation of descriptor reliability over different image
content types in general could allow us to generate stored descriptor weighting profiles
that could be applied for optimal results tailored to image content. A further extension to
this approach would be to provide an interactive user-interface that could facilitate query
adjustment (section 8.2) allowing the user to refine their search to their requirements
Finally, a possibility considered but not implemented in this work due to time con-
straints, was the possibility of incorporating continuity constraint and organizational prim-
ing into the Gestalt grouping algorithm. Organizational priming, in this case, refers to
the bias towards particular grouping arrangements based upon previous groupings. For
example, where an image is perceived to contain many vertical linear groupings then there
will be a tendency to perceive other image content as vertical linear structures. This is
essentially a global application of the continuation principle, where a linear feature being
developed favours groupings that will form a continuous curve with the current feature
(even if the continuous grouping is not the most proximal one). An interesting approach to
dealing with this would be through the use of a further n-dimensional space to store edge
values as they are generated in the grouping/segmentation stage. Rather than progressing
through an ordered list of edges (whose values are generated from an n-dimensional volume
nearest neighbour search) this list could be replaced by another n-dimensional volume of
edge values. With such an architecture (that would still be updateable) the next edge to
be processed would be the point in the space closest to the origin (which would then be
removed from the space). With edge selection based around proximity to the origin, this
would generate similar results to the ordered edge-list of the original grouping algorithm.
If, however, we use a moveable origin that moves in the direction of the last edge processed
then the entire next edge search becomes biased towards favouring edges of that partic-
182 CHAPTER 9. CONCLUSIONS
ular direction in the n-dimensional space. Figure 9.1 shows a very simplified illustration
showing how this process should work. It is anticipated that this approach will facilitate
the continuity principle, and better groupings based upon image geometry. The use of
such a second order edge space, using the same fundamental n-dimensional architecture
as the nearest neighbour search space would also represent a more elegant and complete
architecture. Exact implementation issues, and whether or not such an edge selection
process would have unwanted side-effects have not been determined, but this approach
would represent an interesting avenue of further research.
9.1. FUTURE WORK 183
(a) Turn 1 (b) Turn 2
(c) Turn 3 (d) Turn 4
(e) Turn 5 (f) Turn 6
Figure 9.1: Directing Edge Selection using a 2nd Order Edge Search Space and moveable
origin.
184 CHAPTER 9. CONCLUSIONS
Chapter 10
Self-Similar Convolution Image
Distribution Histograms
The following provides a brief overview of work undertaken during the writing of this
thesis (although not directly used for the work) and presented at the British Machine
Vision Conference 2001. Trademark recognition can be considered a specialized, simplistic
form of sketch recognition. Reading in this area [Alwis00], led to the development of
a novel invariant storage method for trademark descriptions. “Self Similar Convolution
Image Histograms” achieve limited invariant storage using scaled copies of the original
image as a filter to generate a spatio-intensity histogram which would form the basis of a
similarity invariant signature.
The basis of this technique is to generate an identifying signature from a binary trade-
mark image by using a scaled down version of itself as a convolution mask to generate a
scalar convolution image. As the convolution filter will always be aligned to the original
image, the normalized distribution histogram of the resulting grey scale image is an affine
invariant description based purely upon the image structure, which can be then be used
for database search using a database of binary images. [Tuke01] explains this process in
detail and presents practical test results. Although this technique represents an interesting
and novel approach to achieving invariance, it is only effective in the tightly constrained
environments that trademark imagery represents. This fact, combined with the need for a
specialised storage architecture incompatible to most invariant signature techniques sug-
gests that this technique is therefore not applicable to the generalised invariant image
search.
185
186CHAPTER 10. SELF-SIMILAR CONVOLUTION IMAGE DISTRIBUTION HISTOGRAMS
Figure 10.1: Showing the invariance of self-similar convolution image histograms to image
transformations (but NOT perspective)
187
Figure 10.2: Screen-shot of a typical query and database response using the self similar
geometric histogram as an identifier.
188CHAPTER 10. SELF-SIMILAR CONVOLUTION IMAGE DISTRIBUTION HISTOGRAMS
Chapter 11
The Linear Gestalt Grouping
Algorithm and Data Types
189
190CHAPTER 11. THE LINEAR GESTALT GROUPING ALGORITHM AND DATA TYPES
Algorithm 11.1 KD Data Types
BEST KD TYPE
(Linked list structure used to return results from nearest neighbour queries of the KD tree)
VOID *Group List (Pointer to a group structure being returned by this node)
float *Coords (The KD tree coordinates of the region being returned)
float Dist ()
BEST KD TYPE *Next (Pointer to next item in linked list)
KD TYPE
(Dual linked list/tree structure used to store KD tree branches and leaves)
int Dimension (Dimension index on which the feature space is to be subdivided)
float Radius ()
void *Group (Pointer to a group structure on a leaf of the KD tree, otherwise NULL)
float *Coords (The centre of the bounding box referenced by this branch or
the KD tree coordinates of a leaf)
KD TYPE *Higher (Pointer to the next branch or next leaf)
KD TYPE *Lower (Pointer to the next branch or NULL when a leaf node)
KD DYNAMIC TREE CLASS
(A class to keep track of the KD tree structure, including functions)
private:
KD TYPE *Tree (The root of the iterative KD tree structure)
float Min[],Max[] (Store dimensions of the current section of feature space being worked on)
float Dist
int BestSize
public:
bool Initialized (Flag to indicate status of KD tree)
int Dimensions (Dimensionality of space to store)
float Terminating Value (Minimum subdivision of space before branches become leaves)
float Furthest Dist (Current largest distance from origin)
BEST KD TYPE **Best (Data structure used to return nearest neighbour results)
int No To Find (Number of nearest neighbours to be found by a query)
int No Node
191
Algorithm 11.2 Linear Gestalt Grouping Algorithm Data Types
REGION TYPE
NODE TYPE *Node List (pointer to list of nodes that make up this region)
Description Variables (Variables that describe the region)
NODE TYPE
EDGE TYPE *Edge1=NULL (pointer to edge terminating in this node)
EDGE TYPE *Edge2=NULL (pointer to edge terminating in this node)
REGION TYPE *Region (pointer to the Region this node belongs to)
NODE TYPE *Next (pointer to next node in the list)
EDGE TYPE
NODE TYPE Node1 (terminating node of edge)
NODE TYPE Node2 (other terminating node of edge)
double Original Score (the perceptual significance score without continuity)
double Score (the working perceptual significance score, including continuity)
unsigned char Status=EDGE ENABLED (used to store information during processing)
EDGE TYPE *Next (pointer to next edge in the list)
Algorithm 11.3 Overview of the Linear Gestalt Grouping Algorithm
For each entry k in edge list e
{if ((Edge[k].Status)<>EDGE DISABLED)
if (JunctionTest(Edge[k]))
if (ContinuityTest(Edge[k]))
Reposition(Edge[k])
if Status(Edge[k].Status)==EDGE REPOSITIONED
{AddEdgeToRegion(Edge[k])
}}
192CHAPTER 11. THE LINEAR GESTALT GROUPING ALGORITHM AND DATA TYPES
Algorithm 11.4 Overview of the JunctionTest function
bool JunctionTest(Edge)
(Only linear structures allowed here, clustering with 3 way nodes is false)
{LeftNode = Edge.Node1
RightNode = Edge.Node2
if (LeftNode.Edge2)
{(Node already has 2 nodes attached to it so prevent further connections)
Edge.Status =EDGE DISABLED
return false
}else
{(Node has either no or only a single edge running from it, valid)
return true
}}
193
Algorithm 11.5 Overview of the ContinuityTest function
bool ContinuityTest(Edge)
(Evaluate the continuity implications of adding this edge)
(Optionally disallow edges that have very poor continuity)
{OScore=Get Continuation Score(Edge) *returns 0-1*
(Disallow edges with continuity scores lower than MC)
if (OScore <MC)
{Edge.Status =EDGE DISABLED
return false
}
(Add continuity values to edge score)
OScore = Edge.Original Score + OScore − CB
If (OScore < 0) OScore = 0
elseif (OScore > 1) OScore = 1
Edge.Score = OScore
return true
}
194CHAPTER 11. THE LINEAR GESTALT GROUPING ALGORITHM AND DATA TYPES
Algorithm 11.6 Overview of the Get Continuity Score function
float Get Continuity Score(Edge)
(Return continuity implications of adding this edge)
{LeftNode = Edge.Node1
RightNode = Edge.Node2
A1 =Evaluate minimum angle (degrees) between edges connecting LeftNode
(180 if node isn’t a junction between edges)
A2 =Evaluate minimum angle (degrees) between edges connecting RightNode
(180 if node isn’t a junction between edges)
(Select the lowest angle (worst score))
if (A2 < A1)
A1 = A2
(Normalize angle to range between 0 and 1)
A1 = A1/180
return A1
}
Algorithm 11.7 Overview of the Reposition function
Reposition (Edge)
{Reposition Edge based on its score
(between current k position to end of list)
Edge.Status = EDGE REPOSITIONED
}
195
Algorithm 11.8 Overview of the AddEdgeToRegion function
AddEdgeToRegion (Edge)
{(Add all nodes in the region pointed to by Edge.Node2
to the node list of Edge.Node1.Region)
R1 = Edge.Node1.Region
R2 = Edge.Node2.Region
for each Node in R2.Node List
{Node.Region = R1
Remove Node from R2.Node List, place on R1.Node List
}Remove R2 from Regions
}
196CHAPTER 11. THE LINEAR GESTALT GROUPING ALGORITHM AND DATA TYPES
Chapter 12
The n-Dimensional KD Tree
Structure and Algorithms
197
198CHAPTER 12. THE N-DIMENSIONAL KD TREE STRUCTURE AND ALGORITHMS
Algorithm 12.1 Return the current largest axis, which will be split by the tree branching
process
Find Largest Dimension ()
{(Search for the largest axis defining the currently
referenced volume of space)
Largest Value=0
Largest Dimension=0
for all f in Dimensions
(Max and Min define the currently referenced volume of space)
if (Max[f]-Min[f]≥Largest Value)
Largest Value=Max[f]-Min[f]
Largest Dimension=f
return Largest Dimension
}
199
Algorithm 12.2 Create a new KD tree branch
KD TYPE *Create New Limb()
{(Create a new limb)
Allocate memory for Limb
Limb.Dimension=Find Largest Dimension()
if (Max[Limb.Dimension]-Min[Limb.Dimension]≤Terminating Value)
{(Is a leaf node)
Limb.Dimension= -1 (Dimension of -1 used to indicate leaf nodes)
}else
(Is a branch node)
for all f in Dimensions
{(Set Limb.Coords to the centre of the current branch bounding box)
Limb.Coords[f] = min[f]+(max[f]-min[f])/2
(Set Limb.Radius to the largest vector possible in bounding box)
Limb.Radius+=(max[f]-min[f])/2
}(Square root of Limb.Radius is now minimum encompassing radius
from Limb.Coords to bounding box)
return Limb
}
200CHAPTER 12. THE N-DIMENSIONAL KD TREE STRUCTURE AND ALGORITHMS
Algorithm 12.3 Adding a point to the KD tree
Generate Point (Current Branch, Group, Coords)
{(Iteratively travel down tree branches, creating new branches where
required, add a leaf node pointing to a group.
The entire tree, leaves and branches, is made up
of iteratively generated KD TYPE nodes)
if (!Current Branch)
{(Limb currently set to NULL, need to generate a new one)
Current Branch=Create New Limb()
Fresh=true
}else Fresh=false
if (Current Branch.Dimension<0)
{(Limb is a leaf node, nodes behave in a linked list manner from now on)
(We have found where we will place our point group structure)
if (!Fresh)
{(More than one leaf on this branch, so need to generate
memory and insert into linked list of leaves)
Tmp=new KD TYPE
Tmp→Next=Current Branch
Current Branch=Tmp
}
(Copy across group references and coordinate values)
Current Branch→Group=Group
Dist=0
for all f in Dimensions
{Current Branch.Coords[f]=Coords[f]
Dist+=Coords[f]
}if (Dist>Furthest Dist) Furthest Dist=Dist
}else
201
{(Limb is a tree branch, lets see which limb to continue down)
if (Coords[Current Branch.Dimension]>
Current Branch[Current Branch.Dimension])
(Entry being added belongs down the higher branch)
(Store current dimensions so we can restore them after iteration)
tmp=Min[Current Branch.Dimension]
(Reduce working volume dimensions ready for next iteration)
Min[Current Branch.Dimension]=
Current Branch.Coords[Current Branch.Dimension]
(Iterate down the highest branch)
Branch→Higher=
Generate Point(Current Branch→Higher, Group, Coords)
(Restore dimensions back to original values)
Min[Current Branch.Dimension]=tmp
}else
(Entry being added belongs down the lower branch)
(Store current dimensions so we can restore them after iteration)
tmp=Max[Current Branch.Dimension]
(Reduce working volume dimensions ready for next iteration)
Max[Current Branch.Dimension]=
Current Branch.Coords[Current Branch.Dimension]
(Iterate down the lowest branch)
Branch→Higher=
Generate Point(Current Branch→Lower, Group, Coords)
(Restore dimensions back to original values)
Max[Current Branch.Dimension]=tmp
}}
return Current Branch
}
202CHAPTER 12. THE N-DIMENSIONAL KD TREE STRUCTURE AND ALGORITHMS
Algorithm 12.4 Return the squared Euclidean distance between two points in feature
space
float Get Distance Between(Coords1, Coords2)
{(Find distance between Coords1 and Coords2)
dist=0
for all f in Dimensions
{dist=dist+(Coords1[f ]-Coords2[f ])2
}return dist
}
Algorithm 12.5 Copy coordinate values from one group to another
Copy Coords(E1, E2)
{(Copy coordinate values from E2 to E1)
for all f in Dimensions
E1.Coords[f ]=E2.Coords[f ]
}
203
Algorithm 12.6 Check to see if a leaf group belongs in the nearest neighbour table
Check Against Table(Branch, Current Group, Coords)
{(Ensure we don’t try to place the target group, we don’t
want the query group appearing in the neighbour table)
if (Group==Branch→Group) return
dist=Get Distance Between(Branch.Coords, Coords)
(See if the table is already full)
if (Best[No To Find-1])
{(Is full, we only need to query the final entry to see
if this entry belongs in the table)
if (dist>Best[No To Find-1].Dist)
{(Entry is further than furthest table entry,
leave neighbour table intact)
return
}}
(We know the entry belongs in the table,
so scan to find the entries position)
for all i in No To Find-1
{(If a blank entry in table encountered,
this is our neighbour table position)
if (!Best[i]) break
(If distances are greater or equal, insert at this point
if (Best[i].Dist≥dist) break
}
204CHAPTER 12. THE N-DIMENSIONAL KD TREE STRUCTURE AND ALGORITHMS
(If table has a blank entry then just create new entry, otherwise
we have to rearrange the existing table entries)
if (Best[i])
if (Best[i].Dist==dist)
{(Best entry exists, and is the same distance)
(No table shuffling needed, just add to linked list pointed
at by neighbour table)
if (!Check List For Duplication(Best[i]))
{(Exit if group is already present in neighbour table)
return
}else
{(Best entry exists and is different distance to table entry)
(A table shuffle is required to make space for new entry)
(Destroy final table entry, along with it’s linked list)
Destroy Entry(Best[No To Find-1])
(BestSize keeps track of the total number of groups held in
the nearest neighbour table lists)
BestSize–
(Move table entries along, making room for a new table entry)
for (f=No To Find-2;f≥i;f–) Best[f + 1]=Best[f ]
Best[i]=NULL
}
(Best[i] now points to an appropriate neighbour table linked list)
(Create a new BEST KD TYPE entry)
BPnt=new BEST KD TYPE
BPnt.Coords=new float[Dimensions]
BestSize++
Copy Coords(BPnt, Branch)
BPnt.Dist=dist
BPnt→Group=Branch→Group
BPnt→Next=Best[i]
Best[i]=BPnt
}
205
Algorithm 12.7 Determine if a KD tree branch is worth exploring
bool Is Branch Possibly Closer(Coords, Branch, OldDist)
{(See if this branch could possibly contain a leaf closer than
current entries on the neighbour table)
(Detect distance between the encompassing sphere of the section of
feature space referenced by this branch and the target coordinates
if (Get Distance Between(Branch.Coords, Coords)-Branch.Radius>OldDist)
return false
else
return true
}
206CHAPTER 12. THE N-DIMENSIONAL KD TREE STRUCTURE AND ALGORITHMS
Algorithm 12.8 Iteratively search the KD tree for the n-nearest neighbours to a point
in feature space
Iterate Find Points (Current Branch, Group, Coords, Firstonly)
{(Search the KD tree for the n-nearest neighbours to a point in feature space)
(First check to ensure that the branch points to an active part of the feature space)
if (!Branch) return
(Are we dealing with a branch or a leaf?)
if (Branch.Dimension<0)
{(Is a leaf, search like a linked list)
Check each leaf to see if it belongs in the nearest neighbour table
Check Against Table(Branch, Group, Coords)
(Process any more leaves, unless rough mode is selected)
if (!firstonly)
Iterate Find Points(Branch→Higher, Group, Coords, firstonly)
}else
{(Is a Branch)
(Check to see if it’s worth exploring this block of feature space)
if (!Is Branch Possibly Closer(Coords, Branch, Best[No To Find-1]→Dist)
return
(This block of space could contain a nearest neighbour
so continue iteratively probing the tree structure)
Iterate Find Points(Branch→Higher, Group, Coords, firstonly)
Iterate Find Points(Branch→Lower, Group, Coords, firstonly)
}return
}
Chapter 13
Combined Gestalt Grouping
Algorithms and Results
Algorithm 13.3 shows the new group data structure. Rather than explicitly storing a
list of nodes that refer to member pixels, the new structure stores the original edge that
formed the group. If the group represents a single pixel then it can easily be recognised
by checking for a null value in the founding edge. Where a group is larger than a single
pixel, the founding edge points to the two original groups (or sub-groups) whose merging
resulted in this group. This results in a binary tree structure.
207
208CHAPTER 13. COMBINED GESTALT GROUPING ALGORITHMS AND RESULTS
Algorithm 13.1 Original Group Structure
GROUP TYPE
float Coords[NO COORDINATES] (group N dimensional coordinate)
NODE TYPE *Node List (pointer to list of nodes that make up this region)
Description Variables (Variables that describe the region)
GROUP TYPE *Next (pointer to next group in list)
GROUP TYPE *Prev *pointer to previous group in list)
Algorithm 13.2 Orginal group merge procedure
float Merge Groups (*Group1, *Group2, *Edge)
{(Loop through Group2’s Node List, making each node point to Group1)
for all n in Group2.Node List
{n.Group=Group1
}(Move Node List from Group2 to the end of Group1’s Node List)
Group1.Node List+=Group2.Node List
(Update Group1 description to new combined group description)
Group1Update Description(Group1, Group2)
(Remove and destroy Group2, Group1 is now the new child group)
Destroy(Group2)(Remove founding edge from the edge list and destroy it)
DestroyEdge
}
209
Algorithm 13.3 New tree based Group Structure
GROUP TYPE
float Coords[NO COORDINATES] (Group N dimensional coordinate)
EDGE TYPE *Founding Edge (Pointer to the edge that created this group)
Description Variables ((Variables that describe the region)
GROUP TYPE *Descendent (Immediate child (if this is a subgroup)
GROUP TYPE *YoungestDescendent (Most developed child group)
GROUP TYPE *Next (pointer to next group in list)
GROUP TYPE *Prev (pointer to previous group in list)
210CHAPTER 13. COMBINED GESTALT GROUPING ALGORITHMS AND RESULTS
Algorithm 13.4 Set YoungestDescendent procedure
void Set YoungestDescendent (GROUP TYPE *Grp, GROUP TYPE *Youngest)
{(Iteratively crawl both up and down the tree structure resetting
YoungestDescendent pointers to the new child group)
Grp.YoungestDescendent=Youngest
(Expand down the tree if the current group has a founding edge)
if (GPnt.Founding Edge)
{(If parent groups haven’t already been reset then reset them)
if Grp.Founding Edge.Group1.YoungestDescendent!=Youngest)
Set YoungestDescendents(Grp.Founding Edge.Group1, Youngest)
if Grp.Founding Edge.Group2.YoungestDescendent!=Youngest)
Set YoungestDescendents(Grp.Founding Edge.Group2, Youngest)
}(We have to consider crawling back up the tree too because
edge structures may point to less developed subgroups lower
down in the tree structure)
if (Grp.Descendent)
if (Grp.Descendent.YoungestDescendent!=Youngest)
Set YoungestDescendents(GPnt.Descendent,Youngest)
}
211
Algorithm 13.5 New, tree-based, group merge procedure
float Merge Groups (*Group1, *Group2, *Edge)
{(Each group referred to by the Founding Edge could be an
old group currently forms part of a larger child group,
so we need to actually combine the youngest descendants
of the groups.)
RGroup1=Group1.YoungestDescendent
RGroup2=Group2.YoungestDescendent
(Create ChildGroup and add it to the group list)
ChildGroup=Create Group()
(The new child group is the immediate descendent of the two
merged groups, so set their descendent pointers to it)
RGroup1.Descendent=ChildGroup
RGroup2.Descendent=ChildGroup
(Update ChildGroup description to new combined group description)
ChildGroupUpdate Description(RGroup1, RGroup2)
(Store the edge that formed this group)
ChildGroup.Founding EdgeEdge
(The new group is now the youngest descendent of that tree,
so we need to move through the tree and update each Groups YoungestDescendent pointer)
Set YoungestDescendent(ChildGroup, ChildGroup)
}
212CHAPTER 13. COMBINED GESTALT GROUPING ALGORITHMS AND RESULTS
Chapter 14
Database Search Algorithms
213
214 CHAPTER 14. DATABASE SEARCH ALGORITHMS
Algorithm 14.1 Label Component Correspondence Algorithm
Generate Matches(query label, correspondence[])
{For each database image
{Load database label
For q=each group in query label
{min difference=1
For d=each group in database label
{difference=
∑
difference between descriptor values(descriptor weights)∑
descriptor weights*global user weighting for that description type
if (difference < min difference)
{correspondence[q]=d
min difference=difference
}}
}}
}
215
Algorithm 14.2 Similarity Evaluation
Compare Labels(query label, database label, correspondence[])
(Evaluates weighted differences between best matching groups from
the two labels, for each description type)
{For q=each description type in query label
{overall similarity=0
score[q]=0
sum weights=0
For d=each group in description
{w=query label[q][d].weight*global user weighting for that description type
v=query label[q][d].value
wd=database label[q][correspondence[d]].weight
vd=database label[q][correspondence[d]].value
diff=abs(v − vd)
if (w > wd) w=wd
diff=diff*w
sum weights=sum weights+w
score[q]=score[q]+diff
}if (sum weights > 0)
score[q]= score[q]sum weights
else
score[q]=1
(Invert from difference to percentage similarity)
score[q]=100 − (score[q] ∗ 100)
overall similarity=overall similarity + score[q]
}overall similarity= overall similarity
number of description types
}
216 CHAPTER 14. DATABASE SEARCH ALGORITHMS
Bibliography
[Adelson93] Edward. H. Adelson Perceptual Organization and the Judgement of Bright-
ness Science,Volume 262, pp.2042-2044,1993
[Adelson00] Edward. H. Adelson Lightness Perception and Light-
ness Illusions The New Cognitive Neurosciences (2nd
Edition) MIT Press, Chapter 24, 339-351, 2000
http://persci.mit.edu/people/adelson/publications/gazzan.dir/gazzan.htm/#intro
[Albert90] J. Albert, F. Ferri, J. Domingo and M. Vicens,An approach to natural scene
segmentation by means of genetic algorithms with fuzzy data, Prez de la
Blanca N., SanFeliu A., and Vidal E. (eds) Fourth National Symposium
in Pattern Recognition and Image Analysis ( Selected Papers), 1990, pp
97-112.
[Alwis00] T. P. G. L. S. Alwis, Content-Based Retrieval of Trademark Images, DPhil
Thesis Dept. of Computer Science, University of York. Feb 2000
[Austin96] J. Austin, High Speed Image Segmentation using a Binary Neural Network,
Aug. 1996
[Bach96] J. Bach, C. Fuller, A. Grupta, A. Hampapur, B. Horowitz, R. Humphrey,
R. Jain and C. Shu, Virage image search engine: An open framework for
image management, Proc. SPIE Storage and Retrieval for Still Image and
Video Databases IV, San Jose, California, pp. 76-87
[Barnard96] K. Barnard, G. Finlayson and B. Funt, Colour Constancy for Scenes
with Varying Illumination, 4th European Conference on Computer Vision,
Springer, 1996
[Barnard00] K. Barnard and G. Finlayson, Shadow Identification using Colour Ratios
Proceedings of the IS&T/SID Eigth Color Imaging Conference: Color Sci-
ence, Systems and Applications pp. 97-101, 2000.
217
218 BIBLIOGRAPHY
[Bergen88] James R. Bergen and Edward H. Adeleson Early vision and texture percep-
tion Nature, Vol.333 pp.363-364, 1988
[Besag86] J. Besag, On the statistical analysis of dirty pictures, Journal of the Royal
Statistical Society, series B, 48, pp. 259-302, 1986.
[Biederman87] I. Beiderman, Recognition by components: A theory of human image un-
derstanding, Psychological Review, 94, pp. 115-147, 1987.
[Boole1872] G. Boole, A Treatise on the Calculus of Finite Differences. 1872.
[Brainard97] D.H. Brainard, W.T. Freeman, Bayesian color constancy, Journal of the
Optical Society of America A, 14:1393-1411, 1997
[Brainard01] D.H. Brainard, Sensation and Perception: Color Vision Theory, The In-
ternational Encyclopedia of Social & Behavioral Sciences, Pergamon Press,
2001.
[Brill92] M.H. Brill, E.B. Barrett and P.M. Payton, Projective invariants for curves in
2 and 3 dimensions, Geometric Invariance in Computer Vision, pp. 193-214,
1992.
[Buchsbaum80] G. Buchsbaum, A Spatial Processor Model for Object Colour Perception,
Journal of the Franklin Institute, 310:1-26, 1980.
[Bulthoff95] H.H. Bulthoff, S.Y. Edelman and M.J. Tarr, How are three-dimensional
objects represented in the brain?, Cerebral Cortex, 5(3), pp. 247-260, 1995.
[Califano94] A. Califano and R. Mohan, Multidimensional Indexing for Recognizing Vi-
sual Shapes, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence Vol 16, No 4, pp.373-391, April 1994.
[Canny86] J. Canny, A Computational Approach to Edge Detection, IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, Vol. 8, No. 6, Nov.
1986.
[Cayley] A. Cayley Cross-referenced overview of his work and life history
Available via the internet from the University of St Andrews, St
Andrews, Fife, Scotland. http://www-groups.dcs.st-andrews.ac.uk/ his-
tory/Mathematicians/Cayley.html
[Chen00] K. Chen, D. Wang and X. Liu, Weight Adaptation and Oscillatory Corre-
lation for Image Segmentation, IEEE Trans. Neural Networks, vol. 11, no.
5, Sept 2000
BIBLIOGRAPHY 219
[Cheng96] S.M. Cheng and K.T. Lo, Fast clustering process for vector quantisation
codebook design, Electronic Letters, vol. 32(4), pp. 311-312, February 1996
[Chialvo95] D. R. Chialvo and M. Millonas, How Swarms Build Cognitive Maps, The Bi-
ology and Technology of Intelligent Autonomous Agents, NATO ASI Series,
1995, pp. 439-450
[Christou03] C.G. Christou, B.S. Tjan and H. Bulthoff, Extrinsic cues aid shape recog-
nition from novel viewpoints, Journal Of Vision, pp. 183-197, 2003
[Cohen93] Cohen and Cohen, Finite element methods for active contour models and
balloons for 2D and 3D images, IEEE-Pattern Analysis and Machine Intel-
ligence, Nov. 1993.
[Conners80] R.W. Conners and C.A Harlow, A theoretical comparison of texture algo-
rithms, IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol.2, 1980, pp. 204-222.
[Cook96] R. Cook, I. McConnel and D. Stewart, Segmentation and Simulated An-
nealing, Microwave Sensing and Synthetic Aperture Radar, edited by G.
Franceschetti et al, Proc. SPIE 2958 (1996) pp30-35.
[Cutzu96] F. Cutzu and S. Edelman, Representation of object similarity in human
vision: psychophysics and a computational model, The Weizmann Institure
of Science, 1996
[Dudek97] G. Dudek and J.K. Tsotsos, Shape Representation and Recognition from
Multiscale Curvature, Computer Vision and Image Understanding, vol. 68,
no. 2, pp. 170-189(20), November 1997.
[Edelman95] S. Edelman, Representation of similarity in 3D object discriminations, Neu-
ral Computation, 7 pp. 407-422, 1995
[Felzenwalb98] P. F. Felzenwalb an D. P. Huttenlocher, Efficiently Computing a Good
Segmentation, DARPA Image Understanding Workshop, 1998
[Finlayson92] Graham D. Finlayson, Mark S. Drew and Brian V.Funt Diagonal Trans-
forms Suffice for Color Constancy IEEE Proceedings Fourth International
Conference on Computer Vision pp. 164-171, May 1993
[Finlayson95] Graham D. Finlayson Coefficient Color Constancy DPhil Thesis submission
Simon Fraser University
220 BIBLIOGRAPHY
[Finlayson97a] Graham D. Finlayson, Mark S. Drew White-point preserving color correc-
tion 5th Color Imaging Conference: Color, Science, Systems and Applica-
tions, IS&T/SID, pp.258-261, Nov. 1997.
[Finlayson97b] G. Finlayson and S. Hordley, Selection for Gamut Mapping Colour Con-
stancy, British Machine Vision Conference, 1997
[Finlayson01] G. Finlayson and G. Schaefer, Hue that is invariant to brightness and
gamma, British Machine Vision Conference, Vol. 1, 2001, pp. 303-313
[Flickner95] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M.
Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele and P. Yanker, Query
by image and video content: The QBIC system, IEEE Computer 28(9),
pp.23-32, 1995.
[Forbes98] F. Forbes and A. Raftery, Bayesian Morphology: Fast Unsupervised
Bayesian Image Analysis,RR-3374, INRIA, March 1998.
[Forsyth90] D.A. Forsyth, A Novel Algorithm for Colour Constancy International Jour-
nal of Computer Vision, 5 pp. 5-36, 1990
[Forsyth90a] D. Forsyth, J. Munday, A. Zisserman and C. Brown, Projectively invariant
representations using implicit algebraic curves, Image Vision Computing 8,
pp. 130-136, 1990.
[Forsyth98] D.A. Forsyth Sampling, resampling and colour constancy Proceedings of the
Computer Vision and Pattern Recognition Conference 1998
[Gagaudakis03] G. Gagaudakis and P.L. Rosin, Shape measures for image retrieval, Pat-
tern Recognition Letters, vol. 24, no. 15, pp. 2711-2721, 2003.
[Gevers99] Theo Gevers and Arnold W.M. Smeulders Color Based Object Recognition
Pattern Recognition, 32(3) pp:453-464,1999.
[Gilchrist97] Gilchrist, 1997.
[Gonnet02] G. Gonnet, Scientific Computation (website),
http://linneus20.ethz.ch:8080/2 6 2.html, Institute for Scientific Com-
puting, ETH Zrich, Switzerland 2002.
[Gonzalez92] R. Gonzalez and R. Woods, Digital Image Processing, Addison-Wesley Pub-
lishing Company, Chap. 4. 1992.
BIBLIOGRAPHY 221
[Gool92] L. J. Van Gool, T. Moons, E. Pauwels and A. Oosterlinck, Semi-Differential
Invariants, Geometric Invariance in Computer Vision, pp. 157-192. Cam-
bridge, MA, MIT Press, 1992.
[Gordan] P.A. Gordan Cross-referenced overview of his work and life history
Available via the internet from the University of St Andrews, St
Andrews, Fife, Scotland. http://www-groups.dcs.st-andrews.ac.uk/ his-
tory/Mathematicians/Gordan.html
[Gosling96] J. Gosling and A. Smith, Sun Microsystems, FastQSortAlgorithm
- A quick sort demonstration algorithm using a tri-median pivot,
http://www.cs.ubc.ca/spider/harrison/Java/FastQSortAlgorithm.java.html
[Gra] A version of Graham’s Algorithm, http://www.pms.informatik.uni-
muenchen.de/lehre/compgeometry/Gosper/convex hull/convex hull.html
[Graps95] A. Graps, An Introduction to Wavelets, IEEE Computational Science and
Engineering, Vol. 2, 1995
[Greenspan94] H. Greenspan, S. Belongie, R. Goodman and P. Perona, Rotation Invariant
Texture Recognition Using a Steerable Pyramid, ICPR 1994, vol 2, pp. 162-7
[Gros98] P. Gros, O. Bournez and E. Boyer, Using Local Planar Geometric Invariants
to Match and Model Images of Line Segments, Computer Vision and Image
Understanding, vol. 69, no. 2, pp. 135-155(21), February 1998.
[Harwood87] D. Harwood, M. Subbarao, H. Hakalahti and L.S. Davis, A New Class of
Edge-Preserving Smoothing Filters, Pattern Recognition Letters, Vol. 6,
1987, pp. 155-162
[Hering64] E. Hering, Outlines of a theory of the light senses, translated by Leo M.
Hurvich and Dorothea. Cambridge, MA:Harvard Univ. Press, 1964.
[Hodge02] V. Hodge and J. Austin. A High Performance k-NN Approach using Binary
Neural Networks, Neural Networks, Elsevier Science, 2002
[Hoffmann97] T. Hofmann, J. Puzicha and J. Buhmann, Deterministic Annealing for
Unsupervised Texture Segmentation, Proceedings EMMCVPR’97, Venice,
1997.
[Hoffmann98] T. Hofmann, J. Puzicha and J. Buhmann, Unsupervised Texture Segmen-
tation in a Deterministic Annealing Framework, IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, May, 1998.
222 BIBLIOGRAPHY
[Hsu93] T. Hsu, A.D. Calway, R. Wilson, Texture Analysis using the Multiresolution
Fourier Transform, 8th Scandinavian Conference on Image Analysis, May,
1993.
[Iivarinen97] J. Iivarinen, M. Peura, J. Srel, A. Visa, Comparison of Combined Shape
Descriptios for Irregular Objects’, 8th British Machine Vision Conference,
British Machine Vision Conference, 1997)
[Kam99] L. Kam and J. Blanc-Talon, Multifractal Texture Characterization For Real
World Image Segmentation, ACIVS 1999
[Kasari96] M. Hauta-Kasari, J. Parkkinen, T. Jaaskelainen and R. Lenz, Generalized
Cooccurence Matrix for Multispectral Texture Analysis, Proc. 13th Inter-
national Conference on Pattern Recognition 96, Aug. 1996
[Kass87] M. Kass, A. Witkin and D. Terzopoulos, Snakes: Active Contour Models,
International Journal of Computer Vision, Vol. 1, 1987, pp 321-331
[Katz35] D. Katz, The World of Colour, 1935 London: Kegan Paul.
[Keren89] D. Keren, R. Marcus and M. Werman, Segmenting and Compressing Wave-
forms by Minimum Length Encoding, Technical Report, Leibniz Center,
1989.
[Klinker88] G.J. Klinker, S.A. Shafer and T. Kanade, Color image analysis with an
intrinsic reflection model, Proceedings of the International Conference on
Computer Vision, 1988.
[Kliot98] M. Kliot and E. Rivlin, Invariant-Based Shape Retrieval in Pictorial
Databases, Computer Vision and Image Understanding, vol. 71, no. 2, pp.
182-197(16), 1998.
[Koubaroulis00a] D. Koubaroulis, J.Matas and J.Kittler, Illumination Invariant Object
Recognition Using The MNS Method, Proceedings of the 10th European
Signal Processing Conference, 2000
[Koubaroulis00b] D. Koubaroulis, J.Matas and J.Kittler Colour-based Image Retrieval
from Video Sequences In John P Eakins and Peter G B Enser, editors,
Proceedings of the Czech Pattern Recognition Workshop, pp1-12, 2000
[Kruizinga99] P. Kruizinga, N. Petkov and S.E. Grigorescu, Comparison of texture features
based of Gabor filters, Proceedings of the 10th International Conference on
Image Analysis and Processing, Sep. 1999, pp. 142-147
BIBLIOGRAPHY 223
[Lamdan88] Y. Lamdan and H. J. Wolfson, Geometric hashing: a general and efficient
model-based recognition scheme, Proceedings of the 2nd International Con-
ference on Computer Vision, pp. 238-249, 1988.
[Land71] Land, E. H. and McCann, Lightness and retinex theory, J. J. Journal of the
Optical Society of America, 1971, 61 (1), pp. 1-11.
[Laws80] K. Laws, Textured Image Segmentation, Ph.D. Dissertation, University of
Southern California, January 1980.
[Levine85] M. D. Levine, Vision in Man and Machine, Publisher: McGraw-Hill, 1985
[Ma95] W.Y. Ma and B. S. Manjunath, Image indexing using a texture dictionary,
Proc. of SPIE conference on Image Storage and Archiving System, volume
2606. Oct. 1995
[Marr82] D. Marr, Vision, Freeman Press, 1982.
[Mokhtarian99] F. Mokhtarian and S. Abbasi, Shape-Based Indexing using Curvature
Scale Space with Affine Curvature, Proc. European Workshop on Content-
Based Multi-Media Indexing, pp. 255-262, 1999.
[Mundy92] J.L. Mundy and A. Zisserman (editors), Geometric Invariance In Computer
Vision, MIT Press, 1992
[Nagao95] Kenji Nagao and W.Eric.L. Grimson Recognizing 3D Objects Using Pho-
tometric Invariant International Conference in Computer Vision, 1995 pp.
480-487
[Peipa] The Pilot European Image Processing Archive (PEIPA),
http://peipa.essex.ac.uk/
[Palmer96] Palmer, Neff and Besk, 1996.
[Palmer00] Palmer and Nelson, 2000.
[Peleg90] S. Peleg, D. Keren, R. Marcus and M. Werman, Segmentation by Minimum
Length Encoding, 10th International Conference of Pattern Recognition,
June 1990
[Pentland94] A. Pentland, R. Picard and S. Sclaroff, Photobook: Tools for content-based
manipulation of image databases, Proc. SPIE Storage and Retrieval for
Image and Video Databases II, San Jose, California, pp. 34-47.
224 BIBLIOGRAPHY
[Peura97] M. Peura and J. Iivarinen, Efficiency of simple shape descriptors, Aspects
of Visual Form, World Scientific, pp. 443451, 1997.
[Quan98] L. Quan and F. Veillon, Joint Invariants of a Triplet of Coplanar Conics:
Stability and Discriminating Power for Object Recognition, Computer Vi-
sion and Image Understanding, vol. 70, no. 1, pp. 111-119(9), April 1998.
[Ra93] S.W. Ra and J.K. Kim. A fast mean-distance-ordered search algorithm par-
tial codebook search algorithm for image vector quantization, IEEE Trans-
actions on Circuits and SystemsII: Analogue and Digital Signal Processing,
Vol. 40(9), pp. 576579, September 1993.
[Rahman] Z. Rahman, D. J. Jobson and G. A. Woodell NASA Langley Research Centre
http://dragon.larc.nasa.gov/viplab/retinex/retinex.html
[Ramos00] V. Ramo, F. Almeida, Artificial Ant Colonies in Digital Image Habitats
A Mass Behaviour Effect Study on Pattern Recognition, ANTS 2000 - 2nd
International Workshop on Ant Algorithms (From Ant Colonies to Artificial
Ants), Sep. 2000, pp. 113-116
[Reiss93] T.H. Reiss, Recognizing Planar Objects Using Invariant Image Features,
Publisher: Springer-Verlag, pp. 14, 1993.
[Roberts65] L. Roberts, Machine Perception of 3-D Solids, Optical and Electro-optical
Information Processing, MIT Press 1965.
[Rock92] Rock et al., Grouping can occur after perception of lightness constancy,
1992.
[Rock64] Rock and Brosgole, Perceptual grouping after stereoscopic deph perception,
1964.
[Rosin03] P.L. Rosin, Measuring shape: ellipticity, rectangularity, and triangularity,
Machine Vision and Applications, vol. 14, no. 3, pp. 172-184, 2003.
[Schulz03] M.F. Schulz and T. Sanocki, Time Course of Perceptual Grouping by Color,
Psychological Science, Vol. 14, Number 1, pp. 26-30, January 2003.
[Shi97] J. Shi and J. Malik, Normalized Cuts and Image Segmentation, IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 1997
[SimonFraser] Colour Constancy Algorithms http://www.cs.sfu.ca/ colour/research/colour-
constancy.html Computational Vision Lab Computing Science, Simon
Fraser University, Burnaby, BC, Canada, V5A 1S6
BIBLIOGRAPHY 225
[Simons98] D.J. Simons and R.F. Wang, Perceiving real-world viewpoint changes, Psy-
chological Science, 9, pp. 315-320, 1998.
[Singh80] S. Singh, M. Sharma, M. Marku,”Evaluation of texture Methods for Image
Analysis”, Pattern Recognition Letters. 1980
[Singh99] S. Singh, J. Haddon and M. Markou, Nearest Neighbour Strategies For Im-
age Understanding, Proc. Workshop on Advanced Concepts for Intelligent
Vision Systems, August, 1999
[Smith94] J.R. Smith and S.F. Chang, Quad-Tree Segmentation for Texture-Based
Query, Proc. 2nd Annual Multimedia Conference, San Francisco, Oct. 1994
[Smith97] J.R. Smith and S.F. Chang, Querying by color regions using the VisualSEEk
content-based visual query system, Intelligent Multimedia Information Re-
trieval, The MIT Press, Massachusetts Institute of Technology, Cambridge,
Massachusets and London, England, pp 23-41.
[Sproull91] R.F. Sproull, Refinements to nearest-neighbour searching in k-dimensional
trees, Algorithmica 6 (4), pp. 579-589, 1991.
[Squire00] D.M.G. Squire and T.M. Caelli, Invariance Signatures: Characterizing Con-
tours by Their Departures from Invariance, Computer Vision and Image
Understanding, vol. 77, no. 3, pp. 284-316(33), March 2000.
[Stegmann00] Mikkel B. Stegmann and Rune Fisker, On Properties of
Active Shape Models Technical Report, IMM-REP-2000-12,
http://www.imm.dtu.dk/ aam/downloads/asmprops/
[Talon00] J. Blanc-Talon, Fractal Techniques in Image Analysis, Image Segmentation
and Image Compression, AISTA 2000
[Tarr90] M.J. Tarr, S. Pinker, When does human object recognition use a viewer-
centred reference frame? Psychological Science, 1, pp. 253-256, 1990.
[Taubin92] G. Taubin and D. Copper, Object recognition based on moment (or alge-
braic) invariants, Geometric Invariance in Computer Vision, pp. 375-397.
Cambridge, MA, MIT Press, 1992.
[Thacker95] N.A. Thacker, P.A. Riocreux and R.B. Yates, Assessing the completeness
properties of pairwise geometric histograms, Image and Vision Computing,
Vol 13, No. 5, pp. 423 - 429, June 1995
226 BIBLIOGRAPHY
[Thorisson94] K.R. Thorisson, Simulated Perceptual Grouping: An Application to
Human-Computer Interaction, Proceedings of the Sixteenth Annual Confer-
ence of the Cognitive Science Society. Atlanta, Georgia, 1994, pp. 876-881.
[Tuke01] C.E. Tuke, J. Austin and S.O’Keefe,Self-Similar Convolution Image Distri-
bution Histograms as Invariant Identifiers, British Machine Vision Confer-
ence, Sept 2001
[Vecera97] S.P. Vecera and M.J. Farah, Is visual image segmentation a bottom-up or
an interactive process?, Perception and Psychophysics, 59, pp. 1280-1296,
1997.
[Wang97] D. Wang and D. Terman, Image Segmentation based upon Oscillatory Cor-
relation, Neural Computation, Vol. 9, 1997, pp. 805-836
[Weiss88] I. Weiss, Projective invariants of shape, Proc. DARPA Image Understanding
Workshop, 1988
[Xilin99] Y. Xilin, I. Octavia, Line-Based Recognition Using a Multidimensional
Hausdorff Distance, IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol 21, No 9, pp.901-914, Sept 1999.
[Xu90] Lei Xu, Erkki Oja, and Pekka Kultanen, Randomized Hough Transform,
101st International Conference on Pattern Recognition, 1989, pp 631-635
[Yu01] S. X. Yu, J. Shi, Understanding Popout: Pre-attentive Segmentation
through Nondirectional Repulsion, CMU-RI-TR-01-20, Jul. 2001
[Zucker77] S.W. Zucker, R.A. Hummel and A. Rosenfeld, An application of Relaxation
Labelling to Line and Curve Enhancement, IEEE Transactions on Comput-
ers, Vol. 26, 1977, pp. 394-403