03 martinelli

7/30/2019 03 Martinelli

1/61

MSc. Thesis Scene layout segmentation of trafficenvironments using a Conditional Random Field

Fernando Cervigni Martinelli

Honda Research Institute Europe GmbH

A Thesis Submitted for the Degree ofMSc Erasmus Mundus in Vision and Robotics (VIBOT)

2010


2/61

Abstract

At least 80% of the traffic accidents in the world are caused by human mistakes. Whetherdrivers are too tired, drunk or speeding, most accidents have their root in the improper behaviorof drivers. Many of these accidents could be avoided if cars were equipped with some kind ofintelligent system able to detect inappropriate actions of the driver and autonomously interveneby controlling the car in emergency situations. Such an advanced driver assistance systemneeds to be able to understand the car environment and, from that information, predict the

appropriate behavior of the driver at every instant. In this thesis project we investigate theproblem of scene understanding solely based on images from an off-the-shelf camera mountedto the car.

A system has been implemented that is capable of performing semantic segmentation andclassification of road scene video sequences. The object classes which are to be segmented canbe easily defined as input parameters. Some important classes for the prediction of the driverbehavior include road, sidewalk, car and building, for example. Our system is trained ina supervised manner and takes into account information such as color, location, texture andalso spatial context between classes. These cues are integrated within a Conditional RandomField model, which offers several practical advantages in the domain of image segmentationand classification. The recently proposed CamVid database, which contains challenging inner-city road video sequences with very precise ground truth segmentation data, has been used forevaluating the quality of our segmentation, including a comparison to state-of-the-art methods.

Everything should be made as simple as possible, but not simpler . . .

Albert Einstein


3/61

Contents

Acknowledgments v

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Problem definition 4

2.1 Combined segmentation and recognition . . . . . . . . . . . . . . . . . . . . . . . 4

3 State of the art 6

3.1 Features for image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.1 Spatial prior knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.2 Sparse 3D cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.3 Gradient-based edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.4 Color distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.5 Texture cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.6 Context features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Probabilistic segmentation framework . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.2 Energy minimization for label inference . . . . . . . . . . . . . . . . . . . 13

i


4/61

3.3 Example: TextonBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3.1 Potentials without context . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.2 Texture-layout potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Application to road scenes (Sturgess et al.) . . . . . . . . . . . . . . . . . . . . . 20

4 Methodology 21

4.1 CRF framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Basic model: location and edge potentials . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Texture potential model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3.1 Feature vector and choice of filter bank . . . . . . . . . . . . . . . . . . . 24

4.3.2 Boosting of feature vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3.3 Adaptive training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 Texture-layout potential model (context) . . . . . . . . . . . . . . . . . . . . . . . 31

4.4.1 Training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.4.2 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Results 37

5.1 Model without context features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Model with context features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 Influence of number of weak classifiers . . . . . . . . . . . . . . . . . . . . 38

5.2.2 Influence of the different model potentials . . . . . . . . . . . . . . . . . . 39

5.2.3 Influence of 3D features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3 CamVid sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4 Comparison to state of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Conclusions 49

Bibliography 54

ii


5/61

List of Figures

2.1 Example of ideal segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 3D features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Gradient-based edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 GrabCut: segmentation using color GMMs and user interaction . . . . . . . . . . 9

3.4 The Leung-Malik (LM) filter bank . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.5 Clique layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.6 Sample results of TextonBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.7 Image textonization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.8 Texture-layout filters (context) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.9 Sturgess: higher-order cliques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Examples of location potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Intuitive example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Filter bank resp onses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4 The MR8 filter bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5 3D features interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.6 Adaboost training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.7 Adaboost classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.8 Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iii


6/61

5.1 Confusion matrix without adaptive training . . . . . . . . . . . . . . . . . . . . . 38

5.2 Confusion matrix after adaptive training . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 Example of texture-layout features (context) . . . . . . . . . . . . . . . . . . . . 39

5.4 Influence of number of weak classifiers . . . . . . . . . . . . . . . . . . . . . . . . 40

5.5 Influence of the different potentials . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.6 Confusion matrices for 4-class segmentation . . . . . . . . . . . . . . . . . . . . . 44

5.7 Example of segmentations for 4-class set . . . . . . . . . . . . . . . . . . . . . . . 45

5.8 Results for 11-class set segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.9 Example of segmentations for 11-class set . . . . . . . . . . . . . . . . . . . . . . 47

5.10 Comparison to state of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.1 Adaptive scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

iv


7/61

Acknowledgments

I would like to thank above all my family for the constant support. They are always with me,

even though they live on the other side of the Atlantic ocean.

My heartly thanks to my supervisors at Honda, Jannik Fritsch, who has been so nice and

given me all the support I needed, and Martin Heracles, who has carefully revised this thesis

report and given precious advice all along these four months. For his help with the iCub

repository and for providing me with his essential CRF code, I would like to sincerely thank

Andrew Dankers.

I wish also to thank my supervisor, Prof. Fabrice Meriaudeau, and all professors of the

Vibot Masters. It is hard to fathom how much I learned with you during these 2 years. Thanks

also for offering this program, which has been an amazing and unforgettable experience.

Last but not least, I wish to thank all my Vibot mates, who have been a great company

studying before exams or chilling at the bar.

v


8/61

Chapter 1

Introduction

1.1 Motivation

Within the Honda Research Institute Europe (HRI-EU), the Attentive Co-Pilot project (ACP)

conducts research on a multi-function Advanced Driver Assistance System (ADAS). It is desired

and to be expected that, in the future, cars will autonomously respond to inappropriate actions

taken by the driver. If he or she does not stop the car when the traffic lights are red or fallsasleep and slowly deviates from the normal driving course, the car should trigger an emergency

procedure and warn the driver. A similar warning should come up, for example, if the driver

gets distracted and the car in front inadvertently brakes, without the driver noticing it. It would

be even safer if the car had the capability of not only recognizing it and warning the driver, but

also of taking over control in critical situations and safely correcting the drivers inappropriate

actions. Since human mistakes, and not technical problems, are by far the main cause of traffic

accidents, countless lives could be saved and much damage avoided if such reliable advanced

driver assistance systems existed and were widely implemented.

If, however, this Advanced Driver Assistance System is to become responsible for saving

lives, in a critical real-time context, it cannot afford to fail. In order to manage the extremelychallenging task of building such an intelligent system, many smaller problems have to be

successfully tackled. One of the most important is related to understanding and adequately

representing the environment in which the car operates. For that, a variety of sensors and input

data can be used. Indeed, participants of the DARPA Urban Challenge [5], which requires

autonomous vehicles to drive through specific routes in a restricted city environment, rely on

a wide range of sensors such as GPS, Radar, Lidar, inertial guidance systems as well as on the

use of annotated maps.

1


9/61

Chapter 1: Introduction 2

One of our aspirations, though, is to achieve the task of scene understanding by visual

perception alone, using an off-the-shelf camera mounted in the car. We humans prove in our

daily life as drivers that seeing the world is largely sufficient to achieve an understanding of the

traffic environment. By ruling out the use of complicated equipment and sensing techniques, we

aim at, once a reliable driver assistance system is achieved, manufacturing it cheap enough for

it to be highly scalable. Considering their great potential of increasing the safety of drivers

and therefore also of pedestrians, bicyclists, and other traffic participants, such advanced

driver assistance systems will most likely become an indispensable car component, like todays

seat-belts.

1.2 Goal

A first step to understanding and representing the world surrounding the car is to segment

the images acquired by the camera in meaningful regions and objects. In our case, meaningful

regions are understood as the regions that are potentially relevant for the behavior of the

driver. Examples of such regions are the road, sidewalks, other cars, traffic signs, pedestrians,

bicyclists and so on. In contrast, in our context it is not so important, for example, to segment

and distinguish a building on the side of the road as an individual class, since, as far as the

driver behavior is concerned, it makes no difference whether there is a building, a fence or even

a tree at that location.

In order to correctly segment such meaningful regions, we need to consider semantic aspects

of the scene rather than only its appearance, that is, even if the road consists of dark and bright

regions because of shadows, it should still be segmented as only one semantic region. This can

be achieved by supervised training using ground truth segmentation data.

The work described in this thesis aims at performing this task of semantic segmentation,

exploring the most recent insights of researchers in the field, as well as well-known and state-

of-the-art image processing and segmentation techniques.

1.3 Thesis outline

This thesis is structured in five more chapters. In Chapter 2, the main goal of the investigation

done in this thesis project is fomalized and explained. Chapter 3 investigates the state of the art

in the field of semantic segmentation and road scene interpretation. Cutting-edge algorithms

like TextonBoost are described in greater detail as they are fundamental to state-of-the-art

methods. In Chapter 4, the methodology and implementation steps followed throughout this

thesis project are detailed. Chapter 5 shows the results obtained for the CamVid database, both


10/61

3 1.3 Thesis outline

for a set of four classes and for a set of eleven classes. A comparison of these results and to the

state of the art mentioned in Chapter 3 is also shown. Finally, in Chapter 6 the conclusions

of the thesis are presented and suggestions regarding the areas on which future efforts should

focus are given.


11/61

Chapter 2

Problem definition

2.1 Combined segmentation and recognition

The main goal of this thesis project is to investigate, implement and evaluate a system that per-

forms segmentation of road scene images including a classification of the different object classes

involved. More specifically, each input color image x GMN3, where G = {0, 1, 2, , 255}

and M and N are the image height and width, respectively, must be pixel-wise segmented.

That means that each pixel i of the image has to be assigned one of N pre-defined classes or

labels, of a set L = {l1, l2, l3, , lN}. In mathematical terms, the segmentation investigated

can be defined as a function f that takes the color image x GMN3 and returns a label

image f(x) = y LMN, also called a labeling of x:

This is achieved by supervised training, which means that the system is given labeled training

images, from which it should learn in order to subsequently segment new, unseen images.

According to the state of the art researchers, supervised segmentation techniques yield better

results than unsupervised techniques (see Chapter 3). This is not surprising, since unsupervised

segmentation techniques do not have ground truth information from which to learn semantic

properties, hence can only segment the images based on purely data-driven features.Figure 2.1 shows a typical inner-city road scene as considered in this thesis project, as well

an ideal segmentation, obtained by manual annotation. The example is taken from the CamVid

database [4], which is a recently proposed image database with high-quality, manually-labeled

ground truth which we use for training our system. The images have been acquired by a car-

mounted camera, filming the scene in front of the car while driving in a city. More detail about

the CamVid dataset is given in Chapter 5.

Theoretically, it would be ideal if the segmentation algorithm proposed could precisely

4


12/61

5 2.1 Combined segmentation and recognition

(a) (b)

Figure 2.1: (a) An example of a typical inner-city road scene extracted from the CamViddatabase. (b) The corresponding manually labeled ground truth, taking into account classeslike road, pedestrian, sidewalk and sky, among others. The goal of the segmentationsystem to be implemented is to produce, given an image (a), an automatic segmentation thatis as close as possible to the ground truth (b).

segment all 32 classes annotated in the CamVid database. However, the more classes one tries

to segment the more challenging and time-consuming the problem becomes. Although our

system is supposed to be able to segment an arbitrary set of classes, as long as they are present

in the training database, a compromise between computational efficiency and the number of

classes to segment has to be reached. More importantly, many of the classes defined in the

CamVid database have little if any influence at all on the behaviour of the driver. Bearing

this in mind, the segmentation algorithm should be optimized and tailored towards the most

behaviorly relevant classes.

Furthermore, a related study recently conducted in the ACP Group suggests that, in order

to achieve a good prediction of the drivers behavior, more effort should be invested in how to

use such a segmentation of meaningful classes in terms of segmentation-based features rather

than in precisely segmenting a vast number of classes that may not influence, after all, how the

driver controls the car [11].


13/61

Chapter 3

State of the art

The problem of image segmentation has been focus, already for some decades, of countless

image processing researchers around the globe. Although the problem itself is old, the solution

to many segmentation tasks remains, still today, under active investigationin particular for

image segmentation applied to highly complex real-world scenes (e.g. traffic scenes). This

chapter describes some of the techniques for image segmentation that have been applied in

related areas to the one investigated in this thesis project.

3.1 Features for image segmentation

3.1.1 Spatial prior knowledge

One of the simplest but useful cues that may be explored when segmenting images in a super-

vised fashion is the location information of objects in the scene. For many object classes, there

is an important correlation between the label to which a region in an image belongs and its

location on the image. For instance, the fact that the road is mostly at the lower part of pictures

could be helpful for its segmentation. The same applies for the segmentation of the sky, which

is normally at the upper part of an image. Many similar examples can be mentionedlike

buildings being usually on the sides of the imagewhich makes this feature powerful, despite

its simplicity.

3.1.2 Sparse 3D cues

Different regions in an image have often different depths. Therefore, if available, the information

of how far each point in the image was from the camera when the image was acquired can be very

6


14/61

7 3.1 Features for image segmentation

Figure 3.1: The algorithm proposed by Brostow et al. uses 3D point clouds estimated fromvideos sequences and performs, using motion and structure features, a very satisfactory 11-classsemantic segmentation.

useful for segmentation purposes. Since individual images do not carry any depth information,

3D cues can only be explored in specific cases where one can either measure or infer how far

the objects in an image are. If the use of radars or equipment that directly measure distance

is to be discarded, 3D information can be inferred by using a stereo camera set or, in the case

of a single camera, by using structure-from-motion techniques [9]. When dealing with images

taken from an ordinary video frame, structure-from-motion techniques must be applied.

Figure 3.1, extracted from the work of Brostow et al. [3], shows how accurate the segmen-

tation of road scenes can get only by using reconstructed 3D point clouds.

Brostow et al. based their work on the following features, which can be extracted from the

sparse 3D point cloud:

Height above the ground;

Shortest distance from cars path;

Surface orientation of groups of points;

Residual error, which is a measure of how objects in the scene move in respect to the

world;

Density of points.

3.1.3 Gradient-based edges

When ones thinks of image segmentation, it is natural to expect that the label boundaries

correspond to strong edges on the image being segmented. For example, the image of a blue

car on a city street will have rather strong edges where, in a perfect labeling, the boundaries

between the labels car and street are located. Some methods, like, for example, active contour

snakes [13], explore gradient-based edge information for segmentation. Figure 3.2 shows an


15/61

Chapter 3: State of the art 8

(a) (b)

Figure 3.2: (a) Original grayscale image of Lena. (b) Edge image obtained by calculating theimage gradients. Edge based segmentation methods explore the information in (b) to proposea meaningful segmentation of (a). Note how Lenas hat, face and shoulder could be quite wellsegmented only with this edge cue.

example picture of Lena and its gradient. The white pixels have a greater probability of being

located on boundaries between labels in a segmentation.

Notice that although this is a very reasonable and useful cue, it can also turn out to be

misleading. When dealing, for example, with shadowed scenes, very often there are stronger

edges inside regions that belong to the same label than there are on the boundaries betweenlabels. This is particularly challenging for real-world scenes such as the traffic scenes considered

in this thesis project. The way this cue was explored in this project is explained in detail in

Section 3.3.1.

3.1.4 Color distribution

Early methods, like [20] tackle the problem of image segmentation by relying solely on color

features, which can be modeled as histogram distributions or by Gaussian Mixture Models

(GMMs). A Gaussian Mixture Model represents a probability distribution, P(x), which is

obtained by summing different Gaussian distributions:

P(x) =k

Pk(x) (3.1)

where

Pk(x) = N(x|k, k) (3.2)

k, k being the mean and variance of the individual Gaussian distribution k.

The use of GMMs to model colors in images has also proven very efficient in binary seg-


16/61

9 3.1 Features for image segmentation

Figure 3.3: Segmentation achieved by GrabCut using color GMMs and user interaction.

mentation problems, as shown by Rother et al [19] with their GrabCut algorithm. In suchproblems, one wants to separate a foreground object from the background for image editing,

object recognition and so on. When possible, user interaction can be very useful to refine the

results by giving important feedback after the initial automatic segmentation (see Figure 3.3).

However, in most cases, either the number of images to segment is prohibitive or the real

time nature of the segmentation task prevents any user interference at all. These both remarks

are true in the field of traffic scenes segmentation for driver assistance.

3.1.5 Texture cues

Along with color, texture information is often considered and can bring significant improvementto the segmentation accuracy, as in [7], where graylevel texture features were combined to color

ones. Nowadays, most if not all the research effort on segmentation also incorporates texture

information. This can be extracted and modeled in two main ways:

1. Statistical Models, which try to describe the statistical correlation between pixel colors

within a restricted vicinity. Among such methods, co-occurrence matrices, have been

successfully used, for instance, for seabed classification [18];


17/61


Figure 3.4: The LM filter bank has a mix of edge, bar and spot filters at multiple scales andorientations. It has a total of 48 filters2 Gaussian derivative filters at 6 orientations and 3scales, 8 Laplacian of Gaussian filters and 4 Gaussian filters.

2. Filter bank convolution, where the image is convolved with a carefully selected set of filter

primitives, usually composed of Gaussians, Gaussian derivatives and Laplacians. A well

known example, the Leung-Malik (LM) filter bank [16], is shown in Figure 3.4. It is

interesting to mention that such filter banks have similarities with the receptive fields of

neurons in the human visual cortex.

3.1.6 Context features

Although color and texture may efficiently characterize image regions, they are far from enough

for a high quality semantic segmentation if considered alone. For instance, even humans may

not be able to tell apart, when looking only at a local patch of an image, a blue sky from the

walls of a blue building. The key aspect of which humans naturally take advantage, and that

allows them to unequivocally understand scenes, is the context. Even if one sees a building

wall painted with exactly the same color as the sky, one just knows that that wall cannot be

the sky because it is surrounded by windows. In the case of road scenes segmentation, typical

spatial relationships between objects can be a very strong cuefor example, the fact that the

car is always on the road, which, in turn, is usually surrounded by sidewalks.

With this in mind, computer vision researchers are now frequently looking beyond low-level

features and are more interested in contextual issues [7, 10, 14]. In Section 3.3, an example ofhow context in images can be exploited for segmentation is described.

3.2 Probabilistic segmentation framework

The choice of image features, described in the previous section, is independent of the theoretical

framework or machine learning technique applied for segmentation inference. One can choose

the very same features as in [7], where belief networks are used, and process them using Support


18/61

11 3.2 Probabilistic segmentation framework

Vector Machines, for example. In recent years, Conditional Random Fields (CRFs) have played

an increasingly central role. CRFs have been introduced by Lafferty et al. in [15] and have ever

since been systematically used in cutting-edge segmentation and classification approaches like

TextonBoost [21], image sequence segmentation [27], contextual analysis of textured scenes [24]

and traffic scene understanding [22], to name a few. Conditional Random Fields are based on

Markov Random Fields and offer practical advantages for image classification and segmenta-

tion. These advantages are explained in the next section, after the formal definition of Markov

Random Fields is given.

3.2.1 Conditional Random FieldsIn the Random Field theory, an image can be described by a lattice S composed of sites i,

which can be thought of as the image pixels. The sites in S are related to one another via

a neighborhood system, which is defined as N = {Ni, i S}, where Ni is the set of sites

neighbouring i. Additionally, i / Ni and i Nj j Ni.

Let y denote a labeling configuration of the lattice S belonging to the set of all possible

labelings Y. In the image segmentation context, y can be seen as a labeling image, where each

of the sites (or pixels) i from the lattice S is assigned one label yi in the set of possible labels

L = {l1, l2, l3, , lN}, which are the object classes. The pair (S,N) can be referred to as a

Random Field.

Moreover, (S,N) is said to be a Markov Random Field (MRF) if and only if

P(y) > 0, y Y, and (3.3)

P(yi|yS{i}) = P(yi|yNi) (3.4)

That means, firstly, that the probability of any defined label configuration must be greater

than zero1 and, secondly and most importantly, that the probability of a site assuming a given

label just depends on its neighboring sites. The latter statement is also known as the Markov

condition.

According to the Hammersley-Clifford theorem [1], an MRF like defined above can equiv-

alently be characterized by a Gibbs distribution. Thus, the probability of a labeling y can be

written as

P(y) = Z1 exp(U(y)), (3.5)

where

Z =yY

exp(U(y)) (3.6)

1This assumption is usually taken for convenience, as it, in practical terms, does not influence the problem.


19/61


is a normalizing constant called the partition function, and U(y) is an energy function of the

form

U(y) =cC

Vc(y). (3.7)

C is the set of all possible cliques and each clique c has a clique potential Vc(y) associated

with it. A clique c is defined as a subset of sites in S in which every pair of distinct sites

are neighbours, with single-site cliques as a special case (see Figure 3.5). Due to the Markov

condition, the value of Vc(y) depends only on the local configuration of clique c.

(a) (b) (c)

Figure 3.5: (a) Example of a 4-pixel neighborhood. (b) Possible unary clique layout. (c)Possible binary clique layouts.

Now let us consider the observation xi, for each site i, which is a state belonging to a set of

possible states W = {w1, w2, , wn}. In this manner, we can represent the image we want to

segment, where each pixel i is assigned to one state of the set W. If one thinks of a gray scale

image with 8 bit-resolution, for example, the set of possible states for each site (or pixel) would

be defined as W = {0, 1, 2, , 255}. The segmentation problem then boils down to finding the

labeling y such that P(y|x)the posterior probability of labeling y given the observation

xis maximized. Bayes theorem tells us that

P(y|x) = P(x|y)P(y)/P(x) (3.8)

where P(x) is a normalization factor, as Z in Eq. 3.5, and plays no role in the maximization.

Thanks to the Hammersley-Clifford theorem, one can greatly simplify this maximization prob-lem by defining only locally the clique potential functions Vc(x,y,). How to choose the forms

and parameters of the potential functions for a specific application is a major topic in MRF

modeling and will be further discussed in Chapter 4.

The main difference between MRFs and CRFs lies on the fact that MRFs are generative

models, whereas CRFs are discriminative. That is, CRFs directly model the posterior distri-

bution P(y|x) while MRFs learn the underlying distributions P(x|y) and P(y), arriving at the

posterior distribution by applying the Bayes theorem.


20/61

13 3.2 Probabilistic segmentation framework

In other words, for MRFs, the learned state-label joint probability is represented as P(y|x) =

P(x|y)P(y)/P(x)), where x represents the observation and y the corresponding labeling con-

figuration. However, for CRFs, it is not required to generate prior distributions over the labels

P(x|y) like for MRFs, as the a posteriori P(y|x) is modeled directly.

This directly modeled posterior probability is simpler to implement and usually sufficient for

segmenting images. Hence, for the road scene segmentation and classification problem at hand,

CRFs are advantageous in comparison to MRFs. This is the main reason why they became so

popular [21,22,27].

3.2.2 Energy minimization for label inference

Finding the labeling y that maximizes the a posteriori probability expressed in Eq. 3.5 is

equivalent to finding y that minimizes the energy function in Eq. 3.7. An efficient way of

finding a good approximation of the energy minimum of such functions, is the alpha-expansion

graph-cut algorithm [2] which widely used along with MRFs and CRFs. The idea of the alpha-

expansion algorithm is to reduce the problem of minimizing a function like U(y) with multiple

labels to a sequence of binary minimization problems. These sub-problems are referred to as

alpha-expansions, and will be shortly described for completeness (for details see [2]).

Suppose that we have a current image labeling y and one randomly chosen label L =

{l1, l2, l3, , lN}. In the alpha-expansion operation, each pixel i makes a binary decision: it

can either keep its old label yi or switch to label , provided that this change decreases the

value of energy the function. For that, we introduce a binary vector s {0, 1}MN which

indicates which pixels in the image (of size M N) keep their label and which switch to label

. This defines the auxiliary configuration y[s] as

yi[s] =

yi, if si = 0, if si = 1 (3.9)

This auxiliary configuration y[s] transforms the function U with multiple labels into a func-

tion of binary variables U(s) = U(y[s]). If function U is composed of attractive potentials,

which could be seen as a kind of convex functions, the global minimum of this binary function 2

is guaranteed to be found exactly using standard graph cuts [21].

The expansion move algorithm starts with any initial configuration y0, which could be set,

for instance, taking, for each pixel, the label with maximum location prior probability3 . It

then computes optimal alpha-expansion moves for labels in a random order, accepting the

2Notice that this does not mean that the global minimum of the multi-label function is found.3In the road scene segmentation case, for instance, pixels on top of the image could start with label sky and

pixels at the bottom with label road. This is equivalent to exploring the features described in 3.1.1


21/61


moves only if they decrease the energy function. The algorithm is guaranteed to converge, and

its output is a strong local minimum, characterized by the property that no further alpha-

expansion can decrease the value of function U.

3.3 Example: TextonBoost

One CRF-based approach to image segmentation that is currently fundamental for state-of-

the-art methods is TextonBoost [21]. In their research, Shotton et al. have used the Microsoft

Research Cambridge (MSRC) database4, which is composed of 591 photographs of the following

21 object classes: building, grass, tree, cow, sheep, sky, airplane, water, face, car, bicycle, flower,sign, bird, book, chair, road, cat, dog, body, and boat. Approximately half of those pictures

is picked for training, in a way that ensures proportional contributions from each class. Some

results of their semantic segmentation on previously unseen images are shown in Figure 3.6.

Figure 3.6: TextonBoost results extracted from [21]. Above, unseen test images. Below, seg-mentation using a color-coded labeling. Textual labels are superimposed for better visualization.

Since the algorithm implemented for the segmentation of road scenes in this master thesis

has been mainly inspired by TextonBoost, a short description of the way it works is provided.

The inference framework used is a conditional random field (CRF) model [15]. The CRF

learns, through the training of the parameters of the clique potentials, the conditional distribu-

tion over the possible labels given an input image. The use of a conditional random field allows

the incorporation of texture, layout, color, location, and edge cues in a single, unified model.

The energy function U(y|x, ), which is the sum of all the clique potentials (see Eq. 3.7), isdefined as:

U(y|x, ) =i

location (yi, i; ) +

color (yi, xi; ) +

texturelayout i(yi, x; )

+

(i,j)

edge (yi, yj, gij(x); ) (3.10)

where y is the labeling or segmentation and x is a given image, is the set of edges in a

4The MSRC database can be downloaded at http://research.microsoft.com/vision/cambridge/recognition/


22/61

15 3.3 Example: TextonBoost

4-connected neighborhood, = {, , , } are the model parameters, and i and j index

pixels in the image, which correspond to sites in the lattice of the Conditional Random Field.

Notice that the model consists of three unary potentials, which depend only on one site i in

the lattice, and one pairwise potential, depending on pairs of neighboring sites.

Each of the potentials is subsequently explained in a simplified way, for details please see [21].

3.3.1 Potentials without context

Location potential

The unary location potentials (yi, i; ) capture the correlation of the class label and the abso-lute location of the pixel in the image. For the databases with which TextonBoost was tested,

the location potentials had a rather low importance since the context of the pictures is very

diverse. In the case of our road scene segmentation, which is a more structured environment,

they have had significantly more relevance, as discussed in Chapter 5.

Color potential

In TextonBoost, the color distributions of object classes are represented as Gaussian Mixture

Models (see Section 3.1.4) in CIELab color space where the mixture coefficients depend on the

class label. The conditional probability of the color x of a pixel labeled with class y is given by

P(x|y) =k

P(x|k)P(k|y) (3.11)

with color clusters (mixture components) P(x|k). Notice that the clusters are shared between

different classes, and that only the coeficients P(k|y) depend on the class label. This makes

the model more eficient to learn than a separate GMM for each class, which is important since

TextonBoost takes into account a high number of classes.

Edge potential

The pairwise edge potentials have the form of a contrast sensitive Potts model [2],

(yi, yj, gij(x); ) = T gij(x)[yi = yj ], (3.12)

with [] the zero-one indicator function:

[condition] =

1, if condition is true0, otherwise (3.13)


23/61


The edge feature gij measures the difference in color between the neighboring pixels, as sug-

gested by [19],

gij =

exp(xi xj

2

1

(3.14)

where xi and xj are three-dimensional vectors representing the CIELab colors of pixels i and j

respectively. Including the unit element allows a bias to be learned, to remove small, isolated

regions5. The quantity is an image-dependent contrast term, and is set separately for each

image to (2xi xj2)1, where denotes an average over the image. The two scalar

constants that compose the parameter vector are appropriately set by hand.

3.3.2 Texture-layout potential

The texture-layout potential is the most important contribution of TextonBoost. It is based on

a set of novel features which are introduced in [21] as texture-layout filters. These new features

are capable of, at once, capturing the correlation between texture, spatial layout, and textural

context in an image.

Here, we quickly describe how the texture-layout features are calculated and the boosting

approach used to automatically select the best features and, thereby, learn the texture-layout

potentials used in Eq. 3.10.

Image textonization

As a first step, the images are represented by textons [17] in order to arrive at a compact

representation of the vast range of possible appearances of objects or regions of interest 6. The

process of textonization is depicted in Figure 3.7, and proceeds as follows. At first, each of the

training images is convolved with a 17-dimensional filter bank. The responses for all training

pixels are then whitenedso that they have zero mean and unit covarianceand clustered

using a standard Euclidean-distance K-means clustering algorithm for dimension reduction.

Finally, each pixel in each image is assigned to the nearest cluster center found with K-means,

producing the texton map T, where pixel i has value Ti {1,...,K}.

Texture-Layout Filters

The texture-layout filter is defined by a pair (r, t) of an image region, r, and a texton t, as

illustrated in Figure 3.8. Region r is relatively referenced to the pixel i being classified and

texton t belongs to the texton map T. For efficiency reasons, only rectangular regions are

5The unit element means that for every pair of pixels that have different labels a constant potential is addedto the whole. This makes contiguous labels preferable when the energy function is minimized.

6Textons have been proven effective in categorizing materials [25] as well as generic object classes [28].


24/61


Figure 3.7: The process of image textonization, as proposed by [21]. All training images areconvolved with a filter-bank. The filter responses are clustered using K-means. Finally, each

pixel is assigned a texton index corresponding to the nearest cluster center to its filter response.

implemented by TextonBoost, although any arbitrary region shape could be considered. A set

R of candidate rectangles is chosen at random, such that every rectangle lies inside a fixed

bounding box.

The feature response at pixel i of texture-layout filter (r, t) is the proportion of pixels under

the offset region r + i that have been assigned texton t in the textonization process,

v[r,t](i) =1

area(r)

j(r+i)

[Tj = t] . (3.15)

Any part of the region r + i that lies outside the image does not contribute to the feature

response.

An efficient and elegant way to calculate the filter responses anywhere over an image can

be achieved with the use of integral images [26]. For each texton t in the texton map T, a

separate integral image I(t) is calculated. In this integral image, the value at pixel i = (ui, vi)

is defined as the number of pixels in the original image that have been assigned to texton t in

the rectangular region with top left corner at (1, 1) and bottom right corner at (ui, vi):

I(t)(i) =

j:(ujui)&(vjvi)[Tj = t] . (3.16)

The advantage of integral images is that they can later be used to compute the texture-

layout filter responses in constant time: if I(t) is the integral image for texton channel t defined

like above, then the feature response is computed as:

v[r,t](i) =

I(t)(rbr) I(t)(rbl) I

(t)(rtr) + I(t)(rtl)

/area(r) (3.17)

where rbr, rbl, rtr and rtl denote the bottom right, bottom left, top right and top left corners


25/61


(a) (c) (e)

(b) (d) (f)

Figure 3.8: Graphical explanation of texture-layout filters extracted from [21]. (a, b) An imageand its corresponding texton map (colors represent texton indices). (c) Texture-layout filtersare defined relative to the point i being classified (yellow cross). In this first example feature,region r1 is combined with texton t1 in blue. (d) A second feature where region r2 is combinedwith texton t2 in green. (e) The response v[r1,t1](i) of the first feature is calculated at threepositions in the texton map (magnified). In this example, v[r1,t1](i1) 0, v[r1,t1](i2) 1, and

v[r1,t1](i3) 1/2. (f ) The second feature (r2, t2), where t2 corresponds to grass, can learnthat points i (such as i4) belonging to sheep regions tend to produce large values of v[r2,t2](i),and hence can exploit the contextual information that sheep pixels tend to be surrounded bygrass pixels.

of rectangle r.

Texture-layout features are sufficiently general to allow for an automatic learning of layout

and context information. Figure 3.8 illustrates how texture-layout filters are able to model

textural context and layout.

Boosting of texture-layout filters

A Boosting algorithm iteratively selects the most discriminative texture-layout filters (r, t) as

weak learners and combines them into a strong classifier used to derive the texture-layout

potential in Eq. 3.10. The boosting scheme used in TextonBoost shares each weak learner

between a set of classes C, so that a single weak learner classifies for several classes at once.

According to the authors, this allows for classification with cost sub-linear in the number of

classes, and leads to improved generalization.

The strong classifier learned is the sum over the classification confidences hmi (c) of M weak


26/61


learners

H(yi, i) =Mm=1

hmi (c) (3.18)

The confidence value H(yi, i) for pixel i is then just multiplied by a negative constantso

that a positive confidence turns into a negative energy, which will be preferred in the energy

minimizationto give the texture-layout potentials i used in Eq. 3.10:

i(yi, x; ) = .H(yi, i) (3.19)

Each weak learner is a decision stump based on the feature response v[r,t](i) of the form

hi[c] =

a

v[r,t](i) >

+ b, if c C

kc, otherwise,(3.20)

with parameters (a,b,kc,,C,r,t). The region r and texton index t together specify the texture-

layout filter feature, and v[r,t](i) denotes the corresponding feature response at position i. For

the classes that share this feature, that is, (c C), the weak learner gives hi(c) {a + b, b}

depending on whether v[r,t](i) is, respectively, greater or lower than a threshold . For classes

not sharing the feature (c / C), the constant kc ensures that unequal numbers of training

examples of each class do not adversely affect the learning procedure.In order to choose the weak classifiers, TextonBoost uses the standard boosting algorithm

introduced by Schapire et al. in [8], which will be explained for completeness. Suppose we are

choosing the mth weak classifier. Each training example i, a pixel in a training image, is paired

with a target value zci {1, +1}where +1 means that pixel i has ground truth class c and

1 notand assigned a weight wci specifying its classification accuracy for class c after the

m 1 previous rounds of boosting. The mth weak classifier is chosen by minimizing an error

function Jerror weighted by wci :

Jerror = c iwci (z

ci h

mi (c))

2 (3.21)

The training examples are then re-weighted

wci := wci ezcih

mi (c) (3.22)

Minimizing the error function Jerror requires, for each new weak classifier, an expensive

brute-force search over the possible sharing classes in C, features (r, t), and thresholds . As

shown in [21] however, given these parameters, a closed form solution does exist for a, b and kc.


27/61


3.4 Application to road scenes (Sturgess et al.)

In the more specific field of road scene segmentation, Sturgess et al. [22] have recently quite

successfully segmented inner-city road scenes in 11 different classes. Their method builds on

the work of Shotton et al. (see Section 3.3) and on that of Brostow et al. [3] integrating

the appearance-based features from TextonBoost with the structure-from-motion features from

Brostow et al. (see Section 3.1.2) in a higher order CRF. According to the authors, the use

of higher-order cliquesthat is, cliques with several pixels, instead of only pairs of pixels like

in TextonBoostproduces accurate segmentations with precise object boundaries. Figure 3.9

shows how Sturgess et al. use an unsupervised meanshift segmentation of the input image to

obtain regions that are used as higher-oder cliques and included in the energy function U to be

minimized.

Figure 3.9: The original image (left), its ground truth labelling (centre) and the meanshiftsegmentation of the image (right). The segments in the meanshift segmentation on the rightare used to define higher-order potentials, allowing for more precise object boundaries in thefinal segmentation.

Sturgess et al. achieved an overall accuracy of 84% compared to the previous state-of-

the-art accuracy of 69% [3] on the challenging CamVid database [4]. The work of Sturgess is

therefore especially important for this thesis as it successfully tackles the same inner-city scene

segmentation problem. The CamVid database will be better described in chapter 5, where the

results obtained by our implementation are compared with those of Sturgess et al. [22].


28/61

Chapter 4

Methodology

4.1 CRF framework

After thorough consideration of related work, CRFs have been deemed very suitable and up-

to-date for dealing with the problem proposed in this thesis project. As discussed in section

2, conditional random fields allow the incorporation of a big variety of cues in a single, unified

model. Moreover, state-of-the-art work in the field of image segmentation (see Section 3.3,

TextonBoost) and also more specifically in the domain of inner-city road scene understanding

(see Section 3.4, Sturgess et al.) has used CRFs. Sturgess et al. have been able to very

successfully segment eleven different classes in road scenes, some of which being very important

to our final goal of driver behavior prediction.

4.2 Basic model: location and edge potentials

Location and edge cues, as mentioned in 3.1, are very meaningful and can significantly con-

tribute to the quality of any segmentation. In our case, location cues are all the more important

because we deal with a very spatially structured scene. The road will, for example, never be at

the top of the image and the sky will never be at the b ottom. We can then extract precious

information as to where to expect our classes to be located on the picture.

If, for a better understanding of the problem, we consider, at first, a model with just the

location and edge potentials, then the energy function to be minimized in order to infer the

21


29/61

Chapter 4: Methodology 22

most likely labeling becomes

U(y|x, ) =i


(i,j)

edge (yi, yj , gij(x); ) . (4.1)

The location potential is calculated based on the incidence, for all the training images, of each

class at each pixel:

(yi, i; ) = log

Nyi,i +

Ni +

(4.2)

where Nyi,i

is the number of pixels at position i assigned class yi

in the training images, Ni

is

the total number of pixels at position i and is a small integer to avoid the indefinition log(0)

when Nyi,i = 0 (we use = 1). Figure 4.1 illustrates the location potential of classes road

and sidewalk in images from the CamVid database.

(a) (b)

Figure 4.1: (a) Location potential of class road, (road, i; ). (b) Location potential ofclass sidewalk, (sidewalk, i; ). The whiter, the bigger the incidence of pixels from thecorresponding class in the training images.

The pairwise edge potential has the form of a contrast sensitive Potts model [2] like defined

in TextonBoost:

(yi, yj , gij(x); ) = T gij(x)[yi = yj ], (4.3)

with [] the zero-one indicator function. The edge feature gij measures the diference in color

between the neighboring pixels, as suggested by [19],

gij =

exp(xi xj

2

1

(4.4)

With the help of an intuitive example, shown in Figure 4.2a, we can see how location and


30/61

23 4.2 Basic model: location and edge potentials

edge potentials interact, resulting in a meaningful segmentation. In this example, we want to

segment the toy image in three different classes, background, foreground-1 and foreground-2.

Figures 4.2b, 4.2d and 4.2f show the unary location potentials (yi, i; ) for classes foreground-

1, foreground-2 and background, respectively, at every pixel i 1. A white pixel represents a

high probability of a class being present at that pixelwhich is equivalent to saying that the

energy potential is lowimpelling the function minimization to prefer labels where the pixels

are white rather than black. Figure 4.2c shows the gradient image, which is a way to visualize

the edge potential calculated like in Eq. 4.3. The segmentation boundaries are more likely to

be located where the edge potential is white. Figure 4.2e shows the final segmentation obtained

through the minimization of Eq. 4.1.

(a) (c) (e)

(b) (d) (f)

Figure 4.2: (a) Noisy toy image to be segmented. (c) Gradient image as basis for edge po-tential. (b,d,f) Location potentials of classes foreground-1, foreground-2 and background,respectively. (e) Final segmentation inferred from the minimization of Eq. 4.1.

Note that the final segmentation correctly ignores the noise, as it is not present at the samepixels simultaneously in the edge and location potentials. The red and yellow structures inside

the main blob are all segmented as class foreground-1 thanks to the contribution of its location

potential. The constant term in Eq. 4.3, which adds a given cost for any pixel belonging to a

label boundary, helps suppress the appeareance of noisy, small foreground regions.

1The location potential of the class background is the complementary of the foreground classes potentials.That is, when either class foreground-1 or foreground-2 is likely, class background is unlikely, and vice-versa.


31/61


4.3 Texture potential model

Although the segmentation of the toy example, obtained with location and edge potentials

described in the last section, was robust against noise, the location potentials provided were

very similar to the regions we wanted to segment. In real images, not only the location potentials

are less correlated to the position of the labels, but there are also much more complex objects

to be segmented that cannot be differentiated just by using location and edge potentials. The

next step to a better segmentation is then modeling the texture information present in the

images. We can represent this new potential by rewriting the energy function U as:

U(y|x, ) =i


texture i(yi, x; ) +

(i,j)

edge (yi, yj, gij(x); ) (4.5)

Note that the texture potential represents local texture only, i.e., it does not take into

account context. It is merely a local feature. Context and layout are explored in Section 4.4,

where the use of simplified texture-layout filters is investigated.

In order to represent the texture information of the images to segment, we opted, similarly

to TextonBoost [21], for the use of filter banks. By using an N-dimensional filter bank F,

one obtains an N-dimensional feature vector, fx(i), for each pixel i. Each component of this

vector is the result of the convolution of the input image converted to grayscale, x, and the

corresponding filter shifted to the position of i:

fx(i) =

(F1 x)|i

(F2 x)|i...

(FN x)|i

(4.6)

Equivalently, the result of the convolution of a N-dimensional filter bank with an image can

be understood by considering the convolution of the image with each component of the filter

at a time. Figure 4.3 shows an example of input image, and the response images for some ofthe Leung-Malik filter bank components [16].

4.3.1 Feature vector and choice of filter bank

The choice of the filter bank used to represent the texture in the images to be segmented was

based on the following criteria:

Good coverage of possible textures without too much redundancy between filters;


32/61

25 4.3 Texture potential model

(a) (c) (e)

(b) (d) (f)

Figure 4.3: (a) Example of inner-city road scene image. (b-f) Examples of responses of fivedifferent filter components of the LM filter bank, which are shown at the bottom left corner ofeach figure.

Fast and efficient filter response calculation;

Ready-to-use implementation available.

Considering those criteria, a very interesting implementation by the Intelligent Systems Lab

of the University of Amsterdam has been found. It is implemented as a matlab .mex file, which

means it is actually a C script which is pre-compiled and then called by Matlab in execution

time. The libraries are freely available for research purposes2.

Using this fast to calculate .mex implementation, 5 different filter banks have been assessed

by segmenting images using only the texture potential in Eq. 4.5. Four classes have been

considered: road, sidewalk, others3 and sky.

The filter banks assessed were the following:

MR8 The MR8 filter bank consists of 38 filters but only 8 filter responses. The filter bank

contains filters at multiple orientations but their outputs are collapsed by recording onlythe maximum filter response across all orientations (see Figure 4.4);

MR8 - no maxima The rotation invariance of the MR8 filter bank, achieved by taking

only the maximum response over all orientation, may not be a desired property of a tex-

ture filter bank used for segmentationsome classes could be described by their features

2Source code at: http://www.science.uva.nl/mark.3Class others is assigned to any pixel that is not labeled as one of the other three classesit can thus be

seen as the complement of the other three classes.


33/61


orientation. Therefore, a filter bank called MR8 - no maxima has been defined, where all

the 38 responses are kept;

MR8 - separate channels Here, the MR8 filter is applied individually to each of the

three color channels, in an attempt to verify whether discriminative texture information

is unevenly distributed over the color channels;

MR8 - multi-scale This filter bank is composed of three MR8 filter banks in three

subsequent scales. Although the MR8 filter bank itself already uses filters in different

scales, we found interesting to try to cover even more scales as road scenes contain,

almost always, objects whose distance may vary in many orders of magnitude 4;

TextonBoosts filter bank This filter bank has 17 dimensions and is based on the CIELab

color space. consists of Gaussians at scales k, 2k and 4k, x and y derivatives of Gaussians

at scales 2k and 4k, and Laplacians of Gaussians at scales k, 2k, 4k and 8k. The Gaussians

are applied to all three color channels, while the other filters are applied only to the

luminance.

Figure 4.4: The MR8 filter bank is low dimensional, rotationally invariant and yet capable ofpicking out oriented features. Note that only the maximum response of the filters of each ofthe first 6 rows is taken.

As all the filter banksexcept MR8 - separate channels and TextoonBoosts filter bankare

convolved with grayscale images, we also concatenated to the texture feature vector fx(i)

4For instance, there might be a car immediately in front of the camera but also another one tens of metersaway.


34/61


which is the response of the filter bankthe L, a and b color values of its corresponding pixel:

fx(i) =

fx(i)

Li

ai

bi

(4.7)

In this manner, the color information was merged with the texture, giving an extra cue to

the Adaboost classifiers5.

The results of the tests showed that the filter bank that yielded the best segmentation

results and, thus, best represented the texture information in the road scene images was the

MR8 - multi-scale. This is probably due to the aforementioned fact that road scene images

have similar objects and regions that may vary greatly in depth. This variation is well captured

by the multiple-scale characteristic of the MR8 - multi-scale filter bank.

Combination of 3D cues to feature vector

As discussed in section 3.1.2, 3D information can be extracted from images in a video sequence

using structure from motion techniques. Those techniques can only infer the 3D position of

characteristic points in the image, that is, points that can be located, described and then

matched in subsequent images. In this thesis this has been done using the Harris corner detector,with normalized cross-correlation over patches for matching. Other possible patch descriptors

are, for example, SIFT and SURF.

All 3D features mentioned in section 3.1.2 have been concatenatedjust like the L, a and

b color valuesto the feature vector described in Eq. 4.7:

fx (i) =

fx(i)

3Dfeature1(i)...

3Dfeature5(i)

(4.8)

However, in order to include this 3D cues in our feature vector, they need to be defined for

every pixel of an input image. That means that we have to transform the sparse 3D features

obtained using reconstruction techniques into dense features. This can be done by interpolation,

where every pixel is assigned 3D feature values based on the values of the sparse neighbors that

could be defined with reconstruction techniques.

5Tests have been performed with different color spaces, yielding the best results when CIELab was used.This comes from the fact that the CIELab color space is partially invariant to scene lighting modificationsonlythe L dimension changes in contrast to the three dimensions of the RGB color space, for instance.


35/61


Figure 4.5 shows an example of dense interpolation of the 3D feature height above ground

for an image taken from the CamVid database.

(a) (b)

Figure 4.5: (a) A dusk image taken from the CamVid database. (b) The calculated height aboveground 3D feature. After determining a point cloud from structure from motion techniques,the sparse features have been interpolated as to yield a dense representation. Notice how thesky has high values and that we can see a faint blob where the car is located in the originalimage.

It is important to mention that, before concatenating them to the feature vector as shown

in Eq. 4.8, the 3D features have been appropriately normalized. The normalization guarantees

that they do not overshadow the texture and color features during the clustering process. This

could happen if the values of the 3D features were much greater than the values of the other

features. Since the clustering method implemented uses Euclidian distances, such an imbalance

in the feature values would result in biased cluster centers. The influence of the use of 3D

features on the segmentation results is discussed in Chapter 5.

4.3.2 Boosting of feature vectors

Having defined the feature vector as in Eq. 4.8, we need then to find patterns in the features

extracted from training images and try to recognize them in new, unseen images. For instance,we want to learn what texture, color and 3D cues are typical in each of the classes we want to

segment. Some of the machine learning techniques suitable for this task are neural networks,

belief networks or Gaussian Mixture Models in the N-dimensional space (where N is the number

of filters in the filter bank). Nonetheless, an Adaboost approach has been preferred for its

generalization power and ease to use.

A short overview about the way Adaboost works is described here. For more details about

its implementation and theoretical grounds, please see [8]. For this thesis project we have


36/61


Figure 4.6: Example of training procedure for classifier road. The Q K data matrix D isrepresented by the red vectors whereas the 1K label vector L is indicated by the green arrows.

utilized a ready-to-use Matlab implementation from the Moscow state university6.

Note that, since we are dealing with binary Adaboost classification, a classifier is trained

for each of the classes we want to segment in a one-versus-all manner. For the training of each

classifier, a learning data matrix D RQK is taken as input by the Adaboost trainer. Matrix

D has size Q K, where Q is the number of dimensions7 of the feature vector from Eq. 4.8 andK is the number of training vectors used for training (the feature vectors are extracted from

pixels in the training images). Another input, a 1 K vector L {0, 1}1K, contains the labels

of the training data D. Vector L is comprised of ones for the pixels belonging to the class of

the classifier being trained, and zeros otherwise. Figure 4.6 illustrates how individual classifiers

for each class are trained.

The Adaboost classifier of class c is composed of M stump weak classifiers hc(f),

hc(f) =

1 if fp >

0, otherwise(4.9)

where fp is the pth dimension of vector f and is a threshold. The strong classifier Hclassc(f(i))

is built by choosing the most discriminative weak learners by minimizing the error to the target

value, like explained in Section 3.3.2. Figure 4.7 shows how a trained classifier outputs a

confidence value between zero and one for feature vectors from unseen images.

Once we have defined a strong classifier H for each class, the texture potential of Eq. 4.5

6Source code available at http://graphics.cs.msu.ru/en/science/research/machinelearning/modestada .7Q = N(number of dimensions of the filter bank) +3(L,a,b) + 5(3D features).


37/61


Figure 4.7: Given a trained classifier, a classification confidence is computed based on how

similiar the input feature vector is to the positive examplesand on how different it is fromthe negative onesprovided in the training phase illustrated in Figure 4.6.

can be defined as:

i(yi, x; ) = .Hclassyi (fx(i)) (4.10)

The output of the strong classifier Hclassyi (fx(i)) is multiplied by a negative constant, so

that a positive confidence turns into a negative energy, which will be preferred in the energy

minimization. is the set of all parameters used in the Adaboost training of H, for instance,

number of weak classifiers.

4.3.3 Adaptive training procedure

In order to make the training of Adaboost classifiers more tractable, not every pixel of every

training image has been selected to build the training data matrix D. Since there is a lot

of redundancy between pixels, this simplification has not adversely affected the quality of the

Adaboost classifiers.

Although the selection of pixels for the extraction of training feature vectors has initially

been random, a smarter and innovative algorithm has been developed.

The adaptive training procedure works by, in an iterative way, choosing an unequal propor-

tion of feature vectors from each label. The idea is that, based on the confusion matrix of agiven segmentation experiment, we know the strengths and weaknesses of the classifiers trained.

For instance, suppose that for a given segmentation experiment class sky is not confused as

much as street and sidewalk. Then, it is reasonable that we choose in the next segmentation

experiment more feature vectors from classes street and sidewalk and less feature vectors

from class sky for the training of classifiers street and sidewalk.

Formally, if we represent the weight (or proportion) of training feature vectors from class

i, used in the Adaboost training of classifier j, as Wij, the update of every weight after each


38/61

31 4.4 Texture-layout potential model (context)

segmentation iteration (experiment) can be expressed as:

Wij =

Wije

Cmij/Z if i = j

Wije(1Cmij)/Z if i = j

(4.11)

where Cmij is the element in the ith row and jth column of the confusion matrix of the

previous segmentation iteration. is a learning speed factor and Z is a normalization factor

that guarantees that

iWij = 1, (4.12)

or, in other words, that the sum of the proportions of feature vectors from each class remains

equal to 1. The weights are all equally initialized as Wij = 1/Nc, Nc representing the number

of classes.

Notice that in the case of a perfect segmentation, where the confusion matrix is equal to

the identity matrix, the proportion of training feature vector samples Wij does not change.

Although the adaptive learning algorithm improved a lot the segmentation quality (see

Section 5.1), the use of local features alone is intrinsically limited. As precise and discriminative

as a classifier may be, there are cases where class sidewalk is virtually identical to class road

for every local feature imaginable. The natural next step towards a better segmentation is to

use context information. Then, the fact that sidewalks are normally alongside roads, separating

them from buildings or other regions, can be explored and help us correctly differentiate what

locally is indifferentiable.

4.4 Texture-layout potential model (context)

In order to model contextual information, we opt for utilizing the texture layout features in-

troduced by TextonBoost. This new potential replaces the texture potentials explained in the

previous section, as they are more general. We then have the following energy function:

U(y|x, ) =i


texture-layout i(yi, x; ) +

(i,j)

edge (yi, yj , gij(x); ) (4.13)

In this equation, the texture-layout potentials are defined similarly to the way they are defined

in TextonBoost:

i(yi, x; ) = .H(yi, i) (4.14)


39/61


The confidence H(yi, i) is the output of a strong classifier found by boosting weak classifiers,

H(yi, i) =Mm=1

hmyi(i) (4.15)

Each weak classifier, in turn, is defined based on the response of a texture-layout filter:

hmyi(i) =

a, if v[r,t](i) >

b, otherwise,(4.16)

Notice the difference from the definition in Eq. 3.20 of TextonBoost: bearing in mind our

final goal of behavior prediction, we do not need to classify as many classes as in TextonBoost

where up to 32 different classes are segmented. TextonBoost shares weak classifiers because

the computation cost becomes sub-linear with the number of classes. Since we do not need as

many classes, it is possible for us to simplify the calculation of strong classifiers by not using

shared weak classifiers. Therefore, in our approach, each strong classifier has its own, exclusive

weak classifiers.

The texture-layout filter response v[r,t](i) is the proportion of pixels in the input image,

from all those lying in the rectangle r with its origin shifted to pixel i, that have been assigned

texton t in the textonization process illustrated in section 3.3.2:

v[r,t](i) =1

area(r)

j(r+i)

[Tj = t] . (4.17)

4.4.1 Training procedure

We used, for our textonization process, the same feature vector definition as in Eq. 4.8, which

contains texture, color and 3D cues.

In order to build a strong classifiernote that we need to train one strong classifier for

each of the classes we want to segment our image in, weak classifiers are added one by one

following the following boosting procedure:

1. Generation of weak classifier candidates: Each weak classifier is composed of a texture-

layout filter (r, t) and a threshold . The candidates are generated by randomly choosing

a rectangle region inside a bounding box, a texton index t T = {1, 2, , K} where K is

the number of clusters used in the textonization process, and finally a threshold between

0 and 1. For the addition of each weak classifier, an arbitrary number of candidates, Ncd,

is generated.


40/61


2. Calculation of parameters a andb for all candidates: Each weak classifier candidate must

also be assigned values a and b so that its response, hmc (i), is fully defined (see Eq. 4.16).

Like described by Torralba et al [23], who use the same boosting approach (except our

does not share weak classifiers), a and b can be calculated as follows:

b =

i w

ci zci

v[r,t](i)

i w

ci

v[r,t](i)

, (4.18)a =

i w

cizci

v[r,t](i) >

i wci v[r,t](i) >

, (4.19)

where c is the label for which the classifier is being trained, zci = +1 or zci = 1 for

pixels i which, respectively, have ground truth label c or different from c, and wci the are

classification accuracy weights used by Adaboost (see Section 3.3.2).

Note, from Eq. 4.18 and Eq. 4.19, that, for the calculation of a and b, the response of the

texture-layout filters, v[r,t](i), must be calculated for all training pixels i and compared

to threshold .

3. Search for the best weak classifier candidate: Once each weak classifier is fully defined, that

is, all parameters (r,t,,a,b) are defined, the most discriminative among the candidates

is found by minimizing the error function with respect to the target values zci

.

In Chapter 5 we see how texture-layout strong classifiers can learn the context between

objects. We observe also how the number of weak classifiers influences the segmentation quality.

4.4.2 Practical considerations

System architecture

Due to the short period of time available for this thesis work, the implementation of software had

to be efficient and fast. Owing to its flexibilityand variety of ready-to-use image processing,

statistics, plotting and other functions availableMatlab has been the preferred tool for theimplementation of the solution.

Conditional Random Fields are, however, intrinsically highly demanding in computational

resources. This is due to the iterative nature of the minimization procedure of the cost function

U, detailed in section 3.2.1. As Matlab is an interpreted programming language, it is signifi-

cantly slower to process loops than compiled languages such as C or C++. Therefore, Matlab

has proven to be unable to cope with the massive calculations needed for the segmentation

inference, when the cost function U is minimized.


41/61


Figure 4.8: Software architecture. The Matlab layer is responsible for the higher-level processingwhereas the C++ layer takes the heavy energy minimization computation.

In the context of the iCub project [12]which is lead by the RobotCub Consortium, consist-

ing of several European universities, a good C++ framework for the minimization of Markov

Random Field energy functions has been found. The main goal of the iCub platform is to

study cognition through the implementation of biological motivated algorithms. The project is

open-sourceboth the hardware design and the software are freely available.

The software implemented has been then based on a two-layer layout, as illustrated in

Figure 4.8. Matlab, on a higher-level, pre-processes imagescalculating, for instance, filter

convolutionswhereas the C++ program calculates the minimum of the energy function U.

In other words, the C++ layer infers, from the given cliques potentials and input Matlab

pre-processed data, what the maximum a posteriori likelihood labeling is.

The assessment of the quality of the segmentations, the storage of results and all comple-

mentary software functionalities are handled by Matlab on the higher-level layer.


42/61


Implementation challenges and optimizations

Differently from the case of the texture potential explained in the previous section, we could not

find any ready-to-use Matlab implementation of the boosting procedure for the texture-layout

potential, as it is very specific to this problem. The whole algorithm had then to be implemented

from scratch. Moreover, since there are countless loops involved in the training algorithm

described above, Matlab was ruled out as programming environment of the implementation,

being replaced by C++.

Two main practical problems have been faced in the C++ programming of the developed

algorithm described above. Firstly, the long processing time and, secondly, the lack of RAM

memory.

1. Processing time: The boosting procedure described in the previous section requires com-

putations over all training pixels. If we consider 100 imagesa typical number for a

training data seteach composed of, for instance, 800 600 pixels, we have already 48

million calculation loops for each step. This turns out to be impractical for todays pro-

cessors. The solution found was to resize all dataset images before segmenting them and

also to consider, as training pixels, only a subsampled set of each image. By resizing the

images to half their original size and subsampling the training pixels in 5-pixel steps, we

could already reduce the number of calculation loops 100 times. After this simplifica-

tion was applied, the decrease in segmentation quality was almost imperceptible, which

indicates that the information necessary for training the classifiers was not lost with the

resizing and subsampling.

2. RAM memory:

As discussed in section 3.3, the use of integral images is essential for the efficiency of the

calculation of the texture-layout filters v[r,t]. If we consider that 100 textons have been

defined in the textonization process, we have, for each training image, 100 integral images,

one for each texton index. Again, considering 100 training images already resized to half

their original size, we have ten thousand 400 300 matrices (each matrix represents an

integral image). If we use a normal int variable for each matrix elementwhich in C++

occupies 4 byteswe need 10000 400 300 4 = 4.8 Gigabytes or RAM memory.

The first attempt to avoid this memory problem was to load only some of the integral

images at a time. However, for the calculation of the texture-layout filter responses of the

weak classifier candidates, all the integral matrices are necessary. They had then to be

all simultaneously accessible in the RAM memory.

The solution was to use short unsigned integerswith only 2 bytes, which were big


43/61


enough for all the integral matrices analyzed8, and also to subsample the integral image

matrices:

I(t)(u, v) = I(t)

round (u, v)

SubsamplingFactor

(4.20)

Again, the subsampling almost did not change the results of the final segmentations. One

of the reasons why the results did not change much is probably that the subsampling

rate of 3 used is much smaller than the sizes of the rectangular regions r used in the

texture-layout features. Although the subsampling reduced the amount of RAM memory

necessary for loading the integral images, there is a limit of training images that can be

used for training without causing memory problems.

8Each short unsigned integer can store a number of up to 65535. If we consider a 400 300 pixel image, themaximum value of an integral imageif all pixels were assigned to one single textonis 120000. However, sinceeach pixel is assigned to one of many texton indexes, the integral image of each texton never has values close tothe limit 65535.


44/61

Chapter 5

Results

In this chapter, we investigate the performance of our semantic segmentation system on the

challenging CamVid dataset and compare our results with existing work. Firstly, we show

preliminary results obtained with the texture features described in Section 4.3 without consid-

ering any context. We then analyse our final model with the context features (texture-layout

features) described in Section 4.4. The effect of different aspects and parameters of the model

is discussed before we present the best results obtained and analyse them quantitatively and

qualitatively.

5.1 Model without context features

Figure 5.1 shows the confusion matrix of the segmentation of approximately 200 pictures, with

classifiers trained on 140 other pictures, all randomly taken from the CamVid database. For this

segmentation experiment, 500 training feature vectors have been randomly chosen per training

image. The segmentations have been computed by minimization of Eq. 4.5 which does not

include any context feature. Notice how sidewalks are almost not recognized at all.

The adaptive training procedure described in Section 4.3.3 chooses for the training of the

adaboost classifiers more examples of feature vectors from labels that are confusedlike roadand sidewalkthan from those who are easily recognizedlike sky. The confusion matrix of

Figure 5.1 shows the results of the segmentation of the first iterationwhere all training vectors

are chosen randomlyof this adaptive Adaboost training algorithm. After three iterations,

examples are selectively chosen and the confusion matrix of the segmentation results, shown in

Figure 5.2, shows much better discernment between classes that were initially mixed up.

Although the adaptive training procedure improved the segmentation quality, context in-

formation, as discussed in the next section, contributes to differentiate classes even better.

37


45/61

Chapter 5: Results 38

Figure 5.1: Confusion matrix of segmentation experiment chosing random feature vectors fortraining the Adaboost classifiers. Each row shows what proportion of the ground truth classeshas been assigned to each class by the classifiers. Class others is the union of all classes defined

in the CamVid database except street, sidewalk and sky. For an ideal segmentation, theconfusion matrix would be equal to the identity matrix.

Figure 5.2: Confusion matrix of segmentation after three iterations of the adaptive training.Initially, 65% of class sidewalk was wrongly assigned to class road, as compared to only 25%with the adaptive learning. The percentage of class sidewalk correctly assigned also increasedfrom 9% to 61%.

5.2 Model with context features

Our final model includes the texture-layout potential (see Section 4.4). This model and its

results are discussed in detail in the following sections.

5.2.1 Influence of number of weak classifiers

As illustrated in Figure 3.8, texture-layout filters work by exploring the contextual correlation

of texturesand in our solution also colorbetween neighboring regions. Figure 5.3 shows the

rectangular region r of each of the first ten texture-layout features for the classifier of class

road. Notice that the location distribution of the regions r is slightly biased towars either the

top half or the bottom half of the image. This comes, probably, from the fact that most of

the correlations between textures present in class road and other textures happen in a vertical

fashion: the road is normally below other classes.


46/61

39 5.2 Model with context features

Figure 5.3: r regions of first ten weak classifiers composing the strong classifier for class road.The yellow cross in the middle indicates the pixel i being classified and the blue rectanglerepresents the bounding box within which all the weak classifiers candidates are created. Thebigger the

03 martinelli

Documents