03 martinelli

Upload: nannurah

Post on 14-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 03 Martinelli

    1/61

    MSc. Thesis Scene layout segmentation of trafficenvironments using a Conditional Random Field

    Fernando Cervigni Martinelli

    Honda Research Institute Europe GmbH

    A Thesis Submitted for the Degree ofMSc Erasmus Mundus in Vision and Robotics (VIBOT)

    2010

  • 7/30/2019 03 Martinelli

    2/61

    Abstract

    At least 80% of the traffic accidents in the world are caused by human mistakes. Whetherdrivers are too tired, drunk or speeding, most accidents have their root in the improper behaviorof drivers. Many of these accidents could be avoided if cars were equipped with some kind ofintelligent system able to detect inappropriate actions of the driver and autonomously interveneby controlling the car in emergency situations. Such an advanced driver assistance systemneeds to be able to understand the car environment and, from that information, predict the

    appropriate behavior of the driver at every instant. In this thesis project we investigate theproblem of scene understanding solely based on images from an off-the-shelf camera mountedto the car.

    A system has been implemented that is capable of performing semantic segmentation andclassification of road scene video sequences. The object classes which are to be segmented canbe easily defined as input parameters. Some important classes for the prediction of the driverbehavior include road, sidewalk, car and building, for example. Our system is trained ina supervised manner and takes into account information such as color, location, texture andalso spatial context between classes. These cues are integrated within a Conditional RandomField model, which offers several practical advantages in the domain of image segmentationand classification. The recently proposed CamVid database, which contains challenging inner-city road video sequences with very precise ground truth segmentation data, has been used forevaluating the quality of our segmentation, including a comparison to state-of-the-art methods.

    Everything should be made as simple as possible, but not simpler . . .

    Albert Einstein

  • 7/30/2019 03 Martinelli

    3/61

    Contents

    Acknowledgments v

    1 Introduction 1

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Problem definition 4

    2.1 Combined segmentation and recognition . . . . . . . . . . . . . . . . . . . . . . . 4

    3 State of the art 6

    3.1 Features for image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3.1.1 Spatial prior knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3.1.2 Sparse 3D cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3.1.3 Gradient-based edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.1.4 Color distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3.1.5 Texture cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3.1.6 Context features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.2 Probabilistic segmentation framework . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.2.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.2.2 Energy minimization for label inference . . . . . . . . . . . . . . . . . . . 13

    i

  • 7/30/2019 03 Martinelli

    4/61

    3.3 Example: TextonBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3.3.1 Potentials without context . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.3.2 Texture-layout potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.4 Application to road scenes (Sturgess et al.) . . . . . . . . . . . . . . . . . . . . . 20

    4 Methodology 21

    4.1 CRF framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    4.2 Basic model: location and edge potentials . . . . . . . . . . . . . . . . . . . . . . 21

    4.3 Texture potential model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    4.3.1 Feature vector and choice of filter bank . . . . . . . . . . . . . . . . . . . 24

    4.3.2 Boosting of feature vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    4.3.3 Adaptive training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    4.4 Texture-layout potential model (context) . . . . . . . . . . . . . . . . . . . . . . . 31

    4.4.1 Training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.4.2 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    5 Results 37

    5.1 Model without context features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    5.2 Model with context features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    5.2.1 Influence of number of weak classifiers . . . . . . . . . . . . . . . . . . . . 38

    5.2.2 Influence of the different model potentials . . . . . . . . . . . . . . . . . . 39

    5.2.3 Influence of 3D features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    5.3 CamVid sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    5.4 Comparison to state of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    6 Conclusions 49

    Bibliography 54

    ii

  • 7/30/2019 03 Martinelli

    5/61

    List of Figures

    2.1 Example of ideal segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    3.1 3D features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.2 Gradient-based edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3.3 GrabCut: segmentation using color GMMs and user interaction . . . . . . . . . . 9

    3.4 The Leung-Malik (LM) filter bank . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.5 Clique layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.6 Sample results of TextonBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3.7 Image textonization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.8 Texture-layout filters (context) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.9 Sturgess: higher-order cliques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    4.1 Examples of location potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    4.2 Intuitive example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    4.3 Filter bank resp onses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    4.4 The MR8 filter bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    4.5 3D features interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    4.6 Adaboost training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4.7 Adaboost classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    4.8 Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    iii

  • 7/30/2019 03 Martinelli

    6/61

    5.1 Confusion matrix without adaptive training . . . . . . . . . . . . . . . . . . . . . 38

    5.2 Confusion matrix after adaptive training . . . . . . . . . . . . . . . . . . . . . . . 38

    5.3 Example of texture-layout features (context) . . . . . . . . . . . . . . . . . . . . 39

    5.4 Influence of number of weak classifiers . . . . . . . . . . . . . . . . . . . . . . . . 40

    5.5 Influence of the different potentials . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    5.6 Confusion matrices for 4-class segmentation . . . . . . . . . . . . . . . . . . . . . 44

    5.7 Example of segmentations for 4-class set . . . . . . . . . . . . . . . . . . . . . . . 45

    5.8 Results for 11-class set segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 46

    5.9 Example of segmentations for 11-class set . . . . . . . . . . . . . . . . . . . . . . 47

    5.10 Comparison to state of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    6.1 Adaptive scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    iv

  • 7/30/2019 03 Martinelli

    7/61

    Acknowledgments

    I would like to thank above all my family for the constant support. They are always with me,

    even though they live on the other side of the Atlantic ocean.

    My heartly thanks to my supervisors at Honda, Jannik Fritsch, who has been so nice and

    given me all the support I needed, and Martin Heracles, who has carefully revised this thesis

    report and given precious advice all along these four months. For his help with the iCub

    repository and for providing me with his essential CRF code, I would like to sincerely thank

    Andrew Dankers.

    I wish also to thank my supervisor, Prof. Fabrice Meriaudeau, and all professors of the

    Vibot Masters. It is hard to fathom how much I learned with you during these 2 years. Thanks

    also for offering this program, which has been an amazing and unforgettable experience.

    Last but not least, I wish to thank all my Vibot mates, who have been a great company

    studying before exams or chilling at the bar.

    v

  • 7/30/2019 03 Martinelli

    8/61

    Chapter 1

    Introduction

    1.1 Motivation

    Within the Honda Research Institute Europe (HRI-EU), the Attentive Co-Pilot project (ACP)

    conducts research on a multi-function Advanced Driver Assistance System (ADAS). It is desired

    and to be expected that, in the future, cars will autonomously respond to inappropriate actions

    taken by the driver. If he or she does not stop the car when the traffic lights are red or fallsasleep and slowly deviates from the normal driving course, the car should trigger an emergency

    procedure and warn the driver. A similar warning should come up, for example, if the driver

    gets distracted and the car in front inadvertently brakes, without the driver noticing it. It would

    be even safer if the car had the capability of not only recognizing it and warning the driver, but

    also of taking over control in critical situations and safely correcting the drivers inappropriate

    actions. Since human mistakes, and not technical problems, are by far the main cause of traffic

    accidents, countless lives could be saved and much damage avoided if such reliable advanced

    driver assistance systems existed and were widely implemented.

    If, however, this Advanced Driver Assistance System is to become responsible for saving

    lives, in a critical real-time context, it cannot afford to fail. In order to manage the extremelychallenging task of building such an intelligent system, many smaller problems have to be

    successfully tackled. One of the most important is related to understanding and adequately

    representing the environment in which the car operates. For that, a variety of sensors and input

    data can be used. Indeed, participants of the DARPA Urban Challenge [5], which requires

    autonomous vehicles to drive through specific routes in a restricted city environment, rely on

    a wide range of sensors such as GPS, Radar, Lidar, inertial guidance systems as well as on the

    use of annotated maps.

    1

  • 7/30/2019 03 Martinelli

    9/61

    Chapter 1: Introduction 2

    One of our aspirations, though, is to achieve the task of scene understanding by visual

    perception alone, using an off-the-shelf camera mounted in the car. We humans prove in our

    daily life as drivers that seeing the world is largely sufficient to achieve an understanding of the

    traffic environment. By ruling out the use of complicated equipment and sensing techniques, we

    aim at, once a reliable driver assistance system is achieved, manufacturing it cheap enough for

    it to be highly scalable. Considering their great potential of increasing the safety of drivers

    and therefore also of pedestrians, bicyclists, and other traffic participants, such advanced

    driver assistance systems will most likely become an indispensable car component, like todays

    seat-belts.

    1.2 Goal

    A first step to understanding and representing the world surrounding the car is to segment

    the images acquired by the camera in meaningful regions and objects. In our case, meaningful

    regions are understood as the regions that are potentially relevant for the behavior of the

    driver. Examples of such regions are the road, sidewalks, other cars, traffic signs, pedestrians,

    bicyclists and so on. In contrast, in our context it is not so important, for example, to segment

    and distinguish a building on the side of the road as an individual class, since, as far as the

    driver behavior is concerned, it makes no difference whether there is a building, a fence or even

    a tree at that location.

    In order to correctly segment such meaningful regions, we need to consider semantic aspects

    of the scene rather than only its appearance, that is, even if the road consists of dark and bright

    regions because of shadows, it should still be segmented as only one semantic region. This can

    be achieved by supervised training using ground truth segmentation data.

    The work described in this thesis aims at performing this task of semantic segmentation,

    exploring the most recent insights of researchers in the field, as well as well-known and state-

    of-the-art image processing and segmentation techniques.

    1.3 Thesis outline

    This thesis is structured in five more chapters. In Chapter 2, the main goal of the investigation

    done in this thesis project is fomalized and explained. Chapter 3 investigates the state of the art

    in the field of semantic segmentation and road scene interpretation. Cutting-edge algorithms

    like TextonBoost are described in greater detail as they are fundamental to state-of-the-art

    methods. In Chapter 4, the methodology and implementation steps followed throughout this

    thesis project are detailed. Chapter 5 shows the results obtained for the CamVid database, both

  • 7/30/2019 03 Martinelli

    10/61

    3 1.3 Thesis outline

    for a set of four classes and for a set of eleven classes. A comparison of these results and to the

    state of the art mentioned in Chapter 3 is also shown. Finally, in Chapter 6 the conclusions

    of the thesis are presented and suggestions regarding the areas on which future efforts should

    focus are given.

  • 7/30/2019 03 Martinelli

    11/61

    Chapter 2

    Problem definition

    2.1 Combined segmentation and recognition

    The main goal of this thesis project is to investigate, implement and evaluate a system that per-

    forms segmentation of road scene images including a classification of the different object classes

    involved. More specifically, each input color image x GMN3, where G = {0, 1, 2, , 255}

    and M and N are the image height and width, respectively, must be pixel-wise segmented.

    That means that each pixel i of the image has to be assigned one of N pre-defined classes or

    labels, of a set L = {l1, l2, l3, , lN}. In mathematical terms, the segmentation investigated

    can be defined as a function f that takes the color image x GMN3 and returns a label

    image f(x) = y LMN, also called a labeling of x:

    This is achieved by supervised training, which means that the system is given labeled training

    images, from which it should learn in order to subsequently segment new, unseen images.

    According to the state of the art researchers, supervised segmentation techniques yield better

    results than unsupervised techniques (see Chapter 3). This is not surprising, since unsupervised

    segmentation techniques do not have ground truth information from which to learn semantic

    properties, hence can only segment the images based on purely data-driven features.Figure 2.1 shows a typical inner-city road scene as considered in this thesis project, as well

    an ideal segmentation, obtained by manual annotation. The example is taken from the CamVid

    database [4], which is a recently proposed image database with high-quality, manually-labeled

    ground truth which we use for training our system. The images have been acquired by a car-

    mounted camera, filming the scene in front of the car while driving in a city. More detail about

    the CamVid dataset is given in Chapter 5.

    Theoretically, it would be ideal if the segmentation algorithm proposed could precisely

    4

  • 7/30/2019 03 Martinelli

    12/61

    5 2.1 Combined segmentation and recognition

    (a) (b)

    Figure 2.1: (a) An example of a typical inner-city road scene extracted from the CamViddatabase. (b) The corresponding manually labeled ground truth, taking into account classeslike road, pedestrian, sidewalk and sky, among others. The goal of the segmentationsystem to be implemented is to produce, given an image (a), an automatic segmentation thatis as close as possible to the ground truth (b).

    segment all 32 classes annotated in the CamVid database. However, the more classes one tries

    to segment the more challenging and time-consuming the problem becomes. Although our

    system is supposed to be able to segment an arbitrary set of classes, as long as they are present

    in the training database, a compromise between computational efficiency and the number of

    classes to segment has to be reached. More importantly, many of the classes defined in the

    CamVid database have little if any influence at all on the behaviour of the driver. Bearing

    this in mind, the segmentation algorithm should be optimized and tailored towards the most

    behaviorly relevant classes.

    Furthermore, a related study recently conducted in the ACP Group suggests that, in order

    to achieve a good prediction of the drivers behavior, more effort should be invested in how to

    use such a segmentation of meaningful classes in terms of segmentation-based features rather

    than in precisely segmenting a vast number of classes that may not influence, after all, how the

    driver controls the car [11].

  • 7/30/2019 03 Martinelli

    13/61

    Chapter 3

    State of the art

    The problem of image segmentation has been focus, already for some decades, of countless

    image processing researchers around the globe. Although the problem itself is old, the solution

    to many segmentation tasks remains, still today, under active investigationin particular for

    image segmentation applied to highly complex real-world scenes (e.g. traffic scenes). This

    chapter describes some of the techniques for image segmentation that have been applied in

    related areas to the one investigated in this thesis project.

    3.1 Features for image segmentation

    3.1.1 Spatial prior knowledge

    One of the simplest but useful cues that may be explored when segmenting images in a super-

    vised fashion is the location information of objects in the scene. For many object classes, there

    is an important correlation between the label to which a region in an image belongs and its

    location on the image. For instance, the fact that the road is mostly at the lower part of pictures

    could be helpful for its segmentation. The same applies for the segmentation of the sky, which

    is normally at the upper part of an image. Many similar examples can be mentionedlike

    buildings being usually on the sides of the imagewhich makes this feature powerful, despite

    its simplicity.

    3.1.2 Sparse 3D cues

    Different regions in an image have often different depths. Therefore, if available, the information

    of how far each point in the image was from the camera when the image was acquired can be very

    6

  • 7/30/2019 03 Martinelli

    14/61

    7 3.1 Features for image segmentation

    Figure 3.1: The algorithm proposed by Brostow et al. uses 3D point clouds estimated fromvideos sequences and performs, using motion and structure features, a very satisfactory 11-classsemantic segmentation.

    useful for segmentation purposes. Since individual images do not carry any depth information,

    3D cues can only be explored in specific cases where one can either measure or infer how far

    the objects in an image are. If the use of radars or equipment that directly measure distance

    is to be discarded, 3D information can be inferred by using a stereo camera set or, in the case

    of a single camera, by using structure-from-motion techniques [9]. When dealing with images

    taken from an ordinary video frame, structure-from-motion techniques must be applied.

    Figure 3.1, extracted from the work of Brostow et al. [3], shows how accurate the segmen-

    tation of road scenes can get only by using reconstructed 3D point clouds.

    Brostow et al. based their work on the following features, which can be extracted from the

    sparse 3D point cloud:

    Height above the ground;

    Shortest distance from cars path;

    Surface orientation of groups of points;

    Residual error, which is a measure of how objects in the scene move in respect to the

    world;

    Density of points.

    3.1.3 Gradient-based edges

    When ones thinks of image segmentation, it is natural to expect that the label boundaries

    correspond to strong edges on the image being segmented. For example, the image of a blue

    car on a city street will have rather strong edges where, in a perfect labeling, the boundaries

    between the labels car and street are located. Some methods, like, for example, active contour

    snakes [13], explore gradient-based edge information for segmentation. Figure 3.2 shows an

  • 7/30/2019 03 Martinelli

    15/61

    Chapter 3: State of the art 8

    (a) (b)

    Figure 3.2: (a) Original grayscale image of Lena. (b) Edge image obtained by calculating theimage gradients. Edge based segmentation methods explore the information in (b) to proposea meaningful segmentation of (a). Note how Lenas hat, face and shoulder could be quite wellsegmented only with this edge cue.

    example picture of Lena and its gradient. The white pixels have a greater probability of being

    located on boundaries between labels in a segmentation.

    Notice that although this is a very reasonable and useful cue, it can also turn out to be

    misleading. When dealing, for example, with shadowed scenes, very often there are stronger

    edges inside regions that belong to the same label than there are on the boundaries betweenlabels. This is particularly challenging for real-world scenes such as the traffic scenes considered

    in this thesis project. The way this cue was explored in this project is explained in detail in

    Section 3.3.1.

    3.1.4 Color distribution

    Early methods, like [20] tackle the problem of image segmentation by relying solely on color

    features, which can be modeled as histogram distributions or by Gaussian Mixture Models

    (GMMs). A Gaussian Mixture Model represents a probability distribution, P(x), which is

    obtained by summing different Gaussian distributions:

    P(x) =k

    Pk(x) (3.1)

    where

    Pk(x) = N(x|k, k) (3.2)

    k, k being the mean and variance of the individual Gaussian distribution k.

    The use of GMMs to model colors in images has also proven very efficient in binary seg-

  • 7/30/2019 03 Martinelli

    16/61

    9 3.1 Features for image segmentation

    Figure 3.3: Segmentation achieved by GrabCut using color GMMs and user interaction.

    mentation problems, as shown by Rother et al [19] with their GrabCut algorithm. In suchproblems, one wants to separate a foreground object from the background for image editing,

    object recognition and so on. When possible, user interaction can be very useful to refine the

    results by giving important feedback after the initial automatic segmentation (see Figure 3.3).

    However, in most cases, either the number of images to segment is prohibitive or the real

    time nature of the segmentation task prevents any user interference at all. These both remarks

    are true in the field of traffic scenes segmentation for driver assistance.

    3.1.5 Texture cues

    Along with color, texture information is often considered and can bring significant improvementto the segmentation accuracy, as in [7], where graylevel texture features were combined to color

    ones. Nowadays, most if not all the research effort on segmentation also incorporates texture

    information. This can be extracted and modeled in two main ways:

    1. Statistical Models, which try to describe the statistical correlation between pixel colors

    within a restricted vicinity. Among such methods, co-occurrence matrices, have been

    successfully used, for instance, for seabed classification [18];

  • 7/30/2019 03 Martinelli

    17/61

    Chapter 3: State of the art 10

    Figure 3.4: The LM filter bank has a mix of edge, bar and spot filters at multiple scales andorientations. It has a total of 48 filters2 Gaussian derivative filters at 6 orientations and 3scales, 8 Laplacian of Gaussian filters and 4 Gaussian filters.

    2. Filter bank convolution, where the image is convolved with a carefully selected set of filter

    primitives, usually composed of Gaussians, Gaussian derivatives and Laplacians. A well

    known example, the Leung-Malik (LM) filter bank [16], is shown in Figure 3.4. It is

    interesting to mention that such filter banks have similarities with the receptive fields of

    neurons in the human visual cortex.

    3.1.6 Context features

    Although color and texture may efficiently characterize image regions, they are far from enough

    for a high quality semantic segmentation if considered alone. For instance, even humans may

    not be able to tell apart, when looking only at a local patch of an image, a blue sky from the

    walls of a blue building. The key aspect of which humans naturally take advantage, and that

    allows them to unequivocally understand scenes, is the context. Even if one sees a building

    wall painted with exactly the same color as the sky, one just knows that that wall cannot be

    the sky because it is surrounded by windows. In the case of road scenes segmentation, typical

    spatial relationships between objects can be a very strong cuefor example, the fact that the

    car is always on the road, which, in turn, is usually surrounded by sidewalks.

    With this in mind, computer vision researchers are now frequently looking beyond low-level

    features and are more interested in contextual issues [7, 10, 14]. In Section 3.3, an example ofhow context in images can be exploited for segmentation is described.

    3.2 Probabilistic segmentation framework

    The choice of image features, described in the previous section, is independent of the theoretical

    framework or machine learning technique applied for segmentation inference. One can choose

    the very same features as in [7], where belief networks are used, and process them using Support

  • 7/30/2019 03 Martinelli

    18/61

    11 3.2 Probabilistic segmentation framework

    Vector Machines, for example. In recent years, Conditional Random Fields (CRFs) have played

    an increasingly central role. CRFs have been introduced by Lafferty et al. in [15] and have ever

    since been systematically used in cutting-edge segmentation and classification approaches like

    TextonBoost [21], image sequence segmentation [27], contextual analysis of textured scenes [24]

    and traffic scene understanding [22], to name a few. Conditional Random Fields are based on

    Markov Random Fields and offer practical advantages for image classification and segmenta-

    tion. These advantages are explained in the next section, after the formal definition of Markov

    Random Fields is given.

    3.2.1 Conditional Random FieldsIn the Random Field theory, an image can be described by a lattice S composed of sites i,

    which can be thought of as the image pixels. The sites in S are related to one another via

    a neighborhood system, which is defined as N = {Ni, i S}, where Ni is the set of sites

    neighbouring i. Additionally, i / Ni and i Nj j Ni.

    Let y denote a labeling configuration of the lattice S belonging to the set of all possible

    labelings Y. In the image segmentation context, y can be seen as a labeling image, where each

    of the sites (or pixels) i from the lattice S is assigned one label yi in the set of possible labels

    L = {l1, l2, l3, , lN}, which are the object classes. The pair (S,N) can be referred to as a

    Random Field.

    Moreover, (S,N) is said to be a Markov Random Field (MRF) if and only if

    P(y) > 0, y Y, and (3.3)

    P(yi|yS{i}) = P(yi|yNi) (3.4)

    That means, firstly, that the probability of any defined label configuration must be greater

    than zero1 and, secondly and most importantly, that the probability of a site assuming a given

    label just depends on its neighboring sites. The latter statement is also known as the Markov

    condition.

    According to the Hammersley-Clifford theorem [1], an MRF like defined above can equiv-

    alently be characterized by a Gibbs distribution. Thus, the probability of a labeling y can be

    written as

    P(y) = Z1 exp(U(y)), (3.5)

    where

    Z =yY

    exp(U(y)) (3.6)

    1This assumption is usually taken for convenience, as it, in practical terms, does not influence the problem.

  • 7/30/2019 03 Martinelli

    19/61

    Chapter 3: State of the art 12

    is a normalizing constant called the partition function, and U(y) is an energy function of the

    form

    U(y) =cC

    Vc(y). (3.7)

    C is the set of all possible cliques and each clique c has a clique potential Vc(y) associated

    with it. A clique c is defined as a subset of sites in S in which every pair of distinct sites

    are neighbours, with single-site cliques as a special case (see Figure 3.5). Due to the Markov

    condition, the value of Vc(y) depends only on the local configuration of clique c.

    (a) (b) (c)

    Figure 3.5: (a) Example of a 4-pixel neighborhood. (b) Possible unary clique layout. (c)Possible binary clique layouts.

    Now let us consider the observation xi, for each site i, which is a state belonging to a set of

    possible states W = {w1, w2, , wn}. In this manner, we can represent the image we want to

    segment, where each pixel i is assigned to one state of the set W. If one thinks of a gray scale

    image with 8 bit-resolution, for example, the set of possible states for each site (or pixel) would

    be defined as W = {0, 1, 2, , 255}. The segmentation problem then boils down to finding the

    labeling y such that P(y|x)the posterior probability of labeling y given the observation

    xis maximized. Bayes theorem tells us that

    P(y|x) = P(x|y)P(y)/P(x) (3.8)

    where P(x) is a normalization factor, as Z in Eq. 3.5, and plays no role in the maximization.

    Thanks to the Hammersley-Clifford theorem, one can greatly simplify this maximization prob-lem by defining only locally the clique potential functions Vc(x,y,). How to choose the forms

    and parameters of the potential functions for a specific application is a major topic in MRF

    modeling and will be further discussed in Chapter 4.

    The main difference between MRFs and CRFs lies on the fact that MRFs are generative

    models, whereas CRFs are discriminative. That is, CRFs directly model the posterior distri-

    bution P(y|x) while MRFs learn the underlying distributions P(x|y) and P(y), arriving at the

    posterior distribution by applying the Bayes theorem.

  • 7/30/2019 03 Martinelli

    20/61

    13 3.2 Probabilistic segmentation framework

    In other words, for MRFs, the learned state-label joint probability is represented as P(y|x) =

    P(x|y)P(y)/P(x)), where x represents the observation and y the corresponding labeling con-

    figuration. However, for CRFs, it is not required to generate prior distributions over the labels

    P(x|y) like for MRFs, as the a posteriori P(y|x) is modeled directly.

    This directly modeled posterior probability is simpler to implement and usually sufficient for

    segmenting images. Hence, for the road scene segmentation and classification problem at hand,

    CRFs are advantageous in comparison to MRFs. This is the main reason why they became so

    popular [21,22,27].

    3.2.2 Energy minimization for label inference

    Finding the labeling y that maximizes the a posteriori probability expressed in Eq. 3.5 is

    equivalent to finding y that minimizes the energy function in Eq. 3.7. An efficient way of

    finding a good approximation of the energy minimum of such functions, is the alpha-expansion

    graph-cut algorithm [2] which widely used along with MRFs and CRFs. The idea of the alpha-

    expansion algorithm is to reduce the problem of minimizing a function like U(y) with multiple

    labels to a sequence of binary minimization problems. These sub-problems are referred to as

    alpha-expansions, and will be shortly described for completeness (for details see [2]).

    Suppose that we have a current image labeling y and one randomly chosen label L =

    {l1, l2, l3, , lN}. In the alpha-expansion operation, each pixel i makes a binary decision: it

    can either keep its old label yi or switch to label , provided that this change decreases the

    value of energy the function. For that, we introduce a binary vector s {0, 1}MN which

    indicates which pixels in the image (of size M N) keep their label and which switch to label

    . This defines the auxiliary configuration y[s] as

    yi[s] =

    yi, if si = 0, if si = 1 (3.9)

    This auxiliary configuration y[s] transforms the function U with multiple labels into a func-

    tion of binary variables U(s) = U(y[s]). If function U is composed of attractive potentials,

    which could be seen as a kind of convex functions, the global minimum of this binary function 2

    is guaranteed to be found exactly using standard graph cuts [21].

    The expansion move algorithm starts with any initial configuration y0, which could be set,

    for instance, taking, for each pixel, the label with maximum location prior probability3 . It

    then computes optimal alpha-expansion moves for labels in a random order, accepting the

    2Notice that this does not mean that the global minimum of the multi-label function is found.3In the road scene segmentation case, for instance, pixels on top of the image could start with label sky and

    pixels at the bottom with label road. This is equivalent to exploring the features described in 3.1.1

  • 7/30/2019 03 Martinelli

    21/61

    Chapter 3: State of the art 14

    moves only if they decrease the energy function. The algorithm is guaranteed to converge, and

    its output is a strong local minimum, characterized by the property that no further alpha-

    expansion can decrease the value of function U.

    3.3 Example: TextonBoost

    One CRF-based approach to image segmentation that is currently fundamental for state-of-

    the-art methods is TextonBoost [21]. In their research, Shotton et al. have used the Microsoft

    Research Cambridge (MSRC) database4, which is composed of 591 photographs of the following

    21 object classes: building, grass, tree, cow, sheep, sky, airplane, water, face, car, bicycle, flower,sign, bird, book, chair, road, cat, dog, body, and boat. Approximately half of those pictures

    is picked for training, in a way that ensures proportional contributions from each class. Some

    results of their semantic segmentation on previously unseen images are shown in Figure 3.6.

    Figure 3.6: TextonBoost results extracted from [21]. Above, unseen test images. Below, seg-mentation using a color-coded labeling. Textual labels are superimposed for better visualization.

    Since the algorithm implemented for the segmentation of road scenes in this master thesis

    has been mainly inspired by TextonBoost, a short description of the way it works is provided.

    The inference framework used is a conditional random field (CRF) model [15]. The CRF

    learns, through the training of the parameters of the clique potentials, the conditional distribu-

    tion over the possible labels given an input image. The use of a conditional random field allows

    the incorporation of texture, layout, color, location, and edge cues in a single, unified model.

    The energy function U(y|x, ), which is the sum of all the clique potentials (see Eq. 3.7), isdefined as:

    U(y|x, ) =i

    location (yi, i; ) +

    color (yi, xi; ) +

    texturelayout i(yi, x; )

    +

    (i,j)

    edge (yi, yj, gij(x); ) (3.10)

    where y is the labeling or segmentation and x is a given image, is the set of edges in a

    4The MSRC database can be downloaded at http://research.microsoft.com/vision/cambridge/recognition/

  • 7/30/2019 03 Martinelli

    22/61

    15 3.3 Example: TextonBoost

    4-connected neighborhood, = {, , , } are the model parameters, and i and j index

    pixels in the image, which correspond to sites in the lattice of the Conditional Random Field.

    Notice that the model consists of three unary potentials, which depend only on one site i in

    the lattice, and one pairwise potential, depending on pairs of neighboring sites.

    Each of the potentials is subsequently explained in a simplified way, for details please see [21].

    3.3.1 Potentials without context

    Location potential

    The unary location potentials (yi, i; ) capture the correlation of the class label and the abso-lute location of the pixel in the image. For the databases with which TextonBoost was tested,

    the location potentials had a rather low importance since the context of the pictures is very

    diverse. In the case of our road scene segmentation, which is a more structured environment,

    they have had significantly more relevance, as discussed in Chapter 5.

    Color potential

    In TextonBoost, the color distributions of object classes are represented as Gaussian Mixture

    Models (see Section 3.1.4) in CIELab color space where the mixture coefficients depend on the

    class label. The conditional probability of the color x of a pixel labeled with class y is given by

    P(x|y) =k

    P(x|k)P(k|y) (3.11)

    with color clusters (mixture components) P(x|k). Notice that the clusters are shared between

    different classes, and that only the coeficients P(k|y) depend on the class label. This makes

    the model more eficient to learn than a separate GMM for each class, which is important since

    TextonBoost takes into account a high number of classes.

    Edge potential

    The pairwise edge potentials have the form of a contrast sensitive Potts model [2],

    (yi, yj, gij(x); ) = T gij(x)[yi = yj ], (3.12)

    with [] the zero-one indicator function:

    [condition] =

    1, if condition is true0, otherwise (3.13)

  • 7/30/2019 03 Martinelli

    23/61

    Chapter 3: State of the art 16

    The edge feature gij measures the difference in color between the neighboring pixels, as sug-

    gested by [19],

    gij =

    exp(xi xj

    2

    1

    (3.14)

    where xi and xj are three-dimensional vectors representing the CIELab colors of pixels i and j

    respectively. Including the unit element allows a bias to be learned, to remove small, isolated

    regions5. The quantity is an image-dependent contrast term, and is set separately for each

    image to (2xi xj2)1, where denotes an average over the image. The two scalar

    constants that compose the parameter vector are appropriately set by hand.

    3.3.2 Texture-layout potential

    The texture-layout potential is the most important contribution of TextonBoost. It is based on

    a set of novel features which are introduced in [21] as texture-layout filters. These new features

    are capable of, at once, capturing the correlation between texture, spatial layout, and textural

    context in an image.

    Here, we quickly describe how the texture-layout features are calculated and the boosting

    approach used to automatically select the best features and, thereby, learn the texture-layout

    potentials used in Eq. 3.10.

    Image textonization

    As a first step, the images are represented by textons [17] in order to arrive at a compact

    representation of the vast range of possible appearances of objects or regions of interest 6. The

    process of textonization is depicted in Figure 3.7, and proceeds as follows. At first, each of the

    training images is convolved with a 17-dimensional filter bank. The responses for all training

    pixels are then whitenedso that they have zero mean and unit covarianceand clustered

    using a standard Euclidean-distance K-means clustering algorithm for dimension reduction.

    Finally, each pixel in each image is assigned to the nearest cluster center found with K-means,

    producing the texton map T, where pixel i has value Ti {1,...,K}.

    Texture-Layout Filters

    The texture-layout filter is defined by a pair (r, t) of an image region, r, and a texton t, as

    illustrated in Figure 3.8. Region r is relatively referenced to the pixel i being classified and

    texton t belongs to the texton map T. For efficiency reasons, only rectangular regions are

    5The unit element means that for every pair of pixels that have different labels a constant potential is addedto the whole. This makes contiguous labels preferable when the energy function is minimized.

    6Textons have been proven effective in categorizing materials [25] as well as generic object classes [28].

  • 7/30/2019 03 Martinelli

    24/61

    17 3.3 Example: TextonBoost

    Figure 3.7: The process of image textonization, as proposed by [21]. All training images areconvolved with a filter-bank. The filter responses are clustered using K-means. Finally, each

    pixel is assigned a texton index corresponding to the nearest cluster center to its filter response.

    implemented by TextonBoost, although any arbitrary region shape could be considered. A set

    R of candidate rectangles is chosen at random, such that every rectangle lies inside a fixed

    bounding box.

    The feature response at pixel i of texture-layout filter (r, t) is the proportion of pixels under

    the offset region r + i that have been assigned texton t in the textonization process,

    v[r,t](i) =1

    area(r)

    j(r+i)

    [Tj = t] . (3.15)

    Any part of the region r + i that lies outside the image does not contribute to the feature

    response.

    An efficient and elegant way to calculate the filter responses anywhere over an image can

    be achieved with the use of integral images [26]. For each texton t in the texton map T, a

    separate integral image I(t) is calculated. In this integral image, the value at pixel i = (ui, vi)

    is defined as the number of pixels in the original image that have been assigned to texton t in

    the rectangular region with top left corner at (1, 1) and bottom right corner at (ui, vi):

    I(t)(i) =

    j:(ujui)&(vjvi)[Tj = t] . (3.16)

    The advantage of integral images is that they can later be used to compute the texture-

    layout filter responses in constant time: if I(t) is the integral image for texton channel t defined

    like above, then the feature response is computed as:

    v[r,t](i) =

    I(t)(rbr) I(t)(rbl) I

    (t)(rtr) + I(t)(rtl)

    /area(r) (3.17)

    where rbr, rbl, rtr and rtl denote the bottom right, bottom left, top right and top left corners

  • 7/30/2019 03 Martinelli

    25/61

    Chapter 3: State of the art 18

    (a) (c) (e)

    (b) (d) (f)

    Figure 3.8: Graphical explanation of texture-layout filters extracted from [21]. (a, b) An imageand its corresponding texton map (colors represent texton indices). (c) Texture-layout filtersare defined relative to the point i being classified (yellow cross). In this first example feature,region r1 is combined with texton t1 in blue. (d) A second feature where region r2 is combinedwith texton t2 in green. (e) The response v[r1,t1](i) of the first feature is calculated at threepositions in the texton map (magnified). In this example, v[r1,t1](i1) 0, v[r1,t1](i2) 1, and

    v[r1,t1](i3) 1/2. (f ) The second feature (r2, t2), where t2 corresponds to grass, can learnthat points i (such as i4) belonging to sheep regions tend to produce large values of v[r2,t2](i),and hence can exploit the contextual information that sheep pixels tend to be surrounded bygrass pixels.

    of rectangle r.

    Texture-layout features are sufficiently general to allow for an automatic learning of layout

    and context information. Figure 3.8 illustrates how texture-layout filters are able to model

    textural context and layout.

    Boosting of texture-layout filters

    A Boosting algorithm iteratively selects the most discriminative texture-layout filters (r, t) as

    weak learners and combines them into a strong classifier used to derive the texture-layout

    potential in Eq. 3.10. The boosting scheme used in TextonBoost shares each weak learner

    between a set of classes C, so that a single weak learner classifies for several classes at once.

    According to the authors, this allows for classification with cost sub-linear in the number of

    classes, and leads to improved generalization.

    The strong classifier learned is the sum over the classification confidences hmi (c) of M weak

  • 7/30/2019 03 Martinelli

    26/61

    19 3.3 Example: TextonBoost

    learners

    H(yi, i) =Mm=1

    hmi (c) (3.18)

    The confidence value H(yi, i) for pixel i is then just multiplied by a negative constantso

    that a positive confidence turns into a negative energy, which will be preferred in the energy

    minimizationto give the texture-layout potentials i used in Eq. 3.10:

    i(yi, x; ) = .H(yi, i) (3.19)

    Each weak learner is a decision stump based on the feature response v[r,t](i) of the form

    hi[c] =

    a

    v[r,t](i) >

    + b, if c C

    kc, otherwise,(3.20)

    with parameters (a,b,kc,,C,r,t). The region r and texton index t together specify the texture-

    layout filter feature, and v[r,t](i) denotes the corresponding feature response at position i. For

    the classes that share this feature, that is, (c C), the weak learner gives hi(c) {a + b, b}

    depending on whether v[r,t](i) is, respectively, greater or lower than a threshold . For classes

    not sharing the feature (c / C), the constant kc ensures that unequal numbers of training

    examples of each class do not adversely affect the learning procedure.In order to choose the weak classifiers, TextonBoost uses the standard boosting algorithm

    introduced by Schapire et al. in [8], which will be explained for completeness. Suppose we are

    choosing the mth weak classifier. Each training example i, a pixel in a training image, is paired

    with a target value zci {1, +1}where +1 means that pixel i has ground truth class c and

    1 notand assigned a weight wci specifying its classification accuracy for class c after the

    m 1 previous rounds of boosting. The mth weak classifier is chosen by minimizing an error

    function Jerror weighted by wci :

    Jerror = c iwci (z

    ci h

    mi (c))

    2 (3.21)

    The training examples are then re-weighted

    wci := wci ezcih

    mi (c) (3.22)

    Minimizing the error function Jerror requires, for each new weak classifier, an expensive

    brute-force search over the possible sharing classes in C, features (r, t), and thresholds . As

    shown in [21] however, given these parameters, a closed form solution does exist for a, b and kc.

  • 7/30/2019 03 Martinelli

    27/61

    Chapter 3: State of the art 20

    3.4 Application to road scenes (Sturgess et al.)

    In the more specific field of road scene segmentation, Sturgess et al. [22] have recently quite

    successfully segmented inner-city road scenes in 11 different classes. Their method builds on

    the work of Shotton et al. (see Section 3.3) and on that of Brostow et al. [3] integrating

    the appearance-based features from TextonBoost with the structure-from-motion features from

    Brostow et al. (see Section 3.1.2) in a higher order CRF. According to the authors, the use

    of higher-order cliquesthat is, cliques with several pixels, instead of only pairs of pixels like

    in TextonBoostproduces accurate segmentations with precise object boundaries. Figure 3.9

    shows how Sturgess et al. use an unsupervised meanshift segmentation of the input image to

    obtain regions that are used as higher-oder cliques and included in the energy function U to be

    minimized.

    Figure 3.9: The original image (left), its ground truth labelling (centre) and the meanshiftsegmentation of the image (right). The segments in the meanshift segmentation on the rightare used to define higher-order potentials, allowing for more precise object boundaries in thefinal segmentation.

    Sturgess et al. achieved an overall accuracy of 84% compared to the previous state-of-

    the-art accuracy of 69% [3] on the challenging CamVid database [4]. The work of Sturgess is

    therefore especially important for this thesis as it successfully tackles the same inner-city scene

    segmentation problem. The CamVid database will be better described in chapter 5, where the

    results obtained by our implementation are compared with those of Sturgess et al. [22].

  • 7/30/2019 03 Martinelli

    28/61

    Chapter 4

    Methodology

    4.1 CRF framework

    After thorough consideration of related work, CRFs have been deemed very suitable and up-

    to-date for dealing with the problem proposed in this thesis project. As discussed in section

    2, conditional random fields allow the incorporation of a big variety of cues in a single, unified

    model. Moreover, state-of-the-art work in the field of image segmentation (see Section 3.3,

    TextonBoost) and also more specifically in the domain of inner-city road scene understanding

    (see Section 3.4, Sturgess et al.) has used CRFs. Sturgess et al. have been able to very

    successfully segment eleven different classes in road scenes, some of which being very important

    to our final goal of driver behavior prediction.

    4.2 Basic model: location and edge potentials

    Location and edge cues, as mentioned in 3.1, are very meaningful and can significantly con-

    tribute to the quality of any segmentation. In our case, location cues are all the more important

    because we deal with a very spatially structured scene. The road will, for example, never be at

    the top of the image and the sky will never be at the b ottom. We can then extract precious

    information as to where to expect our classes to be located on the picture.

    If, for a better understanding of the problem, we consider, at first, a model with just the

    location and edge potentials, then the energy function to be minimized in order to infer the

    21

  • 7/30/2019 03 Martinelli

    29/61

    Chapter 4: Methodology 22

    most likely labeling becomes

    U(y|x, ) =i

    location (yi, i; ) +

    (i,j)

    edge (yi, yj , gij(x); ) . (4.1)

    The location potential is calculated based on the incidence, for all the training images, of each

    class at each pixel:

    (yi, i; ) = log

    Nyi,i +

    Ni +

    (4.2)

    where Nyi,i

    is the number of pixels at position i assigned class yi

    in the training images, Ni

    is

    the total number of pixels at position i and is a small integer to avoid the indefinition log(0)

    when Nyi,i = 0 (we use = 1). Figure 4.1 illustrates the location potential of classes road

    and sidewalk in images from the CamVid database.

    (a) (b)

    Figure 4.1: (a) Location potential of class road, (road, i; ). (b) Location potential ofclass sidewalk, (sidewalk, i; ). The whiter, the bigger the incidence of pixels from thecorresponding class in the training images.

    The pairwise edge potential has the form of a contrast sensitive Potts model [2] like defined

    in TextonBoost:

    (yi, yj , gij(x); ) = T gij(x)[yi = yj ], (4.3)

    with [] the zero-one indicator function. The edge feature gij measures the diference in color

    between the neighboring pixels, as suggested by [19],

    gij =

    exp(xi xj

    2

    1

    (4.4)

    With the help of an intuitive example, shown in Figure 4.2a, we can see how location and

  • 7/30/2019 03 Martinelli

    30/61

    23 4.2 Basic model: location and edge potentials

    edge potentials interact, resulting in a meaningful segmentation. In this example, we want to

    segment the toy image in three different classes, background, foreground-1 and foreground-2.

    Figures 4.2b, 4.2d and 4.2f show the unary location potentials (yi, i; ) for classes foreground-

    1, foreground-2 and background, respectively, at every pixel i 1. A white pixel represents a

    high probability of a class being present at that pixelwhich is equivalent to saying that the

    energy potential is lowimpelling the function minimization to prefer labels where the pixels

    are white rather than black. Figure 4.2c shows the gradient image, which is a way to visualize

    the edge potential calculated like in Eq. 4.3. The segmentation boundaries are more likely to

    be located where the edge potential is white. Figure 4.2e shows the final segmentation obtained

    through the minimization of Eq. 4.1.

    (a) (c) (e)

    (b) (d) (f)

    Figure 4.2: (a) Noisy toy image to be segmented. (c) Gradient image as basis for edge po-tential. (b,d,f) Location potentials of classes foreground-1, foreground-2 and background,respectively. (e) Final segmentation inferred from the minimization of Eq. 4.1.

    Note that the final segmentation correctly ignores the noise, as it is not present at the samepixels simultaneously in the edge and location potentials. The red and yellow structures inside

    the main blob are all segmented as class foreground-1 thanks to the contribution of its location

    potential. The constant term in Eq. 4.3, which adds a given cost for any pixel belonging to a

    label boundary, helps suppress the appeareance of noisy, small foreground regions.

    1The location potential of the class background is the complementary of the foreground classes potentials.That is, when either class foreground-1 or foreground-2 is likely, class background is unlikely, and vice-versa.

  • 7/30/2019 03 Martinelli

    31/61

    Chapter 4: Methodology 24

    4.3 Texture potential model

    Although the segmentation of the toy example, obtained with location and edge potentials

    described in the last section, was robust against noise, the location potentials provided were

    very similar to the regions we wanted to segment. In real images, not only the location potentials

    are less correlated to the position of the labels, but there are also much more complex objects

    to be segmented that cannot be differentiated just by using location and edge potentials. The

    next step to a better segmentation is then modeling the texture information present in the

    images. We can represent this new potential by rewriting the energy function U as:

    U(y|x, ) =i

    location (yi, i; ) +

    texture i(yi, x; ) +

    (i,j)

    edge (yi, yj, gij(x); ) (4.5)

    Note that the texture potential represents local texture only, i.e., it does not take into

    account context. It is merely a local feature. Context and layout are explored in Section 4.4,

    where the use of simplified texture-layout filters is investigated.

    In order to represent the texture information of the images to segment, we opted, similarly

    to TextonBoost [21], for the use of filter banks. By using an N-dimensional filter bank F,

    one obtains an N-dimensional feature vector, fx(i), for each pixel i. Each component of this

    vector is the result of the convolution of the input image converted to grayscale, x, and the

    corresponding filter shifted to the position of i:

    fx(i) =

    (F1 x)|i

    (F2 x)|i...

    (FN x)|i

    (4.6)

    Equivalently, the result of the convolution of a N-dimensional filter bank with an image can

    be understood by considering the convolution of the image with each component of the filter

    at a time. Figure 4.3 shows an example of input image, and the response images for some ofthe Leung-Malik filter bank components [16].

    4.3.1 Feature vector and choice of filter bank

    The choice of the filter bank used to represent the texture in the images to be segmented was

    based on the following criteria:

    Good coverage of possible textures without too much redundancy between filters;

  • 7/30/2019 03 Martinelli

    32/61

    25 4.3 Texture potential model

    (a) (c) (e)

    (b) (d) (f)

    Figure 4.3: (a) Example of inner-city road scene image. (b-f) Examples of responses of fivedifferent filter components of the LM filter bank, which are shown at the bottom left corner ofeach figure.

    Fast and efficient filter response calculation;

    Ready-to-use implementation available.

    Considering those criteria, a very interesting implementation by the Intelligent Systems Lab

    of the University of Amsterdam has been found. It is implemented as a matlab .mex file, which

    means it is actually a C script which is pre-compiled and then called by Matlab in execution

    time. The libraries are freely available for research purposes2.

    Using this fast to calculate .mex implementation, 5 different filter banks have been assessed

    by segmenting images using only the texture potential in Eq. 4.5. Four classes have been

    considered: road, sidewalk, others3 and sky.

    The filter banks assessed were the following:

    MR8 The MR8 filter bank consists of 38 filters but only 8 filter responses. The filter bank

    contains filters at multiple orientations but their outputs are collapsed by recording onlythe maximum filter response across all orientations (see Figure 4.4);

    MR8 - no maxima The rotation invariance of the MR8 filter bank, achieved by taking

    only the maximum response over all orientation, may not be a desired property of a tex-

    ture filter bank used for segmentationsome classes could be described by their features

    2Source code at: http://www.science.uva.nl/mark.3Class others is assigned to any pixel that is not labeled as one of the other three classesit can thus be

    seen as the complement of the other three classes.

  • 7/30/2019 03 Martinelli

    33/61

    Chapter 4: Methodology 26

    orientation. Therefore, a filter bank called MR8 - no maxima has been defined, where all

    the 38 responses are kept;

    MR8 - separate channels Here, the MR8 filter is applied individually to each of the

    three color channels, in an attempt to verify whether discriminative texture information

    is unevenly distributed over the color channels;

    MR8 - multi-scale This filter bank is composed of three MR8 filter banks in three

    subsequent scales. Although the MR8 filter bank itself already uses filters in different

    scales, we found interesting to try to cover even more scales as road scenes contain,

    almost always, objects whose distance may vary in many orders of magnitude 4;

    TextonBoosts filter bank This filter bank has 17 dimensions and is based on the CIELab

    color space. consists of Gaussians at scales k, 2k and 4k, x and y derivatives of Gaussians

    at scales 2k and 4k, and Laplacians of Gaussians at scales k, 2k, 4k and 8k. The Gaussians

    are applied to all three color channels, while the other filters are applied only to the

    luminance.

    Figure 4.4: The MR8 filter bank is low dimensional, rotationally invariant and yet capable ofpicking out oriented features. Note that only the maximum response of the filters of each ofthe first 6 rows is taken.

    As all the filter banksexcept MR8 - separate channels and TextoonBoosts filter bankare

    convolved with grayscale images, we also concatenated to the texture feature vector fx(i)

    4For instance, there might be a car immediately in front of the camera but also another one tens of metersaway.

  • 7/30/2019 03 Martinelli

    34/61

    27 4.3 Texture potential model

    which is the response of the filter bankthe L, a and b color values of its corresponding pixel:

    fx(i) =

    fx(i)

    Li

    ai

    bi

    (4.7)

    In this manner, the color information was merged with the texture, giving an extra cue to

    the Adaboost classifiers5.

    The results of the tests showed that the filter bank that yielded the best segmentation

    results and, thus, best represented the texture information in the road scene images was the

    MR8 - multi-scale. This is probably due to the aforementioned fact that road scene images

    have similar objects and regions that may vary greatly in depth. This variation is well captured

    by the multiple-scale characteristic of the MR8 - multi-scale filter bank.

    Combination of 3D cues to feature vector

    As discussed in section 3.1.2, 3D information can be extracted from images in a video sequence

    using structure from motion techniques. Those techniques can only infer the 3D position of

    characteristic points in the image, that is, points that can be located, described and then

    matched in subsequent images. In this thesis this has been done using the Harris corner detector,with normalized cross-correlation over patches for matching. Other possible patch descriptors

    are, for example, SIFT and SURF.

    All 3D features mentioned in section 3.1.2 have been concatenatedjust like the L, a and

    b color valuesto the feature vector described in Eq. 4.7:

    fx (i) =

    fx(i)

    3Dfeature1(i)...

    3Dfeature5(i)

    (4.8)

    However, in order to include this 3D cues in our feature vector, they need to be defined for

    every pixel of an input image. That means that we have to transform the sparse 3D features

    obtained using reconstruction techniques into dense features. This can be done by interpolation,

    where every pixel is assigned 3D feature values based on the values of the sparse neighbors that

    could be defined with reconstruction techniques.

    5Tests have been performed with different color spaces, yielding the best results when CIELab was used.This comes from the fact that the CIELab color space is partially invariant to scene lighting modificationsonlythe L dimension changes in contrast to the three dimensions of the RGB color space, for instance.

  • 7/30/2019 03 Martinelli

    35/61

    Chapter 4: Methodology 28

    Figure 4.5 shows an example of dense interpolation of the 3D feature height above ground

    for an image taken from the CamVid database.

    (a) (b)

    Figure 4.5: (a) A dusk image taken from the CamVid database. (b) The calculated height aboveground 3D feature. After determining a point cloud from structure from motion techniques,the sparse features have been interpolated as to yield a dense representation. Notice how thesky has high values and that we can see a faint blob where the car is located in the originalimage.

    It is important to mention that, before concatenating them to the feature vector as shown

    in Eq. 4.8, the 3D features have been appropriately normalized. The normalization guarantees

    that they do not overshadow the texture and color features during the clustering process. This

    could happen if the values of the 3D features were much greater than the values of the other

    features. Since the clustering method implemented uses Euclidian distances, such an imbalance

    in the feature values would result in biased cluster centers. The influence of the use of 3D

    features on the segmentation results is discussed in Chapter 5.

    4.3.2 Boosting of feature vectors

    Having defined the feature vector as in Eq. 4.8, we need then to find patterns in the features

    extracted from training images and try to recognize them in new, unseen images. For instance,we want to learn what texture, color and 3D cues are typical in each of the classes we want to

    segment. Some of the machine learning techniques suitable for this task are neural networks,

    belief networks or Gaussian Mixture Models in the N-dimensional space (where N is the number

    of filters in the filter bank). Nonetheless, an Adaboost approach has been preferred for its

    generalization power and ease to use.

    A short overview about the way Adaboost works is described here. For more details about

    its implementation and theoretical grounds, please see [8]. For this thesis project we have

  • 7/30/2019 03 Martinelli

    36/61

    29 4.3 Texture potential model

    Figure 4.6: Example of training procedure for classifier road. The Q K data matrix D isrepresented by the red vectors whereas the 1K label vector L is indicated by the green arrows.

    utilized a ready-to-use Matlab implementation from the Moscow state university6.

    Note that, since we are dealing with binary Adaboost classification, a classifier is trained

    for each of the classes we want to segment in a one-versus-all manner. For the training of each

    classifier, a learning data matrix D RQK is taken as input by the Adaboost trainer. Matrix

    D has size Q K, where Q is the number of dimensions7 of the feature vector from Eq. 4.8 andK is the number of training vectors used for training (the feature vectors are extracted from

    pixels in the training images). Another input, a 1 K vector L {0, 1}1K, contains the labels

    of the training data D. Vector L is comprised of ones for the pixels belonging to the class of

    the classifier being trained, and zeros otherwise. Figure 4.6 illustrates how individual classifiers

    for each class are trained.

    The Adaboost classifier of class c is composed of M stump weak classifiers hc(f),

    hc(f) =

    1 if fp >

    0, otherwise(4.9)

    where fp is the pth dimension of vector f and is a threshold. The strong classifier Hclassc(f(i))

    is built by choosing the most discriminative weak learners by minimizing the error to the target

    value, like explained in Section 3.3.2. Figure 4.7 shows how a trained classifier outputs a

    confidence value between zero and one for feature vectors from unseen images.

    Once we have defined a strong classifier H for each class, the texture potential of Eq. 4.5

    6Source code available at http://graphics.cs.msu.ru/en/science/research/machinelearning/modestada .7Q = N(number of dimensions of the filter bank) +3(L,a,b) + 5(3D features).

  • 7/30/2019 03 Martinelli

    37/61

    Chapter 4: Methodology 30

    Figure 4.7: Given a trained classifier, a classification confidence is computed based on how

    similiar the input feature vector is to the positive examplesand on how different it is fromthe negative onesprovided in the training phase illustrated in Figure 4.6.

    can be defined as:

    i(yi, x; ) = .Hclassyi (fx(i)) (4.10)

    The output of the strong classifier Hclassyi (fx(i)) is multiplied by a negative constant, so

    that a positive confidence turns into a negative energy, which will be preferred in the energy

    minimization. is the set of all parameters used in the Adaboost training of H, for instance,

    number of weak classifiers.

    4.3.3 Adaptive training procedure

    In order to make the training of Adaboost classifiers more tractable, not every pixel of every

    training image has been selected to build the training data matrix D. Since there is a lot

    of redundancy between pixels, this simplification has not adversely affected the quality of the

    Adaboost classifiers.

    Although the selection of pixels for the extraction of training feature vectors has initially

    been random, a smarter and innovative algorithm has been developed.

    The adaptive training procedure works by, in an iterative way, choosing an unequal propor-

    tion of feature vectors from each label. The idea is that, based on the confusion matrix of agiven segmentation experiment, we know the strengths and weaknesses of the classifiers trained.

    For instance, suppose that for a given segmentation experiment class sky is not confused as

    much as street and sidewalk. Then, it is reasonable that we choose in the next segmentation

    experiment more feature vectors from classes street and sidewalk and less feature vectors

    from class sky for the training of classifiers street and sidewalk.

    Formally, if we represent the weight (or proportion) of training feature vectors from class

    i, used in the Adaboost training of classifier j, as Wij, the update of every weight after each

  • 7/30/2019 03 Martinelli

    38/61

    31 4.4 Texture-layout potential model (context)

    segmentation iteration (experiment) can be expressed as:

    Wij =

    Wije

    Cmij/Z if i = j

    Wije(1Cmij)/Z if i = j

    (4.11)

    where Cmij is the element in the ith row and jth column of the confusion matrix of the

    previous segmentation iteration. is a learning speed factor and Z is a normalization factor

    that guarantees that

    iWij = 1, (4.12)

    or, in other words, that the sum of the proportions of feature vectors from each class remains

    equal to 1. The weights are all equally initialized as Wij = 1/Nc, Nc representing the number

    of classes.

    Notice that in the case of a perfect segmentation, where the confusion matrix is equal to

    the identity matrix, the proportion of training feature vector samples Wij does not change.

    Although the adaptive learning algorithm improved a lot the segmentation quality (see

    Section 5.1), the use of local features alone is intrinsically limited. As precise and discriminative

    as a classifier may be, there are cases where class sidewalk is virtually identical to class road

    for every local feature imaginable. The natural next step towards a better segmentation is to

    use context information. Then, the fact that sidewalks are normally alongside roads, separating

    them from buildings or other regions, can be explored and help us correctly differentiate what

    locally is indifferentiable.

    4.4 Texture-layout potential model (context)

    In order to model contextual information, we opt for utilizing the texture layout features in-

    troduced by TextonBoost. This new potential replaces the texture potentials explained in the

    previous section, as they are more general. We then have the following energy function:

    U(y|x, ) =i

    location (yi, i; ) +

    texture-layout i(yi, x; ) +

    (i,j)

    edge (yi, yj , gij(x); ) (4.13)

    In this equation, the texture-layout potentials are defined similarly to the way they are defined

    in TextonBoost:

    i(yi, x; ) = .H(yi, i) (4.14)

  • 7/30/2019 03 Martinelli

    39/61

    Chapter 4: Methodology 32

    The confidence H(yi, i) is the output of a strong classifier found by boosting weak classifiers,

    H(yi, i) =Mm=1

    hmyi(i) (4.15)

    Each weak classifier, in turn, is defined based on the response of a texture-layout filter:

    hmyi(i) =

    a, if v[r,t](i) >

    b, otherwise,(4.16)

    Notice the difference from the definition in Eq. 3.20 of TextonBoost: bearing in mind our

    final goal of behavior prediction, we do not need to classify as many classes as in TextonBoost

    where up to 32 different classes are segmented. TextonBoost shares weak classifiers because

    the computation cost becomes sub-linear with the number of classes. Since we do not need as

    many classes, it is possible for us to simplify the calculation of strong classifiers by not using

    shared weak classifiers. Therefore, in our approach, each strong classifier has its own, exclusive

    weak classifiers.

    The texture-layout filter response v[r,t](i) is the proportion of pixels in the input image,

    from all those lying in the rectangle r with its origin shifted to pixel i, that have been assigned

    texton t in the textonization process illustrated in section 3.3.2:

    v[r,t](i) =1

    area(r)

    j(r+i)

    [Tj = t] . (4.17)

    4.4.1 Training procedure

    We used, for our textonization process, the same feature vector definition as in Eq. 4.8, which

    contains texture, color and 3D cues.

    In order to build a strong classifiernote that we need to train one strong classifier for

    each of the classes we want to segment our image in, weak classifiers are added one by one

    following the following boosting procedure:

    1. Generation of weak classifier candidates: Each weak classifier is composed of a texture-

    layout filter (r, t) and a threshold . The candidates are generated by randomly choosing

    a rectangle region inside a bounding box, a texton index t T = {1, 2, , K} where K is

    the number of clusters used in the textonization process, and finally a threshold between

    0 and 1. For the addition of each weak classifier, an arbitrary number of candidates, Ncd,

    is generated.

  • 7/30/2019 03 Martinelli

    40/61

    33 4.4 Texture-layout potential model (context)

    2. Calculation of parameters a andb for all candidates: Each weak classifier candidate must

    also be assigned values a and b so that its response, hmc (i), is fully defined (see Eq. 4.16).

    Like described by Torralba et al [23], who use the same boosting approach (except our

    does not share weak classifiers), a and b can be calculated as follows:

    b =

    i w

    ci zci

    v[r,t](i)

    i w

    ci

    v[r,t](i)

    , (4.18)a =

    i w

    cizci

    v[r,t](i) >

    i wci v[r,t](i) >

    , (4.19)

    where c is the label for which the classifier is being trained, zci = +1 or zci = 1 for

    pixels i which, respectively, have ground truth label c or different from c, and wci the are

    classification accuracy weights used by Adaboost (see Section 3.3.2).

    Note, from Eq. 4.18 and Eq. 4.19, that, for the calculation of a and b, the response of the

    texture-layout filters, v[r,t](i), must be calculated for all training pixels i and compared

    to threshold .

    3. Search for the best weak classifier candidate: Once each weak classifier is fully defined, that

    is, all parameters (r,t,,a,b) are defined, the most discriminative among the candidates

    is found by minimizing the error function with respect to the target values zci

    .

    In Chapter 5 we see how texture-layout strong classifiers can learn the context between

    objects. We observe also how the number of weak classifiers influences the segmentation quality.

    4.4.2 Practical considerations

    System architecture

    Due to the short period of time available for this thesis work, the implementation of software had

    to be efficient and fast. Owing to its flexibilityand variety of ready-to-use image processing,

    statistics, plotting and other functions availableMatlab has been the preferred tool for theimplementation of the solution.

    Conditional Random Fields are, however, intrinsically highly demanding in computational

    resources. This is due to the iterative nature of the minimization procedure of the cost function

    U, detailed in section 3.2.1. As Matlab is an interpreted programming language, it is signifi-

    cantly slower to process loops than compiled languages such as C or C++. Therefore, Matlab

    has proven to be unable to cope with the massive calculations needed for the segmentation

    inference, when the cost function U is minimized.

  • 7/30/2019 03 Martinelli

    41/61

    Chapter 4: Methodology 34

    Figure 4.8: Software architecture. The Matlab layer is responsible for the higher-level processingwhereas the C++ layer takes the heavy energy minimization computation.

    In the context of the iCub project [12]which is lead by the RobotCub Consortium, consist-

    ing of several European universities, a good C++ framework for the minimization of Markov

    Random Field energy functions has been found. The main goal of the iCub platform is to

    study cognition through the implementation of biological motivated algorithms. The project is

    open-sourceboth the hardware design and the software are freely available.

    The software implemented has been then based on a two-layer layout, as illustrated in

    Figure 4.8. Matlab, on a higher-level, pre-processes imagescalculating, for instance, filter

    convolutionswhereas the C++ program calculates the minimum of the energy function U.

    In other words, the C++ layer infers, from the given cliques potentials and input Matlab

    pre-processed data, what the maximum a posteriori likelihood labeling is.

    The assessment of the quality of the segmentations, the storage of results and all comple-

    mentary software functionalities are handled by Matlab on the higher-level layer.

  • 7/30/2019 03 Martinelli

    42/61

    35 4.4 Texture-layout potential model (context)

    Implementation challenges and optimizations

    Differently from the case of the texture potential explained in the previous section, we could not

    find any ready-to-use Matlab implementation of the boosting procedure for the texture-layout

    potential, as it is very specific to this problem. The whole algorithm had then to be implemented

    from scratch. Moreover, since there are countless loops involved in the training algorithm

    described above, Matlab was ruled out as programming environment of the implementation,

    being replaced by C++.

    Two main practical problems have been faced in the C++ programming of the developed

    algorithm described above. Firstly, the long processing time and, secondly, the lack of RAM

    memory.

    1. Processing time: The boosting procedure described in the previous section requires com-

    putations over all training pixels. If we consider 100 imagesa typical number for a

    training data seteach composed of, for instance, 800 600 pixels, we have already 48

    million calculation loops for each step. This turns out to be impractical for todays pro-

    cessors. The solution found was to resize all dataset images before segmenting them and

    also to consider, as training pixels, only a subsampled set of each image. By resizing the

    images to half their original size and subsampling the training pixels in 5-pixel steps, we

    could already reduce the number of calculation loops 100 times. After this simplifica-

    tion was applied, the decrease in segmentation quality was almost imperceptible, which

    indicates that the information necessary for training the classifiers was not lost with the

    resizing and subsampling.

    2. RAM memory:

    As discussed in section 3.3, the use of integral images is essential for the efficiency of the

    calculation of the texture-layout filters v[r,t]. If we consider that 100 textons have been

    defined in the textonization process, we have, for each training image, 100 integral images,

    one for each texton index. Again, considering 100 training images already resized to half

    their original size, we have ten thousand 400 300 matrices (each matrix represents an

    integral image). If we use a normal int variable for each matrix elementwhich in C++

    occupies 4 byteswe need 10000 400 300 4 = 4.8 Gigabytes or RAM memory.

    The first attempt to avoid this memory problem was to load only some of the integral

    images at a time. However, for the calculation of the texture-layout filter responses of the

    weak classifier candidates, all the integral matrices are necessary. They had then to be

    all simultaneously accessible in the RAM memory.

    The solution was to use short unsigned integerswith only 2 bytes, which were big

  • 7/30/2019 03 Martinelli

    43/61

    Chapter 4: Methodology 36

    enough for all the integral matrices analyzed8, and also to subsample the integral image

    matrices:

    I(t)(u, v) = I(t)

    round (u, v)

    SubsamplingFactor

    (4.20)

    Again, the subsampling almost did not change the results of the final segmentations. One

    of the reasons why the results did not change much is probably that the subsampling

    rate of 3 used is much smaller than the sizes of the rectangular regions r used in the

    texture-layout features. Although the subsampling reduced the amount of RAM memory

    necessary for loading the integral images, there is a limit of training images that can be

    used for training without causing memory problems.

    8Each short unsigned integer can store a number of up to 65535. If we consider a 400 300 pixel image, themaximum value of an integral imageif all pixels were assigned to one single textonis 120000. However, sinceeach pixel is assigned to one of many texton indexes, the integral image of each texton never has values close tothe limit 65535.

  • 7/30/2019 03 Martinelli

    44/61

    Chapter 5

    Results

    In this chapter, we investigate the performance of our semantic segmentation system on the

    challenging CamVid dataset and compare our results with existing work. Firstly, we show

    preliminary results obtained with the texture features described in Section 4.3 without consid-

    ering any context. We then analyse our final model with the context features (texture-layout

    features) described in Section 4.4. The effect of different aspects and parameters of the model

    is discussed before we present the best results obtained and analyse them quantitatively and

    qualitatively.

    5.1 Model without context features

    Figure 5.1 shows the confusion matrix of the segmentation of approximately 200 pictures, with

    classifiers trained on 140 other pictures, all randomly taken from the CamVid database. For this

    segmentation experiment, 500 training feature vectors have been randomly chosen per training

    image. The segmentations have been computed by minimization of Eq. 4.5 which does not

    include any context feature. Notice how sidewalks are almost not recognized at all.

    The adaptive training procedure described in Section 4.3.3 chooses for the training of the

    adaboost classifiers more examples of feature vectors from labels that are confusedlike roadand sidewalkthan from those who are easily recognizedlike sky. The confusion matrix of

    Figure 5.1 shows the results of the segmentation of the first iterationwhere all training vectors

    are chosen randomlyof this adaptive Adaboost training algorithm. After three iterations,

    examples are selectively chosen and the confusion matrix of the segmentation results, shown in

    Figure 5.2, shows much better discernment between classes that were initially mixed up.

    Although the adaptive training procedure improved the segmentation quality, context in-

    formation, as discussed in the next section, contributes to differentiate classes even better.

    37

  • 7/30/2019 03 Martinelli

    45/61

    Chapter 5: Results 38

    Figure 5.1: Confusion matrix of segmentation experiment chosing random feature vectors fortraining the Adaboost classifiers. Each row shows what proportion of the ground truth classeshas been assigned to each class by the classifiers. Class others is the union of all classes defined

    in the CamVid database except street, sidewalk and sky. For an ideal segmentation, theconfusion matrix would be equal to the identity matrix.

    Figure 5.2: Confusion matrix of segmentation after three iterations of the adaptive training.Initially, 65% of class sidewalk was wrongly assigned to class road, as compared to only 25%with the adaptive learning. The percentage of class sidewalk correctly assigned also increasedfrom 9% to 61%.

    5.2 Model with context features

    Our final model includes the texture-layout potential (see Section 4.4). This model and its

    results are discussed in detail in the following sections.

    5.2.1 Influence of number of weak classifiers

    As illustrated in Figure 3.8, texture-layout filters work by exploring the contextual correlation

    of texturesand in our solution also colorbetween neighboring regions. Figure 5.3 shows the

    rectangular region r of each of the first ten texture-layout features for the classifier of class

    road. Notice that the location distribution of the regions r is slightly biased towars either the

    top half or the bottom half of the image. This comes, probably, from the fact that most of

    the correlations between textures present in class road and other textures happen in a vertical

    fashion: the road is normally below other classes.

  • 7/30/2019 03 Martinelli

    46/61

    39 5.2 Model with context features

    Figure 5.3: r regions of first ten weak classifiers composing the strong classifier for class road.The yellow cross in the middle indicates the pixel i being classified and the blue rectanglerepresents the bounding box within which all the weak classifiers candidates are created. Thebigger the