ball detection via machine learning - kth · ball detection via machine learning rafael osorio...

Ball Detection via Machine Learning

R A F A E L O S O R I O

Master of Science Thesis Stockholm, Sweden 2009

Ball Detection via Machine Learning

R A F A E L O S O R I O

Master’s Thesis in Computer Science (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2009 Supervisor at CSC was Örjan Ekeberg Examiner was Anders Lansner TRITA-CSC-E 2009:004 ISRN-KTH/CSC/E--09/004--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se

Abstract

This thesis evaluates a method for real time detection of footballs in low

resolution images. The company Tracab uses a system of 8 camera pairs that

cover the whole pitch during a football match. By using stereo vision it is

possible to track the players and the ball, to be able to extract statistical data. In

this report a method proposed by Viola and Jones is evaluated to see if it can be

used to detect footballs in the images extracted by the cameras. The method is

based on a boosting algorithm called Adaboost and has mainly been used for

face detection. A cascade of boosted classifiers is trained from examples of

positive and negative images of footballs. In this report the images are much

smaller than the typical objects that the method has been developed for and a

question that this thesis tries to answer is if the method is applicable to objects

of such small sizes.

The Support Vector Machine (SVM) method has also been tested to see if the

performance of the classifier can be improved. Since the SVM method is a

time-consuming method it has been tested as a last step in the classifier

cascade, using features selected by the boosting process as input.

In addition to this, a database of images of footballs from 6 different matches

consisting of 10317 images used for training and 2221 images used for testing

has been produced. Results show that detection can be made with improved

performance compared to Tracab’s existing software.

Sammanfattning

Bolldetektion via maskininlärning

Denna rapport granskar en metod för realtidsdetektion av fotbollar i

lågupplösta bilder. Företaget Tracab använder sig av 8 kamerapar som

tillsammans täcker en hel fotbollsplan under en match. Med hjälp av

stereoseende är det möjligt att följa spelare och boll för att sedan erbjuda

statistik till fans. I denna rapport utvärderas en metod utvecklad av Viola och

Jones för att se om den går att använda till att detektera fotbollar i bilderna från

de 16 kamerorna. Metoden baseras på en boostingalgoritm som kallas

Adaboost och som främst har använts för ansiktsdetektion. En kaskad av

boostade klassificerare tränas utifrån positiva och negativa exempelbilder av

fotbollar. I denna rapport används bilder på små bollar som är mindre än de

vanliga objekt som metoden skapats för. En fråga som denna rapport försöker

svara på är huruvida denna metod är applicerbar på så små objekt.

Support Vektor Maskiner (SVM) har även testats för att se om klassificerarens

prestanda kan höjas. Eftersom SVM är en långsam metod har den integrerats

som ett sista steg i den tränade kaskaden. Features från Viola och Jones metod

har använts som input till SVM.

En databas bestående utav ett träningsset och ett testset har skapats från 6

matcher. Träningssetet består av 10317 bilder och testsetet består av 2221

bilder. Resultaten visar att detektion går att göra med högre precision jämfört

med Tracabs nuvarande mjukvara.

Content

Introduction .............................................................................................................. 1

1.1 Background ........................................................................................................... 1

1.2 Objective of the thesis .......................................................................................... 2

1.3 Hit rate vs. false positive rate ............................................................................... 4

1.4 Related Work ........................................................................................................ 5

1.5 Thesis Outline ....................................................................................................... 9

Image Database ....................................................................................................... 10

2.1 Ball tool ............................................................................................................... 10

2.2 Images ................................................................................................................. 11

2.2.1 Training set .................................................................................................. 11

2.2.2 Negatives ..................................................................................................... 12

2.2.3 Test set ........................................................................................................ 12

2.2.4 Five-a-side.................................................................................................... 13

2.2.5 Correctness .................................................................................................. 13

Theoretical background ........................................................................................... 15

3.1 Overview ............................................................................................................. 15

3.2 Features .............................................................................................................. 16

3.2.1 Haar features ............................................................................................... 16

3.2.2 Integral Image .............................................................................................. 17

3.5 AdaBoost ............................................................................................................. 19

3.5.1 Analysis ........................................................................................................ 19

3.5.2 Weak classifiers ........................................................................................... 20

3.5.3 Boosting ....................................................................................................... 21

3.6 Cascade ............................................................................................................... 22

3.6.1 Bootstrapping .............................................................................................. 23

3.7 Support Vector Machine ..................................................................................... 24

3.7.1 Overfitting ................................................................................................... 24

3.7.2 Non-linearly separable data ........................................................................ 25

3.7.3 Features extracted with Adaboost .............................................................. 26

3.8 Tying it all together ............................................................................................. 26

Method ................................................................................................................... 27

4.1 Training ............................................................................................................... 27

4.2 Step size and scaling ........................................................................................... 28

4.3 Masking out the audience .................................................................................. 29

4.4 Number of stages ................................................................................................ 29

4.5 Brightness threshold ........................................................................................... 30

4.6 SVM ..................................................................................................................... 31

4.7 OpenCV ............................................................................................................... 32

Results ..................................................................................................................... 33

5.1 ROC-curves .......................................................................................................... 33

5.2 Training results ................................................................................................... 34

5.3 Using different images for training ..................................................................... 35

5.3.1 Image size .................................................................................................... 35

5.3.2 Image sets .................................................................................................... 36

5.3.3 Negative images .......................................................................................... 37

5.4 Step Size .............................................................................................................. 39

5.5 Real and Gentle Adaboost .................................................................................. 41

5.6 Minimum hit rate and max false alarm rate ....................................................... 42

5.7 Brightness threshold ........................................................................................... 44

5.8 Number of stages ................................................................................................ 46

5.9 Support Vector Machine ..................................................................................... 47

5.10 Compared to existing detection........................................................................ 49

5.11 Five-a-side ......................................................................................................... 50

5.12 Discussion ......................................................................................................... 52

Conclusions and future work ................................................................................... 55

6.1 Conclusions ......................................................................................................... 55

6.2 Future work......................................................................................................... 56

Bibliography ............................................................................................................ 57

Appendix 1 .............................................................................................................. 59

Training set: .......................................................................................................... 59

Test set 1: ............................................................................................................. 59

1

Chapter 1

Introduction

In this chapter the circumstances of the problem are presented as well as the

goal of the thesis. Related work is described and an outline of the thesis is

given.

1. 1 Background

This Master thesis was performed at Svenska Tracab AB. Tracab has

developed real-time camera-based technology for locating the positions of

football players and the ball during football matches. Eight pairs of cameras are

installed around the pitch, controlled by a cluster of computers. Fig 1 shows

how pairs of cameras give stereo vision and how this makes it possible to

calculate the X and Y coordinates of an object on the pitch.

Fig 1 - Eight camera pairs cover the pitch giving stereo vision.

With this information it is possible to extract statistics such as total distance

covered by a player, a heat map where warmer color means that the player has

spent more time in that area of the pitch, completed passes, speed and

acceleration of the ball and of the players and a lot more. The whole process is

carried out in real-time (25 times per second). The system is semi-automatic

and is staffed with operators during the game. All moving objects that are

player-like are shown as targets by the system. The operators need to assign the

players to a target, since no face recognition or shirt number identification is

2

done to identify the players. They must also remove targets that are not subject

of tracking, e.g. medics and ball boys.

One big advantage of the system is that it does not interfere with the game in

any way. No transmitters or any other kind of device on the players or the ball

are used.

1. 2 Objective of the thesis

The objective of this Master thesis is to improve the ball detection using

machine learning techniques. Today the existing ball tracking method primarily

uses the movement of an object to recognize the ball, instead of the appearance.

In this report we will see if it is possible to shift the focus from using the

movement to doing object detection in every frame. A key attribute of the

method used is that it has to be fast enough for real-time usage.

Tracab’s technology is already good at detecting moving balls against a static

background, so an aim for this project is to produce reasonable ball hypotheses

in more difficult situations such as:

The ball is partially occluded by players.

The lighting conditions are uneven, especially when the sun only lights

up a part of the pitch.

Other objects like the head of a player or the socks of a player look like

the ball.

The ball is still, e.g. a free kick.

A classifier is to be trained to detect footballs, based on a labeled data set of

ball / non-ball image regions from images captured by Tracab’s cameras. When

talking about image regions in this report, a smaller sub window that is part of

the whole image is what is meant (left of figure 2). When only talking about an

image, the whole image is meant (right of figure 2).

Fig 2 Example of an image region and an image captured by Tracab’s cameras.

The classifier needs to be somewhat robust to changes in ball size and

preferably also ball color since they differ in different situations. One big

difference between this project and previous studies of object detection such as

3

the paper of Viola and Jones is the size of the object [32]. Here it is very small,

only a few pixels wide. A big question is to see if the method presented in this

report can be applied to objects of this size.

Even with reasonably good detection of the ball it is difficult to tell apart the

ball from other objects using only techniques based on the analysis of still

images. One way of solving this is by examining the trajectory of the object in

a sequence of images to discard objects that do not move like a ball. Also, if the

classifier detects the ball most of the time, only missing out a few frames at a

time, it is possible to do post processing to calculate the most likely ball path

between two detections. These two steps are already used today at Tracab and

are not a part of this thesis. Hopefully results from this thesis can be used to

provide more accurate detections thus improving the data input to these steps

and hence reduce the amount of calculation that needs to be done in those

steps.

The aim for this thesis is to evaluate if a machine learning approach can be

used to detect a small object such as a ball. More specifically, different

algorithms based on the work of Viola and Jones, which uses Adaboost, will be

evaluated [32]. In addition to this an extension to their work using SVMs at the

last stage has been tested and evaluated, inspired by the study by Le and Satoh

[13]. An overview of how the methods are combined can be seen in fig 3.

Fig 3 Overview. Image regions of an image are extracted and a cascade of classifiers trained

with Adaboost is run for each image region. A bright pixel is searched for before running a SVM

classifier on the regions that have not been rejected in earlier stages.

Input - all image regions

Cascade of classifiers

Ball

Brightness threshold

Ball

SVM-classifier

Non BallRejected

Ball

Non BallRejected

Non BallRejected

4

1. 3 Hit rate vs. false positive rate

While it is possible to get close to 100% in the hit rate, this will also probably

lead to a very high false positive rate. Having a hit rate of 80% and 0.1% false

positives could be great in some situations and for some applications; in other

situations this is not possible. In a medical environment for example this may

not be good enough, because you really want to be certain before giving some

risky medicine. In this application there is no limit that has to be reached.

Instead it is the ratio between the hit rate and the false alarm that is interesting.

Results will therefore be presented using Receiver Operating Characteristic-

curves (ROC) with hit rate on the y-axis and the false positive rate on the x-

axis. The different results are obtained by varying a sensitivity parameter. In

many applications a threshold is varied to get different detection rates. In this

thesis the number of stages used for detection and the step size used during

detection are varied to get the different rates. ROC-curves are usually used to

present this kind of results, which makes it easier to compare these results with

others. A more detailed description of the ROC-curves is given in Section 5.1

along with the results.

Somewhat promising results have been achieved, as seen in fig 4. Compared to

the existing method in Tracab we can see small improvements. More results

can be seen in Chapter 5.

Fig 4 Results compared to an existing method used by Tracab.

The open source project OpenCV has been used in this project for most of the

algorithms [38]. Preprocessing of the image data has been done using Matlab.

The SVM part has been done using libSVM [6] which has been integrated with

OpenCV.

5

1. 4 Related Work

Much of the research in object detection focuses on face detection and texture

classification. Little work has been done on ball detection. Research based on

face detection and texture classification is therefore also presented. More work

has been done on ball tracking in multiple image sequences but are not of

interest for my work and will thus not be regarded. The organization of this

chapter is as follows: First, specific research on ball detection will be

presented. Then, the most interesting methods found for face detection and

texture recognition will be presented.

Much of the research on color independent ball detection has been done using

some kind of edge detection. A Circle Hough Transform (CHT) is used by

D’Orazio and Guaragnella for circle detection [9]. Edges are first obtained by

applying the canny edge detector. Each edge point contributes a circle of radius

R to an output accumulator space. When the radius is not previously known,

the algorithm needs to be run for all possible radiuses. This requires a large

amount of computations to be performed. Another disadvantage is that this

method is bad at handling noisy images. The sub-windows found by the CHT

step are then used to train a Neural Network with different wavelet filters as

features. The Haar wavelet was found to be the best among Daubechies 2,

Daubechies 3 and Daubechies 4 with different decomposition levels. A faster

version of the Circle Hough Transform is used for ball detection by

Scaramuzza et al [27]. Coath and Musumeci use an edge based arc detection to

detect partially occluded balls [7].

Ville Lumikero thresholds a grayscale image into binary form [19]. He applies

morphological operations such as dilation and erosion to “clean” the images

from noise and to fill up holes in objects. The ball candidates are then

thresholded by size and color. The remaining ball candidates are further

processed by a tracking algorithm not described here.

Ancona et al [1, 3] implement an example based method for ball detection by

training a Support Vector Machine (SVM, read Burges tutorial for more

information [5]) with positive and negative example images of a ball. 20x20

pixels large images of footballs are used as input to the SVM. The images are

preprocessed with histogram equalization to reduce variations in image

brightness and contrast. The SVM is a robust algorithm that searches for the

hyper plane that maximizes the minimum distance from the hyper plane to the

closest training point. SVMs are described in more detail in Section 3.7.

To do face detection, a bootstrapping technique is used by Osuna et al to

improve the performance of a SVM [23]. Misclassified images that do not

contain faces are stored and used as negative examples in later training phases.

The importance of this step was first shown by Sung and Poggio [30]. To

reduce computation time of the SVM a reduced set of vectors is introduced by

6

Romdhani et al [26]. By applying the reduced vectors one after the other it may

not be necessary to evaluate all the reduced vectors. If an image is considered

very unlikely to be a face early in the chain it can be discarded fast. Regular

detectors based on a single classifier such as the SVM without using the

reduced vector set are slow because they spend an equally large amount of time

evaluating negative responses as positive responses in an image. This approach

of a cascaded detector that is coarse-to-fine has been used widely in the

literature.

One of these cascade approaches is the algorithm proposed by Viola and Jones

[31, 32]. Their algorithm uses a cascade of boosted classifiers to detect faces

and mimics a decision tree behavior. In the first stages of the cascade only a

few features are evaluated. Later on down the cascade the stages become more

complex, thus a lot of non-faces can be thrown away without much effort while

face candidates are more thoroughly examined. Viola and Jones have shown

that their final classifier is very efficient and well suited for real time

applications (15 384x288 frames/s on a 700 MHz Pentium II). Much of their

work is based on Papageorgiou et al who did not work directly on pixel values

[24]. Instead they use a set of Haar filters as features. These features are

selected by evaluating them in a statistical way and then used as input to a

SVM, used to classify new images. The features in Viola and Jones report are

selected with the help of a boosting technique called Adaboost. Adaboost is a

boosting algorithm that enhances the performance of a simple, so called

“weak” classifier. Haar rectangles are used as features, so the task for the weak

learner is to select the rectangle with the lowest classification error of the

examples. After each round of learning, the examples are re-weighted so that

the examples that were misclassified are given greater importance in the next

step. Interesting is that it has been shown in a study by Freund and Schapire

that this approach minimizes the training error exponentially in the number of

rounds [10]. Viola and Jones also use a fast way of computing the Haar

features by pre-calculating the Integral Image. The Integral Image entry I(x, y)

is the rectangle sum to the left and up of the point (x, y). This makes it possible

to calculate any rectangular sum in four array references. The resulting detector

is easily shifted in scale and location thanks to this. A simple tradeoff between

processing time and detection rate can be done by experimenting with the step

size. This report is largely based on the work by Viola and Jones, but also

regards extensions of their work. This is mainly due to the efficiency of the

method shown in their study [32] as well as the promising proofs about the

error that are explained in Section 3.5.1.

Many extensions to the work of Viola & Jones have been done, for example in

the study by Mitri et al where it is used for ball detection [21]. First images are

preprocessed by extracting edges using a Sobel filter. These images are then

used as input to train the classifier. A variant of Adaboost is used called Gentle

Adaboost, as proposed by Lienhart et al [15]. Here it is shown that for the face

detection problem, Gentle Adaboost outperform the Discrete Adaboost (the

7

original Adaboost is nowadays often referred to as the discrete version) and the

Real Adaboost which was introduced by Shapire and Singer as an extension to

the original version [28].

Lienhart et al also extend the work done by Viola and Jones by introducing the

Rotated Integral Image [14]. With the help of this, an extended set of Haar

features is used which show improvements of the detection rate of the classifier

while the computation time required does not increase to the same extent.

Some improvements to the weak classifiers are done by Rasolzadeh [25]. The

feature response of the Haar wavelets for both the positive and the negative

examples used in Viola and Jones are modeled using a normal distribution. By

doing this it is possible to improve the discriminative power of the weak

classifiers by using multiple thresholds (i.e. two new thresholds are introduced

where the two distributions intersect). In Viola and Jones algorithm a single

threshold is used by the weak classifiers to separate the two classes. They

further show that this multithresholding procedure is a specific implementation

of response binning. By calculating two histograms of the feature response for

the positive and the negative examples for a certain number of bins it is

possible to determine multiple thresholds without modeling the response as

normal distributions. The weak classifier hypothesis consists of comparing the

two histograms for the feature response. This change can be implemented

replacing the old weak classifiers immediately without any other major changes

to the algorithm. Their results suggest some improvement of the detection rate

while keeping the low computing time.

Another extension of the Viola and Jones algorithm that combines their work

with a Support Vector Machine at the last stage of the cascade is presented in a

study by Le and Satoh [13]. Two new stages are added to the cascade of

classifiers. Firstly a stage to reject non-face regions more quickly has been

added. It does so by using a larger window size and a larger step size. As the

last stage of the new cascade a SVM is used. The Haar features that were

selected by Adaboost in the previous stage are used to train the SVM. As they

are already calculated no recalculation is needed. In contrast to the extension of

the Haar features proposed by Lienhart et al [14] a reduction of these features is

implemented here. This is mainly done to reduce training time, and their

experiments show that for rejection purposes no efficiency is lost. Wu et al [33]

have been able to make an algorithm for training the cascade that is roughly

100 times faster than the one of Viola and Jones. The difference is that Wu et al

only train the weak classifier once per node, instead of once for each feature in

the cascade.

Liu et al map the feature response into two histograms (one for the positive set

and one for the negative set) and searches for the feature that maximizes the

divergence of the two classes [17]. This is done by using the Kullback-Leibler

(KL) divergence as the margin of the data. Results are promising compared to

8

Adaboost but the final speed of the classifier is slower (2.5 320x240 frames/s

on a P4 1.8GHz).

Lin and Liu train 8 different cascades to handle 8 different types of occluded

faces [16]. If a sample is detected as a non-face only the trained classifiers that

contain features that do not intersect with the occluded part are evaluated. A

majority of these classifiers should give positive responses when the sample

was indeed a face. The sample is then evaluated with one of the new cascades.

These additional stages yield in a three times longer computing time

(18 320x240 frames/s on a P4 3.06 GHz). To avoid overfitting, the mechanism

to select weak learners during boosting is reconsidered. Influenced by the

Kullback-Leibler Boost they use the Bhattacharyya distance as a measure of

the seperability of the two classes. They claim that the Bhattacharyya

coefficient is much easier to compute than the Kullback-Leibler distance while

maintaining the same performance.

A totally different approach to feature detection has been developed by Lowe

called SIFT [18]. It is a highly appreciated method that has been used widely

and proven effective [20]. It has a high matching percentage and is robust to

lighting variations compared to other local feature methods. The main

interesting part of this work is how the features are described. Points of interest

are found by looking for peaks in the gradient image. Descriptors representing

the local image gradients are extracted from the area around these points of

interest. A 4x4 grid is constructed and for each of these “bins” a histogram of 8

gradients is computed. This representation has the advantage that it is good at

capturing the small distinctive spatial patterns of an object. Rotation invariance

is achieved by combining all the gradient orientations into a reference

orientation. When a new image is to be classified, these descriptors are

matched to the descriptors in the database trained with example images using a

Nearest Neighbor method. For further reading, a study made by Mikolajczyk

and Schmidt evaluates an improvement to SIFT and different local descriptors

are compared [20].

A similar approach to SIFT is a feature descriptor called SURF [4]. The main

difference here is that instead of gradients, Haar features are calculated around

a point of interest and represented as vectors. By using the Integral Image,

calculations can be made faster.

Local Binary Patterns is another approach that has been used for texture

classification by Ojala et al [22]. For each pixel, an occurrence histogram of

features is calculated. A feature is represented by the signs of the difference in

value between the center pixel and its neighbors (positive or negative

responses). The neighbor pixels are chosen as equally spaced pixels on a circle

of radius R. In the report it has been tested with different values of R and a

different number of neighbors. With this representation the output can take as

many as different values, P being the number of neighbor points. Certain

9

local binary patterns are overrepresented in the textures and these patterns share

the property of having few spatial transitions. These are called uniform and the

definition is that they contain less than three 0/1 changes in the pattern. By

making the representation rotationally invariant and by only considering

uniform patterns this amount is reduced significantly. Noticeable is that the

occurrence histogram does not save the spatial layout in the image; it only

stores information about the frequency of the local features.

1. 5 Thesis Outline

The rest of the thesis is organized into five different chapters. In Chapter 2, the

database of images needed for training and testing is described. Chapter 3

explains the theory behind the methods used, both the method proposed by

Viola and Jones based on a cascade of classifiers trained with Adaboost and the

method of using a Support Vector Machine as a classifier. It also describes how

these two methods can be combined in different ways. Chapter 4 describes how

the methods were used in this specific problem setting. Experimental results are

shown and discussed in Chapter 5 and conclusions are drawn in Chapter 6,

along with proposed future work.

10

Chapter 2

Image Database

The image database was produced semi-automatically from example images

taken from the Tracab system cameras. This section describes how this was

done and how the image sets have been created.

2. 1 Ball tool

The images are extracted using a tool that Eric Hayman at Tracab developed.

The procedure was to look at an image and click out the ball and try to center

the position as exact as possible. At the same time the images were labeled

with the degree of freeness of the ball and with the contrast in the image.

A free ball is not in contact with anything and is completely visible. The higher

number in degree of freeness of the ball, the closer to other objects it is. In the

same way the higher the number of contrast the harder it is to distinguish the

ball from the background. Examples can be seen in fig 5.

The resulting information is a text file holding the path to the image, the x and

y position of the ball in the image, the degree of freeness and the degree of

contrast of the ball. Later this information is used to extract image regions that

consist of the ball in the center along with some of the background. This way it

is easy to do tests using image regions of different sizes for training, as

described in Section 5.3.1.

Example (row created by the ball tool):

Image path x y free contrast

“C:\HEIF-GAIS\ball.000001.04.jpg” 116.53 25.10 2 3

11

Description of the two scales:

Freeness

1. Free

2. Close to a player/line but not in contact

3. Contact with the player/line but not over the same

4. Over the player/line

5. Partially occluded by player

Contrast

The contrast scale ranges from 1 to 4 where 1 represents a very high contrast

where the ball is easily distinguishable from the background and 4 represents a

very low contrast where it is difficult to separate the football from the

background.

2. 2 Images

The images taken by the 16 cameras and saved by the ball tool all have a

resolution of 352x288 pixels. The whole images are saved by the ball tool as

RGB images but are converted into grayscale color space for training and

detection. The transformation from RGB to grayscale has been done in the

same way as calculating the luminance of the Y’UV color space [35]. It has

been calculated with the transformation to Y as: 0.299R + 0.587G + 0.114B.

Each channel has color values ranging between [0..255] so Y ranges between

[0..255] as well.

The images have been split into two groups: A training set and a test set. The

training set along with the negative samples is used to train the cascade while

the test set is used to measure performance.

2. 2. 1 Training set

10317 positive example images were extracted for training from 6 different

matches in Allsvenskan, Champions League and an international match from

2007, all with different lighting conditions.

Table 1 shows the number of images that have a value equal to or lower than

the contrast/freeness measures indicated on the left and top of the table.

12

Table 1 The different types of images in the training set.

Contrast/Free 1 2 3 4 5

1 75 83 83 84 84

2 3258 4988 5898 5982 6031

3 4651 7447 9194 9446 9582

4 4840 7845 9777 10120 10317

2. 2. 2 Negatives

To construct negatives the same images are used as for the positive samples.

The ball is removed from the image by setting the pixels in the area of the ball

to black. The whole 352x288 image is then saved and labeled as a negative

image. During training, image regions that are detected as positives are then

extracted from these images and used as negatives, as we know that there is no

football in the images. This procedure is called Bootstrapping and is described

further in section 3.6.1

2. 2. 3 Test set

2221 images were extracted from other sequences in the same matches. Testing

is done on these. The image set is expected to have the same ratio of different

kinds of images as there is in a match generally. Tests are done on images from

the same matches as used in training, to avoid the problem of not having

enough variety in the training data. This set is called test set 1 and the

distribution of different images can be seen in table 2. The ratio of the different

images is similar to the ratio in the training set. The ratio in percentage of the

different images of the two sets can be seen in Appendix 1.

Table 2 The different types of images in test set 1.


1 6 12 12 12 12

2 659 1071 1277 1284 1287

3 951 1612 2097 2131 2139

4 972 1643 2149 2202 2221

Tests have also been done on a match not used in training, though we should

not expect getting good results in general on matches that have not been used

during training when having only 6 different matches to train on. From this

match there are 884 images. This set is called test set 2.

13

(a) (b)

Fig 5 Examples of image regions with different properties. (a) 2 on the free scale and 3 on the

contrast scale. (b) 1 on the free scale and 2 on the contrast scale

2. 2. 4 Five-a-side

New images have been collected from a five a side-match where the cameras

are positioned closer to the pitch. This is possible since the pitch in a five a

side-match is about 16 times smaller than normal. This setup gives images of

the footballs of higher resolution since they come closer to the cameras. The

footballs are now between 2 and 8 pixels in radius in the images which is

significantly larger than before and the texture of the ball now becomes visible.

The training set contains a total of 5937 images and the test set contains 2068

images extracted in the same way as the training set. For this set no analysis of

the quality (freeness and contrast) of the images has been done. Also, to save

time, the process of extracting footballs in the images has been made faster by

mostly including easy targets.

2. 2. 5 Correctness

The data set contains football images of variable size and with a wide range of

lighting conditions. Balls that were close to the cameras are larger than those

far away from the cameras; they can vary up to a couple of pixels in diameter.

It is questionable if the variation in lighting conditions in the extracted images

is enough to capture the variance there is in reality between all different

matches. Optimal would be to also have images from a much wider range of

matches to be able to generalize completely. This has not been done due to the

large amount of time it takes to extract footballs from images manually. The

same can be said about the problem of having to deal with different kinds of

footballs. They are not always white. Some are black and white checkered and

others are even red. This could be solved by training several cascades.

It is also uncertain if the test set represents a general set of images. To be able

to detect the football as often as possible it is optimal to have a training set that

represents all the different image types that are present during a game.

14

Hopefully this is achieved automatically when taking a wide range of images

without any special selection process.

Another thing that could affect the results in a negative way is the labeling of

the data that has been done in table 1 and table 2. This labeling is as always

when there are humans involved a result of subjective reasoning. Also

according to the research done by gestalt psychologists the eye is easily fooled

[37].

15

Chapter 3

Theoretical background

This chapter gives an overview of the general approach used in the project and

describes the theory needed to understand the method. It includes these areas:

Training a boosted classifier using Adaboost, constructing the trained

classifiers into a coarse-to-fine cascade and training a Support Vector

Machine to be used as the last stage of the classifier.

3. 1 Overview

The algorithm used in this report is largely based on the work of Viola and

Jones from 2001 [31]. It is a popular method that has been widely used (See

Related Work). Some proofs about its generalization ability and the bound of

the error have been made which makes the algorithm very interesting. More

about this can be found in the analysis of the method in Section 3.5.1. The

method has mainly been developed and evaluated for face detection rather than

ball detection.

The algorithm works in the following way. A classifier is trained using positive

and negative image regions of an object of same size. The classifier consists of

several so called weak classifiers (3.5.2), consisting of haar-like features

(3.2.1), which are trained using a boosting technique called Adaboost (3.5).

The boosted weak classifiers are combined into a cascade of coarse-to-fine

classifiers (3.6). The idea is to reject a lot of non-objects in the early stages

where the computation is light, reducing process time, while positives get

processed further. When classifying an image region, the classifier outputs if

the object is detected or not, like a binary classifier. The classifier is easily

scaled and shifted to be able to detect objects at different locations and of

different sizes in an image.

This algorithm is combined with a Support Vector Machine (SVM) at the end

as described in Section 3.7. SVMs have been reported to be good classifiers

[26]. The SVM only needs to evaluate the image regions selected by the

Adaboost classifier in the previous stage, making it faster. Otherwise, a big

16

disadvantage with the SVM method is that it may be too slow for real time

[23]. An overview of how the method works can be seen in fig 6.

Fig 6 System overview. A cascade of classifiers trained with Adaboost is combined with a

brightness threshold and a SVM classifier as the last stage. Image regions that make it through

the system without being rejected are classified as footballs. Notice: same figure as fig 3.

3. 2 Features

A feature is the characteristic that is used to distinguish objects from non-

objects. The two main reasons why features are used instead of using the pixel

values directly is that it improves speed (explained in more detail in Chapter 2

about Integral Images) and that they can capture different kinds of properties in

an image. Any feature can be used such as the total sum of an area, variance,

gradients etc.

3. 2. 1 Haar features

In this thesis, the difference in pixel value between adjacent rectangles is used

as features. The features can be seen in fig 7. The value of the pixels in the

white rectangle is subtracted from the value of the pixels in the black rectangle.

The resulting difference that comes from calculating the feature is called the

feature response.

Input - all sub images

Cascade of classifiers

Ball

Brightness threshold

Ball

SVM-classifier

Non BallRejected

Ball

Non BallRejected

Non BallRejected

17

As shown in fig 7, 14 different features are used. With a base resolution of the

detector of 12x12, the total number of possible features in my setup is 8893.

This is a large number of features, but as we will see only a portion of the

entire set of features will be needed.

The features are called Haar-like features because they mimic the behavior of

the basis haar-wavelets. Much like gradients they capture the change in pixel

value rather than the pixel value itself. They are insensitive to mean differences

in intensity and scale.

Fig 7 The extended set of features as Lienhart et al [14] suggested.

1a-b, 2a-d and 3a in figure 7 were in the original set of features used by Viola

and Jones. The rest of the features were introduced by Lienhart et al [14]. The

new set of features consists of 7 additional rectangles that have been rotated by

45 degrees. Having a larger set of features make it possible to more accurately

capture the properties of the object, but it also affects the time it takes to train

the classifier since there are more features to evaluate. However, it does not

automatically mean that a larger amount of features will be used by the final

classifier, which would affect the speed of the final classifier. To speed up the

calculations of the features an integral image is used.

3. 2. 2 Integral Image

An Integral Image is a matrix made to simplify the calculation of the value of

an upright rectangular area in an image. It is a pre-calculation step made to

speed up other calculations. The value of the Integral Image (II) at II(x,y) is the

sum of all the pixels from the original image (OI) up and to the left of OI(x,y).

An example can be seen in fig 8.

18

→

Original Integral Image

Fig 8 The original image (left) and the corresponding Integral image (right)

This makes calculations of the value of any rectangle in the image faster. The

formula is:

II(x,y) = II(x-1,y) + II(x,y-1) – II(x-1,y-1) + OI(x,y) (1)

where OI(x,y) is the original image and II(x,y) is the Integral Image.

Once the Integral Image is calculated, the rectangle D in figure 9 can be

computed by:

4-3-2+1 = (A-(A+B) – (A+C) + (A+B+C+D)) (2)

=

2A-2A+B-B+C-C+D

=

D

Fig 9 Any rectangle D can be computed from the Integral Image by: 1-2-3+4

A rectangle can with the help of the Integral Image be calculated with only 4

array references. Differences in two adjacent rectangles (the edge features) can

be calculated with six array references while three adjacent rectangles (the line

features) require eight array references.

Since the rectangles are small (the training samples are 8-15 pixels in both

width and height) and therefore so are the features, the Integral Image does not

help in all cases. When the rectangles are small enough it is faster to do the

calculations directly on the pixels. This is true when the rectangles are smaller

155 201 226

98 78 48

14 111 44

155 356 582

253 532 806

267 657 975

19

than 3 pixels large. But the loss in time using the Integral Images in those cases

is very small. The calculations with the Integral Image are done in O(1).

For the rotated rectangles a different Integral Image is needed, called a Rotated

Integral Image. The idea is the same as previous and it is calculated with the

formula:

(3)

where RII is the Rotated Integral Image.

3. 5 AdaBoost

The principal idea of boosting is to combine many weak classifiers to produce

a more powerful one. This is motivated by the idea that it is difficult to find one

single highly accurate classifier. The T weak classifiers are combined

into a strong classifier by:

(4)

where are found during boosting.

The weak classifiers are used to select which features among the large number

of features best separate the two classes: objects and non-objects. This kind of

feature selection was first done in a statistical manner by Papageorgiou et al

[24]. By using Adaboost this selection process can be optimized when the best

feature is not obvious.

The boosting step of the algorithm is done by reweighting the examples and

putting more weight on the difficult ones. A new round of feature selection is

then done with the new distribution. These weak classifiers add up to strong

classifiers which are combined to construct a cascade of strong classifiers.

3. 5. 1 Analysis

It has been proven by Shapire and Freund that the error of the final classifier

drops exponentially fast if it is possible to find weak classifiers that classify

more than 50% of the examples correctly [29].

20

The final error is at most:

(5) where is the error of the t-th weak hypothesis

The bound of the error on the final classifier improves when any of the weak

classifiers is improved.

In the same article it was also shown that the generalization error of the final

classifier with high probability is bounded by the training error. This means

that the final classifier is most likely to generalize well on samples it has not

seen before. They say that with high probability, the generalization error is less

than

(6)

where Pr[] is the empirical probability on the training sample, T is the number

of rounds of boosting, m is the size of the sample and d is the VC-dimension

(Vapnik-Chervonenkis dimension) of the base classifier space. The VC-

dimension of hypothesis space H defined over instance space X, is the size of

the largest finite subset of X shattered by H. Further explanation of the VC-

dimension can for example be found in the tutorial by Sewell [34].

Shapire and Freund’s analysis implies that overfitting may be a problem if

running training for too many rounds. However, their tests showed that

boosting does not overfit even after thousands of rounds. They also found out

that the generalization error decreases even after the training error has reached

zero. These are promising results that motivates the use of this method.

3. 5. 2 Weak classifiers

A weak classifier is a simple classifier that only has two prerequisites: that it is

better than chance, i.e. that it classifies more than 50% of the samples correctly

and that it can handle a set of weights over the training examples. The weights

are needed in the boosting step.

In this case the weak classifier consists of one feature along with its threshold.

(7)

21

The feature response is compared to the threshold . The variable is

used to indicate the direction of the inequality sign.

In order to find the best weak classifier at each round of training the

feature responses from all samples are calculated and by applying a threshold

it is possible to separate the samples into two classes. During training the

optimal threshold is determined for each feature, by optimal meaning the

threshold that minimizes the classification error of that feature.

In each step of boosting the feature and its according threshold with the lowest

classification error is selected along with a weight α inversely proportional to

the classification error of that feature. This weight can be seen as a measure of

the importance of that particular weak classifier.

The error is calculated with respect to the weights of the examples (I.e. the

error is the sum of the weights of the misclassified examples).

(8)

where is the true label of the example.

3. 5. 3 Boosting

In order to train and combine several weak classifiers instead of using one

more complex classifier, Adaboost repeats the training step with a modified

distribution of the training set of examples. For each round more emphasis is

put on the more difficult examples. Those examples which were wrongly

classified by the previous weak classifier are given higher weights than the

correctly classified examples. The weights are then normalized. This is done at

each round of training until the total number of rounds is reached.

Different variants of Adaboost have been evaluated for face detection and two

of them have been used and compared for ball detection in this report [15]. The

Discrete Adaboost is the original version proposed by Schapire and Freund

[29]. According to Lienhart et al [15] the Gentle Adaboost is the most

successful algorithm for face detection. The Real Adaboost uses class

probability estimates to construct real-valued contributions. They are all similar

in computational complexity during classification, but differ somewhat during

learning in the way they update the weights at each round of boosting.

22

The main idea is still the same in all three cases:

General pseudo-code for Adaboost

Initialize weights w = 1/m (normalized)

For t = 1..T

Train weak learners using distribution w

Fit the weak classifiers to the data and

calculate the error with respect to the

weights.

Choose the weak classifier with the lowest

error and update weights by increasing the

weights of the misclassified examples.

Do until T is reached.

Output the final hypothesis as a weighted combination

(related to the error) of the weak classifiers.

For more information on the different variants of Adaboost see the comparative

study made by Friedman et al [11].

3. 6 Cascade

By forming several strong classifiers into a cascade of simple to complex level

it is possible to reduce computation time. In the first stages of the cascade

simpler and faster classifiers are used. Since the majority of the image regions

going in to the first stage are non-objects many of them are easily rejected by

the early stages while letting through the majority of the positives. Once an

image region has been rejected at some stage it is discarded for the rest of the

cascade as a non-object and thus not evaluated further. A positive image region

goes through the whole cascade and is evaluated in every stage requiring

further processing, but in total this is a rare event. When going through the

cascade the classifiers at deeper stages get more and more complex, requiring

more computation time. Also, with increasing stage number the number of

weak classifiers which are needed to achieve the desired false alarm rate at the

given hit rate, increases.

The cascade of classifiers is trained by introducing goals in terms of positive

detections and number of false positives. For example, to achieve a final

classifier with a hit rate of 90% and a false positive rate of 0.1%, each stage in

a classifier with 10 stages needs to have a hit rate of 99% (0.99^10 = 0.9) but

23

only a maximum in false positive rate of 50% (0.5^10 = 0.001). Each stage

reduces both values but since the hit rate is close to one the result of the

multiplication stays close to one while the result of the multiplication of the

smaller false positive rate rapidly decreases towards zero. This is all done

under the assumption that the different stages in the classifier are independent

of each other.

The way the cascade is formed is by setting a minimum value for every stage.

The stage is trained and features are added until the desired hit rate value and

the desired false positive rate value have been reached. By specifying these

goals it is possible to get a classifier of your choice.

3. 6. 1 Bootstrapping

A new negative set is constructed for each stage by selecting those image

regions that were falsely detected by the classifier using all previous stages. A

false detection as the one in figure 10 would be added to the negative set. This

method is called bootstrapping. Intuitively this makes sense as we expect the

new examples to help us get away from the current mistakes.

Since at each stage the classifier becomes more and more accurate, it becomes

more and more difficult to find false positives. Also the false positives get more

and more similar to the true detections making the separation task harder. As a

result, deeper stages are more likely to have a high rate of false positives.

Fig 10 A typical hit along with a false detection. The image region of the players shoe is used as

a negative sample in the training of the next stage.

24

3. 7 Support Vector Machine

Support vector machines (SVM) are used for data classification. The basics of

SVMs needed to understand the method is presented here.

In the same way as in the Adaboost case a training set and a test set is needed

to train and evaluate the SVM. Given a set of labeled data points (the training

set) belonging to one of two classes, the SVM finds the hyper plane that best

separates the two classes. It does this by maximizing the margin between the

two classes. In the left image of fig 11 we can see an example of a hyper plane

that separates the two classes with a small margin. In the right image the hyper

plane that maximizes the margin has been found by the SVM. The points that

constrain the margin are called support vectors.

Fig 11 SVM finds the plane that maximizes the margin. The image to the right is considered to

have greater generalization capabilities. Image taken from DTREG’s homepage [36].

3. 7. 1 Overfitting

Fig 12 shows how a classifier that is fitted well to the training set may not

generalize well. In image (a) the classifier has been learnt to classify all

examples correctly. As seen in image (b) this results in some wrongly classified

examples on the test set. However in image (c) we see a classifier that although

classifying one example from the training set wrongly it classifies all the

examples in the test set correctly (as seen in image (d) ). The latter classifier

generalizes better due to that it allowed a wrongly classified example during

training. This can be handled by introducing the penalty parameter that

weights the samples according to how they were classified. Misclassifying a

sample now costs and by increasing the cost of misclassifying an example

increases, making the model more adjusted to the training data.

25

(a)Training data and an overfitting classifier (b) Applied on test data

(c) Training data and a better classifier (d) Applied on test data

Fig 12 An overfitting classifier and a more general classifier. Images from libSVM Guide[6].

3. 7. 2 Non-linearly separable data

The examples in fig 11 show two linearly separable classes. When having more

complicated data, a line may not be enough to separate the two classes. To cope

with this problem the data is moved into a higher (maybe infinite) dimensional

space by a function [6]. The function can take many forms. In this new

space it may be possible to find a plane to separate the data. The problem of

going into a higher dimensional space is that calculations get more expensive

and it makes the method slow. Therefore the kernel trick, first introduced by

Aizerman et al, is used to solve this [1]. Since all SVM calculations can be

done using the dot product <x,y> between the training samples, the operations

in the high dimensional space do not have to be performed. Instead we

can try to find a function . This function is called the

Kernel function. Examples of popular kernels are: the Polynomial kernel, the

Radial Basis Function (RBF), the linear kernel and the sigmoid kernel. As

proposed by LibSVM, the RBF is a good choice to start with:

(9)

The linear and the sigmoid kernels are special cases of the RBF with certain

values of the parameters (C, ) [6]. The polynomial kernel is more complex in

terms of number of parameters to select.

When using the RBF kernels there are two parameters to select: C and . Since

it may not be useful to achieve high training accuracy, these parameters have

been evaluated by doing cross validation on the training set. This is done by

26

dividing the training set into two parts, one for training and the other part for

testing. This is done repeatedly with different partitions to get a more accurate

result. The values of the parameter have been tested by increasing the values

logarithmically and then doing cross-validation to get the performance. The

cross-validation can help us get around the problem of overfitting.

3. 7. 3 Features extracted with Adaboost

Having a good classifier does not make sense unless the data points represent

something meaningful. The idea is to use features extracted from some stage of

the cascade constructed with Adaboost. Fig 12 could then be interpreted as

having the feature response of one feature on the x-axis and the feature

response of another feature on the y-axis. But unlike the examples in fig 11 and

fig 12, more features than two needs to be used. This does not change any of

the theory except that we move into a higher dimensional space.

The samples used for training were gathered by letting an Adaboost trained

classifier classify the image regions in the training set as in Chapter 2. The

feature response from the samples classified as positives were chosen as the

positive training set and the feature responses from some of the false positives

from each image were selected to be part of the negative set. A detailed

explanation on how this was done is given in Section 4.6.

3. 8 Tying it all together

To be able to add SVM as the last stage of the classifier we need to decide from

which stage to take the features and how many features to use. Results in a face

detection study by Le and Satoh suggest that the switch between the Adaboost

classifier to the SVM classifier can be done in any stage [13]. Also shown in

the same study there seems to be a big increase in performance when going

from 25 to 75 features, while the difference between using 75 and 200 features

is not significantly large. Since the objects to detect are different in this report

and in the study by Le and Satoh there is no guarantee that the optimal number

of features is the same. The speed of the classifier depends very much on the

number of features that are used, so it is important to find an optimal tradeoff

here.

27

Chapter 4

Method

This section describes how boosted classifiers described in section 3 have been

trained and how they are used on the specific task of detecting footballs. It

describes how the classifiers are shifted in both location and scale across the

image during detection. To reduce the detections of some false positives a

brightness threshold is introduced, and also a mask is used to only do detection

on the area where it is interesting to search for the ball: the pitch. A

description of how the features for the SVM have been collected and tested is

given.

4. 1 Training

Several different cascades are trained as described in Chapter 3 and the

performances of these classifiers can be seen in Chapter 5.

Image regions of sizes between 8x8 and 15x15 pixels have been used to train

four different classifiers. Bigger image regions result in training samples that

include more or less of the background. If no background was included in the

image regions used for training, the classifier would only learn the texture of

the ball. Since the resolution is low, it is very difficult to distinguish any texture

on the footballs. The idea here is therefore to include some of the background

to give the classifier more information to work with. By including the

background the classifier has the possibility of finding the difference between

the dark background and the bright ball. How much of the background that

should be included in the samples is not clear. If there is too little background

maybe the classifier will not be able to capture the property that the ball is

white and round compared to the darker background. On the other hand, if too

much of the background is used the classifier will probably do detections only

based on the background instead.

The difference in using different parts of the training set has been evaluated.

One classifier has been trained with easier images and another with harder

images. So called easier images are the images labeled with contrast 1 and 2

and labeled with freeness 1, 2 and 3. The harder images have an additional

1747 images that have been labeled with contrast 3. By using harder images

28

where the ball is occluded and the contrast is bad during training should result

in better detection when the ball is close to a player or occluded in some other

way but also implies that it is harder to distinguish between a ball and a non-

ball. The rejection process will be more forgiving, letting more examples

through the cascade since the training images have a wider diversity. One can

expect that more false detections will be made, requiring a higher number of

stages to reach the same level of false detections.

Two classifiers have been trained to evaluate the importance of using a high

number of negative samples. 2000 and 5000 false positives have been extracted

to use as negative samples in the bootstrapping step.

Discrete Adaboost and Gentle Adaboost, two different kinds of boosting

algorithms are evaluated and the minimum hit rate and maximum false alarm

rate are varied to train three new classifiers.

An overview of how the classifier is used can be seen in fig 6.

4. 2 Step size and scaling

As mentioned in Chapter 3, detection is done by sweeping a window of

different sizes over the image, running the classifier at each image region.

Since the footballs are not perfectly aligned and have a small variation in

position and size, the trained detector is somewhat independent to small shifts.

An object can therefore be detected even though not being perfectly centered.

However, if not going through all possible image regions some objects are

likely to be missed. The step size also affects the detector speed. With a step

size of 1 pixel and running 10 different window sizes there are around one

million image regions that the classifier needs to be run on. By simply

increasing the step size to 2 (hence skipping one pixel at each step) the number

of image regions can be halved and thus halving the total time of the

classification. The step size is therefore a tradeoff between detection rate and

time. When having such small objects to detect as is the case in this report, it is

very likely that a small step size is required. The shift in location and window

size has been tested with different step sizes and results are shown in Section

5.3.

Since the balls we want to detect range in size between 3 and 7 pixels in

diameter there is no reason to search for objects of other sizes. The detection

window is therefore scaled until it is larger than the biggest object possible, but

not more. When scaling, this can be done either by scaling the image region

itself or by scaling the features. In this case scaling is done by scaling the

features since this is done without any cost (see the section of Integral Image

29

that shows that the size of the rectangle doesn’t affect calculation time) while

scaling the image region is time-consuming.

4. 3 Masking out the audience

Since the ball only needs to be detected when it is in play there is no need to

perform detection outside the pitch. Since the system of cameras have a model

of the pitch it is easy to get the limits from there. For each match 16 different

mask images are constructed, one for each camera (to the right in fig 13). When

stepping through the x and y coordinates of an image it is first evaluated to see

if the mask says that detection should be made at this position or not. By doing

this a more accurate result can be achieved. The result can be seen to the left in

figure 13. No detections are made outside the boundaries of the mask.

Fig 13 Detections (left) and the corresponding mask(right).

4. 4 Number of stages

What has been noticed when running the trained classifiers on the test data is

that images from different matches respond very differently to the cascade.

Images from a bright match may give a lot of false detections when running the

cascade with many stages, while images from another match give hardly any

false detections. But when increasing the number of stages used, positive

detections are lost from the latter while only reducing the false positives from

the former. This means that the optimal number of stages used for classification

differs from match to match. So another way of getting the ROC-curves would

be to set a limit on how many false detections the classifier is allowed to find in

an image, and run the classifier until this limit has been reached. This increases

30

the performance of the classifier when testing on a range of matches, and a test

made this way is presented in Chapter 5. However during testing of other

parameters this method would not be practicable, since it would be impossible

to get any comparison data that depended only on the tested parameter.

4. 5 Brightness threshold

A brightness threshold has been used to eliminate some of the false detections.

Since the features used do not capture the pixel value but only how the pixel

values are related to each other, some of the detections have been found to be

located on grass areas which are totally green. These can easily be rejected by

looking at the brightness of the pixels in the detected area. If no single pixel is

bright enough the detection can be ruled out.

Fig 14 Left: Image of false alarms on grass. Right: Mask that indicates the two different

thresholds when an umbra is present.

To the left in fig 14 we can see two false detections that have the same feature

responses as a ball. It can be seen that these two detections consist of a brighter

circular area in the center and darker pixels around. It also seems clear that the

areas do not contain any pixels that are as white as a ball would be.

To find the optimal threshold value for each individual match, a histogram of

the intensity of the pixels in the whole image has been used. Usually, a peak in

this histogram indicates the color of the grass. When there is an umbra because

of the sun, it should be possible to find two peaks indicating the color of the

grass. It is possible to extract a mask of where the umbra is as seen to the right

in fig 14 and by looking at the brightness histogram it is possible to find a

corresponding threshold that is optimal for the two different regions. On the

dark sections of the pitch the threshold is close to one peak, on the bright

sections of the field the threshold is close to the other peak. To the right in fig

14 the threshold has been chosen to be .

31

4. 6 SVM

Two different types of features have been used to train and evaluate the SVM

method: using the pixel value of the image regions directly and using the

feature responses from a classifier trained with Adaboost. False positives have

been extracted the same way. Both methods uses images from the training set

and the test set 1 described in Chapter 2.

The training and testing procedure for the SVM using feature responses is as

follows:

Training - Let a cascade using some low number of stages run the training set

discussed in Chapter 2. The feature responses of around 200 features starting

from some stage are calculated, for both positives and negatives (one false

positive from each image is taken to get an equal amount of positives and

negatives in the resulting training data). How many features that should be used

is not known but results from the study made by Le and Satoh show that 200 is

a good number [13]. Some different values are tested. The Support Vector

Machine is then trained with these feature responses. By using feature

responses from later stages it should be possible to change the SVM into

classifying more difficult samples better. At the same time it should classify the

easier samples worse, but the cascade of boosted classifiers that will be run

before the SVM step are meant to take care of these.

Testing - Let the same cascade classify the test set 1 discussed in Chapter 2.

Get the around 200 feature responses starting from some stage and label the

ones that are classified as detections as positives, and the rest as negatives. Run

the SVM on the extracted feature responses and evaluate the performance. By

adding the SVM after different number of stages it will be possible to construct

a ROC-curve. Run the cascade up until stages that give similar rejection or

detection rate as the SVM and evaluate the performance. We can now compare

how well the two different methods perform on the same positive and negative

set.

As mentioned in the guide made by LibSVM scaling of the data is very

important to achieve good results [6]. Without scaling the data it was not

possible to get any acceptable results. The main reason why scaling is

important is to avoid the domination of large numbers over small numbers.

Also, by scaling the data calculations are made easier and thus improving the

efficiency of the classifier. All data, both the training and the testing data, has

been scaled in the same way to the range .

32

4. 7 OpenCV

The detection part of OpenCV was developed with face detection in mind. As a

result it has been somewhat optimized for larger objects than the footballs in

this thesis. This means that modifications to the code have been made to search

for smaller objects.

33

Chapter 5

Results

In this section results from the different trained classifiers are presented.

Comparisons are made using ROC-curves. The performance is only shown for

the values of interest since the training of later stages requires a lot of time.

5. 1 ROC-curves

To see the ratio between hit rate and false alarm rate, ROC-curves are used to

present the results. Hit rates are shown in percentage while the false alarm rate

shows the actual count of detected false positives per image.

A detection is added as a positive hit if:

The distance between the center of the detection and the center of the

actual football is less than 30% of the width of the actual football

The width of the detection window is within ±50% of the actual

football width

Other detections are regarded as false alarms.

A test with a perfect result has a ROC-curve that passes through the upper left

corner. That means 100% hit rate and 0 false alarms. The closer the curve is to

this point, the better accuracy of the classifier. Points that fall below the dotted

line in fig 15 are the results of a classifier that is worse than chance.

34

Fig 15 A ROC-curve. The closer the curve is to the top left corner the better.

To get the curves, the number of stages used for detection is varied. More

stages mean a more specified classifier while less number of stages lets more of

the image regions through the cascade as positives.

The choice of number of stages influences the performance differently on

different matches. In some matches a lot of detections are made using a specific

number of stages. Using the same number of stages on another match may

result in very few numbers of detections. Since there are images from different

matches in the test set, this is a problem. A test is therefore made by adjusting

the number of stages used during detection. Depending on the number of

detections made in the previous image, more or fewer stages are used for the

next images. This improves performance as seen in Section 5.8.

5. 2 Training results

Training and testing has been done on a 2.16 GHz computer with 2 GB RAM.

In general it has taken time in order of days to train a classifier up until stage

35. Some training sessions have been forced to stop earlier due to the training

being too time consuming. It has however always been possible to get a ROC-

curve that can be used to compare the performance with the others. The

variables that affect training time are: the number of negatives used in the

bootstrapping step, the size of the image regions, the minimum hit rate and the

maximum false alarm. The first attribute can be very time consuming and it is

therefore important to have a good set of negative images from where the

algorithm can find false positives. The last three attributes increase the number

of features that is needed to reach the goal at each step.

Some examples of features chosen by the algorithm can be seen in figure 16.

Fig 16 Example of features selected by Adaboost in early stages

35

5. 3 Using different images for training

Adaboost has been reported to be sensitive to noisy data [8]. Here some tests

with different images used during training are reported.

5. 3. 1 Image size

Image regions of sizes between 8x8 and 15x15 pixels have been used to train 4

different classifiers. Examples are given in figure 17.

Results show small differences in the performance of the classifiers trained

using image regions of size 10*10 pixels compared to the classifier trained

using image regions of size 12*12 pixels. Using image regions of size 8*8

pixels show worse performance. Using bigger image regions than 12*12 does

not improve performance as seen in fig 18.

Size 8*8 Size 10*10 Size 12*12 size 15*15

Fig 17 Image regions of different sizes

36

Fig 18 By using image regions of size 12*12 best performance is achieved, although there is not

a big difference between the different classifiers.

5. 3. 2 Image sets

The next test shows the importance of having a good image database. Two

classifiers trained with different images are compared, one trained with images

in which the ball is occluded more and the contrast is worse. The results can be

seen in fig 19.

Around 4 more stages were required to get down to the same false alarm rate.

At the same time the classifier that was trained with harder images shows a

better performance in total.

37

Fig 19 A classifier trained with images with less contrast and where the ball is partially

occluded shows better performance than a classifier trained only on clearly visible footballs.

5. 3. 3 Negative images

Another modification that can be done to the set of images used in training is

using more negatives in the bootstrapping step (see Section 3.6.1). The

bootstrapping step selects a number of false positives classified by the present

available classifier as examples of negatives. Until now the algorithm has used

only 2000 negative samples at each stage. By letting the training procedure

extract 5000 samples of negatives at each stage it should be possible to

improve performance. One problem of using a high number of negatives in this

step is that as training reaches later stages it becomes more and more difficult

to find a large number of false positives. One should also mention that

increasing this number immediately increases the time needed for training. As a

reference it took 705 s to find 2000 false positives at stage 39, while it took

under a second in the first stages. To find 5000 false positives at stage 39 took

2105 s. In early stages the difference was not noticeable.

38

Fig 20 Using a higher number of negative samples for each stage of training increases

performance a couple of percentages.

The comparison in performance between using 5000 negatives at each stage

and using 2000 negatives can be seen in fig 20. The two curves follow each

other, with the curve for the classifier trained with 5000 negatives a couple of

percentages higher.

One question that arises is how it affects the generalization performance to use

a higher number of negatives? Is it only positive or could it be so that the new

classifier gets too specialized to the training data? If the negatives are very

similar to the positives the trained classifier is likely to have a decision

boundary that lies very close to the positives. By running the classifiers on the

test set 2 described in Section 2 we can see some indications on how well the

two classifiers generalize to the data. The difference in performance of the two

classifiers is very small. There is not a big enough difference in performance

when testing the generalization ability to be able to draw any conclusions about

it. As seen in fig 21 the detection rates are very low.

39

Fig 21 The classifier trained with 5000 negatives performs better in the lower regions of the

ROC-curve when testing on a game that has not been used in training

5. 4 Step Size

During detection a detection window is swept across the image at different

locations and at different scales. By shifting the window a few pixels at a time

it is possible to scan the whole image. A step factor is used to increase both

the window size and the step size. For example, if the current step size is the

window is shifted pixels. This means that when the detection window is

large it is shifted more than one pixel at a time. Since the image regions used in

training are not perfectly centered, a small amount of translational variability is

trained into the classifier. While speeding up the classifier substantially,

shifting more than one pixel at a time has resulted in a decrease of the

performance.

With a scale factor of 1.2 the classifier classified around 24 images per second

(it took between 87 and 96 seconds to classify all 2221 images). On the other

hand it took about double the time with a scale factor of 1.1 This means that it

takes one second to classify 13 images. Due to the higher performance when

using a small step factor, a fixed step size at 1 pixel and not using the scale

factor for the step size has been tested. The window size is still increased with

40

the step factor until the window is large enough. This way it only classifies

about 9 images per second.

Fig 22 A smaller step factor increases performance but also increases processing time.

These results as seen in fig 22 clearly show a big increase in hit rate when

decreasing the step size. In the forthcoming results a fixed step size of 1 pixel is

used along with a step factor of 1.1 for the window size. Also a pre-calculation

step has been removed which used a step size of 2 pixels as the first step. By

removing this step the detection rate improves as shown in figure 23 while

decreasing the number of images processed per second to 8.5.

41

Fig 23 Old step size as in fig 22 compared with having removed a pre-calculation step.

5. 5 Real and Gentle Adaboost

Two variants of the Adaboost algorithm have been evaluated. The discrete

Adaboost was not able to finish, so it will therefore be left out of the

comparison. What happened was that the boosting step was not able to improve

performance by updating the weights, so the process got stuck. Lienhart also

declared that he had convergence problems when using LogitBoost for face

detection and was not able to evaluate that method [14]. Also, Lienhart could

show that the Gentle Adaboost was the best method between the Real, Discrete

and Gentle Adaboost, at least for face detection. D. Le and S. Satoh also stated

that the Discrete Adaboost is too weak to boost in the case of a hard

distinguished dataset [13].

42

Fig 24 The performance of the Real Adaboost and the Gentle Adaboost is significantly the same.

Figure 24 shows how the difference in performance between the two different

variants of Adaboost, Real and Gentle, is minimal.

5. 6 Minimum hit rate and max false

alarm rate

As described in Section 3.6 the min hit rate and max false alarm rate are used

to set up the properties of the cascade. They describe the values each stage

needs to reach in order to move on to the next stage.

We see that increasing the max false alarm rate improves performance

significantly. Even better performance is achieved by using a higher minimum

hit rate during training. It is worth noting that a higher min hit rate alone shows

as good performance as when rising both the min hit rate and the max false

alarm. To make it easier to refer back to this classifier later in the report, the

classifier with the best performance in fig 25 is called classifier 1.

43

Fig 25 Comparison between using different values of the minimum hit rate and the false alarm

rate during training. Better performance is achieved by using a higher min hit rate during

training.

A question that arises is if this procedure deteriorates the generalization

performance of the classifier. Maybe these results show a too specific classifier

that has been too well adjusted to the training data? To test this, the

performance of this classifier is compared with the performance of the

classifier using a min hit rate of 0.995 on the test set 2. Test set 2 contains

images from a game not used for training. The results in fig 26 show small

indications on that the classifier has reduced its generalization ability. The

classifier trained with an increased minimum hit rate and a decreased

maximum false alarm rate still performs better than the classifier trained with

the default values, but the difference is much smaller.

The downside of changing these limits is that it may be bad for training. It can

take longer time for the classifier to reach the limits, increasing the total time

taken to train the cascade. Some training sessions never even ended because

they couldn’t reach the limits. In addition to this, more features are usually

needed to reach higher limits which mean that the final classifier will be

slower.

44

Fig 26 The classifier trained with a higher min hit rate and a lower false alarm rate still

performs better when testing on test set 2, but the difference between the two classifiers is

minimal.

5. 7 Brightness threshold

When looking at the false positives that the classifiers detect, it can be seen that

false detections are sometimes made on the green grass. By looking at them it

seems clear that they do not contain any white pixels, but are thought of as

footballs anyway. See section 4.5 for more information. A simple way of

removing these detections should be to reject the detections if they do not

contain any white pixels. This is done by introducing a brightness threshold

that rules out detections if no pixel in the detected area is brighter than the

threshold. The performance when testing different values of the brightness

threshold can be seen in fig 27.

45

Fig 27 Shows how performance changes when using different threshold values to remove

detections that are not bright enough.

The images are saved by the ball tool in RGB space but later transformed into

grayscale with one color channel. The luminance Y in the Y’UV color space is

used as grayscale value. This is described in more detail in Section 2.2.

The brightness threshold is added as the last step of detection.

The results show that by adding this threshold, many of the false detections can

be ruled out without decreasing the detection rate of the classifier. These tests

were made on the test set 1, and the classifier used was again classifier 1. The

best result achieved without losing any hits was using a threshold of 125. Using

20 stages of the cascade no positive detections were lost, while decreasing the

number of false detections from 5.4 to 4.3 per image.

Further tests show that the threshold can be optimized even more. When

increasing the threshold value, only images from one match suffer from weaker

detection while the detection rate of the other matches remains. Logically this

match had cumbersome lighting conditions and the football was often darker

than normal. These results suggest that the threshold value can be optimized for

each individual match.

By using the threshold mask described in Section 4.5 the threshold can

automatically be optimized for the current lighting conditions. The comparison

is made in fig 27 between the cascaded classifier 1 with and without threshold.

As expected the performance is increased even further using this new

threshold.

46

5. 8 Number of stages

By increasing or decreasing the number of stages used for detection in an

image depending on how many detections that were made in the previous

image it is possible to adjust the classifier to each game since the images in the

test set are ordered.

An upper and a lower limit for when to lower and raise the number of stages

are needed. By varying these two limits it is possible to get a ROC-curve as in

fig 28.

Fig 28 By adapting the number of stages according to the number of detections made in the

previous image better results are achieved.

Results show that the classifier benefits from being adjusted to each game to

maximize performance.

47

5. 9 Support Vector Machine

As mentioned in Section 4.6, there are two parameters that need to be selected

when using the RBF kernel for SVM classification: C and γ. Unfortunately

there is no way to generalize the selection of these parameters. For every data

set there is a different choice of parameter values that are optimal. These values

have been found using cross-validation (see Section 3.7.2). Only the

performances of the best classifiers are shown in this section.

In these tests the SVM has been integrated as the last step after the cascaded

classifier. The ROC-curve has been constructed by varying the stage when the

SVM takes over the classification task.

The first test uses a SVM model that has been trained with 283 feature

responses from stages 16 and 17 of the best classifier from Section 5.6

(classifier 1). As seen in fig 29, the results of the SVM model stay on the

negative side of the cascaded classifier in the ROC-curve. The results from a

similar classifier trained on the pixel values directly are even inferior and are

therefore left out of the report.

Fig 29 Using SVM as the last stage at different stages. Using SVM as the last stage does not

show better results than the cascaded classifier when trained on 283 features.

48

Possible explanations to why the results from the SVM are worse than when

only using the boosted classifier are that too few or not enough features have

been used or that the features have been taken from stages too late or too early

in the cascade. The next tests are made to examine these possibilities.

The second two tests using SVM have been done on classifiers trained with the

same number of feature responses (283) but using feature responses from

earlier and later stages of the cascade. The feature responses have been taken

starting from stage 2 and 26 respectively. The same cascaded classifier has

been used to be able to compare the difference of using more or less

discriminative features. Using features from later stages for training should

result in a classifier that is better on separating harder samples than before. In

fig 30 results show that the overall performances of the two new classifiers are

worse than the previous classifier.

A higher number of features do not help the SVM classifier, as seen in fig 31.

On the other hand, when lowering the number of features the performance is

increased. The results in this figure comes when using a classifier trained with

features from stage 15 and forth until the wanted number of features have been

reached. A zero-mean normalization has also been done on each image region

but without signs of improvement.

Fig 30 Training the SVM on feature responses from earlier and later stages results in poor

performance.

49

Fig 31 Using additional features do not improve the performance of the SVM classifier. Good

performance is shown by the classifiers trained with very few features.

5. 10 Compared to existing detection

The main method for finding the ball today is based on tracking the movement

of the ball. Since there has not been enough time to integrate the method

proposed in this report with the existing software used by Tracab, comparisons

of the performance of finding a final ball hypotheses has not been done

between the two methods. Instead, a specific detection step in Tracab’s system

is compared with the method proposed in this report.

The comparison has been done by running the two methods on the test set 1. A

first step in Tracab’s algorithm extracts possible ball candidates from the test

set by looking for image regions that are brighter than the background. These

regions are then stereo-matched. The resulting set of image regions are saved

and used as the database in the comparison test. The next step of the method

continues to narrow down the candidates by using a correlation method that

extracts ball-like candidates. This results in a detection rate and a false

detection rate that is compared with the cascaded classifier as can be seen in fig

32. The cascaded classifier has been run on the same database using different

number of stages to get a ROC-curve.

50

Fig 32 The method proposed in this report compared with a method in the existing system

As seen in fig 32 the performance of the cascaded classifier is slightly higher.

5. 11 Five-a-side

Five-a-side is the name of the game when there are only 5 persons in each team

playing and the dimensions of the pitch is much smaller, around 16 times

smaller than in regular football. This test can be compared with having the

same setup as before but with better cameras of higher resolution. Since

technology evolves rapidly and prices fall, there are reasons to believe that

better cameras will be used in the near future.

The cascade has been trained in the same way as before as described in Chapter

3, but with images as described in Section 2.2.4. Due to long training time the

parameters for training has been set to a max false alarm rate of 0.4, a min hit

rate of 0.995 and 2000 negatives for each stage. The SVM has been trained

with 197 feature responses from stage 16 and is added as the last step of the

cascade.

Fig 34 shows the performance of the classifier run with and without SVM on

the test set described in Section 2.2.4. The hit rate is over 95% even at low

false alarm rate. As before, using SVM does not increase performance. The two

curves in fig 34 follow each other closely.

51

Fig 33 System has been set up at a 5-a-side pitch. Cameras are closer to the ball giving footballs

of higher resolution. The texture of the ball is now distinguishable.

Fig 34 Very good performance is shown by the classifier trained using images of footballs of

higher resolution. No improvement can be seen when using SVM as the last stage.

In the same way as in Section 5.10 a comparison has also been made between

the classifier proposed in this section and a present detection method used by

Tracab. This can be seen in fig 35.

52

Fig 35 The method proposed in this report compared with a method in the existing system at

Tracab.

As seen the cascaded classifier outperforms the current method. Again, it is

important to remember that this is not the only method used by Tracab’s

system. In addition to a higher detection rate, the most positive result in this

test is the low false alarm rate shown by both methods.

5. 12 Discussion

Overall results show that the detection task set up for this thesis can be done

with pleasing results. It seems as the boosting procedure is capable of

extracting the information available in the sample images when looking at the

features it selects. In early stages the features are understandable. They capture

the property of the football being bright in the middle and darker to the sides.

On the other hand, it is not as obvious what features in later stages represent.

These may be signs that the process overfits to the data, but as studies have

shown the method is robust to overfitting [29].

As expected, results in Section 5.3 indicate that the image database is crucial

for getting a good performance. It is a little surprising that best performance is

shown when using so much of the background as in the image regions of 12*12

53

pixels. The increased performance when using harder images is probably due to

that the image set used in training related better to the test set 1 used for both

tests. The results confirm the theory that it requires more stages when using

harder images in order to reject as many samples as when using only the easier

image.

Also as expected, results show a big difference in performance when reducing

the step size. This is the case because the objects dealt with in this report are

small. Unfortunately this is directly related to the time needed to classify an

image. Luckily it is very easy to do a tradeoff between processing time and

performance.

Tests of using a brightness threshold show how the boosting only trains the

classifier to identify relative features, not exact pixel values. This is why the

brightness threshold can be successful. The tests of brightness threshold along

with varying the number of stages show the importance of adjusting the

classifier to each game and lighting condition.

The most disappointing results in this report have been shown by the Support

Vector Machine method. The results in this thesis contradicts the results in a

related study made by Le and Satoh which suggest that a higher number of

features extracted by a boosted classifier makes it easier for the SVM to

separate the two classes [13]. This may be due to overfitting (Section 3.7.1).

These results are discouraging since better results were expected from the

Support Vector Machine method.

One of the goals with this thesis was to evaluate if it was possible to improve

the football detection of today. In the comparison, one has to bear in mind that

additional techniques are used by Tracab to find the final ball hypothesis. Both

classifiers show good performance on the comparison made in Section 5.10.

Another comparison would have been to include harder images such as

footballs that are partially occluded by a player and therefore only visible in

one camera. This has not been possible. One advantage with the cascaded

classifier that can be seen in fig 32 may be that it makes it easy to rate the

detections according to how confident they are. A detection that makes it

through a high number of stages is more likely to be the actual ball than a

detection that gets thrown away in an earlier stage.

Even higher detection rates are shown in the test made on the five-a-side

match. As the cameras are closer to the pitch, the texture of the ball now

becomes visible and the classifier should have more to work with. This is part

of the explanation to why the classifier shows such good results as in fig 33.

However when comparing the performance one should bear in mind that this

test set is not the same as before so it is impossible to compare these results

straight off. The labeling of the images regarding the contrast and how free the

footballs are has not been done for this set, which makes it even more difficult

54

to compare with the first test set. Also, to save time, the process of extracting

footballs in the images has been made faster by mostly including easy targets

and by removing detections in areas around tracked players. Of course this

makes it easier to explain the high hit rates of the classifiers.

Similar disappointing results of the SVM method are shown here as in the

earlier SVM tests. Although the performance of the SVM seems to have

increased it still does not perform better than the cascaded classifier. This can

also be an indication that the cascaded classifier is showing good results.

55

Chapter 6

Conclusions and future

work

This chapter gives an overview of the results and what conclusions that can be

drawn from them. Some thoughts on what needs to be improved in the future

are presented.

6. 1 Conclusions

In this report a method for object detection has been used to detect small

footballs in real time. Finding these footballs is a hard task mainly because the

footballs are very small. This method has not been used on objects of this size

before. Because of the size of the balls, a smaller spatial step size has been

needed to achieve a desirable hit rate compared to what has been reported in

previous reports. This results in a much slower detector. When using the best

classifier a speed of 8.5 images/seconds is achieved. On the other hand no

optimization for speed has been done. By introducing the brightness threshold

before the classifier the processing time could be reduced. It is also easy to do a

tradeoff between processing time and performance. Tests made on images of

footballs in higher resolution show increased performance. On the other hand,

so does the method available at Tracab today.

The overall performance shown by the classifier in tests is promising, but since

the method has not been implemented to do a single final hypothesis of the best

ball candidate, it has been difficult to make fair comparisons with the method

available at Tracab today. Therefore it is difficult to say if this method

implemented as a final ball detector would be better or worse than the one

available today at Tracab.

The idea of using a classifier such as SVM as the last stage has been shown not

to work perfectly. Decreased performance when using a higher number of

features during training of the SVM contradicts results in previous studies [13].

This may be due to overfitting.

56

6. 2 Future work

The natural first next step to take would be to integrate the method into the

existing system to see if it can be used to improve performance of finding a

final ball hypothesis. This is the only way of getting a true comparison with the

existing methods at Tracab.

Big differences in performance can be seen when using different image sets

(Section 5.3.2). The image set can therefore probably be improved and should

definitely be revised. The current classifiers have been trained on 6 different

matches. This small set is not enough to get a variety in the images that cover

all possible conditions regarding illumination and color. To get a classifier that

generalizes well on any kind of new data it is necessary to use a wider range of

matches. With such a data set it will be necessary to test the classifier on a wide

range of data from matches not used during training. Another approach would

be to train a cascade that is optimized for one setup. This could be done by

training only on images with certain lighting conditions or images of a specific

football. During classification one would for each match or maybe even for

each image region start by examining which of the several trained classifier to

use.

The results from the SVM method suggest that the feature selection can be

done in a better way. Are the features relevant or is one feature worth more

than the other and needs to be weighted up? These are some of the questions

that need to be answered and as a beginning one can read a survey addressing

the problem of feature selection [12].

In the near future cameras of higher resolution will probably be used so it is

natural to continue the research towards this. The five-a-side test was a first

step towards testing this.

57

Bibliography

1. M. Aizerman, E. Braverman, L. Rozonoer. Theoretical foundations of the

potential function method in pattern recognition learning. Automation and Remote Control 25, pp 821-837, 1964.

2. N. Ancona, A. Branca. Example based object detection in cluttered background with Support Vector Machines. Instituto Elaborazione Segnali ed Immagini. Bari, Italy 2000.

3. N. Ancona, G. Cicirelli, E. Stella, A. Distante. Ball Detection in Static Images with SVM for Classification. Image and Vision Computing 21, pp 675-692, 2003.

4. H. Bay, T. Tuytelaars, L. Van Gool. SURF: Speeded Up Robust Features. Proceedings of the ninth European Conference on Computer Vision, 2006.

5. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2, 121–167, 1998.

6. C. Chang, C. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

7. G. Coath, P. Musumeci. Adaptive Arc Fitting for Ball Detection in RoboCup. APRS Workshop on Digital Image Analysing, 2003

8. T. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization. Machine Learning, 1-22, 1999.

9. T. D’Orazio, C. Guaragnella, M. Leo, A. Distante. A new algorithm for ball recognition using circle Hough transform and neural classifer. Pattern Recognition 37, pp 393-408, 2003.

10. Y. Freund, R. Schapire. A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting. European Conference on Computational Learning Theory, 1995.

11. J.Friedman, T.Hastie, R.Tibshirani. Additive Logistic Regression : a Statistical View of Boosting. Annals of Statistics, vol. 28, no. 2, pp. 237--407, 2000.

12. I. Guyon, A. Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, 2003.

13. D. Le, S. Satoh. A Multi-Stage Approach to Fast Face Detection. IEICE TRANS. INF. & SYST., Vol.E89–D, NO.7, 2006.

14. R. Lienhart, J. Maydt. An Extended Set of Haar-like Features for Rapid Object Detection. IEEE ICIP 2002, Vol. 1, pp. 900-903, 2002.

15. R. Lienhart, A. Kuranov, V. Pisarevsky. Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection. MRL Technical Report, 2002.

16. Y. Lin, T. Liu. Fast Object Detection with Occlusions. The 8th European Conference on Computer Vision (ECCV-2004), Prague, 2004.

17. C. Liu, H. Shum. Kullback-Leibler Boosting. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 587-594, 2003.

58

18. D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), pp 91-110, 2004.

19. Ville Lumikero. Football Tracking in Wide-Screen Video Sequences. Master Thesis in Computer Science, School of Electrical Engineering Royal Institute of Technology. Stockholm, 2004.

20. K. Mikolajczyk, C. Schmid. A Performance Evaluation of Local Descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 27, NO. 10, 2005.

21. S. Mitri, K. Pervölz, H. Surmann, A. Nüchter. Fast Color-Independent Ball Detection for Mobile Robots. Fraunhofer Institute for Autonomous Intelligens Systems (AIS), Sankt Augustin, Germany 2004.

22. T. Ojala, M. Pietikäinen, T. Mäenpää. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 24 NO. 7, 2001.

23. E. Osuna, R. Freund, F. Girosi. Training Support Vector Machines: An Application to Face Detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Puerto Rico 1997.

24. C. Papageorgiou, M. Oren, T. Poggio. A General Framework for Object Detection. International Conference on Computer Vision, 1998.

25. B. Rasolzadeh. Response Binning: Improved Weak Classifiers for Boosting. Intelligent Vehicles Symposium, pp 344 – 349, 2006.

26. S. Romdhani, P. Torr, B. Schölkopf, A. Blake. Computationally Efficient Face Detection. Proceeding of the 8th International Conference on Computer Vision, 2001.

27. D. Scaramuzza, S. Pagnottelli, P. Valigi. Ball Detection and Predictive Ball Following Based on a Stereoscopic Vision System. Proceedings of the 2005 IEEE International Conference on Robotics and Automation. Barcelona, Spain, 2005.

28. R. Schapire, Y. Singer. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning, 37(3):297-336, 1999.

29. R. Schapire, Y. Freund. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of computer and system sciences 55, 119-139 (1997)

30. K. Sung and T. Poggio. Example-based Learning for View-based Human Face Detection. A.I. Memo 1521, MIT A.I. Lab. 1994

31. P.Viola, M.Jones. Robust Real-Time Object Detection. IEEE ICCV Workshop Statistical and Computational Theories of Vision, 2001.

32. P. Viola, M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp 511–518, 2001.

33. J.Wu, J.M. Rehg, and M.D.Mullin, Learning a rare event detection cascade by direct feature selection. Advances in Neural Information Processing Systems (NIPS), 2003.

34. M. Sewell. http://www.svms.org/vc-dimension/ Accessed 2008-05-15 35. C. Poynton. Frequently Asked Questions about Color.

http://www.poynton.com/PDFs/ColorFAQ.pdf Accessed 2008-05-15 36. http://www.dtreg.com/svm.htm Accessed 2008-05-28 37. http://www.gestalttheory.net/ Accessed 2008-05-28 38. Open CV library. http://opencvlibrary.sourceforge.net/ Accessed 2008-06-27

59

Appendix 1

Percentage of images that have a value equal or lower than the one on the side

and on the top of the table:

Training set:


1 0.7 0.8 0.8 0.8 0.8

2 31.0 48.3 57.1 58.0 58.5

3 45.0 72.2 89.1 91.6 92.9

4 46.9 76.0 94.7 98.0 100

Test set 1:


1 0.2 0.5 0.5 0.5 0.5

2 29.7 48.2 57.5 57.8 57.9

3 42.9 72.6 94.4 95.9 96.3

4 43.8 74.0 96.8 99.1 100

TRITA-CSC-E 2009:004 ISRN-KTH/CSC/E--09/004--SE

ISSN-1653-5715

www.kth.se

ball detection via machine learning - kth · ball detection via machine learning rafael osorio...

Documents