Ball Detection via Machine Learning - KTH ? Ball Detection via Machine Learning RAFAEL OSORIO Masters

Download Ball Detection via Machine Learning - KTH ? Ball Detection via Machine Learning RAFAEL OSORIO Masters

Post on 13-Sep-2018

212 views

Category:

Documents

0 download

TRANSCRIPT

  • Ball Detection via Machine Learning

    R A F A E L O S O R I O

    Master of Science Thesis Stockholm, Sweden 2009

  • Ball Detection via Machine Learning

    R A F A E L O S O R I O

    Masters Thesis in Computer Science (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2009 Supervisor at CSC was rjan Ekeberg Examiner was Anders Lansner TRITA-CSC-E 2009:004 ISRN-KTH/CSC/E--09/004--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se

  • Abstract

    This thesis evaluates a method for real time detection of footballs in low

    resolution images. The company Tracab uses a system of 8 camera pairs that

    cover the whole pitch during a football match. By using stereo vision it is

    possible to track the players and the ball, to be able to extract statistical data. In

    this report a method proposed by Viola and Jones is evaluated to see if it can be

    used to detect footballs in the images extracted by the cameras. The method is

    based on a boosting algorithm called Adaboost and has mainly been used for

    face detection. A cascade of boosted classifiers is trained from examples of

    positive and negative images of footballs. In this report the images are much

    smaller than the typical objects that the method has been developed for and a

    question that this thesis tries to answer is if the method is applicable to objects

    of such small sizes.

    The Support Vector Machine (SVM) method has also been tested to see if the

    performance of the classifier can be improved. Since the SVM method is a

    time-consuming method it has been tested as a last step in the classifier

    cascade, using features selected by the boosting process as input.

    In addition to this, a database of images of footballs from 6 different matches

    consisting of 10317 images used for training and 2221 images used for testing

    has been produced. Results show that detection can be made with improved

    performance compared to Tracabs existing software.

    Sammanfattning

    Bolldetektion via maskininlrning

    Denna rapport granskar en metod fr realtidsdetektion av fotbollar i

    lgupplsta bilder. Fretaget Tracab anvnder sig av 8 kamerapar som

    tillsammans tcker en hel fotbollsplan under en match. Med hjlp av

    stereoseende r det mjligt att flja spelare och boll fr att sedan erbjuda

    statistik till fans. I denna rapport utvrderas en metod utvecklad av Viola och

    Jones fr att se om den gr att anvnda till att detektera fotbollar i bilderna frn

    de 16 kamerorna. Metoden baseras p en boostingalgoritm som kallas

    Adaboost och som frmst har anvnts fr ansiktsdetektion. En kaskad av

    boostade klassificerare trnas utifrn positiva och negativa exempelbilder av

    fotbollar. I denna rapport anvnds bilder p sm bollar som r mindre n de

  • vanliga objekt som metoden skapats fr. En frga som denna rapport frsker

    svara p r huruvida denna metod r applicerbar p s sm objekt.

    Support Vektor Maskiner (SVM) har ven testats fr att se om klassificerarens

    prestanda kan hjas. Eftersom SVM r en lngsam metod har den integrerats

    som ett sista steg i den trnade kaskaden. Features frn Viola och Jones metod

    har anvnts som input till SVM.

    En databas bestende utav ett trningsset och ett testset har skapats frn 6

    matcher. Trningssetet bestr av 10317 bilder och testsetet bestr av 2221

    bilder. Resultaten visar att detektion gr att gra med hgre precision jmfrt

    med Tracabs nuvarande mjukvara.

  • Content

    Introduction .............................................................................................................. 1

    1.1 Background ........................................................................................................... 1

    1.2 Objective of the thesis .......................................................................................... 2

    1.3 Hit rate vs. false positive rate ............................................................................... 4

    1.4 Related Work ........................................................................................................ 5

    1.5 Thesis Outline ....................................................................................................... 9

    Image Database ....................................................................................................... 10

    2.1 Ball tool ............................................................................................................... 10

    2.2 Images ................................................................................................................. 11

    2.2.1 Training set .................................................................................................. 11

    2.2.2 Negatives ..................................................................................................... 12

    2.2.3 Test set ........................................................................................................ 12

    2.2.4 Five-a-side.................................................................................................... 13

    2.2.5 Correctness .................................................................................................. 13

    Theoretical background ........................................................................................... 15

    3.1 Overview ............................................................................................................. 15

    3.2 Features .............................................................................................................. 16

    3.2.1 Haar features ............................................................................................... 16

    3.2.2 Integral Image .............................................................................................. 17

    3.5 AdaBoost ............................................................................................................. 19

    3.5.1 Analysis ........................................................................................................ 19

    3.5.2 Weak classifiers ........................................................................................... 20

    3.5.3 Boosting ....................................................................................................... 21

    3.6 Cascade ............................................................................................................... 22

    3.6.1 Bootstrapping .............................................................................................. 23

    3.7 Support Vector Machine ..................................................................................... 24

    3.7.1 Overfitting ................................................................................................... 24

    3.7.2 Non-linearly separable data ........................................................................ 25

    3.7.3 Features extracted with Adaboost .............................................................. 26

    3.8 Tying it all together ............................................................................................. 26

  • Method ................................................................................................................... 27

    4.1 Training ............................................................................................................... 27

    4.2 Step size and scaling ........................................................................................... 28

    4.3 Masking out the audience .................................................................................. 29

    4.4 Number of stages ................................................................................................ 29

    4.5 Brightness threshold ........................................................................................... 30

    4.6 SVM ..................................................................................................................... 31

    4.7 OpenCV ............................................................................................................... 32

    Results ..................................................................................................................... 33

    5.1 ROC-curves .......................................................................................................... 33

    5.2 Training results ................................................................................................... 34

    5.3 Using different images for training ..................................................................... 35

    5.3.1 Image size .................................................................................................... 35

    5.3.2 Image sets .................................................................................................... 36

    5.3.3 Negative images .......................................................................................... 37

    5.4 Step Size .............................................................................................................. 39

    5.5 Real and Gentle Adaboost .................................................................................. 41

    5.6 Minimum hit rate and max false alarm rate ....................................................... 42

    5.7 Brightness threshold ........................................................................................... 44

    5.8 Number of stages ................................................................................................ 46

    5.9 Support Vector Machine ..................................................................................... 47

    5.10 Compared to existing detection........................................................................ 49

    5.11 Five-a-side ......................................................................................................... 50

    5.12 Discussion ......................................................................................................... 52

    Conclusions and future work ................................................................................... 55

    6.1 Conclusions ......................................................................................................... 55

    6.2 Future work......................................................................................................... 56

    Bibliography ............................................................................................................ 57

    Appendix 1 .............................................................................................................. 59

    Training set: .......................................................................................................... 59

    Test set 1: ............................................................................................................. 59

  • 1

    Chapter 1

    Introduction

    In this chapter the circumstances of the problem are presented as well as the

    goal of the thesis. Related work is described and an outline of the thesis is

    given.

    1. 1 Background

    This Master thesis was performed at Svenska Tracab AB. Tracab has

    developed real-time camera-based technology for locating the positions of

    football players and the ball during football matches. Eight pairs of cameras are

    installed around the pitch, controlled by a cluster of computers. Fig 1 shows

    how pairs of cameras give stereo vision and how this makes it possible to

    calculate the X and Y coordinates of an object on the pitch.

    Fig 1 - Eight camera pairs cover the pitch giving stereo vision.

    With this information it is possible to extract statistics such as total distance

    covered by a player, a heat map where warmer color means that the player has

    spent more time in that area of the pitch, completed passes, speed and

    acceleration of the ball and of the players and a lot more. The whole process is

    carried out in real-time (25 times per second). The system is semi-automatic

    and is staffed with operators during the game. All moving objects that are

    player-like are shown as targets by the system. The operators need to assign the

    players to a target, since no face recognition or shirt number identification is

  • 2

    done to identify the players. They must also remove targets that are not subject

    of tracking, e.g. medics and ball boys.

    One big advantage of the system is that it does not interfere with the game in

    any way. No transmitters or any other kind of device on the players or the ball

    are used.

    1. 2 Objective of the thesis

    The objective of this Master thesis is to improve the ball detection using

    machine learning techniques. Today the existing ball tracking method primarily

    uses the movement of an object to recognize the ball, instead of the appearance.

    In this report we will see if it is possible to shift the focus from using the

    movement to doing object detection in every frame. A key attribute of the

    method used is that it has to be fast enough for real-time usage.

    Tracabs technology is already good at detecting moving balls against a static

    background, so an aim for this project is to produce reasonable ball hypotheses

    in more difficult situations such as:

    The ball is partially occluded by players.

    The lighting conditions are uneven, especially when the sun only lights

    up a part of the pitch.

    Other objects like the head of a player or the socks of a player look like

    the ball.

    The ball is still, e.g. a free kick.

    A classifier is to be trained to detect footballs, based on a labeled data set of

    ball / non-ball image regions from images captured by Tracabs cameras. When

    talking about image regions in this report, a smaller sub window that is part of

    the whole image is what is meant (left of figure 2). When only talking about an

    image, the whole image is meant (right of figure 2).

    Fig 2 Example of an image region and an image captured by Tracabs cameras.

    The classifier needs to be somewhat robust to changes in ball size and

    preferably also ball color since they differ in different situations. One big

    difference between this project and previous studies of object detection such as

  • 3

    the paper of Viola and Jones is the size of the object [32]. Here it is very small,

    only a few pixels wide. A big question is to see if the method presented in this

    report can be applied to objects of this size.

    Even with reasonably good detection of the ball it is difficult to tell apart the

    ball from other objects using only techniques based on the analysis of still

    images. One way of solving this is by examining the trajectory of the object in

    a sequence of images to discard objects that do not move like a ball. Also, if the

    classifier detects the ball most of the time, only missing out a few frames at a

    time, it is possible to do post processing to calculate the most likely ball path

    between two detections. These two steps are already used today at Tracab and

    are not a part of this thesis. Hopefully results from this thesis can be used to

    provide more accurate detections thus improving the data input to these steps

    and hence reduce the amount of calculation that needs to be done in those

    steps.

    The aim for this thesis is to evaluate if a machine learning approach can be

    used to detect a small object such as a ball. More specifically, different

    algorithms based on the work of Viola and Jones, which uses Adaboost, will be

    evaluated [32]. In addition to this an extension to their work using SVMs at the

    last stage has been tested and evaluated, inspired by the study by Le and Satoh

    [13]. An overview of how the methods are combined can be seen in fig 3.

    Fig 3 Overview. Image regions of an image are extracted and a cascade of classifiers trained

    with Adaboost is run for each image region. A bright pixel is searched for before running a SVM

    classifier on the regions that have not been rejected in earlier stages.

    Input - all image regions

    Cascade of classifiers

    Ball

    Brightness threshold

    Ball

    SVM-classifier

    Non BallRejected

    Ball

    Non BallRejected

    Non BallRejected

  • 4

    1. 3 Hit rate vs. false positive rate

    While it is possible to get close to 100% in the hit rate, this will also probably

    lead to a very high false positive rate. Having a hit rate of 80% and 0.1% false

    positives could be great in some situations and for some applications; in other

    situations this is not possible. In a medical environment for example this may

    not be good enough, because you really want to be certain before giving some

    risky medicine. In this application there is no limit that has to be reached.

    Instead it is the ratio between the hit rate and the false alarm that is interesting.

    Results will therefore be presented using Receiver Operating Characteristic-

    curves (ROC) with hit rate on the y-axis and the false positive rate on the x-

    axis. The different results are obtained by varying a sensitivity parameter. In

    many applications a threshold is varied to get different detection rates. In this

    thesis the number of stages used for detection and the step size used during

    detection are varied to get the different rates. ROC-curves are usually used to

    present this kind of results, which makes it easier to compare these results with

    others. A more detailed description of the ROC-curves is given in Section 5.1

    along with the results.

    Somewhat promising results have been achieved, as seen in fig 4. Compared to

    the existing method in Tracab we can see small improvements. More results

    can be seen in Chapter 5.

    Fig 4 Results compared to an existing method used by Tracab.

    The open source project OpenCV has been used in this project for most of the

    algorithms [38]. Preprocessing of the image data has been done using Matlab.

    The SVM part has been done using libSVM [6] which has been integrated with

    OpenCV.

  • 5

    1. 4 Related Work

    Much of the research in object detection focuses on face detection and texture

    classification. Little work has been done on ball detection. Research based on

    face detection and texture classification is therefore also presented. More work

    has been done on ball tracking in multiple image sequences but are not of

    interest for my work and will thus not be regarded. The organization of this

    chapter is as follows: First, specific research on ball detection will be

    presented. Then, the most interesting methods found for face detection and

    texture recognition will be presented.

    Much of the research on color independent ball detection has been done using

    some kind of edge detection. A Circle Hough Transform (CHT) is used by

    DOrazio and Guaragnella for circle detection [9]. Edges are first obtained by

    applying the canny edge detector. Each edge point contributes a circle of radius

    R to an output accumulator space. When the radius is not previously known,

    the algorithm needs to be run for all possible radiuses. This requires a large

    amount of computations to be performed. Another disadvantage is that this

    method is bad at handling noisy images. The sub-windows found by the CHT

    step are then used to train a Neural Network with different wavelet filters as

    features. The Haar wavelet was found to be the best among Daubechies 2,

    Daubechies 3 and Daubechies 4 with different decomposition levels. A faster

    version of the Circle Hough Transform is used for ball detection by

    Scaramuzza et al [27]. Coath and Musumeci use an edge based arc detection to

    detect partially occluded balls [7].

    Ville Lumikero thresholds a grayscale image into binary form [19]. He applies

    morphological operations such as dilation and erosion to clean the images

    from noise and to fill up holes in objects. The ball candidates are then

    thresholded by size and color. The remaining ball candidates are further

    processed by a tracking algorithm not described here.

    Ancona et al [1, 3] implement an example based method for ball detection by

    training a Support Vector Machine (SVM, read Burges tutorial for more

    information [5]) with positive and negative example images of a ball. 20x20

    pixels large images of footballs are used as input to the SVM. The images are

    preprocessed with histogram equalization to reduce variations in image

    brightness and contrast. The SVM is a robust algorithm that searches for the

    hyper plane that maximizes the minimum distance from the hyper plane to the

    closest training point. SVMs are described in more detail in Section 3.7.

    To do face detection, a bootstrapping technique is used by Osuna et al to

    improve the performance of a SVM [23]. Misclassified images that do not

    contain faces are stored and used as negative examples in later training phases.

    The importance of this step was first shown by Sung and Poggio [30]. To

    reduce computation time of the SVM a reduced set of vectors is introduced by

  • 6

    Romdhani et al [26]. By applying the reduced vectors one after the other it may

    not be necessary to evaluate all the reduced vectors. If an image is considered

    very unlikely to be a face early in the chain it can be discarded fast. Regular

    detectors based on a single classifier such as the SVM without using the

    reduced vector set are slow because they spend an equally large amount of time

    evaluating negative responses as positive responses in an image. This approach

    of a cascaded detector that is coarse-to-fine has been used widely in the

    literature.

    One of these cascade approaches is the algorithm proposed by Viola and Jones

    [31, 32]. Their algorithm uses a cascade of boosted classifiers to detect faces

    and mimics a decision tree behavior. In the first stages of the cascade only a

    few features are evaluated. Later on down the cascade the stages become more

    complex, thus a lot of non-faces can be thrown away without much effort while

    face candidates are more thoroughly examined. Viola and Jones have shown

    that their final classifier is very efficient and well suited for real time

    applications (15 384x288 frames/s on a 700 MHz Pentium II). Much of their

    work is based on Papageorgiou et al who did not work directly on pixel values

    [24]. Instead they use a set of Haar filters as features. These features are

    selected by evaluating them in a statistical way and then used as input to a

    SVM, used to classify new images. The features in Viola and Jones report are

    selected with the help of a boosting technique called Adaboost. Adaboost is a

    boosting algorithm that enhances the performance of a simple, so called

    weak classifier. Haar rectangles are used as features, so the task for the weak

    learner is to select the rectangle with the lowest classification error of the

    examples. After each round of learning, the examples are re-weighted so that

    the examples that were misclassified are given greater importance in the next

    step. Interesting is that it has been shown in a study by Freund and Schapire

    that this approach minimizes the training error exponentially in the number of

    rounds [10]. Viola and Jones also use a fast way of computing the Haar

    features by pre-calculating the Integral Image. The Integral Image entry I(x, y)

    is the rectangle sum to the left and up of the point (x, y). This makes it possible

    to calculate any rectangular sum in four array references. The resulting detector

    is easily shifted in scale and location thanks to this. A simple tradeoff between

    processing time and detection rate can be done by experimenting with the step

    size. This report is largely based on the work by Viola and Jones, but also

    regards extensions of their work. This is mainly due to the efficiency of the

    method shown in their study [32] as well as the promising proofs about the

    error that are explained in Section 3.5.1.

    Many extensions to the work of Viola & Jones have been done, for example in

    the study by Mitri et al where it is used for ball detection [21]. First images are

    preprocessed by extracting edges using a Sobel filter. These images are then

    used as input to train the classifier. A variant of Adaboost is used called Gentle

    Adaboost, as proposed by Lienhart et al [15]. Here it is shown that for the face

    detection problem, Gentle Adaboost outperform the Discrete Adaboost (the

  • 7

    original Adaboost is nowadays often referred to as the discrete version) and the

    Real Adaboost which was introduced by Shapire and Singer as an extension to

    the original version [28].

    Lienhart et al also extend the work done by Viola and Jones by introducing the

    Rotated Integral Image [14]. With the help of this, an extended set of Haar

    features is used which show improvements of the detection rate of the classifier

    while the computation time required does not increase to the same extent.

    Some improvements to the weak classifiers are done by Rasolzadeh [25]. The

    feature response of the Haar wavelets for both the positive and the negative

    examples used in Viola and Jones are modeled using a normal distribution. By

    doing this it is possible to improve the discriminative power of the weak

    classifiers by using multiple thresholds (i.e. two new thresholds are introduced

    where the two distributions intersect). In Viola and Jones algorithm a single

    threshold is used by the weak classifiers to separate the two classes. They

    further show that this multithresholding procedure is a specific implementation

    of response binning. By calculating two histograms of the feature response for

    the positive and the negative examples for a certain number of bins it is

    possible to determine multiple thresholds without modeling the response as

    normal distributions. The weak classifier hypothesis consists of comparing the

    two histograms for the feature response. This change can be implemented

    replacing the old weak classifiers immediately without any other major changes

    to the algorithm. Their results suggest some improvement of the detection rate

    while keeping the low computing time.

    Another extension of the Viola and Jones algorithm that combines their work

    with a Support Vector Machine at the last stage of the cascade is presented in a

    study by Le and Satoh [13]. Two new stages are added to the cascade of

    classifiers. Firstly a stage to reject non-face regions more quickly has been

    added. It does so by using a larger window size and a larger step size. As the

    last stage of the new cascade a SVM is used. The Haar features that were

    selected by Adaboost in the previous stage are used to train the SVM. As they

    are already calculated no recalculation is needed. In contrast to the extension of

    the Haar features proposed by Lienhart et al [14] a reduction of these features is

    implemented here. This is mainly done to reduce training time, and their

    experiments show that for rejection purposes no efficiency is lost. Wu et al [33]

    have been able to make an algorithm for training the cascade that is roughly

    100 times faster than the one of Viola and Jones. The difference is that Wu et al

    only train the weak classifier once per node, instead of once for each feature in

    the cascade.

    Liu et al map the feature response into two histograms (one for the positive set

    and one for the negative set) and searches for the feature that maximizes the

    divergence of the two classes [17]. This is done by using the Kullback-Leibler

    (KL) divergence as the margin of the data. Results are promising compared to

  • 8

    Adaboost but the final speed of the classifier is slower (2.5 320x240 frames/s

    on a P4 1.8GHz).

    Lin and Liu train 8 different cascades to handle 8 different types of occluded

    faces [16]. If a sample is detected as a non-face only the trained classifiers that

    contain features that do not intersect with the occluded part are evaluated. A

    majority of these classifiers should give positive responses when the sample

    was indeed a face. The sample is then evaluated with one of the new cascades.

    These additional stages yield in a three times longer computing time

    (18 320x240 frames/s on a P4 3.06 GHz). To avoid overfitting, the mechanism

    to select weak learners during boosting is reconsidered. Influenced by the

    Kullback-Leibler Boost they use the Bhattacharyya distance as a measure of

    the seperability of the two classes. They claim that the Bhattacharyya

    coefficient is much easier to compute than the Kullback-Leibler distance while

    maintaining the same performance.

    A totally different approach to feature detection has been developed by Lowe

    called SIFT [18]. It is a highly appreciated method that has been used widely

    and proven effective [20]. It has a high matching percentage and is robust to

    lighting variations compared to other local feature methods. The main

    interesting part of this work is how the features are described. Points of interest

    are found by looking for peaks in the gradient image. Descriptors representing

    the local image gradients are extracted from the area around these points of

    interest. A 4x4 grid is constructed and for each of these bins a histogram of 8

    gradients is computed. This representation has the advantage that it is good at

    capturing the small distinctive spatial patterns of an object. Rotation invariance

    is achieved by combining all the gradient orientations into a reference

    orientation. When a new image is to be classified, these descriptors are

    matched to the descriptors in the database trained with example images using a

    Nearest Neighbor method. For further reading, a study made by Mikolajczyk

    and Schmidt evaluates an improvement to SIFT and different local descriptors

    are compared [20].

    A similar approach to SIFT is a feature descriptor called SURF [4]. The main

    difference here is that instead of gradients, Haar features are calculated around

    a point of interest and represented as vectors. By using the Integral Image,

    calculations can be made faster.

    Local Binary Patterns is another approach that has been used for texture

    classification by Ojala et al [22]. For each pixel, an occurrence histogram of

    features is calculated. A feature is represented by the signs of the difference in

    value between the center pixel and its neighbors (positive or negative

    responses). The neighbor pixels are chosen as equally spaced pixels on a circle

    of radius R. In the report it has been tested with different values of R and a

    different number of neighbors. With this representation the output can take as

    many as different values, P being the number of neighbor points. Certain

  • 9

    local binary patterns are overrepresented in the textures and these patterns share

    the property of having few spatial transitions. These are called uniform and the

    definition is that they contain less than three 0/1 changes in the pattern. By

    making the representation rotationally invariant and by only considering

    uniform patterns this amount is reduced significantly. Noticeable is that the

    occurrence histogram does not save the spatial layout in the image; it only

    stores information about the frequency of the local features.

    1. 5 Thesis Outline

    The rest of the thesis is organized into five different chapters. In Chapter 2, the

    database of images needed for training and testing is described. Chapter 3

    explains the theory behind the methods used, both the method proposed by

    Viola and Jones based on a cascade of classifiers trained with Adaboost and the

    method of using a Support Vector Machine as a classifier. It also describes how

    these two methods can be combined in different ways. Chapter 4 describes how

    the methods were used in this specific problem setting. Experimental results are

    shown and discussed in Chapter 5 and conclusions are drawn in Chapter 6,

    along with proposed future work.

  • 10

    Chapter 2

    Image Database

    The image database was produced semi-automatically from example images

    taken from the Tracab system cameras. This section describes how this was

    done and how the image sets have been created.

    2. 1 Ball tool

    The images are extracted using a tool that Eric Hayman at Tracab developed.

    The procedure was to look at an image and click out the ball and try to center

    the position as exact as possible. At the same time the images were labeled

    with the degree of freeness of the ball and with the contrast in the image.

    A free ball is not in contact with anything and is completely visible. The higher

    number in degree of freeness of the ball, the closer to other objects it is. In the

    same way the higher the number of contrast the harder it is to distinguish the

    ball from the background. Examples can be seen in fig 5.

    The resulting information is a text file holding the path to the image, the x and

    y position of the ball in the image, the degree of freeness and the degree of

    contrast of the ball. Later this information is used to extract image regions that

    consist of the ball in the center along with some of the background. This way it

    is easy to do tests using image regions of different sizes for training, as

    described in Section 5.3.1.

    Example (row created by the ball tool):

    Image path x y free contrast

    C:\HEIF-GAIS\ball.000001.04.jpg 116.53 25.10 2 3

  • 11

    Description of the two scales:

    Freeness

    1. Free

    2. Close to a player/line but not in contact

    3. Contact with the player/line but not over the same

    4. Over the player/line

    5. Partially occluded by player

    Contrast

    The contrast scale ranges from 1 to 4 where 1 represents a very high contrast

    where the ball is easily distinguishable from the background and 4 represents a

    very low contrast where it is difficult to separate the football from the

    background.

    2. 2 Images

    The images taken by the 16 cameras and saved by the ball tool all have a

    resolution of 352x288 pixels. The whole images are saved by the ball tool as

    RGB images but are converted into grayscale color space for training and

    detection. The transformation from RGB to grayscale has been done in the

    same way as calculating the luminance of the YUV color space [35]. It has

    been calculated with the transformation to Y as: 0.299R + 0.587G + 0.114B.

    Each channel has color values ranging between [0..255] so Y ranges between

    [0..255] as well.

    The images have been split into two groups: A training set and a test set. The

    training set along with the negative samples is used to train the cascade while

    the test set is used to measure performance.

    2. 2. 1 Training set

    10317 positive example images were extracted for training from 6 different

    matches in Allsvenskan, Champions League and an international match from

    2007, all with different lighting conditions.

    Table 1 shows the number of images that have a value equal to or lower than

    the contrast/freeness measures indicated on the left and top of the table.

  • 12

    Table 1 The different types of images in the training set.

    Contrast/Free 1 2 3 4 5

    1 75 83 83 84 84

    2 3258 4988 5898 5982 6031

    3 4651 7447 9194 9446 9582

    4 4840 7845 9777 10120 10317

    2. 2. 2 Negatives

    To construct negatives the same images are used as for the positive samples.

    The ball is removed from the image by setting the pixels in the area of the ball

    to black. The whole 352x288 image is then saved and labeled as a negative

    image. During training, image regions that are detected as positives are then

    extracted from these images and used as negatives, as we know that there is no

    football in the images. This procedure is called Bootstrapping and is described

    further in section 3.6.1

    2. 2. 3 Test set

    2221 images were extracted from other sequences in the same matches. Testing

    is done on these. The image set is expected to have the same ratio of different

    kinds of images as there is in a match generally. Tests are done on images from

    the same matches as used in training, to avoid the problem of not having

    enough variety in the training data. This set is called test set 1 and the

    distribution of different images can be seen in table 2. The ratio of the different

    images is similar to the ratio in the training set. The ratio in percentage of the

    different images of the two sets can be seen in Appendix 1.

    Table 2 The different types of images in test set 1.

    Contrast/Free 1 2 3 4 5

    1 6 12 12 12 12

    2 659 1071 1277 1284 1287

    3 951 1612 2097 2131 2139

    4 972 1643 2149 2202 2221

    Tests have also been done on a match not used in training, though we should

    not expect getting good results in general on matches that have not been used

    during training when having only 6 different matches to train on. From this

    match there are 884 images. This set is called test set 2.

  • 13

    (a) (b)

    Fig 5 Examples of image regions with different properties. (a) 2 on the free scale and 3 on the

    contrast scale. (b) 1 on the free scale and 2 on the contrast scale

    2. 2. 4 Five-a-side

    New images have been collected from a five a side-match where the cameras

    are positioned closer to the pitch. This is possible since the pitch in a five a

    side-match is about 16 times smaller than normal. This setup gives images of

    the footballs of higher resolution since they come closer to the cameras. The

    footballs are now between 2 and 8 pixels in radius in the images which is

    significantly larger than before and the texture of the ball now becomes visible.

    The training set contains a total of 5937 images and the test set contains 2068

    images extracted in the same way as the training set. For this set no analysis of

    the quality (freeness and contrast) of the images has been done. Also, to save

    time, the process of extracting footballs in the images has been made faster by

    mostly including easy targets.

    2. 2. 5 Correctness

    The data set contains football images of variable size and with a wide range of

    lighting conditions. Balls that were close to the cameras are larger than those

    far away from the cameras; they can vary up to a couple of pixels in diameter.

    It is questionable if the variation in lighting conditions in the extracted images

    is enough to capture the variance there is in reality between all different

    matches. Optimal would be to also have images from a much wider range of

    matches to be able to generalize completely. This has not been done due to the

    large amount of time it takes to extract footballs from images manually. The

    same can be said about the problem of having to deal with different kinds of

    footballs. They are not always white. Some are black and white checkered and

    others are even red. This could be solved by training several cascades.

    It is also uncertain if the test set represents a general set of images. To be able

    to detect the football as often as possible it is optimal to have a training set that

    represents all the different image types that are present during a game.

  • 14

    Hopefully this is achieved automatically when taking a wide range of images

    without any special selection process.

    Another thing that could affect the results in a negative way is the labeling of

    the data that has been done in table 1 and table 2. This labeling is as always

    when there are humans involved a result of subjective reasoning. Also

    according to the research done by gestalt psychologists the eye is easily fooled

    [37].

  • 15

    Chapter 3

    Theoretical background

    This chapter gives an overview of the general approach used in the project and

    describes the theory needed to understand the method. It includes these areas:

    Training a boosted classifier using Adaboost, constructing the trained

    classifiers into a coarse-to-fine cascade and training a Support Vector

    Machine to be used as the last stage of the classifier.

    3. 1 Overview

    The algorithm used in this report is largely based on the work of Viola and

    Jones from 2001 [31]. It is a popular method that has been widely used (See

    Related Work). Some proofs about its generalization ability and the bound of

    the error have been made which makes the algorithm very interesting. More

    about this can be found in the analysis of the method in Section 3.5.1. The

    method has mainly been developed and evaluated for face detection rather than

    ball detection.

    The algorithm works in the following way. A classifier is trained using positive

    and negative image regions of an object of same size. The classifier consists of

    several so called weak classifiers (3.5.2), consisting of haar-like features

    (3.2.1), which are trained using a boosting technique called Adaboost (3.5).

    The boosted weak classifiers are combined into a cascade of coarse-to-fine

    classifiers (3.6). The idea is to reject a lot of non-objects in the early stages

    where the computation is light, reducing process time, while positives get

    processed further. When classifying an image region, the classifier outputs if

    the object is detected or not, like a binary classifier. The classifier is easily

    scaled and shifted to be able to detect objects at different locations and of

    different sizes in an image.

    This algorithm is combined with a Support Vector Machine (SVM) at the end

    as described in Section 3.7. SVMs have been reported to be good classifiers

    [26]. The SVM only needs to evaluate the image regions selected by the

    Adaboost classifier in the previous stage, making it faster. Otherwise, a big

  • 16

    disadvantage with the SVM method is that it may be too slow for real time

    [23]. An overview of how the method works can be seen in fig 6.

    Fig 6 System overview. A cascade of classifiers trained with Adaboost is combined with a

    brightness threshold and a SVM classifier as the last stage. Image regions that make it through

    the system without being rejected are classified as footballs. Notice: same figure as fig 3.

    3. 2 Features

    A feature is the characteristic that is used to distinguish objects from non-

    objects. The two main reasons why features are used instead of using the pixel

    values directly is that it improves speed (explained in more detail in Chapter 2

    about Integral Images) and that they can capture different kinds of properties in

    an image. Any feature can be used such as the total sum of an area, variance,

    gradients etc.

    3. 2. 1 Haar features

    In this thesis, the difference in pixel value between adjacent rectangles is used

    as features. The features can be seen in fig 7. The value of the pixels in the

    white rectangle is subtracted from the value of the pixels in the black rectangle.

    The resulting difference that comes from calculating the feature is called the

    feature response.

    Input - all sub images

    Cascade of classifiers

    Ball

    Brightness threshold

    Ball

    SVM-classifier

    Non BallRejected

    Ball

    Non BallRejected

    Non BallRejected

  • 17

    As shown in fig 7, 14 different features are used. With a base resolution of the

    detector of 12x12, the total number of possible features in my setup is 8893.

    This is a large number of features, but as we will see only a portion of the

    entire set of features will be needed.

    The features are called Haar-like features because they mimic the behavior of

    the basis haar-wavelets. Much like gradients they capture the change in pixel

    value rather than the pixel value itself. They are insensitive to mean differences

    in intensity and scale.

    Fig 7 The extended set of features as Lienhart et al [14] suggested.

    1a-b, 2a-d and 3a in figure 7 were in the original set of features used by Viola

    and Jones. The rest of the features were introduced by Lienhart et al [14]. The

    new set of features consists of 7 additional rectangles that have been rotated by

    45 degrees. Having a larger set of features make it possible to more accurately

    capture the properties of the object, but it also affects the time it takes to train

    the classifier since there are more features to evaluate. However, it does not

    automatically mean that a larger amount of features will be used by the final

    classifier, which would affect the speed of the final classifier. To speed up the

    calculations of the features an integral image is used.

    3. 2. 2 Integral Image

    An Integral Image is a matrix made to simplify the calculation of the value of

    an upright rectangular area in an image. It is a pre-calculation step made to

    speed up other calculations. The value of the Integral Image (II) at II(x,y) is the

    sum of all the pixels from the original image (OI) up and to the left of OI(x,y).

    An example can be seen in fig 8.

  • 18

    Original Integral Image

    Fig 8 The original image (left) and the corresponding Integral image (right)

    This makes calculations of the value of any rectangle in the image faster. The

    formula is:

    II(x,y) = II(x-1,y) + II(x,y-1) II(x-1,y-1) + OI(x,y) (1)

    where OI(x,y) is the original image and II(x,y) is the Integral Image.

    Once the Integral Image is calculated, the rectangle D in figure 9 can be

    computed by:

    4-3-2+1 = (A-(A+B) (A+C) + (A+B+C+D)) (2)

    =

    2A-2A+B-B+C-C+D

    =

    D

    Fig 9 Any rectangle D can be computed from the Integral Image by: 1-2-3+4

    A rectangle can with the help of the Integral Image be calculated with only 4

    array references. Differences in two adjacent rectangles (the edge features) can

    be calculated with six array references while three adjacent rectangles (the line

    features) require eight array references.

    Since the rectangles are small (the training samples are 8-15 pixels in both

    width and height) and therefore so are the features, the Integral Image does not

    help in all cases. When the rectangles are small enough it is faster to do the

    calculations directly on the pixels. This is true when the rectangles are smaller

    155 201 226

    98 78 48

    14 111 44

    155 356 582

    253 532 806

    267 657 975

  • 19

    than 3 pixels large. But the loss in time using the Integral Images in those cases

    is very small. The calculations with the Integral Image are done in O(1).

    For the rotated rectangles a different Integral Image is needed, called a Rotated

    Integral Image. The idea is the same as previous and it is calculated with the

    formula:

    (3)

    where RII is the Rotated Integral Image.

    3. 5 AdaBoost

    The principal idea of boosting is to combine many weak classifiers to produce

    a more powerful one. This is motivated by the idea that it is difficult to find one

    single highly accurate classifier. The T weak classifiers are combined

    into a strong classifier by:

    (4)

    where are found during boosting.

    The weak classifiers are used to select which features among the large number

    of features best separate the two classes: objects and non-objects. This kind of

    feature selection was first done in a statistical manner by Papageorgiou et al

    [24]. By using Adaboost this selection process can be optimized when the best

    feature is not obvious.

    The boosting step of the algorithm is done by reweighting the examples and

    putting more weight on the difficult ones. A new round of feature selection is

    then done with the new distribution. These weak classifiers add up to strong

    classifiers which are combined to construct a cascade of strong classifiers.

    3. 5. 1 Analysis

    It has been proven by Shapire and Freund that the error of the final classifier

    drops exponentially fast if it is possible to find weak classifiers that classify

    more than 50% of the examples correctly [29].

  • 20

    The final error is at most:

    (5) where is the error of the t-th weak hypothesis

    The bound of the error on the final classifier improves when any of the weak

    classifiers is improved.

    In the same article it was also shown that the generalization error of the final

    classifier with high probability is bounded by the training error. This means

    that the final classifier is most likely to generalize well on samples it has not

    seen before. They say that with high probability, the generalization error is less

    than

    (6)

    where Pr[] is the empirical probability on the training sample, T is the number

    of rounds of boosting, m is the size of the sample and d is the VC-dimension

    (Vapnik-Chervonenkis dimension) of the base classifier space. The VC-

    dimension of hypothesis space H defined over instance space X, is the size of

    the largest finite subset of X shattered by H. Further explanation of the VC-

    dimension can for example be found in the tutorial by Sewell [34].

    Shapire and Freunds analysis implies that overfitting may be a problem if

    running training for too many rounds. However, their tests showed that

    boosting does not overfit even after thousands of rounds. They also found out

    that the generalization error decreases even after the training error has reached

    zero. These are promising results that motivates the use of this method.

    3. 5. 2 Weak classifiers

    A weak classifier is a simple classifier that only has two prerequisites: that it is

    better than chance, i.e. that it classifies more than 50% of the samples correctly

    and that it can handle a set of weights over the training examples. The weights

    are needed in the boosting step.

    In this case the weak classifier consists of one feature along with its threshold.

    (7)

  • 21

    The feature response is compared to the threshold . The variable is

    used to indicate the direction of the inequality sign.

    In order to find the best weak classifier at each round of training the

    feature responses from all samples are calculated and by applying a threshold

    it is possible to separate the samples into two classes. During training the

    optimal threshold is determined for each feature, by optimal meaning the

    threshold that minimizes the classification error of that feature.

    In each step of boosting the feature and its according threshold with the lowest

    classification error is selected along with a weight inversely proportional to

    the classification error of that feature. This weight can be seen as a measure of

    the importance of that particular weak classifier.

    The error is calculated with respect to the weights of the examples (I.e. the

    error is the sum of the weights of the misclassified examples).

    (8)

    where is the true label of the example.

    3. 5. 3 Boosting

    In order to train and combine several weak classifiers instead of using one

    more complex classifier, Adaboost repeats the training step with a modified

    distribution of the training set of examples. For each round more emphasis is

    put on the more difficult examples. Those examples which were wrongly

    classified by the previous weak classifier are given higher weights than the

    correctly classified examples. The weights are then normalized. This is done at

    each round of training until the total number of rounds is reached.

    Different variants of Adaboost have been evaluated for face detection and two

    of them have been used and compared for ball detection in this report [15]. The

    Discrete Adaboost is the original version proposed by Schapire and Freund

    [29]. According to Lienhart et al [15] the Gentle Adaboost is the most

    successful algorithm for face detection. The Real Adaboost uses class

    probability estimates to construct real-valued contributions. They are all similar

    in computational complexity during classification, but differ somewhat during

    learning in the way they update the weights at each round of boosting.

  • 22

    The main idea is still the same in all three cases:

    General pseudo-code for Adaboost

    Initialize weights w = 1/m (normalized)

    For t = 1..T

    Train weak learners using distribution w

    Fit the weak classifiers to the data and

    calculate the error with respect to the

    weights.

    Choose the weak classifier with the lowest

    error and update weights by increasing the

    weights of the misclassified examples.

    Do until T is reached.

    Output the final hypothesis as a weighted combination

    (related to the error) of the weak classifiers.

    For more information on the different variants of Adaboost see the comparative

    study made by Friedman et al [11].

    3. 6 Cascade

    By forming several strong classifiers into a cascade of simple to complex level

    it is possible to reduce computation time. In the first stages of the cascade

    simpler and faster classifiers are used. Since the majority of the image regions

    going in to the first stage are non-objects many of them are easily rejected by

    the early stages while letting through the majority of the positives. Once an

    image region has been rejected at some stage it is discarded for the rest of the

    cascade as a non-object and thus not evaluated further. A positive image region

    goes through the whole cascade and is evaluated in every stage requiring

    further processing, but in total this is a rare event. When going through the

    cascade the classifiers at deeper stages get more and more complex, requiring

    more computation time. Also, with increasing stage number the number of

    weak classifiers which are needed to achieve the desired false alarm rate at the

    given hit rate, increases.

    The cascade of classifiers is trained by introducing goals in terms of positive

    detections and number of false positives. For example, to achieve a final

    classifier with a hit rate of 90% and a false positive rate of 0.1%, each stage in

    a classifier with 10 stages needs to have a hit rate of 99% (0.99^10 = 0.9) but

  • 23

    only a maximum in false positive rate of 50% (0.5^10 = 0.001). Each stage

    reduces both values but since the hit rate is close to one the result of the

    multiplication stays close to one while the result of the multiplication of the

    smaller false positive rate rapidly decreases towards zero. This is all done

    under the assumption that the different stages in the classifier are independent

    of each other.

    The way the cascade is formed is by setting a minimum value for every stage.

    The stage is trained and features are added until the desired hit rate value and

    the desired false positive rate value have been reached. By specifying these

    goals it is possible to get a classifier of your choice.

    3. 6. 1 Bootstrapping

    A new negative set is constructed for each stage by selecting those image

    regions that were falsely detected by the classifier using all previous stages. A

    false detection as the one in figure 10 would be added to the negative set. This

    method is called bootstrapping. Intuitively this makes sense as we expect the

    new examples to help us get away from the current mistakes.

    Since at each stage the classifier becomes more and more accurate, it becomes

    more and more difficult to find false positives. Also the false positives get more

    and more similar to the true detections making the separation task harder. As a

    result, deeper stages are more likely to have a high rate of false positives.

    Fig 10 A typical hit along with a false detection. The image region of the players shoe is used as

    a negative sample in the training of the next stage.

  • 24

    3. 7 Support Vector Machine

    Support vector machines (SVM) are used for data classification. The basics of

    SVMs needed to understand the method is presented here.

    In the same way as in the Adaboost case a training set and a test set is needed

    to train and evaluate the SVM. Given a set of labeled data points (the training

    set) belonging to one of two classes, the SVM finds the hyper plane that best

    separates the two classes. It does this by maximizing the margin between the

    two classes. In the left image of fig 11 we can see an example of a hyper plane

    that separates the two classes with a small margin. In the right image the hyper

    plane that maximizes the margin has been found by the SVM. The points that

    constrain the margin are called support vectors.

    Fig 11 SVM finds the plane that maximizes the margin. The image to the right is considered to

    have greater generalization capabilities. Image taken from DTREGs homepage [36].

    3. 7. 1 Overfitting

    Fig 12 shows how a classifier that is fitted well to the training set may not

    generalize well. In image (a) the classifier has been learnt to classify all

    examples correctly. As seen in image (b) this results in some wrongly classified

    examples on the test set. However in image (c) we see a classifier that although

    classifying one example from the training set wrongly it classifies all the

    examples in the test set correctly (as seen in image (d) ). The latter classifier

    generalizes better due to that it allowed a wrongly classified example during

    training. This can be handled by introducing the penalty parameter that

    weights the samples according to how they were classified. Misclassifying a

    sample now costs and by increasing the cost of misclassifying an example

    increases, making the model more adjusted to the training data.

  • 25

    (a)Training data and an overfitting classifier (b) Applied on test data

    (c) Training data and a better classifier (d) Applied on test data

    Fig 12 An overfitting classifier and a more general classifier. Images from libSVM Guide[6].

    3. 7. 2 Non-linearly separable data

    The examples in fig 11 show two linearly separable classes. When having more

    complicated data, a line may not be enough to separate the two classes. To cope

    with this problem the data is moved into a higher (maybe infinite) dimensional

    space by a function [6]. The function can take many forms. In this new

    space it may be possible to find a plane to separate the data. The problem of

    going into a higher dimensional space is that calculations get more expensive

    and it makes the method slow. Therefore the kernel trick, first introduced by

    Aizerman et al, is used to solve this [1]. Since all SVM calculations can be

    done using the dot product between the training samples, the operations

    in the high dimensional space do not have to be performed. Instead we

    can try to find a function . This function is called the

    Kernel function. Examples of popular kernels are: the Polynomial kernel, the

    Radial Basis Function (RBF), the linear kernel and the sigmoid kernel. As

    proposed by LibSVM, the RBF is a good choice to start with:

    (9)

    The linear and the sigmoid kernels are special cases of the RBF with certain

    values of the parameters (C, ) [6]. The polynomial kernel is more complex in

    terms of number of parameters to select.

    When using the RBF kernels there are two parameters to select: C and . Since

    it may not be useful to achieve high training accuracy, these parameters have

    been evaluated by doing cross validation on the training set. This is done by

  • 26

    dividing the training set into two parts, one for training and the other part for

    testing. This is done repeatedly with different partitions to get a more accurate

    result. The values of the parameter have been tested by increasing the values

    logarithmically and then doing cross-validation to get the performance. The

    cross-validation can help us get around the problem of overfitting.

    3. 7. 3 Features extracted with Adaboost

    Having a good classifier does not make sense unless the data points represent

    something meaningful. The idea is to use features extracted from some stage of

    the cascade constructed with Adaboost. Fig 12 could then be interpreted as

    having the feature response of one feature on the x-axis and the feature

    response of another feature on the y-axis. But unlike the examples in fig 11 and

    fig 12, more features than two needs to be used. This does not change any of

    the theory except that we move into a higher dimensional space.

    The samples used for training were gathered by letting an Adaboost trained

    classifier classify the image regions in the training set as in Chapter 2. The

    feature response from the samples classified as positives were chosen as the

    positive training set and the feature responses from some of the false positives

    from each image were selected to be part of the negative set. A detailed

    explanation on how this was done is given in Section 4.6.

    3. 8 Tying it all together

    To be able to add SVM as the last stage of the classifier we need to decide from

    which stage to take the features and how many features to use. Results in a face

    detection study by Le and Satoh suggest that the switch between the Adaboost

    classifier to the SVM classifier can be done in any stage [13]. Also shown in

    the same study there seems to be a big increase in performance when going

    from 25 to 75 features, while the difference between using 75 and 200 features

    is not significantly large. Since the objects to detect are different in this report

    and in the study by Le and Satoh there is no guarantee that the optimal number

    of features is the same. The speed of the classifier depends very much on the

    number of features that are used, so it is important to find an optimal tradeoff

    here.

  • 27

    Chapter 4

    Method

    This section describes how boosted classifiers described in section 3 have been

    trained and how they are used on the specific task of detecting footballs. It

    describes how the classifiers are shifted in both location and scale across the

    image during detection. To reduce the detections of some false positives a

    brightness threshold is introduced, and also a mask is used to only do detection

    on the area where it is interesting to search for the ball: the pitch. A

    description of how the features for the SVM have been collected and tested is

    given.

    4. 1 Training

    Several different cascades are trained as described in Chapter 3 and the

    performances of these classifiers can be seen in Chapter 5.

    Image regions of sizes between 8x8 and 15x15 pixels have been used to train

    four different classifiers. Bigger image regions result in training samples that

    include more or less of the background. If no background was included in the

    image regions used for training, the classifier would only learn the texture of

    the ball. Since the resolution is low, it is very difficult to distinguish any texture

    on the footballs. The idea here is therefore to include some of the background

    to give the classifier more information to work with. By including the

    background the classifier has the possibility of finding the difference between

    the dark background and the bright ball. How much of the background that

    should be included in the samples is not clear. If there is too little background

    maybe the classifier will not be able to capture the property that the ball is

    white and round compared to the darker background. On the other hand, if too

    much of the background is used the classifier will probably do detections only

    based on the background instead.

    The difference in using different parts of the training set has been evaluated.

    One classifier has been trained with easier images and another with harder

    images. So called easier images are the images labeled with contrast 1 and 2

    and labeled with freeness 1, 2 and 3. The harder images have an additional

    1747 images that have been labeled with contrast 3. By using harder images

  • 28

    where the ball is occluded and the contrast is bad during training should result

    in better detection when the ball is close to a player or occluded in some other

    way but also implies that it is harder to distinguish between a ball and a non-

    ball. The rejection process will be more forgiving, letting more examples

    through the cascade since the training images have a wider diversity. One can

    expect that more false detections will be made, requiring a higher number of

    stages to reach the same level of false detections.

    Two classifiers have been trained to evaluate the importance of using a high

    number of negative samples. 2000 and 5000 false positives have been extracted

    to use as negative samples in the bootstrapping step.

    Discrete Adaboost and Gentle Adaboost, two different kinds of boosting

    algorithms are evaluated and the minimum hit rate and maximum false alarm

    rate are varied to train three new classifiers.

    An overview of how the classifier is used can be seen in fig 6.

    4. 2 Step size and scaling

    As mentioned in Chapter 3, detection is done by sweeping a window of

    different sizes over the image, running the classifier at each image region.

    Since the footballs are not perfectly aligned and have a small variation in

    position and size, the trained detector is somewhat independent to small shifts.

    An object can therefore be detected even though not being perfectly centered.

    However, if not going through all possible image regions some objects are

    likely to be missed. The step size also affects the detector speed. With a step

    size of 1 pixel and running 10 different window sizes there are around one

    million image regions that the classifier needs to be run on. By simply

    increasing the step size to 2 (hence skipping one pixel at each step) the number

    of image regions can be halved and thus halving the total time of the

    classification. The step size is therefore a tradeoff between detection rate and

    time. When having such small objects to detect as is the case in this report, it is

    very likely that a small step size is required. The shift in location and window

    size has been tested with different step sizes and results are shown in Section

    5.3.

    Since the balls we want to detect range in size between 3 and 7 pixels in

    diameter there is no reason to search for objects of other sizes. The detection

    window is therefore scaled until it is larger than the biggest object possible, but

    not more. When scaling, this can be done either by scaling the image region

    itself or by scaling the features. In this case scaling is done by scaling the

    features since this is done without any cost (see the section of Integral Image

  • 29

    that shows that the size of the rectangle doesnt affect calculation time) while

    scaling the image region is time-consuming.

    4. 3 Masking out the audience

    Since the ball only needs to be detected when it is in play there is no need to

    perform detection outside the pitch. Since the system of cameras have a model

    of the pitch it is easy to get the limits from there. For each match 16 different

    mask images are constructed, one for each camera (to the right in fig 13). When

    stepping through the x and y coordinates of an image it is first evaluated to see

    if the mask says that detection should be made at this position or not. By doing

    this a more accurate result can be achieved. The result can be seen to the left in

    figure 13. No detections are made outside the boundaries of the mask.

    Fig 13 Detections (left) and the corresponding mask(right).

    4. 4 Number of stages

    What has been noticed when running the trained classifiers on the test data is

    that images from different matches respond very differently to the cascade.

    Images from a bright match may give a lot of false detections when running the

    cascade with many stages, while images from another match give hardly any

    false detections. But when increasing the number of stages used, positive

    detections are lost from the latter while only reducing the false positives from

    the former. This means that the optimal number of stages used for classification

    differs from match to match. So another way of getting the ROC-curves would

    be to set a limit on how many false detections the classifier is allowed to find in

    an image, and run the classifier until this limit has been reached. This increases

  • 30

    the performance of the classifier when testing on a range of matches, and a test

    made this way is presented in Chapter 5. However during testing of other

    parameters this method would not be practicable, since it would be impossible

    to get any comparison data that depended only on the tested parameter.

    4. 5 Brightness threshold

    A brightness threshold has been used to eliminate some of the false detections.

    Since the features used do not capture the pixel value but only how the pixel

    values are related to each other, some of the detections have been found to be

    located on grass areas which are totally green. These can easily be rejected by

    looking at the brightness of the pixels in the detected area. If no single pixel is

    bright enough the detection can be ruled out.

    Fig 14 Left: Image of false alarms on grass. Right: Mask that indicates the two different

    thresholds when an umbra is present.

    To the left in fig 14 we can see two false detections that have the same feature

    responses as a ball. It can be seen that these two detections consist of a brighter

    circular area in the center and darker pixels around. It also seems clear that the

    areas do not contain any pixels that are as white as a ball would be.

    To find the optimal threshold value for each individual match, a histogram of

    the intensity of the pixels in the whole image has been used. Usually, a peak in

    this histogram indicates the color of the grass. When there is an umbra because

    of the sun, it should be possible to find two peaks indicating the color of the

    grass. It is possible to extract a mask of where the umbra is as seen to the right

    in fig 14 and by looking at the brightness histogram it is possible to find a

    corresponding threshold that is optimal for the two different regions. On the

    dark sections of the pitch the threshold is close to one peak, on the bright

    sections of the field the threshold is close to the other peak. To the right in fig

    14 the threshold has been chosen to be .

  • 31

    4. 6 SVM

    Two different types of features have been used to train and evaluate the SVM

    method: using the pixel value of the image regions directly and using the

    feature responses from a classifier trained with Adaboost. False positives have

    been extracted the same way. Both methods uses images from the training set

    and the test set 1 described in Chapter 2.

    The training and testing procedure for the SVM using feature responses is as

    follows:

    Training - Let a cascade using some low number of stages run the training set

    discussed in Chapter 2. The feature responses of around 200 features starting

    from some stage are calculated, for both positives and negatives (one false

    positive from each image is taken to get an equal amount of positives and

    negatives in the resulting training data). How many features that should be used

    is not known but results from the study made by Le and Satoh show that 200 is

    a good number [13]. Some different values are tested. The Support Vector

    Machine is then trained with these feature responses. By using feature

    responses from later stages it should be possible to change the SVM into

    classifying more difficult samples better. At the same time it should classify the

    easier samples worse, but the cascade of boosted classifiers that will be run

    before the SVM step are meant to take care of these.

    Testing - Let the same cascade classify the test set 1 discussed in Chapter 2.

    Get the around 200 feature responses starting from some stage and label the

    ones that are classified as detections as positives, and the rest as negatives. Run

    the SVM on the extracted feature responses and evaluate the performance. By

    adding the SVM after different number of stages it will be possible to construct

    a ROC-curve. Run the cascade up until stages that give similar rejection or

    detection rate as the SVM and evaluate the performance. We can now compare

    how well the two different methods perform on the same positive and negative

    set.

    As mentioned in the guide made by LibSVM scaling of the data is very

    important to achieve good results [6]. Without scaling the data it was not

    possible to get any acceptable results. The main reason why scaling is

    important is to avoid the domination of large numbers over small numbers.

    Also, by scaling the data calculations are made easier and thus improving the

    efficiency of the classifier. All data, both the training and the testing data, has

    been scaled in the same way to the range .

  • 32

    4. 7 OpenCV

    The detection part of OpenCV was developed with face detection in mind. As a

    result it has been somewhat optimized for larger objects than the footballs in

    this thesis. This means that modifications to the code have been made to search

    for smaller objects.

  • 33

    Chapter 5

    Results

    In this section results from the different trained classifiers are presented.

    Comparisons are made using ROC-curves. The performance is only shown for

    the values of interest since the training of later stages requires a lot of time.

    5. 1 ROC-curves

    To see the ratio between hit rate and false alarm rate, ROC-curves are used to

    present the results. Hit rates are shown in percentage while the false alarm rate

    shows the actual count of detected false positives per image.

    A detection is added as a positive hit if:

    The distance between the center of the detection and the center of the

    actual football is less than 30% of the width of the actual football

    The width of the detection window is within 50% of the actual

    football width

    Other detections are regarded as false alarms.

    A test with a perfect result has a ROC-curve that passes through the upper left

    corner. That means 100% hit rate and 0 false alarms. The closer the curve is to

    this point, the better accuracy of the classifier. Points that fall below the dotted

    line in fig 15 are the results of a classifier that is worse than chance.

  • 34

    Fig 15 A ROC-curve. The closer the curve is to the top left corner the better.

    To get the curves, the number of stages used for detection is varied. More

    stages mean a more specified classifier while less number of stages lets more of

    the image regions through the cascade as positives.

    The choice of number of stages influences the performance differently on

    different matches. In some matches a lot of detections are made using a specific

    number of stages. Using the same number of stages on another match may

    result in very few numbers of detections. Since there are images from different

    matches in the test set, this is a problem. A test is therefore made by adjusting

    the number of stages used during detection. Depending on the number of

    detections made in the previous image, more or fewer stages are used for the

    next images. This improves performance as seen in Section 5.8.

    5. 2 Training results

    Training and testing has been done on a 2.16 GHz computer with 2 GB RAM.

    In general it has taken time in order of days to train a classifier up until stage

    35. Some training sessions have been forced to stop earlier due to the training

    being too time consuming. It has however always been possible to get a ROC-

    curve that can be used to compare the performance with the others. The

    variables that affect training time are: the number of negatives used in the

    bootstrapping step, the size of the image regions, the minimum hit rate and the

    maximum false alarm. The first attribute can be very time consuming and it is

    therefore important to have a good set of negative images from where the

    algorithm can find false positives. The last three attributes increase the number

    of features that is needed to reach the goal at each step.

    Some examples of features chosen by the algorithm can be seen in figure 16.

    Fig 16 Example of features selected by Adaboost in early stages

  • 35

    5. 3 Using different images for training

    Adaboost has been reported to be sensitive to noisy data [8]. Here some tests

    with different images used during training are reported.

    5. 3. 1 Image size

    Image regions of sizes between 8x8 and 15x15 pixels have been used to train 4

    different classifiers. Examples are given in figure 17.

    Results show small differences in the performance of the classifiers trained

    using image regions of size 10*10 pixels compared to the classifier trained

    using image regions of size 12*12 pixels. Using image regions of size 8*8

    pixels show worse performance. Using bigger image regions than 12*12 does

    not improve performance as seen in fig 18.

    Size 8*8 Size 10*10 Size 12*12 size 15*15

    Fig 17 Image regions of different sizes

  • 36

    Fig 18 By using image regions of size 12*12 best performance is achieved, although there is not

    a big difference between the different classifiers.

    5. 3. 2 Image sets

    The next test shows the importance of having a good image database. Two

    classifiers trained with different images are compared, one trained with images

    in which the ball is occluded more and the contrast is worse. The results can be

    seen in fig 19.

    Around 4 more stages were required to get down to the same false alarm rate.

    At the same time the classifier that was trained with harder images shows a

    better performance in total.

  • 37

    Fig 19 A classifier trained with images with less contrast and where the ball is partially

    occluded shows better performance than a classifier trained only on clearly visible footballs.

    5. 3. 3 Negative images

    Another modification that can be done to the set of images used in training is

    using more negatives in the bootstrapping step (see Section 3.6.1). The

    bootstrapping step selects a number of false positives classified by the present

    available classifier as examples of negatives. Until now the algorithm has used

    only 2000 negative samples at each stage. By letting the training procedure

    extract 5000 samples of negatives at each stage it should be possible to

    improve performance. One problem of using a high number of negatives in this

    step is that as training reaches later stages it becomes more and more difficult

    to find a large number of false positives. One should also mention that

    increasing this number immediately increases the time needed for training. As a

    reference it took 705 s to find 2000 false positives at stage 39, while it took

    under a second in the first stages. To find 5000 false positives at stage 39 took

    2105 s. In early stages the difference was not noticeable.

  • 38

    Fig 20 Using a higher number of negative samples for each stage of training increases

    performance a couple of percentages.

    The comparison in performance between using 5000 negatives at each stage

    and using 2000 negatives can be seen in fig 20. The two curves follow each

    other, with the curve for the classifier trained with 5000 negatives a couple of

    percentages higher.

    One question that arises is how it affects the generalization performance to use

    a higher number of negatives? Is it only positive or could it be so that the new

    classifier gets too specialized to the training data? If the negatives are very

    similar to the positives the trained classifier is likely to have a decision

    boundary that lies very close to the positives. By running the classifiers on the

    test set 2 described in Section 2 we can see some indications on how well the

    two classifiers generalize to the data. The difference in performance of the two

    classifiers is very small. There is not a big enough difference in performance

    when testing the generalization ability to be able to draw any conclusions about

    it. As seen in fig 21 the detection rates are very low.

  • 39

    Fig 21 The classifier trained with 5000 negatives performs better in the lower regions of the

    ROC-curve when testing on a game that has not been used in training

    5. 4 Step Size

    During detection a detection window is swept across the image at different

    locations and at different scales. By shifting the window a few pixels at a time

    it is possible to scan the whole image. A step factor is used to increase both

    the window size and the step size. For example, if the current step size is the

    window is shifted pixels. This means that when the detection window is

    large it is shifted more than one pixel at a time. Since the image regions used in

    training are not perfectly centered, a small amount of translational variability is

    trained into the classifier. While speeding up the classifier substantially,

    shifting more than one pixel at a time has resulted in a decrease of the

    performance.

    With a scale factor of 1.2 the classifier classified around 24 images per second

    (it took between 87 and 96 seconds to classify all 2221 images). On the other

    hand it took about double the time with a scale factor of 1.1 This means that it

    takes one second to classify 13 images. Due to the higher performance when

    using a small step factor, a fixed step size at 1 pixel and not using the scale

    factor for the step size has been tested. The window size is still increased with

  • 40

    the step factor until the window is large enough. This way it only classifies

    about 9 images per second.

    Fig 22 A smaller step factor increases performance but also increases processing time.

    These results as seen in fig 22 clearly show a big increase in hit rate when

    decreasing the step size. In the forthcoming results a fixed step size of 1 pixel is

    used along with a step factor of 1.1 for the window size. Also a pre-calculation

    step has been removed which used a step size of 2 pixels as the first step. By

    removing this step the detection rate improves as shown in figure 23 while

    decreasing the number of images processed per second to 8.5.

  • 41

    Fig 23 Old step size as in fig 22 compared with having removed a pre-calculation step.

    5. 5 Real and Gentle Adaboost

    Two variants of the Adaboost algorithm have been evaluated. The discrete

    Adaboost was not able to finish, so it will therefore be left out of the

    comparison. What happened was that the boosting step was not able to improve

    performance by updating the weights, so the process got stuck. Lienhart also

    declared that he had convergence problems when using LogitBoost for face

    detection and was not able to evaluate that method [14]. Also, Lienhart could

    show that the Gentle Adaboost was the best method between the Real, Discrete

    and Gentle Adaboost, at least for face detection. D. Le and S. Satoh also stated

    that the Discrete Adaboost is too weak to boost in the case of a hard

    distinguished dataset [13].

  • 42

    Fig 24 The performance of the Real Adaboost and the Gentle Adaboost is significantly the same.

    Figure 24 shows how the difference in performance between the two different

    variants of Adaboost, Real and Gentle, is minimal.

    5. 6 Minimum hit rate and max false

    alarm rate

    As described in Section 3.6 the min hit rate and max false alarm rate are used

    to set up the properties of the cascade. They describe the values each stage

    needs to reach in order to move on to the next stage.

    We see that increasing the max false alarm rate improves performance

    significantly. Even better performance is achieved by using a higher minimum

    hit rate during training. It is worth noting that a higher min hit rate alone shows

    as good performance as when rising both the min hit rate and the max false

    alarm. To make it easier to refer back to this classifier later in the report, the

    classifier with the best performance in fig 25 is called classifier 1.

  • 43

    Fig 25 Comparison between using different values of the minimum hit rate and the false alarm

    rate during training. Better performance is achieved by using a higher min hit rate during

    training.

    A question that arises is if this procedure deteriorates the generalization

    performance of the classifier. Maybe these results show a too specific classifier

    that has been too well adjusted to the training data? To test this, the

    performance of this classifier is compared with the performance of the

    classifier using a min hit rate of 0.995 on the test set 2. Test set 2 contains

    images from a game not used for training. The results in fig 26 show small

    indications on that the classifier has reduced its generalization ability. The

    classifier trained with an increased minimum hit rate and a decreased

    maximum false alarm rate still performs better than the classifier trained with

    the default values, but the difference is much smaller.

    The downside of changing these limits is that it may be bad for training. It can

    take longer time for the classifier to reach the limits, increasing the total time

    taken to train the cascade. Some training sessions never even ended because

    they couldnt reach the limits. In addition to this, more features are usually

    needed to reach higher limits which mean that the final classifier will be

    slower.

  • 44

    Fig 26 The classifier trained with a higher min hit rate and a lower false alarm rate still

    performs better when testing on test set 2, but the difference between the two classifiers is

    minimal.

    5. 7 Brightness threshold

    When looking at the false positives that the classifiers detect, it can be seen that

    false detections are sometimes made on the green grass. By looking at them it

    seems clear that they do not contain any white pixels, but are thought of as

    footballs anyway. See section 4.5 for more information. A simple way of

    removing these detections should be to reject the detections if they do not

    contain any white pixels. This is done by introducing a brightness threshold

    that rules out detections if no pixel in the detected area is brighter than the

    threshold. The performance when testing different values of the brightness

    threshold can be seen in fig 27.

  • 45

    Fig 27 Shows how performance changes when using different threshold values to remove

    detections that are not bright enough.

    The images are saved by the ball tool in RGB space but later transformed into

    grayscale with one color channel. The luminance Y in the YUV color space is

    used as grayscale value. This is described in more detail in Section 2.2.

    The brightness threshold is added as the last step of detection.

    The results show that by adding this threshold, many of the false detections can

    be ruled out without decreasing the detection rate of the classifier. These tests

    were made on the test set 1, and the classifier used was again classifier 1. The

    best result achieved without losing any hits was using a threshold of 125. Using

    20 stages of the cascade no positive detections were lost, while decreasing the

    number of false detections from 5.4 to 4.3 per image.

    Further tests show that the threshold can be optimized even more. When

    increasing the threshold value, only images from one match suffer from weaker

    detection while the detection rate of the other matches remains. Logically this

    match had cumbersome lighting conditions and the football was often darker

    than normal. These results suggest that the threshold value can be optimized for

    each individual match.

    By using the threshold mask described in Section 4.5 the threshold can

    automatically be optimized for the current lighting conditions. The comparison

    is made in fig 27 between the cascaded classifier 1 with and without threshold.

    As expected the performance is increased even further using this new

    threshold.

  • 46

    5. 8 Number of stages

    By increasing or decreasing the number of stages used for detection in an

    image depending on how many detections that were made in the previous

    image it is possible to adjust the classifier to each game since the images in the

    test set are ordered.

    An upper and a lower limit for when to lower and raise the number of stages

    are needed. By varying these two limits it is possible to get a ROC-curve as in

    fig 28.

    Fig 28 By adapting the number of stages according to the number of detections made in the

    previous image better results are achieved.

    Results show that the classifier benefits from being adjusted to each game to

    maximize performance.

  • 47

    5. 9 Support Vector Machine

    As mentioned in Section 4.6, there are two parameters that need to be selected

    when using the RBF kernel for SVM classification: C and . Unfortunately

    there is no way to generalize the selection of these parameters. For every data

    set there is a different choice of parameter values that are optimal. These values

    have been found using cross-validation (see Section 3.7.2). Only the

    performances of the best classifiers are shown in this section.

    In these tests the SVM has been integrated as the last step after the cascaded

    classifier. The ROC-curve has been constructed by varying the stage when the

    SVM takes over the classification task.

    The first test uses a SVM model that has been trained with 283 feature

    responses from stages 16 and 17 of the best classifier from Section 5.6

    (classifier 1). As seen in fig 29, the results of the SVM model stay on the

    negative side of the cascaded classifier in the ROC-curve. The results from a

    similar classifier trained on the pixel values directly are even inferior and are

    therefore left out of the report.

    Fig 29 Using SVM as the last stage at different stages. Using SVM as the last stage does not

    show better results than the cascaded classifier when trained on 283 features.

  • 48

    Possible explanations to why the results from the SVM are worse than when

    only using the boosted classifier are that too few or not enough features have

    been used or that the features have been taken from stages too late or too early

    in the cascade. The next tests are made to examine these possibilities.

    The second two tests using SVM have been done on classifiers trained with the

    same number of feature responses (283) but using feature responses from

    earlier and later stages of the cascade. The feature responses have been taken

    starting from stage 2 and 26 respectively. The same cascaded classifier has

    been used to be able to compare the difference of using more or less

    discriminative features. Using features from later stages for training should

    result in a classifier that is better on separating harder samples than before. In

    fig 30 results show that the overall performances of the two new classifiers are

    worse than the previous classifier.

    A higher number of features do not help the SVM classifier, as seen in fig 31.

    On the other hand, when lowering the number of features the performance is

    increased. The results in this figure comes when using a classifier trained with

    features from stage 15 and forth until the wanted number of features have been

    reached. A zero-mean normalization has also been done on each image region

    but without signs of improvement.

    Fig 30 Training the SVM on feature responses from earlier and later stages results in poor

    performance.

  • 49

    Fig 31 Using additional features do not improve the performance of the SVM classifier. Good

    performance is shown by the classifiers trained with very few features.

    5. 10 Compared to existing detection

    The main method for finding the ball today is based on tracking the movement

    of the ball. Since there has not been enough time to integrate the method

    proposed in this report with the existing software used by Tracab, comparisons

    of the performance of finding a final ball hypotheses has not been done

    between the two methods. Instead, a specific detection step in Tracabs system

    is compared with the method proposed in this report.

    The comparison has been done by running the two methods on the test set 1. A

    first step in Tracabs algorithm extracts possible ball candidates from the test

    set by looking for image regions that are brighter than the background. These

    regions are then stereo-matched. The resulting set of image regions are saved

    and used as the database in the comparison test. The next step of the method

    continues to narrow down the candidates by using a correlation method that

    extracts ball-like candidates. This results in a detection rate and a false

    detection rate that is compared with the cascaded classifier as can be seen in fig

    32. The cascaded classifier has been run on the same database using different

    number of stages to get a ROC-curve.

  • 50

    Fig 32 The method proposed in this report compared with a method in the existing system

    As seen in fig 32 the performance of the cascaded classifier is slightly higher.

    5. 11 Five-a-side

    Five-a-side is the name of the game when there are only 5 persons in each team

    playing and the dimensions of the pitch is much smaller, around 16 times

    smaller than in regular football. This test can be compared with having the

    same setup as before but with better cameras of higher resolution. Since

    technology evolves rapidly and prices fall, there are reasons to believe that

    better cameras will be used in the near future.

    The cascade has been trained in the same way as before as described in Chapter

    3, but with images as described in Section 2.2.4. Due to long training time the

    parameters for training has been set to a max false alarm rate of 0.4, a min hit

    rate of 0.995 and 2000 negatives for each stage. The SVM has been trained

    with 197 feature responses from stage 16 and is added as the last step of the

    cascade.

    Fig 34 shows the performance of the classifier run with and without SVM on

    the test set described in Section 2.2.4. The hit rate is over 95% even at low

    false alarm rate. As before, using SVM does not increase performance. The two

    curves in fig 34 follow each other closely.

  • 51

    Fig 33 System has been set up at a 5-a-side pitch. Cameras are closer to the ball giving footballs

    of higher resolution. The texture of the ball is now distinguishable.

    Fig 34 Very good performance is shown by the classifier trained using images of footballs of

    higher resolution. No improvement can be seen when using SVM as the last stage.

    In the same way as in Section 5.10 a comparison has also been made between

    the classifier proposed in this section and a present detection method used by

    Tracab. This can be seen in fig 35.

  • 52

    Fig 35 The method proposed in this report compared with a method in the existing system at

    Tracab.

    As seen the cascaded classifier outperforms the current method. Again, it is

    important to remember that this is not the only method used by Tracabs

    system. In addition to a higher detection rate, the most positive result in this

    test is the low false alarm rate shown by both methods.

    5. 12 Discussion

    Overall results show that the detection task set up for this thesis can be done

    with pleasing results. It seems as the boosting procedure is capable of

    extracting the information available in the sample images when looking at the

    features it selects. In early stages the features are understandable. They capture

    the property of the football being bright in the middle and darker to the sides.

    On the other hand, it is not as obvious what features in later stages represent.

    These may be signs that the process overfits to the data, but as studies have

    shown the method is robust to overfitting [29].

    As expected, results in Section 5.3 indicate that the image database is crucial

    for getting a good performance. It is a little surprising that best performance is

    shown when using so much of the background as in the image regions of 12*12

  • 53

    pixels. The increased performance when using harder images is probably due to

    that the image set used in training related better to the test set 1 used for both

    tests. The results confirm the theory that it requires more stages when using

    harder images in order to reject as many samples as when using only the easier

    image.

    Also as expected, results show a big difference in performance when reducing

    the step size. This is the case because the objects dealt with in this report are

    small. Unfortunately this is directly related to the time needed to classify an

    image. Luckily it is very easy to do a tradeoff between processing time and

    performance.

    Tests of using a brightness threshold show how the boosting only trains the

    classifier to identify relative features, not exact pixel values. This is why the

    brightness threshold can be successful. The tests of brightness threshold along

    with varying the number of stages show the importance of adjusting the

    classifier to each game and lighting condition.

    The most disappointing results in this report have been shown by the Support

    Vector Machine method. The results in this thesis contradicts the results in a

    related study made by Le and Satoh which suggest that a higher number of

    features extracted by a boosted classifier makes it easier for the SVM to

    separate the two classes [13]. This may be due to overfitting (Section 3.7.1).

    These results are discouraging since better results were expected from the

    Support Vector Machine method.

    One of the goals with this thesis was to evaluate if it was possible to improve

    the football detection of today. In the comparison, one has to bear in mind that

    additional techniques are used by Tracab to find the final ball hypothesis. Both

    classifiers show good performance on the comparison made in Section 5.10.

    Another comparison would have been to include harder images such as

    footballs that are partially occluded by a player and therefore only visible in

    one camera. This has not been possible. One advantage with the cascaded

    classifier that can be seen in fig 32 may be that it makes it easy to rate the

    detections according to how confident they are. A detection that makes it

    through a high number of stages is more likely to be the actual ball than a

    detection that gets thrown away in an earlier stage.

    Even higher detection rates are shown in the test made on the five-a-side

    match. As the cameras are closer to the pitch, the texture of the ball now

    becomes visible and the classifier should have more to work with. This is part

    of the explanation to why the classifier shows such good results as in fig 33.

    However when comparing the performance one should bear in mind that this

    test set is not the same as before so it is impossible to compare these results

    straight off. The labeling of the images regarding the contrast and how free the

    footballs are has not been done for this set, which makes it even more difficult

  • 54

    to compare with the first test set. Also, to save time, the process of extracting

    footballs in the images has been made faster by mostly including easy targets

    and by removing detections in areas around tracked players. Of course this

    makes it easier to explain the high hit rates of the classifiers.

    Similar disappointing results of the SVM method are shown here as in the

    earlier SVM tests. Although the performance of the SVM seems to have

    increased it still does not perform better than the cascaded classifier. This can

    also be an indication that the cascaded classifier is showing good results.

  • 55

    Chapter 6

    Conclusions and future

    work

    This chapter gives an overview of the results and what conclusions that can be

    drawn from them. Some thoughts on what needs to be improved in the future

    are presented.

    6. 1 Conclusions

    In this report a method for object detection has been used to detect small

    footballs in real time. Finding these footballs is a hard task mainly because the

    footballs are very small. This method has not been used on objects of this size

    before. Because of the size of the balls, a smaller spatial step size has been

    needed to achieve a desirable hit rate compared to what has been reported in

    previous reports. This results in a much slower detector. When using the best

    classifier a speed of 8.5 images/seconds is achieved. On the other hand no

    optimization for speed has been done. By introducing the brightness threshold

    before the classifier the processing time could be reduced. It is also easy to do a

    tradeoff between processing time and performance. Tests made on images of

    footballs in higher resolution show increased performance. On the other hand,

    so does the method available at Tracab today.

    The overall performance shown by the classifier in tests is promising, but since

    the method has not been implemented to do a single final hypothesis of the best

    ball candidate, it has been difficult to make fair comparisons with the method

    available at Tracab today. Therefore it is difficult to say if this method

    implemented as a final ball detector would be better or worse than the one

    available today at Tracab.

    The idea of using a classifier such as SVM as the last stage has been shown not

    to work perfectly. Decreased performance when using a higher number of

    features during training of the SVM contradicts results in previous studies [13].

    This may be due to overfitting.

  • 56

    6. 2 Future work

    The natural first next step to take would be to integrate the method into the

    existing system to see if it can be used to improve performance of finding a

    final ball hypothesis. This is the only way of getting a true comparison with the

    existing methods at Tracab.

    Big differences in performance can be seen when using different image sets

    (Section 5.3.2). The image set can therefore probably be improved and should

    definitely be revised. The current classifiers have been trained on 6 different

    matches. This small set is not enough to get a variety in the images that cover

    all possible conditions regarding illumination and color. To get a classifier that

    generalizes well on any kind of new data it is necessary to use a wider range of

    matches. With such a data set it will be necessary to test the classifier on a wide

    range of data from matches not used during training. Another approach would

    be to train a cascade that is optimized for one setup. This could be done by

    training only on images with certain lighting conditions or images of a specific

    football. During classification one would for each match or maybe even for

    each image region start by examining which of the several trained classifier to

    use.

    The results from the SVM method suggest that the feature selection can be

    done in a better way. Are the features relevant or is one feature worth more

    than the other and needs to be weighted up? These are some of the questions

    that need to be answered and as a beginning one can read a survey addressing

    the problem of feature selection [12].

    In the near future cameras of higher resolution will probably be used so it is

    natural to continue the research towards this. The five-a-side test was a first

    step towards testing this.

  • 57

    Bibliography

    1. M. Aizerman, E. Braverman, L. Rozonoer. Theoretical foundations of the

    potential function method in pattern recognition learning. Automation and Remote Control 25, pp 821-837, 1964.

    2. N. Ancona, A. Branca. Example based object detection in cluttered background with Support Vector Machines. Instituto Elaborazione Segnali ed Immagini. Bari, Italy 2000.

    3. N. Ancona, G. Cicirelli, E. Stella, A. Distante. Ball Detection in Static Images with SVM for Classification. Image and Vision Computing 21, pp 675-692, 2003.

    4. H. Bay, T. Tuytelaars, L. Van Gool. SURF: Speeded Up Robust Features. Proceedings of the ninth European Conference on Computer Vision, 2006.

    5. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2, 121167, 1998.

    6. C. Chang, C. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

    7. G. Coath, P. Musumeci. Adaptive Arc Fitting for Ball Detection in RoboCup. APRS Workshop on Digital Image Analysing, 2003

    8. T. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization. Machine Learning, 1-22, 1999.

    9. T. DOrazio, C. Guaragnella, M. Leo, A. Distante. A new algorithm for ball recognition using circle Hough transform and neural classifer. Pattern Recognition 37, pp 393-408, 2003.

    10. Y. Freund, R. Schapire. A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting. European Conference on Computational Learning Theory, 1995.

    11. J.Friedman, T.Hastie, R.Tibshirani. Additive Logistic Regression : a Statistical View of Boosting. Annals of Statistics, vol. 28, no. 2, pp. 237--407, 2000.

    12. I. Guyon, A. Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, 2003.

    13. D. Le, S. Satoh. A Multi-Stage Approach to Fast Face Detection. IEICE TRANS. INF. & SYST., Vol.E89D, NO.7, 2006.

    14. R. Lienhart, J. Maydt. An Extended Set of Haar-like Features for Rapid Object Detection. IEEE ICIP 2002, Vol. 1, pp. 900-903, 2002.

    15. R. Lienhart, A. Kuranov, V. Pisarevsky. Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection. MRL Technical Report, 2002.

    16. Y. Lin, T. Liu. Fast Object Detection with Occlusions. The 8th European Conference on Computer Vision (ECCV-2004), Prague, 2004.

    17. C. Liu, H. Shum. Kullback-Leibler Boosting. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 587-594, 2003.

  • 58

    18. D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), pp 91-110, 2004.

    19. Ville Lumikero. Football Tracking in Wide-Screen Video Sequences. Master Thesis in Computer Science, School of Electrical Engineering Royal Institute of Technology. Stockholm, 2004.

    20. K. Mikolajczyk, C. Schmid. A Performance Evaluation of Local Descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 27, NO. 10, 2005.

    21. S. Mitri, K. Pervlz, H. Surmann, A. Nchter. Fast Color-Independent Ball Detection for Mobile Robots. Fraunhofer Institute for Autonomous Intelligens Systems (AIS), Sankt Augustin, Germany 2004.

    22. T. Ojala, M. Pietikinen, T. Menp. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 24 NO. 7, 2001.

    23. E. Osuna, R. Freund, F. Girosi. Training Support Vector Machines: An Application to Face Detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Puerto Rico 1997.

    24. C. Papageorgiou, M. Oren, T. Poggio. A General Framework for Object Detection. International Conference on Computer Vision, 1998.

    25. B. Rasolzadeh. Response Binning: Improved Weak Classifiers for Boosting. Intelligent Vehicles Symposium, pp 344 349, 2006.

    26. S. Romdhani, P. Torr, B. Schlkopf, A. Blake. Computationally Efficient Face Detection. Proceeding of the 8th International Conference on Computer Vision, 2001.

    27. D. Scaramuzza, S. Pagnottelli, P. Valigi. Ball Detection and Predictive Ball Following Based on a Stereoscopic Vision System. Proceedings of the 2005 IEEE International Conference on Robotics and Automation. Barcelona, Spain, 2005.

    28. R. Schapire, Y. Singer. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning, 37(3):297-336, 1999.

    29. R. Schapire, Y. Freund. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of computer and system sciences 55, 119-139 (1997)

    30. K. Sung and T. Poggio. Example-based Learning for View-based Human Face Detection. A.I. Memo 1521, MIT A.I. Lab. 1994

    31. P.Viola, M.Jones. Robust Real-Time Object Detection. IEEE ICCV Workshop Statistical and Computational Theories of Vision, 2001.

    32. P. Viola, M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp 511518, 2001.

    33. J.Wu, J.M. Rehg, and M.D.Mullin, Learning a rare event detection cascade by direct feature selection. Advances in Neural Information Processing Systems (NIPS), 2003.

    34. M. Sewell. http://www.svms.org/vc-dimension/ Accessed 2008-05-15 35. C. Poynton. Frequently Asked Questions about Color.

    http://www.poynton.com/PDFs/ColorFAQ.pdf Accessed 2008-05-15 36. http://www.dtreg.com/svm.htm Accessed 2008-05-28 37. http://www.gestalttheory.net/ Accessed 2008-05-28 38. Open CV library. http://opencvlibrary.sourceforge.net/ Accessed 2008-06-27

  • 59

    Appendix 1

    Percentage of images that have a value equal or lower than the one on the side

    and on the top of the table:

    Training set:

    Contrast/Free 1 2 3 4 5

    1 0.7 0.8 0.8 0.8 0.8

    2 31.0 48.3 57.1 58.0 58.5

    3 45.0 72.2 89.1 91.6 92.9

    4 46.9 76.0 94.7 98.0 100

    Test set 1:

    Contrast/Free 1 2 3 4 5

    1 0.2 0.5 0.5 0.5 0.5

    2 29.7 48.2 57.5 57.8 57.9

    3 42.9 72.6 94.4 95.9 96.3

    4 43.8 74.0 96.8 99.1 100

  • TRITA-CSC-E 2009:004 ISRN-KTH/CSC/E--09/004--SE

    ISSN-1653-5715

    www.kth.se

Recommended

View more >