sona project

Image processing

and object tracking

from single camera

JOHAN SOMMERFELD

Master's Degree Project

Stockholm, Sweden 2006-12-13

Abstract

In the last decades the computers' ability to perform huge amount of cal-

culations, and handle information �ows we never thought possible ten years

ago has emerged. Despite this a computer can only extract little information

from the image in comparison to human seeing. The way the human brain

�lters out useful information is not fully known and this skill has not been

merged into computer vision science.

The aim of this thesis is to implement a system in Matlab that is able

to track a speci�c object in a video stream from a single web camera. The

system should use both fast and advanced algorithms aiming to achieve a

better ratio between accuracy and speed than you would achieve with either

fast or advanced algorithms. The system will be tested by trying to follow a

persons hand, placed in front of a computer with the web camera mounted

on the screen.

The goal is to achieve a system with the potential to be implemented in

a real time environment. Therefore the system needs to be very fast. The

work in this thesis is an initial step and will not be implemented to run in

real time.

The hardware used is a standard computer and a regular web camera

i

with a 640x480 resolution at 30 frames per second (fps).

The system works overall as expected and was able to track a persons

hand with numerous con�gurations. It outperforms advanced algorithms in

terms of lower computational power needed, and is more stable than the

fast ones. A drawback is that the system parameters were dependent on the

object and its surroundings.

Acknowledgments

The thesis was written at the sound and image processing laboratory at the

school of electrical engineering, the Royal Institute of Technology (KTH)

through the school year of 2005�2006. I would like to take this opportunity

to thank my supervisor M.Sc. Anders Ekman for his patience when things

progressed a bit slow, PhD Disa Sommerfeld for proofreading and also assis-

tant professor Danica Kragi¢ for pushing me forward.

ii

Contents

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Problem 5

2.1 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Method 9

3.1 Adaptive �lters . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 The Least Mean Square algorithm . . . . . . . . . . . . 11

3.2 Motion detection . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Pattern recognition . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.1 Parametric algorithms . . . . . . . . . . . . . . . . . . 14

3.3.2 Nonparametric algorithms . . . . . . . . . . . . . . . . 14

3.3.3 Linear discriminant . . . . . . . . . . . . . . . . . . . . 15

3.3.4 Support Vector Machines . . . . . . . . . . . . . . . . . 16

3.3.5 ψ-learning . . . . . . . . . . . . . . . . . . . . . . . . . 17

iii

4 Implementation 21

4.1 Initiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3.1 Feature Space . . . . . . . . . . . . . . . . . . . . . . . 25

4.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.3 Detecting . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Result 31

5.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2 Color spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.4 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Discussion 37

6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A Mathematical cornerstones 39

A.1 Statistical Theory . . . . . . . . . . . . . . . . . . . . . . . . . 39

B Simulation plots 43

Bibliography 51

iv

Chapter 1

Introduction

In the last decades the computers ability to perform huge amount of calcu-

lations, and handle information �ows we never thought possible ten years

ago has emerged. Despite this a computer can only extract little information

from the image in comparison to human seeing. The way the human brain

�lters out useful information is not fully known and therefore this skill has

not been merged into computer vision science.

1.1 Background

Even if we have not been able to teach a computer to process visual input

in a complex sense, there is quite much a computer can do when it comes to

following movement and performing easier recognitions.

One of the key features in a computer vision system is for the computer

to extract interesting areas (foreground). Research on this has mainly two

approaches. The �rst group uses advanced algorithms for pattern recognition

1

to extract the foreground. Often these methods take little use of temporal re-

dundancy, and are slow because of the large amount of computations needed.

The second approach is di�erent, often using pixel by pixel computations and

only a few computations per pixel. In general the latter methods are fast

and may be implemented in real-time applications. The drawback of these

methods is that they are, due to the lack of complexity in the algorithms,

sensitive to noise and often need a static environment to be able to function.

1.2 Related work

There are a few simple algorithms for tracking, for example: detection of

discontinuities using Laplacian and Gaussian �lters, often implemented with

a simple kernel [1]; thresholding; and motion detection with reference image.

These algorithms are simple, but sensitive to noise, and hard to generalize. A

set of more advanced algorithms involves iterations and/or transformations,

such as the Hough transform, region based segmentation and morphological

segmentation. These algorithms are generally more stable concerning noise,

although as pictures and/or frames grows larger, these algorithms get slow [1].

Other algorithms make use of pattern recognition, such as neural net-

works, maximum-likelihood and support vector machines [2]. First the image

has to be translated into something that the pattern recognition algorithms

understand. The image is processed to a so-called feature vector. The ma-

jority of pattern recognition algorithms require a set of training data to form

the decision boundary. The training is often slow, however thereafter the

algorithm is fast. The problem is that extracting the feature vector might

2

be a demanding task for the computer.

There is a number of interesting approaches to object tracking. In the

study by Kyrki et al [3] they use both model-based features such as a wire

frame combined with model-free features such as points of interest on a calcu-

lated surface. In the study by Doulamis et al [4] they use an implementation

of neural networks to track objects in a video stream. The neural network

is adaptive and changes over time as the object translates. In the study by

Comaniciu et al [5] a kernel-based solution is used for identifying an object.

In a study, Amer [6] uses voting based features and motion to detect objects,

which are tuned for real time processing. In the PhD thesis by Kragi¢ [7], a

multiple cue algorithm is presented, using features that are fast to compute

and relying on the assumption that not everyone fails at the same time. In

the study by Cavallaro et al [8] a hybrid algorithm is presented using infor-

mation about both objects and regions. In the study by Gastaud et al [9]

they track objects using active contours.

Kragi¢ [7] uses multiple cues for better tracking. Instead of using multiple

cues of fast algorithms, the approach in the present thesis takes the advantage

of the fast and also the advanced algorithms in order to achieve a system that

outperforms the simple algorithms, and operates faster than the advanced

ones.

3

Chapter 2

Problem

The aim of this thesis is to implement a system in Matlab that is able to

track a speci�c object in a video stream from a single web camera. The

system should use both fast and advanced algorithms aiming to achieve a

better ratio between accuracy and speed than you would achieve with either

fast or advanced algorithms. The system will be tested by trying to follow a

persons hand, placed in front of a computer with the web camera mounted

on the screen.

2.1 System

The goal is to achieve a system with the potential to be implemented in a

real time environment. Therefore the system needs to be very fast. Also a

higher accuracy than the simple methods described in section 1.2 needs to

be achieved. The system will use algorithms that need training. The work in

this thesis is an initial step and will not be implemented to run in real time.

5

Figure 2.1: The main blocks of the system.

At the start a user input telling the system the whereabouts of the object to

track is requested. In this thesis the system is implemented in Matlab and

therefore only a proof of concept is possible to achieve.

More speci�cally, the system is based on four blocks, see �gure 2.1.

Detection is responsible for detecting and segmenting the interesting parts

of the image. The mainly responsible algorithm is most often one of

the fast algorithms described in section 1.2.

Recognition is responsible for classifying the foreground extracted from the

image by the detection block.

Updating is responsible for updating the representation of the tracked ob-

ject, using information generated by the recognition block.

Prediction is responsible for using all information and to predict where to

start the segmentation in the Detection block, to minimize the time

consumed and to minimize the error probability.

6

2.2 Hardware

The hardware used is a standard computer and a regular web camera with

a 640x480 resolution at 30 frames per second (fps). The computer is an

Apple Power Mac G5 2x2.7 GHz. The camera is an Isight from Apple, the

video stream is in Dv format. However Matlab's Unix version only takes

uncompressed video, and therefore the stream is converted to uncompressed

video with true color.

7

Chapter 3

Method

3.1 Adaptive �lters

An adaptive �lter is a �lter that changes over time depending on the signal.

For a resumé of the statistical theory used, see appendix A.1. Assume that

you have two non-stationary signals with zero mean and known stochastic

functions, hence covariance and cross-covariance

ryy(n,m) = E[y(n)y(n+m)]

rxy(n,m) = E[x(n)y(n+m)].

The problem of estimating x(n) given past y(n) may be written as

x̂(n) =N−1∑k=0

θ(k)y(n− k) = Y T (n)θ,

9

where Y (n) = [y(n), ..., y(n−N + 1)] and θ = [θ(0), ...θ(N − 1)]T . The MSE

is then given by

MSE(n, θ) = E[(x(n) − x̂(n))2].

The optimal θ may be received by the orthogonality condition which states

that Y T (n)θ is the linear MMSE of x(n) if the estimation error is orthogonal

to the observations Y (n)

E[(x(n) − Y T (n)θ)Y (n)] = 0. (3.1)

If we de�ne the the covariance matrices

ΣY x(n) = E[rxy(n, n), ..., rxy(n, n−N + 1)]

ΣY Y (n) = E[Y (n)Y T (n)]

=

ryy(0) ryy(1) . . . ryy(N − 1)

ryy(1) ryy(0) . . . ryy(N − 2)

.... . .

...

ryy(N − 1) ryy(N − 2) . . . ryy(0)

.

Insert this in 3.1 and we get

ΣY x(n) − ΣY Y (n)θ = 0,

from this we get θopt

θopt(n) = Σ−1Y Y (n)ΣY x(n)

10

here θ is dependent of time. An algorithm to update θ is also needed, a com-

mon method is to take a step in the negative gradient direction of MSE(n, θ)

θ̂(n) = θ̂(n− 1) +µ

2

∂

∂θMSE(n, θ)|θ=θ̂(n−1) (3.2)

here µ is an variable that controls the step size of the algorithm, a large µ

is fast but can be unstable, and a small µ is slow but generally more stable.

The gradient can be written as

∂

∂θMSE(n, θ) = −2ΣY x + 2ΣY Y θ (3.3)

insert 3.3 in 3.2 and we get

θ̂(n) = θ̂(n− 1) + µ(ΣY x(n) − ΣY Y θ̂(n− 1)

)(3.4)

3.1.1 The Least Mean Square algorithm

In general, the statistical information of the variables is not available. More

likely, the only thing available is y(n) and x(n). We will still use the steepest

decent algorithm, see equation 3.4, with some modi�cations.

Since the statistical information is not available, we will not be able to

calculate the MSE. Instead we estimate the MSE by relaxing the expression

by dropping the expectation operand

M̂SE(n, θ) = (x(n) − Y T (n)θ)2

11

the gradient then is

∂

∂θM̂SE(n, θ) = −2(x(n) − Y T (n)θ)Y (n) (3.5)

if we insert 3.5 into 3.2 we get

θ̂(n) = θ̂(n− 1) + µY (n)(x(n) − Y T (n)θ(n− 1)

)(3.6)

The theory for this section was collected from Hjalmarsson et al [10], also

suppling more information about adaptive �lters.

3.2 Motion detection

Motion detection is often built into a larger system and is tweaked to �t

the other algorithms. One of the commonly used algorithms is to take a

threshold on a di�erence image

d(x, y) =

1 if |img(x, y, t) − img(x, y, t− 1)| > T

0 else.

Where T is a threshold variable. Even better is to use a reference image

ref(x, y, t) = α · ref(x, y, t− 1) + (1 − α) · img(x, y, t) (3.7)

12

and then use this reference image to take the threshold

d(x, y) =

1 if |img(x, y, t) − ref(x, y, t)| > T

0 else(3.8)

The rate of which the reference image is updated over time is controled by

α [1]. This is a fast algorithm but sensitive to noise.

Irani et al [11] has developed a method for robust tracking of motion.

In the study they use multiple scales and translations to detect and track

motions. Though this is a robust technique it puts a heavy load on the

hardware, especially at the resolutions used in the present thesis.

3.3 Pattern recognition

When you use Pattern recognition algorithms, you can seldom supply raw

data, such as a video or audio stream into the algorithms. The algorithms

will need some sort of feature(s). These features span a domain called the

feature space. The choice of feature space is essential and in some cases

even more critical than the choice of pattern recognition algorithm. This is

because you want to keep the dimensionality as low as possible, since the

higher dimensionality the more training data is needed and the algorithms

put heavier load on the computer, but if the dimensionality is too low the

ability to separate patterns is reduced. If all statistics are known in advance,

it is possible to analytically decide an optimal decision surface. However

in reality this never happens. Instead, a training set that is supposed to

represent the distribution of the signal/pattern is used to tune the chosen al-

13

gorithm. There is a number of di�erent algorithms with di�erent approaches

in how to use the training set and the di�erent a prioris.

3.3.1 Parametric algorithms

Parametric algorithms use the training set to train distributions chosen ear-

lier. When the distributions of the di�erent patterns are trained the deci-

sion boundary can be calculated using for example maximum likelihood or

Bayesian parameter estimation. These algorithms generally have good con-

vergence and performance if they are tuned right. However quite a lot of

tuning is needed to adapt these algorithms for di�erent problems. Another

problem is the curse of dimensionality, which appears when the feature space

increases in dimensionality [2]. To cope with this problem it is possible to

use Principal component analysis (PCA). PCA uses eigenvectors to decrease

the dimensionality of the feature space [2]. The strength of parametric algo-

rithms is that knowledge about the distributions can be taken into account

making better use of the training data available.

3.3.2 Nonparametric algorithms

In the previous section we discussed the idea behind algorithms that uses

training data to estimate pre-decided distributions. Unfortunately the knowl-

edge about the distribution of the patterns is rarely available. Nonparametric

algorithms do not assume any special distribution, instead they rely on the

training data to be accurately representative of the patterns.

One of the most known nonparametric algorithms is kn nearest neighbors.

14

The algorithm uses the training data to calculate the kn nearest neighbors

to the point in the feature space corresponding to the pattern that is to be

classi�ed. The pattern that the majority of the kn neighbors belongs to is

assumed to be the pattern connected with that point. The strength of this

algorithm is the fact that with su�cient training data it is able to represent

complex distributions. The drawback is that it puts a heavy load on the

computer and the complexity increases with the dimensionality and number

of training data.

3.3.3 Linear discriminant

In the previous sections two techniques with di�erent approaches on how to

use the training set given have been discussed.

This third algorithm is more or less in between the two previous algo-

rithms. We do not de�ne a speci�c distribution in advance and we do not

keep all the training data as base for calculations during run time. The

training data is used directly to train the classi�er which is a set of linear

discriminant functions

g(x) = wtx + w0

where x is the point in the feature space that is supposed to be classi�ed, w

is the weight vector and w0 is the bias [1]. Depending on what problem to

solve, a number of discriminant functions can be trained and used in recog-

nition problems. For instance if the classi�er is supposed to be a binary, one

discriminant function is su�cient. If there are many patterns that are sup-

posed to be classi�ed, the discriminant functions can be designed in multiple

15

ways:

• One versus all is a training technique where the discriminant func-

tion is trained to separate the pattern connected with the discriminant

function from all the other patterns.

• One versus one creates multiple binary discriminants with two patterns

versus each other.

• In a Linear Machine one discriminant for each pattern is trained. The

pattern is classi�ed as the pattern whose discriminant produce the high-

est value.

One problem with these algorithms is that there are spaces where the classi-

�er is unde�ned. The linear machine is the one that often produces the least

amount of unde�ned space. Unde�ned space only occurs when two or more

discriminant functions are equal.

3.3.4 Support Vector Machines

Support Vector Machines (SVM) is basically the same as Linear discrimi-

nants, see section 3.3.3, but a few features to enhance the functionality when

faced with small training sets and ability to create more advanced hyper

planes has been added.

The reason for wanting more advanced hyper planes, is that the dimen-

sionality must be high enough to have good separation between the di�erent

patterns. To be able to create advanced hyper planes the input data is

mapped into a higher dimension, which is often done by kernels. Once the

16

data is mapped into the higher dimension the new data is processed in the

same manner as regular linear SVM. The techniques for choosing dimen-

sions and making general kernels is a �eld of research out of scope for this

thesis. [12]

The linear SVM is similar to the binary linear discriminant. The main

di�erence from linear discriminant function is that during training the SVM

algorithm works towards maximizing the distance from the training data

and the hyper plane, called margin maximization. This often results in a

hyper plane that produces good results also when only small training sets

are available.

The training of the SVM is a minimization process of a cost function

1

2||w||2 + C

N∑i=1

(1 − (yif(xi))+) (3.9)

where C is a tuning parameter that controls the relation between training

errors and margin maximization [13]. The ()+ function is plotted in �gure

3.1. If yif(xi) is larger than 1, there is no penalty, but if yif(xi) is less than

1 there is a linear penalty scaled with the tuning parameter C.

The SVM algorithm has been widely used in pattern recognition mainly

for its good generalization [14�17].

3.3.5 ψ-learning

ψ-learning is a variant of the SVM algorithm modi�ed in order to generally

produce better results when faced with sparse non-linear separable training

sets [18]. The mathematical di�erence lies within the cost function which,

17

Figure 3.1: The ()+ function used in the cost function, equation 3.9,for SVM training.

for ψ-learning, looks like

1

2||w||2 + C

N∑i=1

(1 − ψ(yif(xi))). (3.10)

This cost function is similar to the one in SVM (eq 3.9), but there is a ψ()

function instead of a ()+ function. The ψ() function is plotted in �gure 3.2.

The di�erence in the above cost functions is that SVM generates a linear

cost as soon as yif(xi) < 1, meaning a training data that is close to the

decision hyper plane. In ψ-learning there is also a linear cost as soon as the

training data is close to the decision hyper plane, however this is only valid

until the data becomes misclassi�ed. At that point the cost is doubled but

static. In practice this means that the algorithm does not care about the

magnitude of the misclassi�cation, only the fact that there is one.

The reason why this algorithm is more complex than SVM is that the

minimization of the cost function, equation 3.10, can not be directly solved

18

Figure 3.2: The ψ function used in the cost function, equation 3.10, forψ-learning training.

with quadratic programming as is the case with SVM [18].

19

Chapter 4

Implementation

The methods in chapter 3 were implemented to create a system possible to

track a speci�c object from a video stream.

4.1 Initiation

The system needs to train the pattern recognition algorithm and it requires a

point from where it starts tracking. This is done during the initiation phase.

To initiate the pattern recognition algorithm some data used for training

the algorithm is needed. The �rst frame is presented and the object that is

supposed to be tracked is chosen, see �gure 4.1. When the training algorithm

is �nished, the user is promted to choose a starting position, from which the

system will start tracking. The training is further discussed in section 4.3.

21

(a) foreground (b) background

Figure 4.1: The user manually choses which blocks that is the fore-ground/object, everything else is background.

4.2 Detection

Since the system is given a startpoint of the object, there is only when mov-

ment occurs that the system needs to act. The detection is therefore a motion

detection algorithm. The technique is rather simple and the algorithm works

in two steps. First the stream is �ltered with a high pass �lter, and then

a threshold is applied to the output in order to detect motion. Since this

algorithm is very simple it is not robust, however it is very fast. To reduce

the impact of noise, we �rst run a low-pass �lter on each frame. This is done

with a �lter kernel. If the scale on the �lter is 5, then the �lter kernel is a

5x5 kernel and all elements are 1/52. The result is a smoother image, see

�gure 4.2.

The �lter is implemented with the help of a reference image, see �gure

4.3,

refn = α · refn−1 + (1 − α) · imgn (4.1)

22

(a) Original (b) Scale = 15

(c) Scale = 30 (d) Scale = 45

Figure 4.2: Image at di�erent scales.

where refn−1 is the previous reference image. The imgn is the current image

from the stream and α is a variable for tuning how fast the reference image

should adapt to changes. When subtracting the reference from the current

image, we will achive a value that describes the amount of change in color

at every pixel

diffn =‖ imgn − refn ‖ . (4.2)

A threshold is applied to the diffn image, reducing the noise, and at pixels

with valules 6= 0, some kind of motion is assumed, see �gure 4.3. [1]

23

(a) reference image, equation 4.1 (b) di�erence image, equation 4.2

(c) motion detected

Figure 4.3: Results from the detection algorithm. Motion detected isbinary with ones where the di�erence image has a value over a thresholdand zeros otherwise.

4.3 Recognition

To be able to track a speci�c object, motion detection is not su�cient, since

the detection algorithm does not give any information regarding what is

moving. The Recognition block, see section 2.1, is responsible for recognizing

the object that is supposed to be tracked.

The recognition system in the present thesis is based on the system used

for video object segmentation in Liu et al [13]. The learning algorithm used

24

is ψ-learning, described in section 3.3.5. The algorithm is trained at the

initiation process and is then used throughout the whole simulation.

4.3.1 Feature Space

The ψ-learning algorithm does not work directly on the image, thus it needs

to be provided with some form of feature space. The feature space is calcu-

lated on blocks of 9x9 pixels, the image is therefore divided into such blocks.

There is an overlap of 1 pixel between the blocks, where the �rst block spans

from pixels 0−8 and the second block from pixel 8−16 and so on for both x

and y coordinates. The feature space is a 24- dimensional space, 8-dimensions

for each colorspace

1. c(0, 0)

2.√∑N−1

j=1 c(0, j)2

3.√∑N−1

k=1 c(k, 0)2

4.√∑N−1

j=1

∑N−1j=1 c(k, j)2

5. (B(−1,−1) +B(−1,0) +B(−1,1))/3

6. (B(−1,1) +B(0,1) +B(1,1))/3

7. (B(1,−1) +B(1,0) +B(1,1))/3

8. (B(−1,−1) +B(0,−1) +B(1,−1))/3

where c(k, j) is the coe�cients of the Discrete Cosine Transform (DCT), the

system uses Matlab's dct2, calculated on the 9 × 9 blocks. In this case the

�rst 3 coe�cients (N = 3) of the DCT is used, to deal with the fact that

the high frequency coe�cients tends to be small. The last 4 dimensions

25

B(−1,−1) B(−1,0) B(−1,1)

B(0,−1) B(0,0) B(0,1)

B(1,−1) B(1,0) B(1,1)

Figure 4.4: Neighbouring blocks of 9x9 pixels.

are the average color of the 9x9 neighboring blocks on each side, see �gure

4.4. The combination of DCT and neighboring block color values gives good

classi�cation of surface as well as grouping information which reduces the

impact of noise. [13]

4.3.2 Training

When the object is chosen as described in 4.1 the algorithm needs to be

trained by using the test data. The blocks that are not chosen is used as

background, see �gure 4.1. The training is done with Matlab's fminsearch.

fminsearch needs a start point in the feature space. This start point is

calculated using minimum squared error solution with the pseudoinverse

w = (ATA)−1ATY

where w is the weight vector, A is a matrix where each row represents a

training point and Y is a matrix containing rows with the corresponding

class for each training point.

26

(a) Classi�cation output (b) Frame

Figure 4.5: Classi�cation of an entire frame. The green dots representsblocks that are classi�ed as foreground/ojbect and the red ones blocksthat are classi�ed as background.

4.3.3 Detecting

When the training is done, each frame needs to be converted into the feature

space. The image is divided into blocks as described in 4.3.1, then each block

is evaluated binary as foreground or background, see �gure 4.5. To handle

noise better, there need to be at least two blocks connected in order for them

to be accepted as part of the object.

4.4 Updating

When the detection is �nished, a point of interest which is used during op-

timization of the system is calculated. The point of interest is computed

by �nding the block/blocks with the lowest value in y coordinates ((0, 0) in

upper left corner), then the mean of the x coordinates in those groups is

used.

27

4.5 Prediction

A LMS-�lter, see section 3.1.1, is used to predict the next point of interest,

which is used in the optimization of the system.

The LMS-�lter is designed to be a �one step ahead� �lter [10]. We want

to predict the next coordinate using previous observations. Two �lters were

implemented, 1 for each coordinate:

x(n+ 1) =N∑

k=0

θx(k)x(n− k)

y(n+ 1) =N∑

k=0

θy(k)y(n− k).

During simulations the �lter mostly kept the previous 6 (N = 6) coordinates

and µ was set around 10−8.

4.6 Optimization

To make the system run faster, a number of constraints were added to the

system in order to reduce the work load.

The Detection described in section 4.2 is based on a �lter which uses

earlier images. Therefore it is not suitable to reduce the work load only by

calculating parts of the image.

The task that generated the heaviest load on the computer was the con-

version from the pixel blocks to the feature space. In an study by Yi Liu

et al [13], which uses the same feature space, calculations of the DCT is the

major contributor for this load. Therefore two constraints needs to be ful-

28

�lled in order to perform the conversion. The �rst constraint is that only a

certain number of blocks, σ, around the previous point of interest is checked.

During simulations typical values of σ is 5, 7 and 11. On an image with

resolution 640 × 480 there are 4524 blocks that the conversion needs to be

made on. Having σ = 7, and therefore only 225 blocks, reduces the number

of conversions with a factor 20. The other constraint is that the conversion of

the block is only made if there is motion detected, see section 4.2, in a certain

percent, γ, of the pixels in the block. Typical value of γ during simulations

is 60-80%. After these constraints was applied, the conversion was no longer

the bottleneck of the system.

29

Chapter 5

Result

The system was tested in following a persons hand. The camera was mounted

on the screen and the person sat down in front of the camera.

5.1 Simulations

The simulations were made on a sequence of 91 frames, with 5 di�erent σ.

Data on tracking error (euclidean distance from ground truth) and number of

blocks calculated, see section 4.6, was collected. How the system performed

with di�erent σ is presented in �gure 5.1 and table 5.1 shows the average

values over all frames. The plots are separated into separate plots in appendix

B.

Other than σ there are a number of variables that have an impact on the

performance of the system. There are 4 variables that control the motion

detection: scale controls the smoothing of the image, see �gure 4.2; α which

controls at what rate the reference image is updated over time; di�Thres

31

Figure 5.1: Plots of the simulations. Tracking error is the euclideandistance from ground truth at each frame, blocks calculated is the numberof blocks calculated at each frame

32

σ Tracking error Blocks calculated5 43.29 21.907 26.18 31.969 19.12 53.3811 19.01 78.5613 143.62 59.03

Table 5.1: The average value of the plots in �gure 5.1

which is the variable that tunes at what point the di�erence should be classed

as motion; γ which controls the percentage of the pixels in a block that needs

to be classi�ed as motion for the block to be evaluated. There are 2 variables

that controls the prediction: �lterLength which is the length of the LMS-

�lter and µ which is the variable that controls the step size of the LMS-�lter.

During the simulations the variables where set to

scale = 15

α = 0.9

diffThres = 14

γ = 80%

filterLength = 6

µ = 13 · 10−7.

5.2 Color spaces

A number of color spaces were evaluated to see if there were any major

di�erences in performance. The error rate on the training set after training,

33

color space foreground error background error total errorRGB 0.76% 10.57% 8.84%

normalized RGB 1.82% 12.82% 11.26%HSV 0.15% 10.57% 9.10%TSL 5.77% 11.04% 10.30%

YCrCb 1.82% 13.52% 11.86%NTSC 1.67% 7.17% 6.39%

Table 5.2: Error rates of a number of colorspaces.

i.e. the amount of misclassi�cations when trying to classify the training

set, is presented in table 5.2. The conversion from the RGB image was

done either with Matlab's built in functions, or as described in the study by

Sazonov [19]. The reason why the background has such high error rate is

that in the example in section 4.1, see �gure 4.1, the face is not a part of

the object, but has similar features as the hand. The NTSC conversion, YIQ

color space is supplied by Matlab and were used most extensively during the

tests.

5.3 Tracking

During preferable conditions, such as su�cient light and no or little distur-

bance in the background, the tracking worked well. The system still managed

when noise, such as back light and/or motion of other objects in the back-

ground was introduced. The �lter allowed the system to work, even though

the tracking failed during small portions of time, but was able to snap on

again after a few frames. Due to limitations in the system the tracking will

fail if a block is misclassi�ed as the object, which only occurs if motion is

detected in the block. This occurs at frame 38 and σ = 13. The reason why

34

Figure 5.2: 2 frames with motion blur.

the system performs well with σ = 9 and σ = 11 is that it is a su�ciently

large area to search to be able to track well, while still small enough to miss

eventual noise in other parts of the image. Fast motion is something a stan-

dard DV camera is unable to handle, introducing motion blur, see �gure 5.2,

resulting in the hand blurring out with the background and changing in color

and texture.

5.4 Speed

Since the system is implemented in Matab it is hard to reason whether it is

possible to run in realtime or not. With the system optimized as described

in section 4.6 it runs on the Apple computer at roughly 1.3 fps. This frame

rate is possible even though Matlab is not utilizing both processors and has

poor performance when it comes to loops, since it does not optimize them

as programs made in C/C++ would, also the code written is not optimal

when it comes to minimizing work load. For each frame there are roughly

20 − 200 blocks depending on the size of σ that need to be calculated, also

35

the detection part is pixel by pixel computations. Therfore this system could

utilize the full power of computers with multiple cores, and perhaps even

distributed systems.

36

Chapter 6

Discussion

The system works overall as expected. It outperforms advanced algorithms

in terms of lower computational power needed, and is more stable then the

fast ones. A drawback is that the system parameters were dependent on

the object and its surroundings. Much of the failure could probably be

compensated with more complex equipment. A more advanced camera could

be con�gured to use shorter shutter time, reducing the problem with tracking

failure during motion blur.

Problems due to limitations in the algorithms of the system is a more

complex problem. For example, when the tracker fails because of misclassi-

�cation and motion, the problem will not be solved with better hardware.

Also if the object is big and has no texture so that it is registered as a �at

surface, the motion algorithm will only detect motion on the contours, giving

a false representation of the object.

To improve the system, it might be possible to model the shape of the

object and feed that to an adaptive �lter, such as the Kalman �lter [10, 20].

37

Introducing the Kalman �lter would allow more complex constraints that

are also adaptive during runtime. For example: the updating of the point

of interest could be forced to be more like the motions of a human hand;

the change of the shape could be forced to change more continuously. The

drawback of these constraint is that the system becomes less general and

harder to con�gure.

6.1 Future work

Though not in the scope of this thesis, the performance of the system could

probably be improved by implementing it in a low level language such as C or

C++. Then the code could be optimized further making sure no unnecessary

computations are made. Not until then will we be able to measure how

well the system performs in real time. Stereo vision might be able to make

foreground detection easier, however stereo system in real time is not trivial.

To make the system even faster it could be possible, for a simple object like

a hand, to use simpler pattern recognition algorithms.

38

Appendix A

Mathematical cornerstones

A.1 Statistical Theory

Many of today's algorithms and systems uses di�erent forms of a priori knowl-

edge to enhance the result.

Probabilities

There are a few probabilities that are frequently used when working with

pattern recognition and other statistical frameworks. It is the regular prob-

ability

PX(x),

which is the value that describes how likely it is that variable X will be set

to x (P (x) or P (X = x) is di�erent notations for the same thing).

Then there is joint probability

PX,Y (x, y),

39

which describes how likely it is that X is set to x and Y is set to y (P (x, y)

and P (X = x, Y = y) is di�erent notations for the same thing).

Conditional probability

PX|Y (x|y)

describes how likely it is thatX is set to x given that Y is set to y (P (x|y) and

P (X = x|Y = y) is di�erent notations for the same thing). The de�nition is

PX|Y (x|y) =PX,Y (x, y)

PY (y).

Bayes formula

If we have the knowledge of both PX(x) and PY |X(y|x), we can, from the

de�nition of conditional probability get

PX,Y (x, y) = PX|Y (x|y)PY (y)

= PY |X(y|x)PX(x),

which can be rewritten to

PY |X(y|x) =PX|Y (x|y)PY (y)

PX(x).

This is known as Bayes formula [2, 21].

40

Expected value

The expected value is the mean value or function of the stochastic variable

or function

E[X] = mX

E[f(X)] = mf (X).

For a discrete stochastic variable the expected value is calculated

E[X] =∑x∈X

xPX(x).

Variance

The expected value gives the mean value of the stochastic variable or func-

tion. Variance gives the expected value of the squared distance between the

stochastic variable and mx

V ar[X] = σ2 = E[(X −mx)2].

The variance can be expressed

V ar[X] = E[X2] − (E[X])2

V ar[f(X)] = E[f 2(X)] − (E[X])2.

41

Covariance

Covariance is de�ned as

rXY = V ar[XY ]

= E[(X −mX)(Y −mY )]

=∑x∈X

∑y∈Y

(x−mX)(y −mY )PX,Y (x, y).

42

Appendix B

Simulation plots

The plots of the simulations described in section 5.1 separated into inde-

pendent plots. Tracking error is the euclidean distance from ground truth

at each frame, blocks calculated is the number of blocks calculated at each

frame

43

Bibliography

[1] Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing,

Prentice-Hall, Inc., second edition, 2001.

[2] Peter E. Hart Richard O. Duda and David G. Stork, Pattern Classi�-

cation, Wiley & Sons, Inc., second edition, 2001.

[3] Ville Kyrki and Danica Kragi¢, �Tracking rigid objects using integration

of model-based and model-free cues,� nyn, 2005.

[4] Nikolaos D. Doulamis, Anastasios D. Doulamis, and Klimis Ntalianis,

�Adaptive classi�cation-based articulation and tracking of video objects

employing neural network retraining,� .

[5] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer, �Kernel-based

object tracking,� IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no.

5, pp. 564�575, 2003.

[6] Aishy Amer, �Voting-based simultaneous tracking of multiple video ob-

jects,� 2003, vol. 5022, pp. 500�511, SPIE.

[7] Danica Kragi¢, Visual Servoing for Manipulation: Robustness and In-

tegration Issues, Ph.D. thesis, Royal Institute of Technology, 2001.

49

[8] A. Cavallaro, O. Steiger, and T. Ebrahimi, �Tracking video objects in

cluttered background,� IEEE Transactions on Circuits and Systems for

Video Technology, vol. 15, no. 4, pp. 575�584, 2005.

[9] M. Gastaud, M. Barlaud, and G. Aubert, �Tracking video objects using

active contours,� in MOTION '02: Proceedings of the Workshop on

Motion and Video Computing, Washington, DC, USA, 2002, p. 90, IEEE

Computer Society.

[10] Håkan Hjalmarsson and Bjorn Ottersten, �Lecture notes in adaptive

signal processing,� Tech. Rep., Signal, Sensors and System, Stockholm,

Sweden, 2002.

[11] Benny Rousso Michal Irani and Shmuel Peleg, �Computing occluding

and transparent motions,� Tech. Rep., Institute of Computer Science,

Jerusalem, Israel, 1994.

[12] Christopher J. C. Burges, �A tutorial on support vector machines for

pattern recognition,� Data Mining and Knowledge Discovery, vol. 2, no.

2, pp. 121�167, 1998.

[13] Yi Liu and Yuan F. Zheng, �Video object segmentation and tracking

using ψ-learning,� IEEE Transactions on Circuits and System for Video

Technology, 2005.

[14] Constantine Kotropoulos Anastasios Tefas and Ioannis Pitas, �Using

support vector machines to enhance the performance of elastic graph

matching for frontal face authentication,� IEEE Trans on Pattern Anal.

Mach. Intell., 2001.

50

[15] Daniel J. Sebald and James A. Bucklew, �Support vector machine tech-

niques for nonlinear equalization,� IEEE Transactions on Signal Pro-

cessing, 2000.

[16] Robert Freund Edgar Osuna and Federico Girosi, �Training support

vector machines: an application to face detection,� IEEE, Computer

Vision and Pattern Recognition, 1997.

[17] Massimiliano Pontil and Alessandro Verri, �Support vector machines

for 3d object recognition,� IEEE Transactions on Pattern Anal. Mach.

Intell., 1998.

[18] Xuegong Zhang Xiaotong Shen, George C. Tseng and Wing Hung Wong,

�On ψ-learning,� Journal of the American Statistical Association, 2003.

[19] Vassili Sazonov Vladimir Vezhnevets and Alla Andreeva, �A survey on

pixel-based skin color detection techniques,� Tech. Rep., Graphics and

Media Laboratory, Faculty of Computational Mathematics and Cyber-

netics, Moscow, Russia, 2003.

[20] Monson H. Hayes, Statistical Digital Signal Processing and Modeling,

Wiley & Sons, Inc., �rst edition, 1996.

[21] Arne Leijon, �Pattern recognition,� Tech. Rep., Signal, Sensors and

System, Stockholm, Sweden, 2005.

51