master thesis report: face recognition for cognitive robots · 2019-01-02 · robot architecture....

BioMechanical Enginering

Face Recognition for CognitiveRobots

F. Gaisser

Mas

tero

fScie

nce

Thes

is

Face Recognition for Cognitive Robots

Master of Science Thesis

Compulsory part of Master of Science in Mechanical Engineering -BioMechanical Design at Delft University of Technology

F. Gaisser

January 20, 2013

Faculty of Mechanical, Maritime and Materials Engineering (3mE) · Delft University ofTechnology

Copyright c© BioMechanical Engineering (BME)Copyright c© NEC JapanAll rights reserved.

Abstract

In the near future the elderly population will increase in size up to a point that there are notenough people to provide support. One solution would be developing service robots, that canperform household tasks and in that way allow elderly to live longer independently.These service robots have to be able to adapt to changing environments, which requires aflexible framework that can recognize and learn objects regardless of the environment or therobot architecture. In this thesis such a framework, consisting of localization, description,classification and learning modules structured as a pipeline is introduced. Various types ofobjects require different methods to be used in each of the modules. For efficient memoryusage these methods are dynamically loaded into the pipeline in the introduced framework.For human-robot interaction users have to be robustly identified and learned online. Exist-ing state of the art methods for face recognition, such as K-Nearest Neighbours (KNN) andPrincipal Component Analysis (PCA), do not support online learning of faces and lack therecognition performance required to be used in real-world situations. Hence the Class AveragePrincipal Component Analysis (CAPCA) method is developed in this thesis as a descriptor.This method provides the required performance by increasing the separability of the classesby maximizing the inter-class and minimizing intra-class variations. The speed is increasedsignificantly by selecting only the most representative samples.Additionally to allow for classification of unknown faces, the novel Certainty K-NearestNeighbours (CertKNN) method has been introduced. The main benefit over the state ofthe art methods is finding the relation between the distance of classification and the certaintyof that classification. This relation is automatically calculated from the data belonging toeach class. In that way nearly optimal unknown classification can be done. Finally to furtherimprove recognition performance a method has been developed that utilizes multiple framesin classification.To prove the benefits of the introduced methods extensive experiments have been performedon a state of the art face recognition database. The best performance was achieved witha F-measure of 96% and 90% for respectively known and unknown classification. Trainingspeed was increased up to 100 times, which allows for online learning of faces. Lastly theintroduced methods were applied on the Delft Robotics service robot and extensively testedin the RoboCup@Home challenge.

Master of Science Thesis F. Gaisser

ii

F. Gaisser Master of Science Thesis

Table of Contents

Acknowledgements vii

1 Introduction 11.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Image Processing 32.1 Image enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Color correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Descriptors 93.1 Colour descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Shape descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Edge / PAS feature descriptor . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Size descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.3 Fourier shape descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.4 Functional Shape Features . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.5 Shape descriptors for robotics . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Local descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.1 Scale-Invariant Feature Transform: SIFT . . . . . . . . . . . . . . . . . . 123.3.2 Gradient Location and Orientation Histogram: GLOH . . . . . . . . . . . 133.3.3 Speeded Up Robust Feature: SURF . . . . . . . . . . . . . . . . . . . . 133.3.4 Histogram of Oriented Gradients: HOG . . . . . . . . . . . . . . . . . . 133.3.5 Gabor filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


iv Table of Contents

3.4 3D based descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.5 Dimenisionality reduction methods . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.5.1 Principal Component Analysis: PCA . . . . . . . . . . . . . . . . . . . . 153.5.2 Bag of Words: BoW . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Classification and Learning of Objects 174.1 Classification methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.2 KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.3 Tree-like classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1.4 Boosting classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1.5 Combining descriptors and classifiers . . . . . . . . . . . . . . . . . . . . 204.1.6 Certainty of classification . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1.7 Categories in Classification . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.1 Offline training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.2 Online training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.3 Classification and learning of unknown objects . . . . . . . . . . . . . . . 244.2.4 Class retrieval for unknown objects . . . . . . . . . . . . . . . . . . . . . 24

4.3 Classification applied on robots . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Recognition Framework 275.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1.1 Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.1.2 Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.1.3 Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.1.4 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.1.5 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.1.6 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.1.7 Dynamical loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6 Aspects of Face recognition for robotics 316.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.1.1 Detection and Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.1.2 Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.1.3 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.2 Face recognition applied on a robot . . . . . . . . . . . . . . . . . . . . . . . . 34


Table of Contents v

7 Introduced methods for robust description and learning of faces 357.1 Class Average PCA (CAPCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.1.1 Method description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357.1.2 Obtaining the class average . . . . . . . . . . . . . . . . . . . . . . . . . 367.1.3 Differences with Linear Discriminant Analysis (LDA) . . . . . . . . . . . 417.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.2 Certainty KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.2.2 Unknown classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.2.3 Improvement of CertKNN . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.3 Multi-frame classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

8 Application in Robotics 498.1 Face recognition for human robot interaction . . . . . . . . . . . . . . . . . . . 49

8.1.1 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498.1.2 Facerecognition node . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508.1.3 Face tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508.1.4 NeckController . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8.2 Experiments & Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528.2.2 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538.2.3 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.2.4 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558.2.5 Experiment 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

9 Conclusion 57

Bibliography 59

Glossary 65List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

List of Figures 67

List of Tables 69


vi Table of Contents


Acknowledgements

Finishing your masters is a long and difficult road. The end is marked with finishing thisthesis, but getting to the top of this steep mountain is taking an unclear and winding road. Iwould like to thank my supervisor Dr. ing. Maja Rudinac for her great guidance and supportin this journey. She motivated me in understanding the matter more clearly and findingbetter solutions.

This steep mountain was not climbed alone, many people came before me. Of these people Iwould like to thank Aswin and Susanna for their collaboration and advice, which made theroad more clear to me.

Finally I would like to thank Prof. P.P. Jonker in his guidance and the opportunity of doingmy thesis in his department. I enjoyed doing research and hope to stay.

Delft, University of Technology F. GaisserJanuary 20, 2013


viii Acknowledgements


“A circle is the reflection of eternity. It has no beginning and it has no end - andif you put several circles over each other, then you get a spiral.”— Maynard James Keenan

Chapter 1

Introduction

It is expected that in the year 2050 about a third of the world population is 60 years and older.In the more developed countries the elderly population will exceed the younger people. [1]These elderly will need more care and support in their living than currently can be providedby the health care system. Therefore it is necessarily to find solutions for this problem. One ofthese solutions is to support the elderly in such a way that they can live longer independently.This can be done by using service robots which can take over certain tasks at home. [2, 3]In the field of personal assistance or service robots there is a lot of research done. But evenso, there is still a lot to explore. This can also be seen in the commercial successful robotscurrently available on the market. These are simple vacuuming robots or floor sweepingrobots, like the Roomba and Scooba [4]. This could be described as the first generation ofservice robots. These robots have a low level of intelligence; they only require knowing whereobstacles are and how to avoid them. In the more advanced ones it is possible to recognize ifa certain area is dirty and has to be cleaned. Also the amount of DOF (Degrees of Freedom)is limited to three, namely two to navigate and one for performing their task. This allows fora fairly simple programming and intelligence of the robot.From service and personal assistance robots that provide support at home is somewhat morerequired. The next step in the evolution of service robots could be considered adding morecomplex tasks to do. These tasks could be actions like opening doors, handing objects andmaybe even tidying a room.[5, 6] To perform these tasks the robot has to understand more ofits surroundings. It has to recognize different types of objects and classify them. Further therobot has to understand what it can do with the objects and what the state is of the object.[ref] For example with opening a door, the door and the handle on it have to be recognized.Next the robot should understand what it can do with the handle, namely open the door.Further the robot has to perceive that the door is open and it can go through.This step towards a fully functional personal assistant robot is quite big. In the literaturesurvey it was found that for understanding and cognitive interaction with the robots sur-roundings a framework of dynamice visual recognition is required. This visual recognitionframework provides the robot with a way to perceive its surroundings and lays a basis foraffordances modeling.


2 Introduction

1.1 Problem Definition

Because of the expected growth of the elderly population, there is a need for a way of sup-porting these people in their independent living. One of the possible solutions is to developpersonal assistant robots. The current technology of personal assistant robots lacks the re-quired robustness and intelligence to perform these tasks.

The first problem is the required visual skill of robustly recognizing objects in the variantsurroundings of the robot. The robot will encounter various environments with differenttypes of objects. This requires that the robot can recognize objects independent of thestate of the object and the changes in its environment. Also it should be able to learnobjects that are previous unknown. In the field of pattern recognition many methods existfor visual recognition of objects, but are designed for specific type of objects such as faces, cars,people, etc. This is not very efficient for robotic applications, given the various environmentalconditions, contexts and the limitation of the robot’s hardware.

Hence a flexible framework that can recognize and learn objects regardless of the environmentor the robot architecture is required. To efficiently organize the framework, the current state ofthe art methods for localization, recognition and learning need to be evaluated. Furthermoredynamic loading of different methods within the framework is necessary.

Further it has to be investigated, if such a framework can be utilized for the specific caseof face recognition and learning necessary for human robot interaction. Since the currentstate of the art methods do not provide fast and precise face recognition, novel methods arerequired. Additionally to allow online learning of novel users, methods for classification ofunknown faces need to be developed.

This thesis provides solutions to the above mentioned problems and a detailed outline inwhich they are presented is given below.

1.2 Outline

First the literature survey is presented in chapters 2-4 to give insight into the structure of theframework and an overview of the available methods in visual recognition. Next in chapter 5,the recognition framework and its components are introduced. An introduction to the aspectsof face recognition and the requirements in robotic application are given in chapter 6. Themethods to satisfy these requirements are introduced in chapter 7 and extensive testing isdetailed in chapter 8. Future work and conclusions follow.


Chapter 2

Image Processing

In the process of object recognition it is important that the retrieved data from the imagedescribing the object, is consistent over time, place and environmental conditions. Propertieslike colour, light, viewpoint, scale and orientation of the object in the image play a vital rolein robust retrieval of the description of an object. To obtain such a description the image hasto be enhanced to improve the image quality and the object has to be localized in the image.This chapter describes the required processing of an image, to obtain the data of an object.First image enhancement methods will be described. Next methods for localizing the objectin the image will be introduced.

2.1 Image enhancement

The robot can encounter different environmental conditions, like illumination intensity andlight temperature. These conditions have an influence on the appearence of the capturedimage of the environment. Further the method of capturing an image of the robots surround-ings has an large influence on the appearance of the image. To be able to obtain a consistantdescription of an object, image enhancement methods are required that improve the qualityof the image.

2.1.1 Color correction

The recognition of objects is often dependent on the perception of the color and light intensityof an object. Since the illumination temperature (color) and the illumination intensity havea large influence on the appearance of the color and intensity in an image, it is of greatimportance to correct for any changes in these environmental conditions. A good example totake in mind is a blue toy car. Under a more warm color temperature the color can appearnot blue but purple under a red light and green under a yellow light. In general the color ofan object should be corrected towards a standard like D65 [7]. This standard describes thecolor temperature for daylight.


4 Image Processing

Figure 2.1: Color histogram correction [9] Left:images from different cameras with different light-ing. Right: corrected images.

Figure 2.2: Contrast correction [10]Left top: original, rest correction

The image can be enhanced to create a consistant color and intensity description. Heremethods like normalisation, averaging and histograms [8] are often used tools to correctany changes in the appearance of the color and intensity in the image. In [9] methods areproposed that use histograms to correct color differences in the image between cameras andunder different illumination temperatures. In Figure 2.1 an example can be seen how an imagecan be corrected for color appearance due to different lighting conditions. In Figure 2.2 anexample is shown that depicts correction of light intensity and contrast using histograms. [10]

Figure 2.3: Effect of Gaussian blur on methods like edge detection.[11]Left: no blur, too much edges. Middle: blur, good edges. Right: too much blur, edges lost


2.1 Image enhancement 5

Figure 2.4: Gaussian blur [13]Left: original images. Middle: normal blur, noise gone. Right: too much blur, details lost

2.1.2 Denoising

The visual data can be subjected to large quantities of noise, due to errors and imperfectionsin the image capturing process. Therefore the image is often improved with a noise reductionmethod, like a Gaussian filter, but many other methods exist. This kind of image enhancementcan have a large influence on the performance of certain detection and description methods.This can be of importance for methods like Harris interest point [12] and edge detection,where small features are used, see Figure 2.3. But unfortunately these and other features candisappear due to the denoising method used and therefore also remove data that describesthe object as can be seen in Figure 2.4.

2.1.3 Resizing

Objects in image data can differ in size due to viewing distance. Therefore images sometimeshave to be resized to describe the object according to the size that is expected. An goodexample would be an image of a car in the distance, and a toy car taken from close by seenin Figure 2.5. In both images the size of the car is about the same, but in reality the sizeof both objects are quite different. Scaling according to viewing distance is a solution, butrequires depth information in adition to the image information.

Real car Toy car

Figure 2.5: Size appearance in both images is similar for different sized objects.


6 Image Processing

Figure 2.6: Image pyramid with scaling of 50% for each step.

But also image data can be resized to retrieve the data that is invariant to scale. At a lowerscale only the critical information is retained. Scaling images at different sizes for retrievingretaining data is called an image pyramid. An example can be seen in Figure 2.6, the samemethod can also be used in combination with a Gaussian filter or other methods [13] to createscale invariant object descriptors.Another application of scaling is to reduce the amount of data that has to be processed. Thiscan be of a major advantage in robotics, since processing speed is of great importance. Alsoscaling is often applied in object detection for scale invariant object detection. The previousintroduced methods of image enhancement can be used in obtaining a higher image quality.

2.2 Detection

In robotics it is generally not the case that an object is located in the center of the capturedimage. Further the scale and the orientation of the object are often different, where at thesame time also the viewpoint is different for the various encounters with the same object.Further the background of the image is generally also changing. This therefore requires amethod that can robustly retrieve the location of the object in the image independent ofscale, orientation, viewpoint and environment.This type of method is called detection. In general dectectors use either feature detectionor segmentation to obtain parts of the image, which are classified to find the location of theobject in the image. Segmentation is a method which divides the entire image in parts. Thereare many methods available to divide the image, but are often tuned to a specific case. Onemethod used in robotics is dominant plane segmentation [14] using 3D information. But herefeature detection methods are described since they can be used to find specific objects easily.Feature detectors search the whole image data for features which form regions of interest andcould be part of an object. These features can be edges, corners, blobs or any other features[15, 12]. These features are then classified using a two-class classifier, with the classes objectand background. The classifier has to be trained to be able to detect objects of a certain


2.2 Detection 7

Figure 2.7: Harris point detector, detecting points in images with different viewpoints. [12]

kind. Using multiple, mostly overlapping areas that are classified as object, an area can befound containing the data describing the object. Because this type of detector can only detectobjects of a certain kind, many types of detectors have to be generated to be able to detecta large set of objects.Detecting various kind of objects in an environment can be done using a single object detectoror multiple more general type of objects, like faces, people, bottles, cups and chairs. Thelatter one is more calculation intensive but can be usefull to search for a specific object in anenvironment.Feature detectors like the Harris and Hessian affine region detector [12] are invariant to scale.These methods match similar regions between images which can be used in object detection.This is a fast and quite well performing method which can be trained fast. Therefore fordetection of objects that have to be learned online this can be a good method. Anothermethod is the Kadir-Brady saliency detector [15, 16]. This method is invariant to affinetransformations and is very suitable for detection of objects from multiple viewpoints.A different type of detector is the Haar-like feature detector often used for detection of facesor cars [17, 18, 19]. This method encodes the orientation of contrasts in regions and theirrelationships. This method requires segmentation of the image into small parts that is usedfor encoding the orientation of the contrasts. This method can be boosted by combining otherkind of features and using false positives. This method is fast in detecting, but is very slowin training, because it requires large amount of training data. For face training, 5000 to 8000faces are recommended. For detection of general type classes, like faces, the detector doesnot require online learning and this method is a very good choice. An example can be seenin Figure 2.9


8 Image Processing

Figure 2.8: Haar-like features for face de-tection

Figure 2.9: Face detection using haar-likefeatures


Chapter 3

Descriptors

In the previous step regions that contain objects are found. This image data has to betransformed into a form that describes the object accurately and in an unique way betweenand within objects classes. In this way objects of the same kind are described in the sameway and differ enough in description to other kind of objects.Descriptor methods can describe the object in different ways like, it’s red or it is cube shaped?But there are a lot of objects that are red and cube shaped. So a single unique description ora combination of descriptions should be found to uniquely and accurately describe the object.These descriptions should also be invariant to environmental conditions.For example for a tidying robot of children’s toys it is important to find a way to describetoys that is unique for toys only and is not used for other kind of objects. In this case thefunction of an object makes it a toy; you can play with the object. It is rather difficult toderive the function of an object solely by it appearance, there is no colour, texture or shapethat is solely used for objects which you can play with. Therefore a way has to be found thatcan be used to classify an object as a toy.How do people determine that an unknown object is a toy? They use various combinationsof descriptions, like object features, context and interactions to determine the function of anobject. These are based on experience of previous observations of objects and on observationsof the current situation. Therefore this research will focus on what combination of visualrecognition methods can be used to determine the function and interaction of an object.Context and interaction with the object will be discussed at a later point in this literaturesurvey.To base classification of toys on visual features, combinations of description methods arerequired. These descriptors can describe properties of a certain type of toy. For example,a texture descriptor can describe the plush skin of a plush toy. But this descriptor doesnot describe all types of toys. Therefore it is also necessary to combine these descriptors ofmultiple subclasses in a toy classifier to be able to learn an unknown object as a toy. In thenext sections description methods are described in detail.


10 Descriptors

Figure 3.1: Example of toy cars where a colour descriptor could work on

3.1 Colour descriptor

The colour of an object can be used to describe the object. This has its advantages thatit is simple and fast, but also its disadvantages. What can be defined as ’red’ and what ifthe colour of the light changes? Then the colour of the object also changes. Converting thecolour to for example the HSV colourspace can help to achieve illumant invariant description,but there are more similar colourspaces. Here the colour is described with Hue, Saturationand Value. The hue describes the colour, saturation how much colour is used and value theillumination intensity. So when the light intensity changes the hue does not change and if thecolour of the light changes the saturation stays the same.[20]For toy description, colour can be used to exclude objects as a certain type of toy. By examplethe robot could learn wooden blocks as toys based on shape, but this can be supported bylearning the brown colour. In that way a red plastic box, in the same shape as a woodenblock, will not be classified as a wooden block.For robotics, a colour descriptor can be very usefull, since it is fast and if combined withother descriptors it can provide a robust description of the object. Changes in colour dueto illumination changes can be corrected by the previous introduced colour enhancementmethods.

3.2 Shape descriptor

The shape of the object can be used to describe the object. This can be more unique thenthe colour, but it is also harder to describe. For example, what shape is a car? It is longflat box with a trapezium on top. But from the front it looks like a rectangle. So how canyou describe objects in this way? This is done in shape descriptors by describing objects byusing points, corners, edges or other methods. Shape is sometimes considered as the mostimportant type of feature for describing an object. Therefore some types of shape descriptorswill be discussed next.


3.2 Shape descriptor 11

3.2.1 Edge / PAS feature descriptor

The shape of an object is often described by edges within the object, but also by the edgedescribed by the siloutte of the object. Therefore edge detection is often used in descriptionof the shape. In [21] edges are detected and combined into continuous edges, called Pair ofAdjacent Segments (PAS) features. These PAS features are combined into shapes which arecompared with shape models of objects. This appeares to be a good performing applicationof edge detection in a shape descriptor. One of the major problems in shape description isstill the viewpoint and scale variances.This method can be used to create a descriptor for describing the basic shape of objects. Likethe cube shape of wooden blocks or cylindrical shapes of a mug or a bottle, which could be agood basis for describing objects. This would also be a great description method that couldbe used for object detection.

3.2.2 Size descriptor

The size of an object can be used for describing an object. Wooden blocks are unlikely largerthen 10 to 15 cm. Also it can be used to differentiate between a toy car and a real one.This is a good descriptor for detecting objects. For example in a person detector using HOGfeatures, the size of the detected human has a certain range. This means that the describedarea can be rejected if the size does not fit the model. Since the size of the object is heavilydependent on how the image is captured, depth information or comparison with other objectshas to be used.

3.2.3 Fourier shape descriptor

This is a description method that can describe the outer contour of a 2D closed shape inde-pendent of location, scale and orientation. This can be used to detect shapes with ellipticalor spline base. For description of toys this method can be used to describe the shape of plushtoys and parts of toys. See Figure 3.2 for an example of this method. Due to viewpointvariance this method is not commonly used in robotics. [22]

a b c d

Figure 3.2: Fourier shape descriptor. [23]a) original shape, b) contour, c) reconstruced shape with 420 fourier descriptors, d) reconstructedshape with 28 fourier descriptors.


12 Descriptors

3.2.4 Functional Shape Features

The shape of an object can sometimes say something about the function. Especially forfeatures like buttons, knobs and levers it can be learned that you can do something with it.Another example of a functional shape feature could be the cylindrical shape with a closedbottom and open top, which is common used for liquid containers. This knowledge can beused in creating a functional shape descriptor, that can be used in describing the interactionwith that object.Unfortunately a good descriptor method is not yet published. Therefore here is proposed todevelop a method like this, using shape description methods, to describe objects accordingto their function. These description methods could use methods like PAS features, Haar-wavelets, HOG or other local descriptors to describe such functional features.

3.2.5 Shape descriptors for robotics

The previously introduced shape descriptors can all be applied in robotics, with some limi-tations. Humans distinguish objects quite well on shape. Take for example the case in mindwhere you get a present and based on the shape you can feel through the wrapping paperwhat the present is, like a book. But unfortunately there does not exist a method yet, thatcan describe the shape of an object as robustly as humans do.

3.3 Local descriptors

Local descriptors are a class of descriptors that describes features based on local propertiesof objects rather than global properties of the object. These are commonly used and oftendescribe the object quite good. The next sections will describe the various methods.

3.3.1 Scale-Invariant Feature Transform: SIFT

This is a method introduced in [24] that is quite suitable for object detection [25]. It candescribe features invariant to scale and affine transformations up to a 40 degree viewpointagle. It uses a four step method; first interest points that are invariant to scale and orientationare found using a Difference-of-Gaussian (DoG) method. Next for each interest point, a modelis fit to determine scale and location and stabile keypoints are selected. Based on the localgradient direction for each keypoint orientations are assigned. Now the area around eachkeypoint is described with local gradients for the selected scale. This makes the descriptioninvariant to scale, orientation, significant shape distortions and change in illumination. Usingan invariant point detectors like Harris-Laplace in [12] could improve the performance ofSIFT.This method can be used in object detection of textured objects that have different looksfrom different views. For example complex shaped objects like the frog or train in Figure 3.4.But also it can be used to detect the orientation of an object, which can be useful in creatinga more robust recognition pipeline.


3.3 Local descriptors 13

Figure 3.3: Example of how a local descriptor method, like SIFT, GLOH or SURF can be usedin detection of objects.[24]

3.3.2 Gradient Location and Orientation Histogram: GLOH

The SIFT descriptor uses a rectangular grid, where GLOH [26] uses a log-polar location gridwith 17 bins. The gradient orientations are quantized in 16 bins instead of 8, which resultsin a 272 bin histogram compared to 128. Using PCA this descriptor vector is reduced to128. This method has the same application as SIFT in robotics only it has a slightly beterperformance for feature based matching.

3.3.3 Speeded Up Robust Feature: SURF

SURF is an improved SIFT method [27]. It uses a fast-Hessian detector as interest points.In SIFT image pyramids have to be made for different scales. SURF uses a scalable filter fordetection of multiple scales. Next two haarwavelet filters are used to determine the orientation.Then a square region is created around the interest point in the previous retrieved orientation.This area is divided in smaller regions and is described using the same local gradients.The major advantage of SURF over SIFT is that it is faster in calculation. This makes thismethod preferable over SIFT for object detection and its application in robotics.

3.3.4 Histogram of Oriented Gradients: HOG

This is a description method where for every point or region the orientation of the gradientsis used to describe that area. [28] Here it is used to detect humans in images, where itoutperformed al other methods like Haar-wavelets, SIFT and Shape Contexts. This methodcan also be used for gesture and pose recognition. The application of this method for roboticswill be in the field of describing complex and deformable shapes like persons. Detection ofpersons is of great importance to learn from observation.

3.3.5 Gabor filters

Gabor filters are a class of methods that can be used to describe texture of objects. This canbe very useful in describing toys. Toys of the same kind are often made of the same material,like plush toys and wooden blocks. [29, 30]


14 Descriptors

Figure 3.4: Example of Gabor filters, combination of Fourier and Gaussian.

3.4 3D based descriptors

The previous descriptors are all based on a 2D input of visual data. With the current devel-opments with for example the Kinect and available software of PCL [31], object recognitionis not largely limited to 2D anymore. For a lot of descriptors the change from 2D to 3D hasonly slight changes. For example the colour descriptor still describes the colour of the object.For other descriptors the dimension of describing the object can change. The 2D HOG de-scriptor uses 8 bins for its gradient, but in 3D this will increase to 26 bins. This will resultin reduction of processing speed. But also different kind of descriptors can be developed likethe Viewpoint Feature Histogram [32]. This method creates a histogram of surface normalsaccording to the viewpoint. A method like this can descripe the shape of an object invariantto viewpoint.

Figure 3.5: 3D image ofa bunny. [31]

Figure 3.6: Histogram of viewpoints. [31]


3.5 Dimenisionality reduction methods 15

3.5 Dimenisionality reduction methods

In object description the data describing the object can have a large dimensionality, whereoften not all dimensions are used for a specific case. But also the amount of data can bereduced to only the most important dimensions to increase the recognition speed. Thereforea dimenisionality reduction method can be applied, like PCA or Bag of Words. Here a fewdimenisionality reduction methods will be introduced.

3.5.1 Principal Component Analysis: PCA

Principal Component Analysis (PCA) [14] is a commonly used method to decrease the amountof data significantly to increase the recognition speed, while at the same not decreasing theaccuracy significantly. Further this method can increase the variance between descriptionsamples, which can make classification more accurate.PCA achieves this by making a projection of the input vector onto the Principal Component,or eigenspace, see Figure 3.7. Here the black dots are 3 dimensional input vectors and PC1 andPC2 are 2 dimensional vectors that span the principal component space. In fact, this spacecan be spanned by any number of eigenvectors of the covariance matrix of the input samples.By using this covariance matrix, the variances between each of the samples is maximized.Description vectors are projected onto this eigenspace and this projection is then used as adescription vector.This method was developed by Karl Pearson [14] but was for the first time widely applied invisual recognition for facial recognition [33], but also in reducing the size of the descriptionvector in SIFT and SURF. Unfortunately this method has to be initialized using the trainingsamples, to create the eigenspace for projection of the input samples. This can take up a lotof time for large samples sizes and large number of samples.

3.5.2 Bag of Words: BoW

The Bag of Words model (BoW) is based on the linguistic model of Bag of Words and issometimes also called Bag of Features method [34]. This descriptor is a combination of adescriptor and a classifier. It retrieves a dense sampling of SIFT or SURF features which arecollected for a scene or object and compared with a dictionary. This dictionary is a collectionof features grouped together into a bag to describe a class. For example a wooden blockhas the features: red, cube, wood and another block: blue, cube, wood. These collectionsof features are called bags of features. The same has to be done for other classes and thesebags of features are then learned to a classifier. Using a k-means method with a large k theprojection of a bag of features generates a description vector. This is then used for descriptionand learned to the class classifier. This kind of descriptor can be optimized into real-timedescription and classification. For example when the k-means classifier is replaced with therandom trees method which decreases the projection time. [34] This method appears to bevery useful for application in robotics.


16 Descriptors

Figure 3.7: Example of a projection of a 3 dimensional sample space onto a 2 dimensionalEigenspace.


Chapter 4

Classification and Learning of Objects

In the previous chapters is explained how to obtain the location of the object in the imageand how to describe this object. The data describing the object can be used by a classifierin determining the class of the object. Commonly this is done by comparing the descriptiondata with training samples of the same or similar objects. This chapter will introduce inSection 4.1 different classification methods that can be applied in robotics and how these areused in obtaining the class of an object. Further will be discussed how classifiers can be usedfor classifying multiple objects and to determine the certainty of classification.Detection and description methods combined with classification methods, form a pipeline thatcan recognize objects. These pipelines have to be trained with one or more sample imagesof reference objects. These training samples are obtained by creating a database of labeledimages for different types of objects. Here it is important to understand that objects canhave different labels for the same object, where at the same time different objects can havethe same label. This and how to obtain the label for an unknown object are explained inSection 4.2.

4.1 Classification methods

In this section three different classification methods will be introduced that are commonly usedin visual recognition and can be applied in robotics. There are two types of classifiers, singleclass classifiers which can only describe one type of object and are often used for detectors.Where multi-class classifiers can classify different type of objects and are often used in thefinal stage of the recognition pipeline.Classification methods all face the same problem of matching description data to the trainingdata to find the class. In Figure 4.1 an example of a two-dimensional space can be seen with5 different classes, indicated with different colours. The ellipses give an indication what thearea of the class is based on the training samples. Points 1 through 4 are description vectorsthat have to be classified and the ground thruth of these points are described by their colorexcept for point 4, which is a descriptor vector of an unknown object.


18 Classification and Learning of Objects

Figure 4.1: Two-dimensional Euclidean space with training samples of 5 classes and 4 differentdescription vectors.

One of the problems classifiers face is the separation of classes, which can be difficult becausesometimes the classes overlap in the used description space, like for the red and yellow andthe yellow and green classes. This is often due to a missing description dimension or due tobad training samples.Another problem is that the description of an object is often not complete for all the possiblevariances a service robot can encounter. So it can happen that a description sample is notclose to the class it belongs to. Therefore the classification method has to find a method toinclude description samples that are close enough to a class. But at the same time excludedescription samples that are too far away and classify them as an unknown object.

Figure 4.2: SVM mapping of traning samples. [35]


4.1 Classification methods 19

a b c d

Figure 4.3: SVM separation with different kernels. [35]a) linear kernel, b) quadratic kernel , c) 2nd degree polynomial kernel, d) spline kernel.

4.1.1 SVM

Support Vector Machines (SVM) [36, 37] is a classification method that maps the featuresof two classes in such a way that there exist a hyperplane that separates both classes. Thishyperplane is formed by the average of the boundaries of both classes. The hyperplane canbe of linear or non-linear form, see Figure 4.3. An example of how a separation hyperplaneis found can be seen in Figure 4.1 indicated with the colored ellipses. Here can be seen thatpoint 1 is nicely classified.But in some cases the training data is not easily or not at all separatable by a hyperplane,because the training samples of classes are overlapping, like in the case of the red and yellowclass. This can sometimes be solved by creating a mapping of the feature space, like shown inFigure 4.2. This mapping transforms the description data (input space) into a feature spacethat is separatable by one of the different kernels.SVM is a single-class classifier that is often used in object matching because it can classifywith a good accuracy and quite fast, even for a large set of training samples. Although thismethod is a single-class classification method, it can combine multiple single-class classifiersto create a multi-class classifier. Since this method is fast and quite accurate it is often usedin robotics, in detection methods as well as in object classification.

4.1.2 KNN

KNN, K-Nearest Neighbour [38], is a classification method which uses the Euclidean distancein a n-dimensional Euclidean space to classify the n-dimensional description vector. Everytraining sample of description is represented as a point in the Euclidean n-dimensional space.For an unknown description vector the k nearest training samples are found based on theEuclidean distance to the unknown point. Where k is a parameter that is often set as an oddvalue to avoid an even number of found trainings samples for multiples classes.In Figure 4.1 the previously described example of a two-dimensional space can be seen with5 different classes. For point 1 the ground truth is the blue class. This class is also obtainedby the KNN classification method as can be seen with the three blue lines describing thethree closest training samples. Point two is a description vector of an object belonging to thegreen class, but because the description of the green class is not complete, this descriptionpoint falls outside the class boundary used by SVM. KNN can still find the class based on theclosest training samples as can be seen by the lines. This is one of the advantages of KNN, itcan find a class even when the description vector is quite different from the training samples.



In Section 4.1.6 will be explained how certainty of classification can be obtained. KNN is asuitable method to obtain certainty of classification. Because of this and the ability of handlingpartly described classes it is often used in robotics for object recognition. The KNN methodcan classify multi-class problems quite well and is very fast. Although the classificationspeed decreases with increasing description dimensionality and training samples, this can beimproved by reducing the dimension size with a dimensionality reduction method like PCA.To improve the classification speed for a large set of training samples the K-means methodcan be used. This is a method similar to KNN, but clusters trainings samples.

4.1.3 Tree-like classification

Another kind of classification methods is a method like random forest [39]. This methodcreates a tree structure of decisions based on feature inputs. Each decision branch has noknowledge of other decisions in other branches. The random forest method creates randomdecisions based on the description in the training samples. By evaluating the classificationof the generated decision branches, some decision branches are excluded and new ones arecreated. Using this process a tree of random decision branches is created. In random forestsmultiple random trees are created. The classification result for each tree is either weighted orused in voting to obtain the final classification of the description vector.This method is good in object matching as well in object recognition. The training of theclassifier is slower then for example SVM or KNN, because of the generation of the differenttrees and the evaluation of these. In combination with boosting it can provide with a veryrobust and fast object detection method that does not require online learning.

4.1.4 Boosting classification

The process of classification generally uses a method where the description sample is comparedwith training samples of an object. In boosting like AdaBoost [40], the classification method istrained in multiple itterations where the the classification results of the previous itteration areused to improve the classification performance. AdaBoost has been proposed in the processof fast and robust detection of faces. Here also a negative training set has been used. This isa set of images that does not describe the object that has to be trained.For example in the face detection method about 8000 images containing faces and about 5000images that are only background are used in creating a face detection classifier. The trainingof this classifier took several hours and therefore this is not suited for online learning. Butat the same time the classification with a boosted classifier is fast and robust compaired tounboosted classifiers. Therefore boosting a classifier is usefull for a classifier in a detector.

4.1.5 Combining descriptors and classifiers

As discussed before, combining multiple single-class classifiers into one multi-class classifier isa possible method of creating multi-class classifiers. Here the question arises how the outputof each classifier can be compared with each other. Voting can be a solution, but then multi-ple classifiers have to be used that can vote for the same class. This is not the case in SVMs.Another way of determining the class is weighting. This can also be used in a similar problem


4.1 Classification methods 21

when multiple descriptors are used. For example when taking a color descriptor where theoutput of the color is given in 360 degree of hue and a HOG descriptor that uses a scalebetween 0 and 1 for describing the value in the histograms. This example already describesa problem that arises in combining descriptors and classifiers. The outputs of these methodsare not directly comparable.This can be solved by using normalization. This requires the scale of the output. For descrip-tors this is often easily obtained since it is known what the descriptor describes, like the coloror the values in a histogram. For classifiers this is another story, the output for example KNNis the the eucledian distance between training samples. If the input space, thus the output ofthe descriptors, is normalized, than the output of the KNN method is also normalized.For SVM normalization of the description space is not a valid option due to the mappingof the input space. Therefore another method has to be found. Here the output for eachclassifier for all training samples can be used in obtaining an indication of the output scale ofeach classifier. Then the output of each classifier is scaled to the scale of the classifiers outputof the training samples.Combining either descriptors or classifiers are of great importance for object recognition ofmultiple classes at the same time. For example take the following toys: wooden blocks, bestdescribed with texture and HOG descriptor, plush toys best described using texture andfourier shape descriptor and toy cars using color and HOG descriptors. If one wants to cre-ate a pipeline that can detect toys and classify the type of toy, there are two approaches increating this pipeline.One can normalize the descriptor outputs and combine them into one description vector. Thisdescriptor vector is then trained to a classifier. Here one can already see that for each toythere is a part of the description vector that is unused. Like for the toy car the part of thetexture descriptor, this can therefore result in less then optimal performance.The other option is to train different classifiers for each descriptor output. The output of eachclassifier is then used in voting for the class output of the pipeline. For example the classifierfor the color and the HOG shape classifier has output the class car, where the two otherclassifiers have a different output. In the case that the output of these two is not the sameclass, the car class wins. In the case where the two outputs are the same class, a method isrequired to weight the output of the classifiers. This can be done by for example the distancein KNN, but also in SVM with the distance to the seperation hyperplane of each single classSVM. In the latter case there is a vote with a weight for each class from every classifier. Thiscombination can provide a robust and fast recognition pipeline of multiple objects, which isvery usefull in robotics.

4.1.6 Certainty of classification

The certainty of classification of an object is of importance for decision making and for de-tection of unknown objects. The latter will be discussed more in Section 4.2.3 where thedetermination and learning of unknown objects will be discussed. Here we will focus on howthe certainty of classification can be determined.For KNN classifiers it is quite simple to find the certainty of classification based on the dis-tance to other points. But it can also be based on the distance to the mean of the closestclasses. However for classifiers like SVM it will be harder to find a certainty of classification.It can be done based on the distance to other points in its class or to the distance to the



boundary. For tree-like classifiers there exist methods that include certainty based on thenumber of votes. In these methods, the distance used or the number of votes still do notdescribe a real value of certainty. The relation between the obtained value and the certaintyhas to be learned.But this is classification of a single frame, in robotics multiple frames are available to determinethe class of an object in view. This allows for a method where the certainty and classificationsof multiple frames is used to determine the class and certainty. The performance of multi-classpipelines is often unacceptable low in many current methods, with a performance of 35-80%matching rate on the Caltech 256 database [41]. Using multiple frames allows for a drasticincrease in performance and applying these methods in solving real world robotic problems.

4.1.7 Categories in Classification

Objects are classified as a certain class by a classifier, but often for the same class multipletype of objects could fall into the same class. For example the objects mug, glass etc. belongto the class cup as can be seen in Figure 4.4. A mug and a tea glass both have a handleand have a similar shape. But for the same objects there can also be dissimilar features,like material. Therefore these two objects can be grouped together in the class cup or can bedifferentiated in different classes, like mug and tea glass. Thus different forms of classificationscan be made based on categorization of the objects.In classification the subject of categorization is of great importance. Dependent on how theobjects are categorized in the training of a classifier, the classifier provides different labels tothe classified objects. This can be made more clear with an example. Consider a plush toy,that is a teddy bear which is classified by different classifiers. A classifier that is trained withobjects categorized according to toy and non-toy, will provide the label toy for the plush toy.But for a classifier that can distinguish between different types of toys like wooden block,toy car or plush toy, the label will be plush toy. This can even be taken further, because aclassifier can also be trained with the different plush toys like a teddy bear, a plush bunnyand a Snoopy plush toy. This classifier will label the toy as a teddy bear.The previous two examples made clear that for a single object there can exist many differentlabels based on the categorization on appearance like the mug or on type like the teddy bear.Categorization can also be done based on function of the object, this is an important aspectin affordances and will be discussed in more detail in chapter ??.

Figure 4.4: Example of two cups, that are not the same subtype of objects.


4.2 Learning 23

4.2 Learning

As described before a classifier compares the description data with its training samples. There-fore a classifier has to be trained using the data that describes the objects and classes theclassifier has to classify. This can be done by supplying the pipeline of detector, descriptorand classifier with a set of training images and class labels. This section will describe thedifferent types of training a classifier.The learning of unknown objects is required in robotics and comes with some challenges thatwill be discussed here. The service robot is required to detect and learn new objects. It wouldbe very useful if the robot could do this without human interference. Which methods couldfacilitate these aspects are also discussed in this section.

4.2.1 Offline training

Learning objects can be done offline, before the robot has to perform its task.[42] In thisway the robot is able to recognize objects from the start. If the learned object database iscomplete by describing all possible objects in the world, no new objects have to be learnedwhile the robot is performing its tasks, though this will probably impossible to achieve.Offline learning is often used in papers to demonstrate and test developed description orclassification method. This is done by generating a set of images that contain the objects,multiple random subsets are created for training the classifier. The rest of the set is used fortesting the learned classifier. Multiple learning runs are used to make sure of a consistentclassification performance.The main advantage of offline learning is that there is no time restriction to the learningprocess compared to online learning. Also testing can be done based on an existing databaseof images for performance. In this way a level of performance can be guaranteed, based onthe performance obtained in the testing phase. There are also some disadvantages to offlinelearning; it is not possible to learn an unknown object directly when encountered. This resultsin a rigid robot which is incapable of adapting itself to an unknown situation. But still thismethod can be used on robots, by using this for detection of objects. Here online learningis most of the time not required. Further in object recognition a robust basis in recognitionof known objects can be guaranteed. Unknown objects can be related to similar objects andmaybe even the class can be determined using similarity with known objects.

4.2.2 Online training

Online training is learning of objects while the robot is active. This means that when therobot encounters an unknown object it can learn that object. This is required for the servicerobot to be able to adapt to its surroundings and learn the previous unknown objects so theybecome known. But gathering description data of the previous unknown object and addingthat to the database of known objects, comes with challenges.Also relearning of already known objects can be useful. The robot can encounter a knownobject and classifies the object for multiple frames as a certain class. But for a single or fewframes, where the variances of the perceived object are beyond the learned data, the robotcould store this data and use it in expanding the variance of the description of that class.



For a newly learned object it can be difficult to describe that object accurately for differentvariances, since the current encounter often only describes one possible variance of environ-mental conditions. Extending the boundaries of the class or making them somewhat fuzzycould allow for classifying the object correctly when viewing it from a different environmentalvariance at a next encounter. A way to get different viewpoint variances is, could be to movearound the object or pick it up [43, 44].

4.2.3 Classification and learning of unknown objects

As described in the previous section, the ability of learning unknown objects is of greatimportance for a service robot. But how can it be determined whether the object is unknown?Here the certainty of classification is often used to determine whether an object is known orunknown.To classify an object as unknown, the classification certainty is often used in combinationwith a threshold. This threshold has to be determined first. This can be done by using theavailable training data. By excluding a single class of the training data and using this class infinding the classification certainty. If this is done for all classes a threshold can be obtainedby averaging the thresholds for all single classes.An aspect to keep in mind, is that some methods like PCA transform the description spacein such a way that there is less or no description space for unknown objects. Further thetype of pipeline that is used can have a big influence on the classification of unknown objects.For example consider the pipeline for classifying toys using a color descriptor. If this pipelineis trained with a yellow bath duck and red and blue toy cars. In the case the pipeline hasto classify a yellow toy car, the pipeline will classify this as a bath duck. Therefore it is forunknown classification of importance to keep the different variations of objects in mind inchoosing the right methods for the pipeline.This certainty of classification is generally obtained based on how well the object classifiedmatches the trained data. There can be two cases, the object is known, but the variances ofthe object are not known and the case where the object is unknown. The difference betweenthese are unknown and can often not be determined. Even so, sometimes the varianceswithin a class can be larger than the variance between classes, like described in [?] for theclassification of faces. Therefore it is necessary that the class obtained for the unknown objectis checked with excisting class to add the description of the previous unknown variance to theexisting class.

4.2.4 Class retrieval for unknown objects

When encountering an unknown object there is no information other then the perceived data.Retrieving the class of an unknown object can be done by human interaction, but it would bepractical for a service robot to learn objects without human interference. This can be doneby looking for similar already learned objects. [45] For example consider the case where theservice robot knows cube shaped wooden blocks, using a shape and colour descriptor. Therobot now encounters a cylindrical shaped wooden block and based on its shape descriptorclassifies it as an unknown object. It could derive its class by looking at objects with similarcolours. Using this information it can determine that this object is probably an unknownshaped wooden block. Here can be noted that this is actually a mixture of categories.


4.3 Classification applied on robots 25

There are also other methods that can be used in determining the categories of an unknownobject. The methods of context and human observation are discussed here shortly.

Context

The context [46, 47], can be used in the determination of the class of an unknown object.In the previous example a cylindrical shaped wooden block is found to be unknown. Whenchildren are playing, they often do not discriminate between the different parts of a groupof toys. So it will be very likely that other objects of the same kind will be lying close tothe unknown cylindrical wooden block. The robot can look around and find a cube shapedwooden block next to the cylindrical shaped one and deduce from that, that the unknownobject is probably also a wooden block and has the same categories as the cube shaped woodenblock.Other contextual indicators can be used, for example the location.[46] If a toy is lying in thebedroom, it is probably a toy, but if it is in the kitchen, the likelyhood of the unknown objectbeing a toy will be less. Also based on location there can be a categorization of toys. Sometoys belong to a certain child and should be tidied away in that childs room. Toys that areused in the bathtub or outside are often found in those vicinities and are probably tidied awayclose by.

Observation of interaction

Another way of determining categorization is by observation of the actions performed withthe objects [48, 49, 50]. This can be done by determining the interaction and categorizingobjects by the kind of interaction. For a toy the interaction is playing by a child, thus to knowwhether an object is a playable (toy) can be done by observation. The same can be done fortidying, if the child tidies away a toy in a box, the robot can learn from this, that the toy isa tidyable. But here it is also interesting to make the relation between the toy and the boxor location where the toy is tidied away. This concept will be discussed in more detail in thechapter ?? about affordances.Further it is possible to use this as feedback, if the robot tidies away a toy car in a box and itshould not be put there, the child will then remove the car from the box and put it where itbelongs. This action should be observed and learned by the robot for future reference. Hereit is important to also keep in mind that this can be just for that toy car or for all toy cars.Since there is already some human-robot interaction, it is possible for the robot to ask forthat feedback.

4.3 Classification applied on robots

Offline trained classifiers can provide a good basis on a service robot while using onlinetraining for learning unknown objects. Retrieving the class of unknown objects provideswith a challenge in robot vision. Using methods of similarity and context in conjunctionwith human confirmation could provide a good basis for unknown class retrieval. With eachencounter of known objects the classifier can determine whether the detected object can



improve the current description of the class. In this way the classifier can become morerobust to all variances.


Chapter 5

Recognition Framework

In the literature survey Chapter 4 it was stated that robust recognition pipelines are re-quired for recognizing multiple objects. Further unknown classification of objects is requiredin order to learn them. Therefore also online learning of classifiers is recommended. Asa conclusion of the literature survey it was suggested to combine these requirements in arecognition framework that would provide the basis for robust recognition pipelines. In thischapter an overview will be given of this framework. Further will be explained how the dif-ferent requirements found in the literature survey and during the creation of the frameworkare implemented into the framework.

5.1 Architecture

Any recognition pipeline consists of certain stages in the process of obtaining the class of theobject, these stages are illustrated in Figure 5.1. The first stage is the localization of objectsin the input image. This is done by the detection module that searches the input image forpossible objects. Once the object is detected, to enhance the localization speed a tracker isutilized to track the object position in subsequent frames. The viewpoints of the objects, aswell as their locations in the input images are stored in the short term memory, so that theyare available to the tracker and as well as to the next description module.The localized object is recognized in the next stage by one or multiple descriptors to robustlydescribe its appearance. These descriptors provide a description vector that is used in theclassifier module to obtain the object class label. This is done by comparing the descriptionvector with the description vectors of learned objects from the long term memory. The lastmodule allows for learning of the objects classified as unknown or relearning of already learnedobjects. In this process information on the classified object is updated from the short termmemory and stored into long time memory.


28 Recognition Framework

Figure 5.1: Overview of recognition pipeline in the recognition framework

5.1.1 Detector

The detector localizes the object in the input image. Since there are different kinds of objects,various detection methods can be used. These include:

• Boosted classifier using haar like features for face detection

• Keypoint detectors (SIFT, SURF, GLOH etc.)

• Clustering based methods

which are available in the literature as explained in Section 2.2. Since the main focus ofthis thesis is on face recognition, the localization method specific to face detection [17, 18] isutilized. This method was chosen due to its robustness and real-time performance.

5.1.2 Tracker

Once the object is initially detected, its position in the consequent frames is continuouslytracked. This significantly increases the localization speed and accuracy by constraining thesearch to the specific region around the previous object position. One of the possible state ofthe arts methods is TLD [44] that is utilized in this framework.


5.1 Architecture 29

5.1.3 Descriptor

In order for an object to be efficiently recognized, it has to be described in a robust man-ner. To obtain the most representative features that describe the object despite of envi-ronmental variances. Since objects in real world settings can have various representativefeatures, many different description methods are required to describe these objects. For facerecognition Principal Component Analysis (PCA) and Class Average Principal ComponentAnalysis (CAPCA) methods have been implemented, but other methods like Fisherfaces [51]and Laplacianfaces [52] could be used as well. Further for recognition of objects other stateof the art methods described in Chapter 3 could be utilized in order to make this frameworkapplicable in a more general setting. The descriptor constantly interacts with the classifier inorder to recognize the object as a certain class or as unknown.

5.1.4 Memory

The long term memory comprises of the multiple view points of the objects belonging todifferent classes. This serves as a training set for the classifier. The short term memoryconsists appearance and spatial information of the current object to be recognized.

5.1.5 Classifier

The description vector of the object can be used by a classifier to determine its class. Com-monly this is done by comparing the description data with training samples stored in thelong term memory. Methods like Support Vector Machines (SVM) [36, 37], K-NearestNeighbours (KNN) [38] and Randomforests [39] have been used as classifiers.In order to learn a novel object it has to be classified first as unknown before learning pro-cess can be initiated. The Certainty K-Nearest Neighbours (CertKNN) method has beenintroduced in Section 7.2 for this unknown classification.

5.1.6 Learning

In order to learn a novel object, the multiple viewpoints of this object along with its classlabel have to be stored in the long term memory. This class label can be obtained throughinteraction with humans.Given the large variances of the same object encountered by the robot, it is possible thatthe classification of known objects is not possible with the certainty required. In this casethe appearance model of this object has to be extended. This is called relearning and thisalso involves storing of additional object viewpoints into the existing class in the long termmemory.

5.1.7 Dynamical loading

The above explained modules together constitute the recognition pipeline as illustrated inFigure 5.1. Depending on the context different methods can be used within each of thesemodules. This is described in the following example where a service robot receives a fetch


30 Recognition Framework

and deliver task. Consider that the robot is asked to bring a cola and some chips to a per-son. One possible solution is to create a single recognition pipeline that recognizes all objectsknown to the robot. This is memory exhaustive and leads to a slower and less precise recog-nition performance. Another solution would be to dynamically create context dependantsubsets of the known objects. Each of the subsets require specific methods for each of themodules. This requires a flexible framework where these methods can be dynamically loaded.In the example considered, when the robot looks for a cola only cans have to be detected andcolour and texture have to be described. However when the chips is being searched, both cansand bags have to be detected. This emphasises the need for dynamic loading of methods.In this thesis, a framework which facilitates this concept has been developed. A fixed pipelineusing this framework has been implemented and evaluated for the specific case of face recog-nition. Context dependant loading of the methods within different modules is an interestingtopic for the future work.


Chapter 6

Aspects of Face recognition forrobotics

In recent years more and more research is done in robotics and with this also the applicationof robotics is growing. Face recognition for robotics can improve the human-robot interactionsubstantially. For example if a person asks the robot to get a drink, it is of great importancethat the robot can recognize or learn who ordered the drink and recognize to whom it hasto bring the drink. Most of the research in face recognition is based on pattern recognitionand not applied in robotics. The application of the state of the art methods presented inthis chapter on robotics, introduces these challenges compared to its applications in patternrecognition.The first difference that arises is the aspect of time. In pattern recognition the goal is toobtain with high precision as much information as possible from a single image, which comesat the cost of processing time. However in robotics, in order for the robot to react efficiently,all information needs to be processed in nearly real-time. Therefore the amount of informationobtained from a single image needs to be reduced. On the other hand having a robot thatinteracts with its environments allows for processing of multiple images of the same object.Processing of successive frames offers the advantage of obtaining information of multiplevariances of the same object, which significantly improves classification performance. Theoptimal processing speed is 15-30 frames per second.Another difference is the need for robustness to variant data. In robotics the variations arequite large, due to the changeable environmental conditions and viewpoint variances the robotcan encounter. Robustness to this can be obtained by an invariant method?? or by usinga good description of all variances as training data. The latter is difficult to obtain due tothe vast amount of possible combinations and the unknown variances. At the same timethe description of a person is often obtained from a single or a few different encounters andtherefore only contains a few variances.The final difference is the need for high precision in recognizing known faces, as well as inthe classification of unknown faces, which is required for online learning. The process ofclassifying and learning of unknown faces is absent in regular pattern recognition. But in


32 Aspects of Face recognition for robotics

robotics this process is vital to adapt to the new users the robot will encounter. This has tobe done while the robot is in its active phase and in a limited time frame.The overview of the available methods is given in this chapter. In the following chapter threeproposed methods to meet the given requirements are explained.

6.1 Related work

In this section few face recognition methods available in pattern recognition are discussed fortheir application in human robot interaction. The methods are evaluated based on robust-ness and processing speed. The outline follows the different stages that can be found in arecognition pipeline.

6.1.1 Detection and Tracking

The widely used method for face detection is method developed by Viola and Jones [17, 18]that can quite robustly detect faces in real-time. This uses a boosted classification of haar-likefeatures for the detection of faces. This method can detect faces up to 15 degrees rotationleft and right, but almost no rotation side ways. There are other methods available which areless commonly used due to lower detection rates or slower speeds.

In robotics a combination of detection and tracking can be used for optimized speed. Forexample the TLD method [44] uses a combination of tracking, detection and learning. Thisprovides a robust method of tracking faces, because the changes in appearance due to posevariations and movement of the face are relearned to the tracker.

6.1.2 Descriptors

The methods used in face description can generally be divided in two categories.

1. Shape based

2. Appearance based

The methods of the first category describe the face by using differences in shape, like thedistance between the eyes and the size of the face. The second category methods use theappearance of the face in the image to describe the face. Though shape based methods havea higher precision, they require higher processing time. Also obtaining the shape of a facefrom an image is difficult. Therefore only appearance based methods will be considered.

EigenFaces

In [33] Principal Component Analysis (PCA) is applied to recognition of faces. This methodwas called EigenFaces after the Eigen vectors, that are used to describe the faces. It uses theimage data of a face as a single vector and applies the PCA method to find the Eigenspace


6.1 Related work 33

of the training samples to describe the difference between each training sample. As reportedin [33] this method performs quite well and achieves almost real-time performance. Unfortu-nately many training samples for each class and in many variances as possible are required,as the method is not robust to pose and illumination variances. In fact [53] states, that thevariations of illumination are greater then the variations between faces of people.

FisherFaces

PCA tries to maximize the differences between the training samples, but since there are mul-tiple training samples used for a single class, the differences within a class are also increased.This is generally not desired and is corrected by the method introduced in FisherFaces [51],based on Fishers method [54]. This method defines the inter- and intra-class scatter matricesSb as (6.1) and Sw as (6.2), where µ is the average of all training samples and µj is the averageof all training sample of a class. c is the number of classes and Nj is the number of trainingsamples in class j.

Sb =c∑

j=1(µj − µ)(µj − µ)T (6.1)

Sw =c∑

j=1

Nj∑i=1

(xji − µj)(xj

i − µj)T (6.2)

Further this method tries to find a linear projection space that minimizes the intra-classscatter by maximizing (6.3):

detSb

detSw(6.3)

Fisherfaces creates a single dimension in the description space for each class from all trainingsamples of a class. This has also the advantage of the limited dimensionality and increase inthe recognition speed. This method has been proven to outperform PCA in certain cases.

LaplacianFaces

The two previously presented methods assume linear variances of the face appearances. Inthe proposed method of Laplacianfaces [52] this is handled differently. Laplacianfaces are theoptimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on theface manifold. This is method is similar to PCA method, but uses a transformation of theinput vectors before creating the eigenspace. This transformation describes variances in poseand shape in a linear fashion which improves the clustering of classes.

6.1.3 Classifiers

For face recognition there are often many different classes and many different training samples,but at the same time a fast method of learning is required. Therefore methods like K-Nearest Neighbours (KNN) and Support Vector Machines (SVM) are commonly used, thesemethods have been explained in detail in Section 4.1.2 and Section 4.1.1. Since the poseand illumination variances are considered to be larger then the inter-class variances, the


34 Aspects of Face recognition for robotics

separatebility of classes is a challenge. Since SVM tries to find a seperation hyperplane thiscan become a problem for inseperatable classes. In KNN no seperation plane has to be foundand therefore this method is proposed for variant classes like faces.

6.2 Face recognition applied on a robot

This master thesis proposes a robust face recognition method that also allows for onlinelearning of faces required for human robot interaction. In the next chapter methods will beproposed that improve the recognition of faces. Due to the limited robustness of the stateof the art face recognition methods and their inability of learning unknown faces, develop-ment of novel methods is required. In Section 7.1 the Class Average Principal ComponentAnalysis (CAPCA) method is introduced, that improves robustness to variant data. Fur-ther a method that can classify unknown faces based on a certainty measure, derived frominter- and intraclass variations is proposed. Finally these methods are extended to the caseof multiple frames.


Chapter 7

Introduced methods for robustdescription and learning of faces

For a pipeline that can robustly describe and recognize faces while learning unknown facesonline, new methods had to be developed. These methods are described in the next threesections. In Section 7.1 a description method will be introduced, that can robustly describefaces over different variances. Classifying faces with certainty and allowing for unknownclassification, is achieved with the method described in Section 7.2. To take advantage ofthe multiple frames available and to increase robustness of classification, the last method isintroduced in Section 7.3.

7.1 Class Average PCA (CAPCA)

In Section 3.5.1 is described how the Principal Component Analysis (PCA) maximizes thevariances between each training sample. This happens both for samples of different classesand for samples of the same class. To extract only the most representative information of theobject class, the variances of samples within that class needs to be minimized. The optimalsolution would be to find a single description that accurately describes the entire class andall of its variances. To solve this problem we propose the method of Class Average PrincipalComponent Analysis (CAPCA) in this section.

7.1.1 Method description

For this method two assumptions are made in describing a class and its variations. The firstassumption is that for every class there exists a central point that describes the class withzero variations. Secondly it is assumed that all variations within a class are symmetric lineartransformations. By calculating the average of the two outliers for every variation, the centralpoint or the class average of the class can be found. This class average describes the point


36 Introduced methods for robust description and learning of faces

a b c

Figure 7.1: Finding class average with different outliers.

with minimum overall distance to all variations that exist in that class. Therefore the intra-class variations are minimized by using the class average, without using a transformationof the description space. By using PCA on the class averages, the inter-class variations aremaximized.Next an example is given of how the class average can be found with various outliers usedin different training sets. In Figure 7.1a a class with two variances is shown, the black dotsare a set of different training samples within that class. We can assume that the outliers ofa class can be obtained by taking the training samples with minimum and maximum valuesof the variances. The CAPCA method will use the training samples in red as the outliers invariance x and the blue training samples as outliers in variance y, to find the average of thatclass, described with the black circle. The shaded rectangle describes the possible varianceswithin that class.In Figure 7.1b a different set of training samples is shown for the same two variances withinthat class. As can be seen, there exist now two outliers in purple, that describe variance xand y. This means that these outliers will describe both variances. These purple outliershave a larger distance to the average than the red and blue samples used in the first set andtherefore will be used in finding the average. Since these two outliers describe both the redand blue variances, only the two purple training samples will be used in finding the averageand will describe the same possible variances described by the shaded rectangle. The previousexamples all had two outliers on opposing sides of the average, in Figure 7.1c a selection oftraining samples is shown where this is not the case. Here three outliers in orange, purpleand green are now used to obtain the class average.

7.1.2 Obtaining the class average

Taking the previous examples in mind, the question arises of how to obtain the class averagein practice? Because in general the available description of a class is a subset of the possiblevariations within a class. This makes it quite hard to find a description of all outliers for everyvariation within a class. Therefore a method is devised that can find the available outliers inthe class description and use these in an optimal way to obtain the class average.At first one would think that using the average of all available training samples in a classwould be a valid description of the class average. But this assumes an evenly distribution


7.1 Class Average PCA (CAPCA) 37

of the variations within the set of training samples. Especially in robotics nothing could befurther from the truth, because the obtained training samples are often from a single or a fewencounters with the subject of the class. These encounters often only describe a few variancesof all possible variances within that class. Therefore it is most probable that there are manytraining samples containing only a few variances and just a few or no training samples for allother variances. Since the variances within a class are not uniformly distributed, taking theaverage of all training samples will not result in the class average. Doing so the influence ofvariances with a only training samples will diminish.

Method

Since taking the average of all training samples does not provide with a valid class average, amethod has to be devised to find the outliers of the variances in the set of training samples.We first briefly present all steps of the method and then explain them in more detail.

1. Calculate class average of all training samples of a class

2. Create Eigenspace using the class averages

3. Train the projections of the class averages to KNN

4. Project all training samples and obtain intra-class distances using KNN

5. Find outliers and reject previous outlier

6. Calculate new class average of the found outliers of a class

7. Repeat from step 2 until all training samples have been rejected as outliers

Step 1First the average of all training samples for each class is computed in the standard way. Theseaverages now represent the mean image (appearance) of each person. It will come as a surprisethat at first an average of all training samples is taken, where just was explained that thiswould not give a good description of the class average because the less described varianceswill disappear. Here this phenomenon is used to its advantage since the distances for thetraining samples of the disappeared variances will be increased. Therefore the probabilitythat these training samples will be chosen as outliers is also increased.

Step 2Using PCA a projection space of the obtained class averages will be created. This projectionspace will reduce the dimensionality of all training samples and will maximize the variancesbetween the class averages.

Steps 3 & 4The projections of the class averages will be trained to a K-Nearest Neighbours (KNN) clas-sifier. All training samples for each class will be projected and the intra-class distances toits class average will be obtained. This is done by using the KNN classifier with a k of 3.



A k of three is used to be able to find the intra-class distance in the case of overlapping classes.

Step 5An outlier is found based on the maximum intra-class distance for a training sample. Thisoutlier is now stored in the set of outliers and used in the following iterations. In the first iter-ation the set will contain only one outlier. In the second and all successive iterations anotheroutlier will be added to the set of outliers. The outlier added in the previous iteration can berejected in the case it does not minimize the intra-class variations. This is the case if the out-lier distance of the current iteration is larger than the outlier distance of the previous iteration.

Steps 6 & 7A new class average will be calculated using the set of outliers for each class. The wholeprocess will now be repeated from step 2 to 6, until all training samples of all classes arerejected.

Example

Using Figure 7.2 - Figure 7.6 a simplified example will be given of how the class average isfound and how these are used in creating an eigenspace that only maximizes the inter-classvariations. Please note this is a much simplified version of the reality, only used for describingthe method and its special attributes. In these figures an example can be seen of two classeswith each two variances of which one is shared. Here the shared variance can be seen as a poseor illumination variance, where the class specific variance is one of the appearance variances.The training samples are shown as black dots and the possible variations within the variancesare shown by the shaded rectangle. Note that for the red class, the training samples are notsymmetrically describing the possible variations within that class. The coloured circles arethe current class averages, in the eigenspace is the new outlier described with a saturatedcolor and the previous outliers with a darker color.Figure 7.2 depicts the first iteration until step 6. First the class average using all trainingsamples is obtained and used in creating an eigenspace. The training samples have a lowrepresentation in the average and therefore the spread of the samples is quite large. InFigure 7.2b are the first outliers shown for both classes. These become automatically theclass averages as shown in Figure 7.3a. Thus the other outlier of this variance can easily befound, because the distance is maximal as is shown in Figure 7.3b.Figure 7.4 shows the next iteration where the first outlier of the second variance is obtainedand in Figure 7.5 is the second outlier found. This results in an eigenspace as shown inFigure 7.6b. The intra-class variances are decreased compared to Figure 7.2b, which is thegoal of this method; Increasing inter-class variances while decreasing intra-class variances tomaximize the separability of the classes.As noted before does the training samples of the red class not fully describe the possiblevariations of the class. Therefore the class average is shifted and does not fully describe thepossible variations in the shared variance. To be able to also classify these variations as thered class, a method has to be found. This is explained in more detail in Chapter ??.



a description space b eigenspace

Figure 7.2: Finding outliers: Iteration 1







7.1.3 Differences with Linear Discriminant Analysis (LDA)

LDA [51] is a method that uses clustering of samples for similar samples, like classes. Themethod tries to minimize the intra-class variations and to maximize the inter-class variations,which is done by solving (6.3). The only similarity between these methods is that both meth-ods try to minimize the intra-class variations and obtain an eigenspace with dimensionalityequal to the number of classes. The difference is how this eigenspace is obtained. CAPCAtries to find the best description vector to describe that class. Where LDA tries find anoptimal projection manifold that can describe the class using all training samples and not thepossible variances. This method is not very robust when the class is described by trainingsamples of a few variances or one-sided samples in a variance.

7.1.4 Discussion

The method of description of a class could be improved more if it would be possible to excludevariances that are shared between classes and only use the class specific variances. These ex-cluded variances could be considered as variances due to environmental conditions like poseand illumination. It should be investigated if a method can be found for obtaining thesevariances by looking at the changes of the Eigenvectors for each class and find the Eigen-vectors that are shared between classes. The projection onto these vectors is removed fromthe description of the classes and only the remaining values are then used in training andclassification.



7.2 Certainty KNN

KNN is a fast classification method that uses the distance to the training samples to findthe class of the description vector that has to be classified. One of the required aspects inclassification of faces in robotics is the ability to detect unknown faces. One of the possiblemethods to do this is to ascertain the certainty of classification. If the certainty for allpossible classes is below a threshold it can be assumed that the description vector describesan unknown face. Due to this reason it is of great importance to find a methods that canrobustly describe the certainty of classification.The distances to the nearest neighbours can be used in the conventional KNN method tocalculate certainty of classification. But here the problem arises, that the relation between thedistance and the certainty is unknown. The threshold for unknown classification is thereforeoften found by manual estimation. While learning new classes or new training samples therelation between the distance and the certainty can change, so the threshold should changetoo. Further while using PCA the distance relations will vary, due to the fact that theeigenvectors of the covariance matrix will change with different training samples.In Figure 7.7 shows a hypothetical case of a two dimensional description space containingclasses represented by circular shapes of different sizes. Further the distances between theclasses are described by the dashed black lines. The coloured dots outside the classes describedescription vectors to be classified, while the lines to the training samples within the classesdescribe the distances obtained by KNN. Using this figure we would like to explain thedifferent classification cases. The purple dot which lies between two classes (blue and red)represents a description vector of an unknown class. However the distances of this unknownsample are very similar to those of the green dot, which is a sample representing a descriptionvector of the green class. In this case by using distances obtained by a standard KNN methodwe would not be able to differentiate between known and unknown class.This called for a classification method that could obtain a certainty that is comparable betweenclasses and that could provide a consistent certainty for different training sets. Further therelation between the certainty of classification and classification of unknown objects shouldbe obtained. This chapter introduces a method that can achieve this.

7.2.1 Method

To consistently describe the certainty using the distance for an unknown description vector,one requires the relation between the distance and the certainty. The proposed method ob-tains this relation by using the distance between two closest classes ||ObOg|| and the withinclass distances Xc,b and Xc,g. In Figure 7.8 this is shown for the blue and green class. Thedistance between the centres of both classes is taken as the distance to describe the rangeof 100% certainty for the blue class until 100% certainty for the green class as depicted inFigure 7.9. Since the size of the classes also has an influence on the distance certainty relation,the maximum distance from the center of the class and its outliers is used to describe therange of high probability of that class.The relation of the classification certainty and the distance is then described by a polynomialfunction described by (7.1). The distance x is obtained to the nearest class center, whereXc is the distance to the class boundary and Cc is the certainty at the class boundary. Theinter-class distance is used to find the halfway distance X0 where the certainty is 0%. The


7.2 Certainty KNN 43

Figure 7.7: Example of distances found by KNN.

class boundary certainty is generally set to 70% which was a value derived from experiments.

C(x) = 100− ax+ bx(X0 − x) (7.1)

a = 100X0

b = (Cc − aXc)/Xc ∗ (X0 −Xc)

In Figure 7.9 is the certainty described for the blue and green class dependent on the distancefrom the blue class center assuming the distance is 200 so X0 = 100. The class boundary Xc

is 20 for the blue class and 40 for the green class.The Certainty K-Nearest Neighbours (CertKNN) method is trained with the class centres,which can be obtained by averaging all training samples. And the outliers of the training

Figure 7.8: Example of distances found by KNN.



samples are used to find the class boundary distance of that class. Here can be noted, thatin combination with the CAPCA method this method will perform much better, since it isalready known what the class centres are and what the outliers of each class are. Using theclass centres and boundaries the parameters a and b can be obtained for each class.

Figure 7.9: Certainty for the different distances to the class centres.

7.2.2 Unknown classification

In classification of a description vector the distance is obtained to the closest class centres andthe certainty can be calculated for these classes. The class is determined by the class with thehighest certainty. In the case there is no class with a certainty above the specified threshold,the object is classified as unknown. This threshold can be set as 0% because then it is quitecertain that the description vector does not belong to any of the classes. Figure 7.10 showsthe certainty thresholds of 0 and 30% and shades the space describing unknown descriptionvectors.If we consider only the red and orange class in Figure 7.10, there exist no space between theseclasses for a threshold of 0%. However if the threshold is set to any higher value, an unknownclass can exist in the space in-between. The question arises of how high this threshold shouldbe. This can be determined by taking 100− Cc as the threshold value.


7.3 Multi-frame classification 45

Figure 7.10: Description of classes by CertKNN and unknown space at 0% and 30% threshold

7.2.3 Improvement of CertKNN

The proposed method assumes that the classes have an equal distance in each of the descrip-tion dimensions. However this is not a valid assumption in many cases. To find a solution forthis problem CertKNN can be combined with the CAPCA method and the certainty will becalculated in the projected space. Now the intra-class distances are defined by the maximumchange of variance in the projection space. For each of the variances the local certainty iscalculated using (7.1). The total certainty is calculated using Euclidean distance on the localcertainties. In our future work we plan to implement and prove this solution.

7.3 Multi-frame classification

One of the advantages in robotics is that in general there is more than one sample availablefor the same object. If the recognition pipeline is fast enough for real-time processing, it canuse up to 30 frames a second. This means that if one would use fast recognition, for only 1second, there are already up to 30 samples available. This means that the classification andits certainty can be improved by using more than one sample. In this chapter a method isproposed that uses the classification and the certainty of classification of multiple frames toimprove these aspects.

7.3.1 Method

This method assumes that a certainty is provided by the classification method, but if thisis not the case, the proposed method will use 100% as the certainty of classification. Theintroduced method creates a relation between the number of classifications of a class n and



Figure 7.11: Certainties for different Nx Figure 7.12: Certainty for different averagecertainties of classification

the maximum number of frames used N and sums this with the average certainty of classifi-cation. This relation is described with the equation in (7.2).

Cm(n) = an+ bn(N2 − n)(n−N)− c+n∑

i=0

C(xi)N

(7.2)

a = 100N

b = Cx+c−aNx+100Nx( N

2 −Nx)(Nx−N)

c = −aN2

n: Number of frames for certain classN : Maximum number of frames usedNx: Number of frames for certainty thresholdCx: certainty for Nx

This relation is based on a linear relation described by the parameter a and a non-linearrelation described by the parameter b. The latter parameter is based on the required numberof frames classified as a single class to be certain that the object is of that class. In CertKNNthe class boundary is generally described with a 70% certainty. In the multi-frame classifi-cation Cx is generally set to 30% to create a 100% certainty on the assumption that averagecertainty of classification is 70% or higher. The parameter Nx describes the number of framesthat have to be used to add this certainty. This is generally set to 70-90% of the maximumnumber of frames used. In Figure 7.11 the certainty relation is described for different Nx inthe range 14-18. As can be seen, the certainty relation can be changed to accept few or manyframes before the multi-frame method is certain of classification. Further the parameter ccan shift the relation up and down, so it adds more or less certainty. Further the parametera is calculated based on a range of 100% certainty, but it can also use a smaller range.Because the average of the certainty of classification is not obtained by dividing by the num-ber of frames, but by the maximum number of used frames, the certainty of classificationwill be low for the first few frames and will increase with the number of classified frames. InFigure 7.12 the certainty is shown in relation with the number of frames classified and for


7.3 Multi-frame classification 47

average certainties of classification ranging from 0 till 100%. The parameter Nx is set to 18to be quite certain of classification.This method is used to increase the robustness of classification and decision making of therobot, which is of great importance for robotics. It comes only at the cost of speed, whichis dependent on the number of maximum frames used. In general it is suggested to set thisto the number of frames that is processed in 0.5 - 1 second, to provide a fast and robustclassification at the same time.

7.3.2 Discussion

The introduced method of using multiple frames in changing the certainty of classification isnovel and so far has been applied in face and object recognition. The main advantage of thismethod is that it improves the robustness of visual recognition and increases the performance.


Chapter 8

Application in Robotics

In the previous chapters a pipeline for robust recognition and online learning of faces in real-time has been introduced. This pipeline has been applied on Robby, a service robot developedby the Delft Robotics team [55]. Robby is an affordable robot with a minimal number ofdegrees of freedom. For face recognition the following features of Robby are important. Ithas two degrees of freedom in the neck in order to focus and track human users. Further ithas a kinect that captures both 2D and 3D image data, required for the recognition pipeline.The software of Robby uses ROS [56] and therefore the face recognition pipeline has beenimplemented into a ROS node.The implementation of the proposed methods into a ROS node and the features of this nodewill be described in this chapter. Further results of applying the introduced recognitionpipeline on different problems will be presented.

8.1 Face recognition for human robot interaction

In order to bring robots into the common households they have to interact with humans.These interactions require communication based on speech with different users. This requiresthat the robot can learn and remember different users. Further in interaction between humansit is of importance that a person focusses on the person interacting with. For human robotinteraction this is also required. This section will first introduce the tasks the robot has toperform and how face recognition is applied in human robot interaction and then how this isimplemented in a ROS node. Further will be explained how face tracking is applied to focuson a person.

8.1.1 Task

The Delft Robotics team is participating in the RoboCup@Home competition, where servicerobots compete against each other. These robots have to perform ceveral common householdtasks and are evaluated on how they perform these tasks. In these challenges the robots are


50 Application in Robotics

required to be able to find and recognize a known person, as well as to learn an unknown oneby modelling its appearance and inquiring about its name.

8.1.2 Facerecognition node

Humans want to known with whom they are interacting with, in order to know how tointeract with that person. For example a person will interact differently with a colleague ora superior. Further prior knowledge about a specific person from previous encounters defineshow to interact with that person. Generally there are no appearance based properties thatdescribe relations like these. Therefore a robot should be able to recognize a person based onappearance, to find the persons name and create relations between the name of the personand how to interact with that person.The name of the person is obtained by the introduced recognition pipeline of faces. Thishas to be implemented into a ROS node since the structure of Robby’s software is dividedin several nodes. Each node performs a specific task, like recognizing a persons face, sayinga certain sentence or determining an action. ROS nodes can interact with other nodes bypublishing information on a topic and reading information from a topic. Therefore dependenton the task of the recognition pipeline, the node has to be able to publish and receive certaininformation from topics.The FaceRecognizer node listens to a topic where commands are given by the core of therobot. These commands are:

1. start recognizing

2. start recognizing for a specific person

3. start learning an unknown person

4. start looking for a person

5. stop

The first four commands start the face recognition process from camera inputs. In the casethe pipeline finds a face, the location of the face in the image will be published. For the firstthree commands the recognition pipeline will continue to describe and classify the found face.If the pipeline has determined the class of the person with high enough certainty, the classwill be published. For the second command the name will only be published if the foundperson matches the person that has to be found.In the case the found person is an unknown person, the node will publish a temporarilyidentifier of the person. The core will then start the process of retrieving the name of theperson with the FaceLearner node. This is done in conjunction with the speech synthesis andrecognition nodes [57]. The given name will be learned together with the obtained trainingsamples of the person.

8.1.3 Face tracking

As described it is important that the robot can focus on the person it is interacting with. Thisis done by turning the head in such a way that Robby is looking at the person. This requires


8.1 Face recognition for human robot interaction 51

the position of the person relative to the head and a way to turn the head of Robby. Thelatter one is done in the NeckController which is described in more detail in Section 8.1.4.There are two methods of determining the position of the face relative to the head. Thiscan be done with feedback or by calculation of the rotation angles. With using feedbackthe direction of rotation is determined and the head is rotated in that direction. Using thechanging error between the desired position and the current position of the face in the imageit can be determined at what moment the rotation of the head should be stopped. To observethe changing error in position, it is required that the position of the face in the image canbe obtained continuously. Unfortunately this is not always the case, because the movementcreates a blur in the images that inhibits face detection.The other method of determining the relative position is to find the relation between the pixeldistances in the image and the physical world distances. This requires the focal length of thecamera and the distance in which the person is perceived. Robby has a Kinect mounted inits head and can provide with depth information. The focal length is also easily obtainedfrom either experiments or specification of the camera by the manufacturer. Now the relationbetween the image and the physical world is obtained, the rotation of the head be can becalculated.The final method is a combination of both described methods, the relative position is obtainedusing the calculated rotations and during the movement of the head, the rotations can beupdated in the case the face has been detected. A rotation of the head will be initialized inthe case the face is outside a 10% range of the center. This done to prevent oscillation of thehead and of a restless appearance of the robot. Larger movements of the person are trackedfor the person that is interacted with.

8.1.4 NeckController

The NeckContoller node converts movement goals into control commands for the motors.These motors can be controlled in two modes; position and speed mode. In position mode thecontrol commands send to the motors are positions and in speed mode the control commandsare rotational speeds. Since in position mode the speed of movement can not be controlled,it was chosen to use speed mode.Movement goals are received from other nodes and are converted into speeds for the motorsof the neck. There are three types of movement goals:

1. Absolute position

2. Relative position

3. Rotation speed

These are described by the neck control message containing y, z position for the absoluteposition defined in radians, y, z rotation in radians for the relative position and y, z velocitiesfor the rotation speed of the head. The neck will set the rotation goal of the head accordingto the given position or rotation. This goal will be reached with a proportional control ofeither the given speed or the default speed.



8.2 Experiments & Results

In order to evaluate previously introduced methods, several experiments have been performedusing a face database. At first in order to show the benefit of using the introduced methods,the overall performance is compared with the state of the art methods. The performanceevaluation is based on recognition, unknown classification and speed. Further the performanceof unknown classification was tested for the CertKNN classifier for different training set sizesand certainty thresholds. Finally the benefits of multiple frame classification is shown.

8.2.1 Setup

To test the performance of the proposed methods, we used the standard PIE database [58]. Ithas 68 classes with a lot of variations in viewpoint, illumination, facial expressions and talk-ing. From the state of the art databases, this is the only freely available one that has bothenough classes and images per class as well as examples of different variations. In [52] thePIE database is used with a set of 170 images. A similar set is created to be able to comparethe introduced methods with the results of the state of the art methods. The dataset consistsof faces detected by the cascade face detector of [17]. Each class has a total of 170 imagesfrom two different viewpoints, divided in the following categories: 120 images captured whilethe subject was talking, 48 images containing illumination variation and 2 from expressionvariations. For each experiment random selections of images are taken to create the set oftraining samples. The rest of the images are used as test samples. For every experiment 8 setsof training and test samples is created. The resulting performance of all 8 sets is averaged tofilter out abnormalities.

The performance of the methods is measured using precision (8.1), recall (8.3) and F-measure(8.5). Precision describes how accurate the recognition pipeline can classify a face. In orderfor a robot to efficiently perform human-robot interaction tasks, it is of importance to havea high face recognition precision and recall. Recall describes the classification performanceof known classes in respect to unknown classification. Further to show the performance ofunknown classification methods Fallout and Specificity are defined as stated below. In thefollowing sections the different performed experiments and the obtained results are described.

Precision = TP

TP + FP(8.1)

Fallout = TN

TN + FN(8.2)

Recall = TP

TP + FN(8.3)

Specificity = TN

TN + FP(8.4)


8.2 Experiments & Results 53

F −measure = 2 ∗ Precision ∗RecallPrecision+Recall

(8.5)

TP: True Positive are all samples that are classified as the correct class.FP: False Positive are all samples that are classified as the wrong class.TN: True Positive are all samples that are classified as unknown correctly.FN: False Negative are all samples that are classified as unknown wrongly.

8.2.2 Experiment 1

In order to compare the performance of the introduced methods with those of the stateof the art, an experiment is devised that uses different pipelines. In this experiment dif-ferent combinations of description and classification methods are combined to test the per-formance. Principal Component Analysis (PCA) and Class Average Principal ComponentAnalysis (CAPCA) are used as description methods, while K-Nearest Neighbours (KNN) andCertainty K-Nearest Neighbours (CertKNN) are used as classification methods.Since dimensionality of the description vector is of importance for the performance, a commondimension size has to be found. Because the CAPCA method has a dimensionality of thenumber of classes (68), where the dimensionality of PCA can be chosen. Therefore the samedimensionality of 68 has been used in this experiment.In Table 8.1 the performances of the introduced methods in comparison with existing meth-ods is shown for a training set of 85 samples, which is comparable with the setup used in[52]. Using CertKNN over KNN increases the precision from 98.13% to 99.87%, while addingthe possibility of classifying unknown classes. Best results are obtained by combining theintroduced CAPCA and CertKNN methods. If we compare these results with the state ofthe art methods, shown in Table 8.2, the performance is significantly higher.

Method Precision(%)PCA+KNN 98.13%PCA+CertKNN 99.87%CAPCA+KNN 97.88%CAPCA+CertKNN 99.75%

Table 8.1: Performance comparison of introduced methods

Method Precision(%)Eigenfaces 79.4Fisherfaces 94.3Laplacianfaces 95.6

Table 8.2: Results of state of the art



8.2.3 Experiment 2

The performance of methods is generally dependent on the size of the training set. Thereforethe performance is also obtained for different sizes of training sets using the same experimentalsetup as in experiment 1. This can be seen in Table 8.3. From the results it can be observedthat CAPCA+CertKNN method gives high precision above 98%, independent of the smallnumber of training samples. This can be explained by the property of the CAPCA method,to find the most representative samples and only use these in training. FSuch a high precisionwith a small training set makes this method very applicable on a robot and other real worldapplications requiring database acquisition.

Method\# samples 5 10 20 40 85 120PCA+KNN 62.04% 80.60% 86.73% 92.20% 98.13% 99.27%PCA+CertKNN 79.68% 97.39% 99.70% 99.84% 99.87% 99.84%CAPCA+KNN 63.18% 81.43% 87.19% 92.21% 97.88% 99.11%CAPCA+CertKNN 98.96% 98.52% 99.19% 99.20% 99.75% 99.92%

Table 8.3: Precision

CertKNN has the benefit that it allows for unknown classification. Therefore the recall andF-measure of the method is shown in Table 8.4. Using the combination of CAPCA andCertKNN increases the recall of the system significantly over using PCA+CertKNN. Fortraining sets with more than 40 samples, it can be observed that PCA can not separate theclasses any more and that the recall decreases again. This does not happen for CAPCAbecause this method obtains a separable description space.

RecallMethod\# samples 5 10 20 40 85 120PCA+CertKNN 64.44% 69.40% 72.01% 73.76% 68.94% 65.98%CAPCA+CertKNN 66.92% 79.28% 83.88% 90.75% 96.29% 97.12%F-MeasureMethod\# samples 5 10 20 40 85 120PCA+CertKNN 71.25% 81.04% 83.62% 84.84% 81.57% 79.45%CAPCA+CertKNN 79.85% 87.86% 90.89% 94.79% 97.99% 98.50%

Table 8.4: Recall and F-measure

In Table 8.5 the training speed of the different methods is shown. For PCA+KNN the trainingspeed increases drastically as the training set increases. For more than 40 trained samplesper class CAPCA can describe the training set more than 100 times faster than PCA. Thisis one the major benefits of using CAPCA.During a face learning process of two seconds, up to 60 samples can be obtained, where fora complete training procedure of turning the head left and right and recognizing a personsname up to 300 samples can be obtained. Even if the sample rate is less than 30Hz, thetraining set size will be too large for training online. For a service robot this is unacceptablebecause it has to be able to learn a person online. Therefore using a method, like CAPCAthat selects only a few representative training samples, can be a solution to this problem.


8.2 Experiments & Results 55

The classification speed of PCA+KNN was already fast enough for real-time classification,but CAPCA+CertKNN reduces the classification speed up to 14 times.

Method/# samples 5 10 20 40 85 120PCA+KNN 5.25 24.00 178.63 1470.63 5476.63 5562.25PCA+CertKNN 7.00 22.38 185.48 1567.50 4805.88 5515.63CAPCA+KNN 2.38 3.75 7.13 13.63 37.13 38.75CAPCA+CertKNN 3.25 4.13 7.38 15.50 39.63 45.63

Table 8.5: Training speed (sec)

8.2.4 Experiment 3

In the case a person needs to be learned by a robot, this person first has to be classifiedas unknown. Therefore the performance of unknown classification will be evaluated in thisexperiment. The original KNN method does not provide with a method to classify unknownclasses and PCA has too long training times, therefore these methods are not used in theevaluation of unknown classification and only CAPCA+CertKNN is used.The used training set consist of a varying number of classes with each 85 images as trainingsamples and 85 images as test samples. The remaining classes are used as test samples forunknown classification. In Table 8.6 the performance is shown for varying number of trainedclasses and a class boundary certainty Cc (7.1) of 70. The performance increases with anincreasing number of trained classes, which was expected as the description space is moreextensively defined and can therefore describe more differences between classes. Therefore asmuch as possible classes should be learned to get a high precision. The F-measure of 96.73%is comparable with the F-measure obtained in experiment 2 of 97.99%, despite of addingunknown classification.

# trained Unkownclasses Precision Recall F-measure Fallout Specifity F-measure10 20.70% 99.82% 34.29% 99.91% 34.29% 51.06%20 49.85% 99.01% 66.31% 99.31% 58.69% 73.78%30 68.47% 98.95% 80.93% 98.74% 64.21% 77.82%40 86.80% 98.61% 92.33% 97.56% 78.69% 87.11%50 95.24% 98.28% 96.73% 94.79% 86.45% 90.43%

Table 8.6: Performance for different number of trained classes

In CertKNN is the certainty at the class boundary a parameter that can be set. The per-formance for different certainties of Cc is shown with 50 trained classes in Table 8.7. Thisparameter changes the acceptance for a known class which can be seen by a decreasing preci-sion for known classification. But at the same time this changes the acceptance for unknownclasses. A balance has to be found in this and therefore a Cc of 70 is generally taken. But fora dataset that has only a few known classes and a lot of unknown classification is expected,Cc can be set to a higher value. Cc can be set to a lower value for a dataset with a lot ofknown classes and is almost only used for known classification.



UnkownCc Precision Recall F-measure Precision Specifity F-measure50 96.33% 97.87% 97.10% 93.85% 89.73% 91.74%60 95.87% 98.07% 96.96% 94.33% 88.37% 91.25%70 95.24% 98.28% 96.73% 94.79% 86.45% 90.43%80 94.53% 98.47% 96.46% 95.25% 84.31% 89.45%90 93.85% 98.63% 96.18% 95.61% 82.21% 88.41%

Table 8.7: Performance for different number of trained classes

8.2.5 Experiment 4

In the previous experiments the multi-frame classification method was not used. In this exper-iment the classification performance for different parameters of the multi-frame classificationwill be tested and evaluated using CAPCA+CertKNN. The training set is comparable withthe set with 50 trained classes from experiment 3. Now only the classification is counted afterN test samples.In Table 8.8 the performance is shown for different N , where ref is the performance withoutmulti-frame classification. The performance for known classification is best between 15 and30 frames, where unknown classification improves with increasing N .

UnkownN Precision Recall F-measure Fallout Specifity F-measureref 95.24% 98.28% 96.73% 94.79% 86.45% 90.43%4 96.50% 98.97% 97.72% 97.09% 90.50% 93.68%8 96.68% 99.83% 98.23% 99.53% 91.34% 95.26%15 96.78% 99.95% 98.34% 99.87% 92.32% 95.94%30 97.02% 99.99% 98.48% 99.98% 94.37% 97.09%60 96.54% 99.98% 98.23% 99.98% 96.96% 98.45%

Table 8.8: Performance for different N


Chapter 9

Conclusion

Before service robots become part of every household, they need to be able to adapt to novelenvironments and new users. For good human-robot interaction facial appearance as wellas the name of novel users has to be learned. One of the problems in the state of the artmethods is that face learning was either too slow to be applied in a real-time application orhad too low performance for efficient interaction. Therefore in this thesis we propose a novelframework and a set of methods that will enable real-time face recognition and learning witha very high performance applicable on a service robot.Another problem that often occurs in design of robot software architecture, is that the clas-sification and learning frameworks are specifically designed for certain types of objects orenvironments. In that way robot adaptation to novel environments is very limited. In thisthesis, we introduce a novel framework that can perform recognition, classification and learn-ing of all types of objects regardless of the environment or robot architecture.

The introduced recognition framework contains the following stages of the recognition andlearning process. At first the image is acquired and detection of objects is performed onthe input image. The detected object is then described and classified in order to be eitherrecognized or learned. In order to achieve robust learning, the framework allows for unknownobject classification and online relearning of known objects with more data being acquired.Further it will be very beneficial to easily change the methods used in every part of the recog-nition process (eg. multiple detection, description and classification methods can be used).Therefore the introduced framework allows for dynamic loading of recognition pipelines.For online learning of faces novel methods have been developed as a result of this thesis.At first to improve the performance as well as the speed of the state of the art PrincipalComponent Analysis (PCA) method, we have introduced the Class Average Principal Com-ponent Analysis (CAPCA) method. This method improves the performance by creating adescription space, which is more easily separable, by increasing the inter-class distances whiledecreasing the intra-class distances. The description performance is improved by reducing thedescription variance within a class and obtaining a dimension for each class. To significantlyreduce the speed, the projection matrix is reduced in size by not taking all training samplesas in PCA, but by finding a single representation for each class.


58 Conclusion

Further to allow for classification of unknown faces, the novel Certainty K-Nearest Neighbours(CertKNN) method has been introduced. The main benefit over the state of the art methodsis finding the relation between the distance of classification and the certainty of that classifica-tion. This relation is automatically calculated from the data belonging to each class. In thatway nearly optimal unknown classification can be done. Finally to further improve recognitionperformance a method has been developed that utilizes multiple frames in classification.

To prove the benefits of the introduced methods extensive experiments have been performedon a state of the art face recognition database. In all the experiments the combination ofCAPCA and CertKNN had the best performance. At first the introduced methods werecompared with the state of the art methods and a performance increase of 4% till 15% hasbeen proven. Despite small training sets the introduced classification methods gave very highperformance of above 98% and a precision increase of more than 35%. As shown in thesecond experiment, one of the main benefits of the CAPCA method is the fast training andclassification, compared to the state of the art. For training the speed was increased up to 100times and for classification up to 14 times. This allows the usage of these methods in real-timeapplications. Further the performance of unknown classification has been tested for differentnumbers of trained classes as well as for different parameter settings. The best performancewas achieved with 50 trained classes out of a total of 68 classes, with a F-measure of 96% and90% for respectively known and unknown classification. For multi-frame classification theperformance increase that could be achieved was 8% for unknown classification. Lastly theintroduced methods were applied on the Delft Robotics service robot and extensively testedin the RoboCup@Home challenge.

To further improve the introduced methods, several guidelines could be proposed. At first tofurther improve the CertKNN method the found class averages by the CAPCA could be usedfor describing the center of the classes. These class averages give a much better representationof the class centres than taking the average of all training samples as is originally done inCertKNN. Further the training speed of CertKNN can be improved by using the outliersfound by CAPCA to describe the class boundary. These outliers could also allow betterdescription of the class boundary compared to the current spherical shape. To further provethe applicability of these methods on a service robot, more extensive testing in a real-worldsituations could be performed. Also properties of the users like age and gender could beclassified.


Bibliography

[1] UnitedNations, “World population ageing 1950-2050,” tech. rep., UN, New York, 2002.

[2] J. Forlizzi, “Robotic products to assist the aging population,” interactions, vol. 12,pp. 16–18, Mar. 2005.

[3] N. Roy, G. Baltus, D. Fox, F. Gemperle, J. Goetz, T. Hirsch, D. Margaritis, M. Mon-temerlo, J. Pineau, J. Schulte, and S. Thrun, “Towards personal service robots for theelderly,” in Proceedings of the Workshop on Interactive Robots and Entertainment (WIRE2000), (Pittsburgh, PA), May 2000.

[4] “irobot.”

[5] K. Yamazaki, R. Ueda, S. Nozawa, Y. Mori, T. Maki, N. Hatao, K. Okada, and M. Inaba,“A demonstrative research for daily assistive robots on tasks of cleaning and tidying uprooms,” in Proceedings of the 14th Robotics Symposia, p. 522Ű527, 2009.

[6] K. Ruskamp, “Door opening and closing for nonholonomic non-redundant servicerobots,” Master’s thesis, Delft University of Technology, 2011.

[7] N. Ohta and A. R. Robertson, CIE Standard Colorimetric System. John Wiley and Sons,Ltd, 2006.

[8] D. Van der Weken, M. Nachtegael, and E. Kerre, “Improved image quality measuresusing ordered histograms,” in Multimedia Signal Processing, 2004 IEEE 6th Workshopon, pp. 67 – 70, sept.-1 oct. 2004.

[9] G. Finlayson, S. Hordley, G. Schaefer, and G. Y. Tian, “Illuminant and device invariantcolour using histogram equalisation,” Pattern Recognition, vol. 38, no. 2, pp. 179 – 190,2005.

[10] J. Stark, “Adaptive image contrast enhancement using generalizations of histogramequalization,” Image Processing, IEEE Transactions on, vol. 9, pp. 889 –896, may 2000.

[11] “Wikipedia - gaussian blur.”


60 Bibliography

[12] K. Mikolajczyk and C. Schmid, “An affine invariant interest point detector,” in 7th Eu-ropean Conference on Computer Vision (ECCV 2002), (INRIA Rhone-Alpes GRAVIR-CNRS 655, av. de l’Europe, 38330 Montbonnot, France), pp. 128–142, 2002.

[13] E. Adelson, C. Anderson, J. Bergen, P. Burt, and J. Ogden, “Pyramid methods in imageprocessing,” RCA engineer, vol. 29, no. 6, pp. 33–41, 1984.

[14] K. Pearson, “On lines and planes of closest fit to systems of points in space,” PhilosophicalMagazine, vol. 2, pp. 559–572, 1901.

[15] T. Kadir, A. Zisserman, and M. Brady, “An affine invariant salient region detector,” inComputer Vision - ECCV 2004 (T. Pajdla and J. Matas, eds.), vol. 3021 of Lecture Notesin Computer Science, pp. 228–241, Springer Berlin / Heidelberg, 2004.

[16] T. Kadir and M. Brady, “Saliency, scale and image description,” International Journalof Computer Vision, vol. 45, pp. 83–105, 2001. 10.1023/A:1012460413855.

[17] P. Viola and M. J. Jones, “Robust real-time face detection,” International Journal ofComputer Vision, vol. 57, pp. 137–154, 2004. 10.1023/B:VISI.0000013087.49260.fb.

[18] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple fea-tures,” in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings ofthe 2001 IEEE Computer Society Conference on, vol. 1, pp. I–511 – I–518 vol.1, 2001.

[19] R. Lienhart and J. Maydt, “An extended set of haar-like features for rapid object detec-tion,” in Image Processing. 2002. Proceedings. 2002 International Conference on, vol. 1,pp. I–900 – I–903 vol.1, 2002.

[20] K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek, “Evaluating color descriptorsfor object and scene recognition,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 32, no. 9, pp. 1582–1596, 2010.

[21] V. Ferrari, F. Jurie, and C. Schmid, “From images to shape models for object detection,”International Journal of Computer Vision, vol. 87, pp. 284–303, may 2010.

[22] V. H. M. Sonka and R. Boyle., Image Processing, Analysis, and Machine Vision. Brooksand Cole Publishing, 2nd ed., 1998.

[23] M. Rudinac and X. Wang, Basic Image Processing for Robotics. Delft University ofTechnology.

[24] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”International Journal of Computer Vision, vol. 60, pp. 91–110, 2004.10.1023/B:VISI.0000029664.99615.94.

[25] D. Lowe, “Object recognition from local scale-invariant features,” in Computer Vision,1999. The Proceedings of the Seventh IEEE International Conference on, vol. 2, pp. 1150–1157 vol.2, 1999.

[26] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEETransactions on Pattern Analysis & Machine Intelligence, vol. 27, no. 10, pp. 1615–1630,2005.


61

[27] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),”Comput. Vis. Image Underst., vol. 110, pp. 346–359, June 2008.

[28] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” inProceedings of the 2005 IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR’05) - Volume 1 - Volume 01, CVPR ’05, (Washington, DC,USA), pp. 886–893, IEEE Computer Society, 2005.

[29] M. Varma and A. Zisserman, “A statistical approach to texture classification fromsingle images,” International Journal of Computer Vision, vol. 62, pp. 61–81, 2005.10.1023/B:VISI.0000046589.39864.ee.

[30] T. Randen and J. Husoy, “Filtering for texture classification: a comparative study,”Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 21, pp. 291 –310, apr 1999.

[31] “Point cloud library (pcl).”

[32] R. B. Rusu, G. Bradski, R. Thibaux, and J. Hsu, “Fast 3d recognition and pose usingthe viewpoint feature histogram,” in Proceedings of the 23rd IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS), (Taipei, Taiwan), 10/2010 2010.

[33] M. Turk and A. Pentland, “Face recognition using eigenfaces,” in Computer Vision andPattern Recognition, 1991. Proceedings CVPR ’91., IEEE Computer Society Conferenceon, pp. 586 –591, jun 1991.

[34] J. R. R. Uijlings, A. W. M. Smeulders, and R. J. H. Scha, “Real-time bag of words,approximately,” in Proceedings of the ACM International Conference on Image and VideoRetrieval, CIVR ’09, (New York, NY, USA), pp. 6:1–6:8, ACM, 2009.

[35] O. Ivanciuc, “Chapter 6 applications of support vector machines in chemistry,” in Reviewsin Computational Chemistry (K. Lipkowitz and T. Cundari, eds.), vol. 23, pp. 291–400,Wiley-VHC, 2007.

[36] S. Maji, A. C. Berg, and J. Malik, “Classification using intersection kernel support vectormachines is efficient,” in CVPR, 2008.

[37] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” DataMin. Knowl. Discov., vol. 2, pp. 121–167, June 1998.

[38] T. Cover and P. Hart, “Nearest neighbor pattern classification,” Information Theory,IEEE Transactions on, vol. 13, pp. 21 –27, january 1967.

[39] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32, 2001.

[40] Y. Freund and R. Schapire, “A desicion-theoretic generalization of on-line learning and anapplication to boosting,” in Computational Learning Theory (P. VitÃąnyi, ed.), vol. 904of Lecture Notes in Computer Science, pp. 23–37, Springer Berlin / Heidelberg, 1995.

[41] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset.”.

[42] C. Bishop et al., Pattern recognition and machine learning, vol. 4. 2006.


62 Bibliography

[43] K. G. K. D. Rudinac, M. and P. Jonker, “Learning and recognition of objects inspired byearly cognition,” in International Conference on Intelligent Robots and Systems, 2012.

[44] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” Pattern Analysisand Machine Intelligence, IEEE Transactions on, vol. 34, pp. 1409 –1422, july 2012.

[45] C. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classesby between-class attribute transfer,” in Computer Vision and Pattern Recognition, 2009.CVPR 2009. IEEE Conference on, pp. 951 –958, june 2009.

[46] A. Torralba, K. P. Murphy, and W. T. Freeman, “Using the forest to see the trees:exploiting context for visual object detection and localization,” Commun. ACM, vol. 53,pp. 107–114, Mar. 2010.

[47] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost for image understand-ing: Multi-class object recognition and segmentation by jointly modeling texture, lay-out, and context,” International Journal of Computer Vision, vol. 81, pp. 2–23, 2009.10.1007/s11263-007-0109-1.

[48] H. Kjellström, J. Romero, and D. Kragić, “Visual object-action recognition: Inferringobject affordances from human demonstration,” Comput. Vis. Image Underst., vol. 115,pp. 81–90, Jan. 2011.

[49] M. Chen and A. Hauptmann, “Mosift: Recognizing human actions in surveillance videos,”In Carnegie Mellon University-CS-09-161, vol. 09, p. 161, 2009.

[50] D. Moore, I. Essa, and I. Hayes, M.H., “Exploiting human actions and object contextfor recognition tasks,” in Computer Vision, 1999. The Proceedings of the Seventh IEEEInternational Conference on, vol. 1, pp. 80 –86 vol.1, 1999.

[51] P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces vs. fisherfaces: recognitionusing class specific linear projection,” Pattern Analysis and Machine Intelligence, IEEETransactions on, vol. 19, pp. 711 –720, jul 1997.

[52] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, “Face recognition using laplacianfaces,”Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, pp. 328 –340,march 2005.

[53] Y. Adini, Y. Moses, and S. Ullman, “Face recognition: the problem of compensatingfor changes in illumination direction,” Pattern Analysis and Machine Intelligence, IEEETransactions on, vol. 19, pp. 721 –732, jul 1997.

[54] R. A. Fisher, “The statistical utilization of multiple measurements,” Annals of Eugenics,vol. 8, no. 4, pp. 376–386, 1938.

[55] “Delft robotics.”

[56] “Ros.”

[57] S. P. Rueda, “A speech-based dialogue system for household robots,” Master’s thesis,TU Delft, 2012.


63

[58] T. Sim, S. Baker, and M. Bsat, “The cmu pose, illumination, and expression (pie)database,” in Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEEInternational Conference on, pp. 46 –51, may 2002.


64 Bibliography


Glossary

List of Acronyms

PCA Principal Component Analysis

CAPCA Class Average Principal Component Analysis

LDA Linear Discriminant Analysis

SVM Support Vector Machines

KNN K-Nearest Neighbours

CertKNN Certainty K-Nearest Neighbours


66 Glossary


List of Figures

2.1 Color histogram correction [9] Left: images from different cameras with differentlighting. Right: corrected images. . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Contrast correction [10] Left top: original, rest correction . . . . . . . . . . . . . 4

2.3 Effect of Gaussian blur on methods like edge detection.[11] Left: no blur, too muchedges. Middle: blur, good edges. Right: too much blur, edges lost . . . . . . . . 4

2.4 Gaussian blur [13] Left: original images. Middle: normal blur, noise gone. Right:too much blur, details lost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.5 Size appearance in both images is similar for different sized objects. . . . . . . . 5

2.6 Image pyramid with scaling of 50% for each step. . . . . . . . . . . . . . . . . . 6

2.7 Harris point detector, detecting points in images with different viewpoints. [12] . 7

2.8 Haar-like features for face detection . . . . . . . . . . . . . . . . . . . . . . . . 8

2.9 Face detection using haar-like features . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Example of toy cars where a colour descriptor could work on . . . . . . . . . . . 10

3.2 Fourier shape descriptor. [23] a) original shape, b) contour, c) reconstruced shapewith 420 fourier descriptors, d) reconstructed shape with 28 fourier descriptors. . 11

3.3 Example of how a local descriptor method, like SIFT, GLOH or SURF can be usedin detection of objects.[24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Example of Gabor filters, combination of Fourier and Gaussian. . . . . . . . . . . 14

3.5 3D image of a bunny. [31] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.6 Histogram of viewpoints. [31] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.7 Example of a projection of a 3 dimensional sample space onto a 2 dimensionalEigenspace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Two-dimensional Euclidean space with training samples of 5 classes and 4 differentdescription vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


68 List of Figures

4.2 SVM mapping of traning samples. [35] . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 SVM separation with different kernels. [35] a) linear kernel, b) quadratic kernel ,c) 2nd degree polynomial kernel, d) spline kernel. . . . . . . . . . . . . . . . . . 19

4.4 Example of two cups, that are not the same subtype of objects. . . . . . . . . . 22

5.1 Overview of recognition pipeline in the recognition framework . . . . . . . . . . . 28

7.1 Finding class average with different outliers. . . . . . . . . . . . . . . . . . . . . 36

7.2 Finding outliers: Iteration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39





7.7 Example of distances found by KNN. . . . . . . . . . . . . . . . . . . . . . . . . 43

7.8 Example of distances found by KNN. . . . . . . . . . . . . . . . . . . . . . . . . 43

7.9 Certainty for the different distances to the class centres. . . . . . . . . . . . . . 44

7.10 Description of classes by CertKNN and unknown space at 0% and 30% threshold 45

7.11 Certainties for different Nx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

7.12 Certainty for different average certainties of classification . . . . . . . . . . . . . 46


List of Tables

8.1 Performance comparison of introduced methods . . . . . . . . . . . . . . . . . . 53

8.2 Results of state of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

8.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

8.4 Recall and F-measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

8.5 Training speed (sec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

8.6 Performance for different number of trained classes . . . . . . . . . . . . . . . . 55

8.7 Performance for different number of trained classes . . . . . . . . . . . . . . . . 56

8.8 Performance for different N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56


70 List of Tables


master thesis report: face recognition for cognitive robots · 2019-01-02 · robot architecture....

Documents