hand/gesture recognition thesis

Visual Articulated Hand Tracking

for Interactive Surfaces

Martin Tosas

Thesis submitted to the University of Nottingham

for the degree of Doctor of Philosophy

December 2006

ii

Visual Articulated Hand Tracking for Interactive Surfaces

Abstract As computer systems become more and more embedded into our environment, the ability to

interact with them without the need for special equipment is very attractive. Vision-based

Human Computer Interaction (HCI) has the potential of making this possible in a form that is

both easy and natural for people to use. However, there are great technical challenges in the

creation of robust algorithms for vision-based HCI systems. One strategy to overcome these

technical challenges is to create vision algorithms that are more specific to a particular

application. This thesis develops visual articulated hand tracking algorithms for use in

interactive surfaces.

The possibility of using visual articulated hand tracking as sensing technology for interactive

surfaces is very attractive with respect to other sensing technologies because it is flexible,

cheap, and avoids the need of cumbersome hardware. A small number of attempts at

developing vision-based interactive surfaces have been made, but there have been no

previous attempts at using articulated hand tracking as sensing technology for interactive

surfaces. This is because visual articulated hand tracking is difficult to achieve. Fortunately,

the 2D nature of the interactive surfaces does not require a full 3D articulated hand tracking.

This makes it possible to develop a specific but robust visual articulated hand tracking that is

constrained to a particular viewpoint – where the hand is approximately parallel to the

interactive surface.

This thesis develops a specific visual articulated hand tracking system which enables the

creation of a novel vision-based interactive surface, referred to as the Virtual Touch Screen

(VTS). The contents of a VTS are displayed using a projector or a head mounted display.

The VTS is made touch-sensitive by using visual hand tracking. The VTS can potentially be

used as, for example, an alternative to touch screens, an interface to mobile computing

devices, interactive surface for information points, shop displays, video games, and as a

sterile interface for use in hospitals and clean rooms.

iii

List of publications Published papers:

Tosas, M., Li. B. (2004). Virtual Touch Screen for Mixed Reality. Lecture Notes in

Computer Science, Proc. ECCV 2004 Workshop on HCI, pp. 48 - 59.

Tosas, M., Li, B., Mills, S. (2005). Switching Template Fitting Methods During Articulated

Object Tracking. In Proc. IEE Int. Conf. Visual Information Engineering, VIE 2005, pp. 243

- 249.

Tosas, M., Li, B. (2007). Tracking Tree-Structured Articulated Objects Using Particle

Interpolation. Accepted for publication in the proceedings of CGIM2007.

Tosas, M., Li, B. (2007). Virtual Touch Screen: A Vision-Based Interactive Surface.

Accepted for publication in the proceedings of CGIM2007.

M. Tosas, B. Li, S. Mills. (2007). Fast Adaptable Skin Colour Detection in RGB Space.

Submitted to VISAPP 2007.

iv

Acknowledgements

I would like to thank my first supervisor Bai Li for giving me the opportunity of starting this

PhD, for having confidence in my abilities and letting me pursue as the topic for this thesis

my visions about vision-based interactive surfaces, and for her invaluable help and

supervision during the course of this PhD. I would like to thank my second supervisor Steven

Mills for having being always available and willing to give me help and advise both in

technical subjects and in my English writing. Other people in the department who I would

like to thank are William Armitage (who helped me to compile the OTL library), and Holger

Schnadelbach (who helped me building a wooden frame for the second generation VTS).

I would like to thank my family in Barcelona who has always given me their support despite

me being far away. I would also like to thank several friends who were important to me and

whose friendship often help me to remain sane from before and during my PhD. Thanks to

Sara Cathie, Adam Cox, Pablo Nogueira, Dora Tothfalussy, and Franca Supran.

My time in the University of Nottingham has been an exciting one. I have not just completed

a PhD but I have also learned and experienced lots of things that have enriched me. My

experience as a hall tutor in Sherwood hall has enabled me to meet lots of people, and

experience university life from a different point of view. Often my dear tutor colleagues

made me feel as if I had a second family in the hall; thanks to you all. While in Nottingham I

have also become a salsa-dancing fan, this has helped a lot to keep my social life always

active during the otherwise rather solitary PhD experience. Thanks to salsa!

v

Table of Contents

1 INTRODUCTION...........................................................................................1

1.1 VISUAL ARTICULATED HAND TRACKING.............................................................3 1.2 VISUAL HAND TRACKING BASED INTERACTIVE SURFACES..................................5 1.3 CONTRIBUTIONS.................................................................................................8 1.4 ROADMAP OF THE THESIS ...................................................................................9

2 HAND TRACKING AND HCI: A LITERATURE REVIEW..................12

2.1 HUMAN HAND: ANATOMY, MOTION AND MODELLING ......................................13 2.2 VISUAL HAND TRACKING..................................................................................15 2.3 HAND GESTURES ..............................................................................................19 2.4 VISUAL HAND TRACKING IN HCI......................................................................20 2.5 INTERACTIVE SURFACES...................................................................................22 2.6 SUMMARY ........................................................................................................31

3 HAND CONTOUR TRACKING USING PARTICLE FILTERS AND DEFORMABLE TEMPLATES...................................................................33

3.1 DEFORMABLE TEMPLATES................................................................................34 3.2 MEASUREMENT MODEL ....................................................................................36 3.3 THE CONDENSATION ALGORITHM APPLIED TO VISUAL CONTOUR TRACKING....40

Resampling......................................................................................................43 Prediction ........................................................................................................43 Measurement ...................................................................................................44

3.4 ARTICULATED TRACKING .................................................................................45 3.4.1 Partition sampling...........................................................................................46 3.4.2 Incomplete particles in a chain of links ..........................................................51

3.5 TREE-STRUCTURED ARTICULATED OBJECTS .....................................................53 3.6 INCOMPLETE PARTICLES IN TREE-STRUCTURED ARTICULATED OBJECTS...........55 3.7 PARTICLE INTERPOLATION ...............................................................................56

3.7.1 Generating interpolated particles ...................................................................60 3.7.2 Differences from Condensation.......................................................................62

4 IMPLEMENTATIONS AND RESULTS....................................................65

4.1 ARTICULATED HAND CONTOUR MODEL ............................................................66 4.2 DYNAMICAL MODEL.........................................................................................67 4.3 MEASUREMENT MODEL ....................................................................................68

4.3.1 Measurement lines...........................................................................................68 4.3.2 Skin colour based measurement ......................................................................70

4.4 RESAMPLING SCHEME ......................................................................................73 4.5 PARTICLE-SET IMPLEMENTATION .....................................................................74 4.6 REFINING FINGER LENGTH ESTIMATES..............................................................77 4.7 SWEEP IMPLEMENTATION .................................................................................79 4.8 PERFORMANCE MEASURES FOR ARTICULATED CONTOUR TRACKERS ................82

4.8.1 Cost function ...................................................................................................83 4.8.2 Contour distances............................................................................................84 4.8.3 Signal to Noise Ratio (SNR) ............................................................................84

vi

4.8.4 Distance between model points .......................................................................85 4.8.5 Distance between model parameters...............................................................85

4.9 TEST VIDEO SEQUENCE.....................................................................................86 4.10 RESULTS AND COMPARISONS............................................................................87 4.11 RELATIONSHIP BETWEEN PERFORMANCE MEASURES........................................94 4.12 TECHNOLOGIES EMPLOYED IN THE IMPLEMENTATIONS ....................................96 4.13 CONCLUSIONS ..................................................................................................97

5 A SKIN COLOUR CLASSIFIER FOR HCI..............................................99

5.1 PREVIOUS WORK ON SKIN COLOUR DETECTION ............................................100 5.2 DEVELOPMENT OF THE LC CLASSIFIER ..........................................................102

5.2.1 RGB histogram classifier ..............................................................................102 5.2.2 Normalised RGB histogram classifier...........................................................103 5.2.3 Projected RGB histogram .............................................................................104

5.3 THE LINEAR CONTAINER (LC) CLASSIFIER....................................................106 5.4 PERFORMANCE RESULTS ................................................................................109 5.5 TUNING AT VARIOUS RESOLUTIONS...............................................................116 5.6 HCI USABILITY FACTORS ..............................................................................118 5.7 TARGET IMPORTANCE SELECTION...................................................................119 5.8 EXAMPLE OF A LC CLASSIFIER INITIALISATION IN HCI ..................................120 5.9 EXAMPLE OF DYNAMIC SKIN COLOUR MODELLING DURING TRACKING...........123 5.10 CONCLUSIONS ................................................................................................128

6 USING SKIN COLOUR AND EDGE FEATURES.................................130

6.1 EDGE FEATURES VS. SKIN COLOUR FEATURES ................................................131 6.2 USING ONLY EDGE FEATURES IN THE MEASUREMENT FUNCTION ....................135 6.3 COMBINING EDGE DETECTION AND SKIN COLOUR DETECTION IN THE

MEASUREMENT FUNCTION ..............................................................................136 6.4 CONCLUSIONS ................................................................................................140

7 TRACKING IMPROVEMENTS ..............................................................141

7.1 SWITCHING TEMPLATE FITTING METHODS DURING ARTICULATED TRACKING.142 7.1.1 Fitting templates to the links of an articulated object...................................142 7.1.2 Simplified articulated hand tracker...............................................................145 7.1.3 Results with the simplified articulated hand tracker.....................................146 7.1.4 Tracking performance with the sweep implementation.................................152 7.1.5 Conclusions ...................................................................................................153

7.2 QUASI RANDOM SAMPLING.............................................................................154 7.2.1 Quasi-random sequences ..............................................................................155 7.2.2 Application of quasi-random sequences in Condensation ............................156 7.2.3 Results ...........................................................................................................158 7.2.4 Conclusions ...................................................................................................158

7.3 VARIABLE PROCESS NOISE (VPN) .................................................................160 7.4 SKIN COLOUR GUIDED SAMPLING (SCGS) ....................................................162

7.4.1 Skin coloured blob detection and analysis ....................................................164 7.4.2 Combining low-level and high-level information..........................................166 7.4.3 Use of importance and initialisation particles ..............................................167 7.4.4 Reinitialisation test........................................................................................168 7.4.5 Robustness test ..............................................................................................170

vii

7.4.6 Conclusions ...................................................................................................172 7.5 COMBINING TRACKING IMPROVEMENTS .........................................................173

8 VIRTUAL TOUCH SCREEN....................................................................175

8.1 THE VTS INTERFACE .....................................................................................176 8.1.1 Hand tracking................................................................................................179 8.1.2 Operation ......................................................................................................179 8.1.3 Usability ........................................................................................................181

8.2 IMPLEMENTATIONS.........................................................................................183 8.2.1 Projector based VTS (First Generation) .......................................................183 8.2.2 Interface initialisation ...................................................................................185

Initial tracking position .................................................................................187 Skin colour tone of the user's hand................................................................187 Estimation of a kinematic model...................................................................187 Shape of the hand ..........................................................................................189

8.2.3 Touch detection .............................................................................................189 Kinematic model ...........................................................................................189 Thresholds .....................................................................................................190 Moving threshold ..........................................................................................191 Debouncing ...................................................................................................192 Detecting the position of a finger click .........................................................192 Dragging........................................................................................................193

8.2.4 Projector based VTS (Second Generation) ...................................................193 8.2.5 Tracking from the back of the hand ..............................................................196 8.2.6 HMD based VTS (Third Generation) ............................................................198 8.2.7 Third generation VTS experiments................................................................202

VTS operation with a plain background .......................................................203 VTS operation with a complex background..................................................204 VTS operation with resizable interfaces .......................................................205 VTS based drawing application ....................................................................207

8.3 APPLICATIONS................................................................................................211 8.4 CONCLUSIONS ................................................................................................215

9 CLOSING DISCUSSION ...........................................................................218

9.1 SUMMARY ......................................................................................................218 9.2 FUTURE WORK................................................................................................220

APPENDIX A.....................................................................................................................223

APPENDIX B .....................................................................................................................226

BIBLIOGRAPHY...............................................................................................................229

viii

List of Figures Figure 1.1: Articulated hand contour tracking. ....................................................................5 Figure 1.2: VTS implementations...........................................................................................7 Figure 1.3: Drawing application. ............................................................................................8 Figure 2.1: Human hand anatomy and degrees of freedom of each joint.........................14 Figure 2.2: DigitEyes..............................................................................................................16 Figure 2.3: Stenger's hand tracker tracking an out-of-image-plane rotation. .................17 Figure 2.4: MacCormick and Isard's articulated hand contour tracker..........................18 Figure 2.5: Hand Mouse. .......................................................................................................22 Figure 2.6: HandVu. ..............................................................................................................23 Figure 2.7: VisualPanel..........................................................................................................24 Figure 2.8: Visual Touchpad.................................................................................................24 Figure 2.9: Steerable interfaces. ...........................................................................................25 Figure 2.10: ARKB. ...............................................................................................................26 Figure 2.11: HoloWall............................................................................................................27 Figure 2.12: TouchLight........................................................................................................28 Figure 2.13: PlayAnywhere. ..................................................................................................28 Figure 2.14: Canesta keyboard. ............................................................................................29 Figure 2.15: Virtual keyboard based on true-3D optical ranging. ....................................29 Figure 2.16: DiamondTouch. ................................................................................................30 Figure 2.17: FingeRing. .........................................................................................................31 Figure 2.18: SCURRY. ..........................................................................................................31 Figure 2.19: Senseboard. .......................................................................................................31 Figure 3.1: A B-spline contour fitted to the middle finger of a hand. ...............................36 Figure 3.2: Measurement lines distributed along a contour. .............................................37 Figure 3.3: Measurement line normal to a hypothesized contour. ....................................38 Figure 3.4: Weighted particle set approximation of a probability density.......................42 Figure 3.5: Graphical representation of three particles from a particle set.....................42 Figure 3.6: One time-step in the Condensation algorithm. ................................................45 Figure 3.7: An intuitive partition sampling example..........................................................47 Figure 3.8: Articulated object with three links forming a chain. ......................................48 Figure 3.9: Algorithm for one time-step of partition sampling on the chain of links of

Figure 3.8. ............................................................................................................49 Figure 3.10: Particle set diagram showing two fictitious time-steps of partition sampling.

..............................................................................................................................51 Figure 3.11: Tree-structured articulated object..................................................................53 Figure 3.12: Algorithm for one time-step of partition sampling on the tree-structured

articulated object of Figure 3.11(a).................................................................55 Figure 3.13: Particle set diagram showing two fictitious time-steps of partition sampling

for the example articulated hand.. .................................................................57 Figure 3.14: Particle set diagram showing the particle interpolation process. ................59 Figure 3.15: Algorithm for one time-step of partition sampling, and particle

interpolation. ....................................................................................................59 Figure 3.16: Graphical representation of the interpolation process using (rule 1)..........61 Figure 4.1: Hand contour model...........................................................................................67 Figure 4.2: Measurement lines used in the articulated hand contour...............................69 Figure 4.3: Skin colour image with the measurement lines on top....................................70 Figure 4.4: Score look-up table. ............................................................................................71

ix

Figure 4.5: Algorithm to calculate the contour's score.......................................................72 Figure 4.6: One time-step of tracking for the particle-set implementation. .....................75 Figure 4.7: Algorithm for one time-step of tracking with the particle-set

implementation...................................................................................................76 Figure 4.8: Procedure to refine finger length estimations..................................................78 Figure 4.9: Angle sweep pattern. ..........................................................................................80 Figure 4.10: Sweep hand tracker implementation diagram for one time-step.................81 Figure 4.11: Algorithm for one time-step of tracking with the sweep implementation...82 Figure 4.12: Test video sequence structure. ........................................................................87 Figure 4.13: Performance comparison from frame 30 until 174. ......................................89 Figure 4.14: Example frames of the particle-set tracker output from frame 30 to frame

174. ....................................................................................................................90 Figure 4.15: Example frames of the sweep tracker output from frame 30 to frame 174.90 Figure 4.16: Performance comparison from frame 175 until 359. ....................................91 Figure 4.17: Performance comparison from frame 360 until 890. ....................................92 Figure 4.18: Example frames of the particle-set tracker output from frame 360 to frame

890. ....................................................................................................................93 Figure 4.19: Example frames of the sweep tracker output from frame 360 to frame 890.

...........................................................................................................................93 Figure 4.20: Relationship between cost function and distance metric. .............................96 Figure 5.1: Skin colour RGB histogram. ...........................................................................103 Figure 5.2: Normalised RGB histogram. ...........................................................................104 Figure 5.3: Projection from rg to RGB. .............................................................................105 Figure 5.4: LC classifier decision planes............................................................................106 Figure 5.5: Possible decision planes to avoid dark pixels. ................................................107 Figure 5.6: Initialisation image masks. ..............................................................................107 Figure 5.7: Tuning heuristics. .............................................................................................108 Figure 5.8: Ground truth masks. ........................................................................................110 Figure 5.9: Mediterranean subject test. .............................................................................112 Figure 5.10: White Caucasian subject test.........................................................................113 Figure 5.11: Black African subject test.. ............................................................................114 Figure 5.12: Chinese subject test. .......................................................................................115 Figure 5.13: NTP when tuning at various resolutions. .....................................................117 Figure 5.14: NTP chart for four percentages of skin in SkinMask. ................................118 Figure 5.15: NTP chart for two different SkinMask containing 25% of skin colour. ...119 Figure 5.16: Target importance for various tuning situations.........................................121 Figure 5.17: Initialisation sequence. ...................................................................................122 Figure 5.18: Modified cost function....................................................................................125 Figure 5.19: Dynamic tuning vs. static tuning...................................................................126 Figure 5.20: Dynamic tuning vs. static tuning...................................................................127 Figure 6.1: Skin colour vs. edges. .......................................................................................132 Figure 6.2: Exemplar frame. ...............................................................................................133 Figure 6.3: Normalised histograms of feature positions along a measurement line. .....134 Figure 6.4: Distance metric of the skin edge vs. the image edge based sweep tracker. .136 Figure 6.5: Situations in which the use of edges is essential for the correct location of the

hand. ..................................................................................................................137 Figure 6.6: Combination matrix. ........................................................................................138 Figure 6.7: Performance of the sweep tracker using edges and skin colour in the

measurement function. ....................................................................................139 Figure 7.1: Fitting methods for an articulated object.......................................................144

x

Figure 7.2: Potential problem of method 2. .......................................................................144 Figure 7.3: Simplified articulated hand contour model with 9 DOF. .............................146 Figure 7.4: Flow chart of the simplified hand tracker......................................................146 Figure 7.5: Selected frames from the critical zone............................................................148 Figure 7.6: Various misalignments for each method of fitting the articulated template.

............................................................................................................................150 Figure 7.7: X and Y variance of the hand pivot. ...............................................................151 Figure 7.8: Variance of the rotation angle and scale factor. ............................................151 Figure 7.9: Sweep implementation tracking performance when using the combined

template fitting method. ..................................................................................153 Figure 7.10: Distributions of pseudo-random and quasi-random points. ......................156 Figure 7.11: Gaussian transformation of uniform pseudo-random and quasi-random

points. ................................................................................................................157 Figure 7.12: Distance metric performance measure when using VPN. ..........................161 Figure 7.13: Distance metric performance measure when using VPN, separate charts.

............................................................................................................................163 Figure 7.14: Skin coloured blobs. .......................................................................................166 Figure 7.15: Importance samples and initialisation samples. ..........................................167 Figure 7.16: Reinitialisation test selected frames. .............................................................169 Figure 7.17: Distance metric performance measure when using SCGS. ........................171 Figure 7.18: Distance metric performance measure when using SCGS, separate charts.

.........................................................................................................................172 Figure 7.19: Skin coloured blobs mixing. ..........................................................................173 Figure 7.20: Distance metric for the sweep tracker with combined tracking

improvements. ................................................................................................174 Figure 7.21: Distance metric for the sweep tracker with combined tracking

improvements, separate charts.....................................................................174 Figure 8.1: Six possible interface element configurations for the VTS...........................178 Figure 8.2: Proposed VTS operation..................................................................................180 Figure 8.3: Set up of the first generation VTS. .................................................................184 Figure 8.4: Image processing for the first generation VTS..............................................184 Figure 8.5: Typing a telephone number on the first generation VTS. ............................185 Figure 8.6: Initialisation states of the VTS hand contour tracker...................................187 Figure 8.7: Finding finger creases. .....................................................................................188 Figure 8.8: Hand undergoing flexion of middle finger. ....................................................190 Figure 8.9: Touch detection using a moving threshold.....................................................192 Figure 8.10: Set up of the second generation VTS. ...........................................................194 Figure 8.11: Keypad usage. Key press sequence. ..............................................................195 Figure 8.12: Slider bar usage. Dragging sequence. ...........................................................196 Figure 8.13: New hand contour template...........................................................................198 Figure 8.14: New initialisation image masks. ....................................................................198 Figure 8.15: Set up of the third generation VTS. ..............................................................199 Figure 8.16: Example virtual interfaces.............................................................................200 Figure 8.17: Illusion of depth perception...........................................................................201 Figure 8.18: Example frames of the actions occurring during the first experiment (third

generation VTS). ............................................................................................203 Figure 8.19: Example frames of the actions occurring during the second experiment

(third generation VTS). .................................................................................205 Figure 8.20: Example frames of the actions occurring during the third experiment

(third generation VTS). .................................................................................207

xi

Figure 8.21: Drawing application toolbar (third generation VTS). ................................208 Figure 8.22: Automatic hand tracking reinitialisation feature........................................209 Figure 8.23: Example frames of the actions occurring during the drawing application

experiment (third generation VTS)..............................................................210 Figure 8.24: Example frames showing drawing with multiple fingers at the same time

(multiple points of input)...............................................................................211

List of Tables Table 4.1: Parameter values for the hand tracker dynamical model. ..............................68 Table 4.2: Finger angle constrains. ......................................................................................77 Table 5.1: LC classifier priori values. ................................................................................109 Table 5.2: Execution time results .......................................................................................116 Table 7.1: Contour distance performance metric comparison using three sampling

methods................................................................................................................159

1

1 Introduction

As computer systems become more and more embedded into our environment, the ability to

interact with them without the need for special equipment is very attractive. Vision-based

Human Computer Interaction (HCI) has the potential of making this possible in a form that is

both easy and natural for people.

Vision-based HCI core technologies are generally based on visual tracking and visual

template recognition algorithms. These algorithms can be designed to track and recognize

faces (for identity recognition or verification), recognize facial expressions (for mood or

attention detection), detect position and pose of human arms, legs, and body (for gesture

driven interfaces), and track the position and configuration of hands and fingers (for 3D hand

pointing, 3D mouse control, sign language recognition, etc). However, despite the wide

range of potential applications, the creation of these algorithms still presents great technical

challenges, and often, some form of constraint needs to be put in place for these algorithms

to operate correctly. One strategy to overcome the technical problems is to make these

algorithms more application specific. Following this strategy, this thesis develops visual

articulated hand tracking algorithms aimed at interactive surfaces.

1 Introduction 2

Interactive surfaces are surfaces that display information and allow users to interact with this

information by touching the surface either directly with their hands, using a stylus, or using

some form of wearable hardware. The concept includes traditional touch screens, but it goes

beyond them. An interactive surface can be presented on a table or a desk, on a wall, on a

book or on a piece of paper, on a shop display, or even on a virtual surface floating in the air.

The technologies used for both displaying information on the surface, and making the

surface sensitive vary considerably. The possibility of using visual hand tracking as the

sensing technology for interactive surfaces is very attractive with respect to other sensing

technologies because it is flexible (it could be deployed in various configurations), cheap (it

could operate with simple USB cameras), and it avoids the need of cumbersome hardware

(no special gloves or hardware need to be attached to the user's hands). A small number of

attempts at developing vision-based interactive surfaces have been made. These are typically

based on detecting reflected infrared light near the surface, tracking single fingers, or

tracking adorned hands. There have, however, been no previous attempts at using visual

articulated hand tracking as the sensing technology for interactive surfaces. This is partly

because full 3D visual articulated hand tracking is difficult to achieve.

Visual articulated hand tracking is currently an active and challenging area of research in the

computer vision community. Visual articulated hand tracking has a great potential in HCI,

especially in Virtual Reality (VR) and Augmented Reality (AR) environments. However, the

high degree of freedom (DOF) of the hand models, the self-occlusion of the fingers, and the

kinematic singularities in the finger's articulated motion, make visual articulated hand

tracking very difficult. The challenge is even greater when a single camera, unadorned

hands, unconstrained background, and unconstrained illumination levels are required.

Fortunately, the 2D nature of the interactive surfaces does not require a full 3D visual

articulated hand tracking. This makes it possible to develop a specific but robust visual

articulated hand tracking that is constrained to a particular view point – where the hand is

approximately parallel to the interactive surface.

This thesis develops a visual articulated hand tracking system which enables the creation of a

novel vision-based interactive surface, referred to as the Virtual Touch Screen (VTS). The

tracking algorithm is based on the contour tracking framework proposed by Blake and Isard

(1998), with a considerable extension to improve the efficiency of particle propagation

between time-steps in tracking tree-structured articulated objects, such as human hands. This

1 Introduction 3

allows the creation of a novel 14 DOF articulated hand contour tracker, which is capable of

tracking in real-time the articulated contour of a hand with the palm approximately parallel

to the camera's image plane. The tracker uses a robust skin colour classifier to track the hand

contour, enabling the tracking of unadorned hands against cluttered backgrounds. The

tracker is specifically designed to track the finger motions of a hand operating a VTS. The

contents of the VTS can be displayed by either the use of a projector, or the use of a see-

through Head Mounted Display (HMD). The VTS is made touch-sensitive by using the

visual hand tracking system developed in this thesis – in order to determine when and where

a user's finger touches the VTS. The VTS interface can be used as a multi-point touch-

sensitive surface, with the added ability to also detect hand gestures hovering above the VTS

surface. This enables a large number of promising applications including alternatives to

touch sensitive panels, alternative interface for mobile computing devices, contactless

interactive surfaces for museums, information points, shop displays, and video games, and

sterile interfaces for use in hospitals and clean rooms.

The major contribution of this thesis is the development of a visual articulated hand tracking

system that could enable the creation of a vision-based interactive surface. The creation of

the VTS interface itself and related experiments constituted a comparatively small part of the

thesis. In other words, this thesis is focused on computer vision rather than HCI and

therefore the proposed VTS interface has not been tested with a representative group of users

in order to evaluate its usability in various situations and configurations. This is left as a part

of the future work of this thesis.

1.1 Visual articulated hand tracking


novel vision-based interactive surface, referred to as Virtual Touch Screen (VTS). The

intended use of the tracking system sets a number of demanding requirements on it. The

tracking system has to be able to perform robust articulated tracking of an unadorned hand,

using a single camera, against arbitrary backgrounds, and under a wide range of lighting

conditions. The tracking of the fingers has to be accurate enough as to enable detection of

click events on the interactive surface. Finally, the visual articulated hand tracking has to

work in real-time. A visual articulated hand tracking of this description presents great

technical difficulties and has not yet been achieved in the computer vision community.

1 Introduction 4

However, the 2D nature of the interactive surfaces does not require a full 3D visual

articulated hand tracking. A visual articulated hand tracking constrained to a particular

viewpoint can greatly simplify the technical difficulties, and can be suitable for its use in

interactive surfaces. Following this strategy, this thesis develops a 2D hand tracking system

that can both satisfy the above-mentioned requirements and track a hand in an orientation

approximately parallel to the camera's image plane.

The 2D hand tracking in this thesis is based on contours. A contour is a curve that defines the

2D boundary of an object as it appears in an image. In this thesis, the contour is that of a

hand approximately parallel to the camera's image plane. Hand contour tracking involves

matching a deformable hand template to the 2D contour of this hand as it moves within the

camera's field of view. In order to track the articulations of the fingers, the deformable hand

template has to match the changing contour of the fingers too. Existing contour tracking

algorithms based on particle filters (Blake and Isard, 1998) and partition sampling

(MacCormick and Isard, 2000) have the potential of tracking articulated objects in real-time

against cluttered backgrounds. However, they do not satisfy the demanding hand tracking

requirements of the VTS interface. This thesis further develops and improves these tracking

algorithms aiming at the creation of a suitable articulated hand contour tracking system for

the VTS. The improvements to these tracking algorithms are the following:

• A novel technique, referred to as particle interpolation, which makes it possible to

improve the efficiency of particle propagation between time-steps in tracking tree-

structured articulated objects using particle filters and partition sampling.

• A novel measurement function based on skin colour that is both faster and more reliable

than existing edge based measurement functions provided no other skin colour objects

appear in the background.

• A novel skin colour based importance sampling implementation, referred to as Skin

Colour Guided Sampling (SCGS), that allows the estimate of position, scale, and angle of

the hand contour from low-level information, for either users wearing long sleeve or

short sleeve.

• A novel contour fitting method for articulated contour trackers, which improves tracking

agility and reduces jitter on the tracking output.

1 Introduction 5

• A novel method for particle filter based contour trackers, referred to as Variable Process

Noise (VPN), which varies the size of the contour's search region in order to cope with

brisk target movements.

These techniques are used in the development of a novel 14 DOF articulated hand contour

tracking system, that can track the contour of hand approximately parallel to the camera's

image plane, from either its front (palm) or back views. Figure 1.1 shows two snapshots of

the tracking output (blue hand contour) of this articulated hand contour tracking system.

Figure 1.1(a) shows a front view (palm). Figure 1.1(b) shows a back view.

(a) (b) Figure 1.1: Articulated hand contour tracking. (a) Front view. (b) Back view.

1.2 Visual hand tracking based interactive surfaces

A novel form of HCI, referred to as interactive surfaces, has emerged in recent years. An

interactive surface refers to a surface that can display information and can allow users to

interact with this information by touching the surface either directly with their hands, using a

stylus, or using some form of wearable hardware. The concept includes the traditional touch

screens but it goes beyond them. An interactive surface can be presented on a table or a desk,

on a wall, on a book or on a piece of paper, on a shop display, or even on a virtual surface

floating in the air. The technologies used for both displaying information on the surface and

making the surface interactive vary considerably. The use of computer vision in interactive

surfaces is attractive with respect to other sensing technologies because it is flexible and

avoids the need of expensive and cumbersome hardware.

1 Introduction 6


novel vision-based interactive surface, referred to as the Virtual Touch Screen (VTS). In a

VTS the information is displayed by either using a projector, which projects the information

on a selected surface or by using a see-through Head Mounted Display (HMD), which

displays the contents in the HMD, but these appear to users as to be floating in their field of

view. The VTS is made touch-sensitive by visually tracking the user's hand and interpreting

their hand position and configuration. This interpretation results in the detection of click and

drag actions on the VTS. Figure 1.2 shows a projector based VTS, Figure 1.2(a), and two

examples of its operation: operating a keypad, Figure 1.2(b), and operating a sliderbar,

Figure 1.2(c). On the right column, Figure 1.2 shows a HMD based VTS, Figure 1.2(d), and

two examples of its operation: operating a keypad against a cluttered background, Figure

1.2(e), and resizing interface elements with a thumb gesture, Figure 1.2(f).

The VTS interface can be used as a multi-point touch-sensitive surface, with the added

ability to also detect hand gestures hovering above the VTS surface. This enables a large

number of applications for the VTS interface. A VTS could constitute an alternative to touch

sensitive panels (especially attractive for large panels). A VTS could become a flexible

interface for PDAs, or other mobile computing devices. A VTS could be used in museums,

information points, and shop displays in order to show interactive information relevant to

users. A VTS could constitute a cheap and flexible alternative to handheld keypads, controls

or pointing devices for existing HMD based AR environments. The entertainment industry

could also benefit from VTSs for video games. Finally, as the visual hand tracking system

used in the VTS does not require a physical surface to operate, the VTS could have

applications in scenarios where physical contact is not desired, for example, sterile interfaces

for use in hospitals, or clean rooms.

This thesis proposes the VTS interface, its possible configurations, possible operation, and

potential applications regardless of the particular visual hand tracking technology in use.

Then, these ideas are implemented with the help of this thesis visual articulated hand

tracking system. This results in three VTS generations, whose capabilities are tested with a

number of experiments. The final experiment is a VTS based drawing application. Figure 1.3

shows two snapshots of the drawing application in action.

1 Introduction 7

(a)

(d)

(b)

(e)

(c)

(f) Figure 1.2: VTS implementations. Left column shows a projector based VTS, (a), and two examples of

interface operation: operating a keypad, (b), and operating a sliderbar (c). Right column shows a

HMD based VTS, (d), and two examples of interface operation: operating a keypad against a

cluttered background, (e), and resizing interface elements with a thumb gesture (f).

1 Introduction 8

(a) (b) Figure 1.3: Drawing application. (a) Drawing demo. (b) Multiple points of input demo.

1.3 Contributions

The main contributions of this thesis are:

• Critical evaluation of existing visual tracking algorithms, with special focus on the use of

particle filters in the implementation of articulated contour trackers, and analysis of the

efficiency of particle propagation between time-steps in tracking tree-structured

articulated objects.

• Development of a novel technique, referred to as particle interpolation, which makes it

possible to improve the efficiency of particle propagation between time-steps in tracking

tree-structured articulated objects using particle filters and partition sampling.

• Development of a novel skin colour classifier, referred to as the Linear Container (LC)

classifier, and testing of the classifier under various conditions for use in hand tracking

for HCI. The classifier is robust to illumination (brightness) changes, requires less

storage, and is significantly faster than existing classifiers.

• Implementation of skin colour based importance sampling for the hand contour trackers

presented in this thesis. This, referred to as Skin Colour Guided Sampling (SCGS),

allows the estimate of position, scale and angle of the hand contour from low-level

information.

• Analysis of contour fitting methods for articulated contour trackers and development of

an improved contour fitting method that improves tracking agility and reduces jitter on

the tracking output.

1 Introduction 9

• Implementation of variable process noise in a particle filter, in order to improve its

tracking agility. The concept of variable process noise has been used before in mono

modal trackers, such as Kalman filters, but it has never been used before in a particle

filter scenario.

• Development of a novel 14 DOF articulated hand contour tracker. The tracker uses only

skin colour to track the hand contour, which enables real-time tracking of unadorned

hands against cluttered backgrounds.

• Implementation of a novel vision-based HCI interface called the Virtual Touch Screen

(VTS), and demonstration of its capabilities through a number of experiments.

1.4 Roadmap of the thesis

This chapter has given the background information needed for the remanding of the thesis.

The rest of the thesis is organized as follows:

Chapter 2 gives an overall view of the current visual hand tracking technologies and their

applications in HCI. It also reviews several examples of interactive surfaces, both using

vision and non-vision based technologies.

Chapter 3 reviews and expands the contour tracking framework developed by Blake and

Issard (1998). Within this framework, the use of partition sampling (MacCormick and Isard,

2000) for tracking tree-structured articulated objects is analysed, and efficiency problems in

the propagation of particles between time-steps are identified. Finally, a novel technique,

referred to as particle interpolation, that overcomes these efficiency problems is proposed.

Chapter 4 develops two novel 14 DOF articulated hand contour trackers using the particle

interpolation technique presented in Chapter 3 . One of the trackers is entirely based on

stochastic processes, Section 4.5, while the other one uses a combination of stochastic and

deterministic processes, Section 4.7. Both trackers are designed to be used in HCI and

therefore are tested with tracking sequences that simulate HCI. The results of the tests are

analysed and the performance of the trackers is assessed in Section 4.10. The proposed hand

trackers use skin colour as the only cue for tracking. Thus, the skin colour detection needs to

be very robust.

1 Introduction 10

Chapter 5 presents a novel skin colour classifier, referred to as the Linear Container (LC)

classifier, which is robust to illumination (brightness) changes, requires little storage, is

significantly faster than existing classifiers, and can be tuned to a particular skin tone. The

evaluation speed of the LC classifier guarantees that the hand trackers proposed in Chapter 4

can operate in real-time. The LC classifier is tested and compared with existing classifiers.

Finally, a method for rapidly adapting the LC classifier parameters to illumination changes

during tracking is presented in Section 5.9, this method enables the implementation of

dynamic skin colour modelling.

Chapter 6 analyses and compares the use of skin colour information with the use of edge

information in contour trackers. Typically, contour trackers use edge information. However,

the proposed articulated hand contour trackers use skin colour information only. The chapter

supports that the use of skin colour information alone in contour tracking is more attractive

than the use of edge information alone, because it can be faster to evaluate and it is more

accurate provided no other skin colour objects appear in the background.

Chapter 7 describes a number of techniques that improve the tracking performance of

articulated hand contour trackers:

• The order in which the segments of an articulated template are fitted to the corresponding

segments of an articulated object can affect the tracking agility and the level of jitter in

the tracking output. Section 7.1 analyses this phenomenon and proposes a fitting method

that improves tracking agility and reduces jitter levels in the tracking output.

• The use of stochastic processes in the hand tracking has the inconvenience that tracking

is not repeatable. Section 7.2 studies the use of quasi-random sampling in order to

improve the repeatability of the tracking.

• When a target object moves briskly the tracker may lose the location of its contour. This

situation can be prevented by controlling the size of the region in which this contour is

searched. Section 7.3 describes and tests a method, referred to as Variable Process Noise

(VPN), which varies the size of the contour's search region in order to cope with brisk

target movements.

1 Introduction 11

• When the tracked hand exits the camera's field of view, tracking is lost. A method to

automatically regain tracking on the user's hand once this one re-enters the camera's field

of view is presented in Section 7.4. This method, referred to as Skin Colour Guided

Sampling (SCGS), allows the estimate of position, scale and angle of the hand contour

from low-level information.

• Section 7.5 shows that the separate techniques described in Chapter 7 can be combined

in a single articulated hand tracker resulting in improved tracking performance.

Chapter 8 presents a novel vision-based interactive surface, referred to as Virtual Touch

Screen (VTS). The chapter describes the VTS interface, its possible configurations, its

proposed operation, and a discussion of potential VTS applications. Then, using the visual

articulated hand tracking system developed in previous chapters, a number of experiments

with various VTS implementations is presented. Chapter 8 finishes with the presentation of

a VTS based drawing application which illustrates the use of VTS interfaces to complete a

task.

Chapter 9 presents the conclusions of the thesis and future work directions.

The manuscript also contains two appendices. Appendix A shows the calculation of the

reverse kinematics of a chain of links. Appendix B contains a number of video sequences

that illustrate the results of various experiments throughout the thesis. The appendix is both

available in the enclosed CD and in the supporting webpage at:

http://www.cs.nott.ac.uk/~mtb/thesis

12

2 Hand tracking and HCI: a literature review

The most important use of visual hand tracking is its application for HCI. Visual hand

tracking, and in particular visual articulated hand tracking, can enable new ways of

interaction with computers, which can be more natural to people, less intrusive, and more

flexible than other ways of interaction based on hardware devices. However, the

implementation of a full 3D visual articulated hand tracking is a difficult and challenging

problem. The potential benefits of visual articulated hand tracking in HCI, and the difficulty

of the problem, has attracted during the last decade a great deal of research onto this topic.

This chapter places this thesis into a bigger context, regarding to visual articulated hand

tracking and its applications for HCI. The chapter starts reviewing the human hand anatomy,

its motion, and common ways of modelling it. Then it places the visual articulated hand

tracking developed in this thesis into context among the existing visual hand tracking

techniques. Finally, it places the proposed Virtual Touch Screen (VTS) interface into context

inside an emerging field in HCI. This field is referred to as interactive surfaces.

2 Hand tracking and HCI: a literature review 13

2.1 Human hand: anatomy, motion and modelling

Before the current hand tracking technologies can be addressed in the next section, it is

important to review some basic facts about human hand anatomy, motion and modelling.

The human hand skeleton is composed of 27 bones. These bones can be divided into three

groups:

• Eight carpals

• Five metacarpals

• Phalanges (finger bones)

The carpals are found in the wrist, the metacarpals in the palm, and the phalanges in the

finger bones. Joints between these bones have different number of degrees of freedom

(DOF). Figure 2.1 shows the human hand anatomy and degrees of freedom of each joint. For

the little, ring, middle, and index fingers, the Distal Interphalangeal (DIP) joint, and the

Proximal Interphalangeal (PIP) joints have a single DOF. The Metacarpophalangeal (MCP)

joint has 2 DOF. The joints between carpals and metacarpals have 1 DOF more. The Thumb

is a special case having 1 DOF for the Interphalangeal (IP) joint, 1 DOF for the

Metacarpophalangeal (MP) joint, and 3 DOF for the Trapeziometacarpal (TM) joint.

The motion of the phalanges is described using a specific terminology:

• Flexion, refers to the movement of the fingers toward the palm.

• Extension, refers to the movement of the fingers away from the palm.

• Abduction, refers to the movement of the fingers away from the plane that divides the

hand between the middle and ring fingers.

• Adduction, refers the movement of fingers towards this plane.

Flexion and Extension is exhibited by all the phalanges. Abduction and Adduction is

exhibited only by the Metacarpophalangeal (MPC) joint.

A common approach to represent the anatomical structure of the human hand is by means of

a 3D kinematic hand model. These kinematic models are based on a simplified skeleton of

the human hand and they can represent the state of the hand over time (updating them as

hand motion occurs). They typically comprise the lengths of the finger segments, the joint

angles for each of the articulations, and constraints in the motion of the finger segments.


Figure 2.1: Human hand anatomy and degrees of freedom of each joint. (Figure reproduced from Sturman

(1992).)

An anatomically correct kinematic hand model has 26 DOF (and 6 DOF more if the 3D

position and orientation is considered). However, the Metacarpocarpal joints (situated inside

the hand palm) are generally disregarded. The hand can then be modelled with 2 DOF in the

MCP and TM joints, and 1DOF for all the other joints, therefore simplifying the kinematic

model to 21 DOF.

The dimensions of the hand state can be considerably reduced by registering the range of

valid moments (for example using a data glove to gather empiric data of a hand moving

through all the possible configurations) and then analysing the data using Principal

Component Analysis (PCA) techniques. Using this technique, Wu, et al. (2001) managed to

reduce a hand kinematic model from 20 DOF to 7 DOF.

The use of hand motion constraints can also reduce the state space considerably; this

generally does not reduce the DOF but reduces the range of variation in the state space.

There are two sets of constraints that can be placed on the joint angle movements:

• Static constraints. The range of valid joint angles for a given joint.


• Dynamic constraints. The dependencies between joints due to sharing the same tendons.

There is a third category of more subtle constraints proposed by Lin, et al. (2000). These

constraints have nothing to do with limitations in the hand anatomy, but rather are a result of

common and natural movements. These constraints are mostly used in simulation of natural

movement of a hand.

2.2 Visual hand tracking

During the last decade, visual hand tracking has attracted a great deal of research in the

computer vision community. In 1995, Rehg's seminal work in articulated hand tracking,

established what is now a classical approach to model based tracking of articulated objects

through video sequences. The classical approach to articulated hand tracking uses a

kinematic hand model, which also represents the volume of each of its segment and palm

(for example using truncated cylinders, conic sections, or truncated quadrics). The initial

configuration of the model is known, and a fast frame rate is assumed (i.e. small differences

in the hand's configuration between two consecutive frames). The projection of the hand

model onto an adequate plane is then compared to the hand appearing in each frame of a

video sequence. The discrepancies between the hand model projection and the hand features

in the frame are then minimized. The minimization process makes small changes in the state

of the hand model, until a match between the hand model and the hand features is reached.

This procedure allows the state of the hand to be tracked throughout the video sequence. The

matching process has been formulated as a constrained nonlinear optimisation problem.

However, the high DOF of the kinematic hand models, self-occlusions of the fingers, and

kinematic singularities in the articulated motion, often make the optimisation process to get

trapped in local minima. These problems make articulated hand tracking very difficult.

Rehg (1995) built a system called DigitEyes where a hand could be tracked against a black

background, in real-time (up to 10Hz), either using two cameras and a 27 DOF hand model,

or a single camera and a simplified 6 DOF hand model (which was applied in a 3D mouse

user-interface trial). The kinematic models that Rehg used provided geometric constraints on

the position of the hand features. In addition, self-occlusions of the fingers were handled

using layered templates, whose order was inferred from the current state of the kinematic

model. However, finger occlusions were only tracked off-line. One of the main weaknesses


of Rehg's system is the adaptation of the hand model to a new user. This process takes about

4 hours of interactive work where measurements of the length and the breadth of all the links

are made. Figure 2.2 shows the experimental test bed for the DigitEyes system and its 3D

hand model. In a similar system, Kuch and Huang (1995) simplified the model adaptation to

a new user with only three snapshots of the user's hand in three predefined configurations,

plus an interactive selection. A few years later, Rehg and Morris (1997) developed a method

to capture 3D motion using a 2D model. They used this method to register the 3D motion of

a person dancing. Then, afterwards, the system would recover the 3D motion from the 2D

registration using a 3D kinematic model of the person’s body.

Figure 2.2: DigitEyes. Left: experimental test bed for the DigitEyes system. Right top: hand image with

kinematic model overlaid. Right bottom: 3D view of the hand model. (Figure reproduced from

(Rehg, 1995).)

Following the same model based hand tracking approach, Stenger et al. (2001) constructed

an anatomically accurate hand model using truncated quadrics. Their method uses elegant

tools from projective geometry in order to generate 2D profiles of the model and handle self-

occlusions. The pose of the hand is estimated with an Unscented Kalman filter (UKF) using

one or more cameras. Stenger's et al. (2006) work evolved into the discretisation of their

hand model state space, and subsequent organization of the discretised state space into a

hierarchy of hand templates. The hierarchy of templates contains all the possible (or allowed)

hand configurations. The approach is similar to hierarchical object detection. Areas of the

state space, which are unlikely to contain the current hand configuration, are rejected early

on at the top of the template hierarchy. A search down the template hierarchy refines further

the fitting of the hand model to the hand in the image. The search in the template hierarchy is


aided by a dynamic model, which sets only a weak prior assumption about the motion

continuity. The method produces good results, and it is capable of handling out-of-image-

plane rotations, fast motion, and automatic recovery of tracking. However, the system has

large memory requirements, and it does not work yet in real-time (it takes a few seconds to

process each frame of a tracking sequence). Figure 2.3 shows Stenger's hand tracker tracking

an out-of-image-plane hand rotation.

Figure 2.3: Stenger's hand tracker tracking an out-of-image-plane rotation. (Figure reproduced from

Stenger's et al. (2006).)

A different approach to the model based hand tracking involves tracking the 2D contour of

the hand (as it appear on an image). Heap and Samaria (1995) used this approach and

introduced "smart snakes" in order to track and recognize hand gestures. Some time later,

Blake and Isard (1998) established a framework for tracking deformable 2D contours. One of

the most salient features of their work was the introduction of the Condensation algorithm

for real-time tracking contours against cluttered backgrounds. They demonstrated the

robustness of their tracking framework with a number of experiments (Isard, 1998; Isard and

Blake, 1998a; Isard and Blake, 1998b), such as: tracking the fast motion of a leaf on a

cluttered background, tracking people's profiles, tracking cars, tracking facial expressions,

and even tracking of articulated objects. However, tracking of articulated objects using this

framework alone was inefficient (as the number of particles required to track an articulated

object grows exponentially with the DOF of the object). MacCormick and Blake (1999)

introduced a new technique, called partition sampling, which makes possible to avoid the

high cost of particle filters when tracking more than one object. Later, this technique was

used by MacCormick and Isard (2000) to implement a vision based articulated hand tracker.


Their hand tracker could track position, rotation and scale of the user's hand while in a

pointing configuration. In addition, the thumb had 2 DOF and the index finger had 1 DOF.

Figure 2.4 shows some snapshots of MacCormick and Isard's articulated hand contour

tracker. Partition sampling makes it possible to deal with large configuration spaces,

provided that certain conditions such as the ones found in articulated object tracking are met.

The articulated hand contour tracking developed in this thesis is based on partition sampling,

and Blake and Isard's framework. This thesis contributes to these techniques by making

possible efficient tracking of tree-structured articulated objects while using the hierarchical

structure of the object in the matching process.

Figure 2.4: MacCormick and Isard's articulated hand contour tracker. (Figure reproduced from

MacCormick and Isard (2000).)

A different approach to realise articulated hand tracking is that of Nolker and Ritter (1997).

They find the fingertips of a hand in a grey-scale image by means of a hierarchical neural

network. As the hand movement is highly constrained, the fingertip positions are enough to

roughly infer the 3D state of the hand. This allows them to update a 3D hand model from

only the fingertips positions (Nolker and Ritter, 1999). Recent research by Stefanov (2005)

uses a rather different approach to hand tracking. He combines Hough transform features

(circles) from the input image with behaviour knowledge (from structured interaction) in

order to guide and achieve robust hand tracking.


Skin colour is an important source of information in hand tracking. Tracking methods such

as Camshift (Bradski, 1998) rely entirely on skin colour. As colour is a low level feature,

skin colour based trackers are generally fast, and allow real-time operation. However, they

do not generally allow articulated tracking (the articulated hand tracking presented in this

thesis is an exception as it allows fast articulated hand tracking and is entirely based on skin

colour). Other image features are typically combined with skin colour in order to improve

hand tracking. For example, MacCormick and Blake (1999) use skin colour and edges

information; and Kolsch and Turk (2005) use a technique called flocks of features which

uses skin colour and KLM features to track a hand through rapid deformations.

2.3 Hand gestures

Hand gestures are a natural and powerful way of communicating actions or states to a

computer system. If those hand gestures are combined with pose and position hand tracking,

many parameters of an application can be controlled at the same time. Pavlovic et al. (1997)

defines a gesture as a trajectory in parameter space. A gesture can be divided into three

stages: preparation, peak or stroke, and retraction. Quek (1995) suggested a taxonomy of

hand/arm gestures for HCI that divides gestures into Manipulative and Communicative

gestures. Communicative hand gestures inherently communicate an idea, which can be used

as a command to a system. Examples of communicative hand gestures are: the O.K. symbol,

thumbs up, waving the index finger (indicating a negation), and stop (showing the palm and

fingers extended). Manipulative hand gestures are the hand movements resulting from acting

on objects in the environment (object movement, rotation, etc). Examples of manipulative

hand gestures are: grasping a tool, drawing with a pen, or typing in a keyboard.

Computer systems that can recognize hand gestures from visual data typically follow one of

two approaches. The first approach is the use of 3D hand models and tracking (Davis and

Shah, 1994; Heap and Hogg, 1996; Kuch and Huang, 1995; Lee and Kunii, 1995; Rehg and

Kanade, 1995; Shimada et al., 1998; Wu and Huang, 1999). In this approach the

configuration of the hand model is estimated over time. The second approach is the use of

appearance-based models (Cui and Weng, 1996; Rosales and Sclaroff, 2000; Triesch and

Malsburg, 1996; Wu and Huang, 2000). These models aim to characterize the mapping from

the image feature space to the possible hand configuration space directly from a set of

training data. This approach often involves learning techniques. In either approach (3D hand


models or appearance-based models) Hidden Markov Models (HMM) and its variations are

the most important techniques employed in modelling, learning, and recognition of hand

gestures (Yang et al., 1994; Stoll and Ohya, 1995; Starner and Pentland, 1995; Assan and

Grobel, 1997; Vogler and Metaxas, 1998).

2.4 Visual hand tracking in HCI

The most important area of application of visual hand tracking is HCI. Visual hand tracking,

often combined with hand gesture recognition, provides a natural way of interaction with

computer systems. Visual hand tracking is specially suited to VR and AR environments

because it allows users to interact with virtual objects in an intuitive and flexible way,

without requiring cumbersome hardware.

Visual hand tracking has often been used in order to implement interaction with Digital

Desks. The concept of a Digital Desk is to augment a physical desk by adding electronic

features to it, and allow users to interact with these electronic features on the desk, moving

information from the desk to the computer and vice versa. This is achieved by means of two

cameras situated on the top of the work-surface (for tracking the user's hands) and a projector

(to project the electronic features on top of the desk). Wellner (1993) introduced many of the

original Digital Desk ideas. He presented a number of example applications for the Digital

Desk. The first application was a calculator projected on the desk, which could be operated

directly with the finger. This application allowed the user to get numbers directly from some

document on the desk and feed them into the calculator's display. Another application was

called PaperPaint. This application allowed the user to draw on the desk and copy/move and

paste section of the drawing. Finally, he presented the idea of double-desk. This idea

involves projecting the actions happening in two separate Digital Desks at the same time, on

the same desk.

Using a different methodology but also following the Digital Desk concept, Crowley et al.

(1995) implemented a finger drawing application. The system tracked the user’s fingertip

using a template of the fingertip and correlating it with the desk’s image. The correlation was

done only over the area surrounding the last detected finger tip position. If tracking is lost,

the user needed to put his finger on a square situated in one of the corners of the desk. More

recently in the EnhancedDesk (Koike et al., 2001), paper and digital information are


integrated on the desk. The system uses template matching and infrared cameras in order to

track the user's hands and fingers. Also related with the Digital Desk, Stafford and Robinson

(1996) presented BrightBoard. BrightBoard is a system that uses a video camera and audio

feedback to enhance the facilities of an ordinary whiteboard, allowing a user to control a

computer through simple marks made on the board.

Fish tank VR applications, also benefit from the use of visual hand tracking. Bowden et al.

(1996) presented an application where the user could drive 3D engines on a computer screen

using their bare hands. Segen and Kumar (1998) described a multi-dimensional hand-gesture

interface system and its use in interactive spatial applications. The system acquires input data

from two cameras that look at user's hand, recognizes three gestures and tracks the hand in

3D space. Five spatial parameters (position and orientation in 3D) are computed for index

finger and the thumb, which gives the user a simultaneous control of up to ten parameters of

an application. They demonstrated the capabilities of the system with some example

applications: a video game control, a piloting of a virtual fly-through over terrain by hand

pointing, interaction with 3D objects in a scene editor by grasping and moving objects in

space, and partial control of a human hand. Abe et al. (2000) described a system that tracks

the user’s hand using two cameras, one from the top-view, and another from the side-view.

The system allows drawing figures in 3D space and handling of 3D virtual objects on the

computer screen.

As an application for wearable computing, Kurata et al. (2001) described a system, named

Hand Mouse, where the user’s hand is tracked using a wearable camera. Users wear a HMD

that allow them to see tags and information relevant to the user’s environment. A colour

based mean shift algorithm is used in order to track the user’s hand. This allows users to use

their hand as a pointing device. A soft floating keyboard operated with a single finger is

shown as an example application. In a later work, Kurata et al. (2002) improved the system

using a Condensation algorithm with the lowest possible number of samples in order to

coarsely but rapidly track the user’s hand. Three promising applications were described: a

virtual universal remote control, a secure password input, and a real world OCR. The system

allowed selection of areas, and selection of points in space. The selection used the index and

thumb fingers in order to select the area between them (as seen from the wearable camera).

To select a point in space the user had to bring the index finger and thumb together until they

touch each other, at which point a selection was made. Figure 2.5 shows some snapshots


(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 2.5: Hand Mouse. (a) - (d) Selecting a rectangle by dragging. (e) Universal remote control application.

(f) Secure password input application. (g), (h) OCR application detects text regions automatically.

(Figure reproduced from (Kurata et al., 2002).)

from Hand Mouse applications. Also using wearable camera and a video see-through HMD,

Kölsch (2004) constructed a system called HandVu. The system can track the user's hand

through highly articulated hand motions by using "flocks of features". He presented a GUI

which allowed the user to complete multiple tasks, such as interacting with virtual objects

(using ARToolKit fiducial markers), operating buttons, selecting areas from the field of

view, and recognizing gestures. Figure 2.6 shows the HandVu set-up, some snapshots of

hand tracking through highly articulated motions, and some snapshots of the HandVu GUI.

2.5 Interactive surfaces

In recent years, the concept of interactive surfaces has attracted a big deal of research. An

interactive surface can display information and users can interact with this information by

touching the surface either directly with their hands, using a stylus, or using some form of

wearable hardware. The concept includes the traditional touch screens but it goes beyond

them. An interactive surface can be presented on a table or a desk, on a wall, on a book or on

a piece of paper, on a shop display, or even on a virtual surface floating in thin air. The

technologies used for both displaying information on the surface and making the surface

interactive vary considerably. Examples of display technologies include front or rear

projectors, HMD, or simply a computer screen. Some sensing technologies are based on

image processing, either using visible-light cameras, infrared cameras and infrared

illumination, or 3D range cameras. Other sensing technologies use special surfaces which

include touch sensitive elements (eg. antennas, or pressure sensors). Finally, other sensing


(a) (b)

(c)

(d) (e) Figure 2.6: HandVu. (a) System set-up. (b) Tracking of highly articulated hand motions using flocks of

features. (c) Selecting a region with both hands. (d) Interacting with virtual objects and ARToolKit.

(e) Virtual keypad application. (Figure reproduced from (Kölsch, 2004).)

technologies use wearable hardware such as data gloves, pressure sensors, gyros,

accelerometers, etc. Interactive surfaces are sometimes more specifically referred to as

interactive tabletops, virtual keyboards, or touch panels. In any case, they always lay on a

surface, and thus they are generally referred to as interactive surfaces. A short survey on a

number of recent interactive surface developments is presented next.

Image processing is the technology used in a number of interactive surface implementations.

Zhang et al. (2001) presented Visual Panel. Visual Panel is a vision-based interface which

employs an arbitrary quadrangle-shaped panel (eg. and ordinary piece of paper) and a tip

pointer (eg. fingertip) as a wireless and mobile input device. The user's fingertip is tracked

and operations such as click and drag on the visual panel can be detected. The system

simulates clicking and pressing by holding the tip pointer in the same position for a short

period of time. This makes possible to simulate a keyboard or a mouse on the visual panel.

The position of the panel itself is visually tracked and can provide 3D information, serving as

a virtual joystick or 3D mouse. The quadrangle panel is tracked using Hough transform, and

the user's fingertip is tracked using a Kalman filter and fitting a conic to the fingertip.


Background subtraction is also used for reinitialisation. Figure 2.7 shows VisualPanel set-up

and two applications.

(a) (b) (c)

Figure 2.7: VisualPanel. (a) Tracked panel and tracked fingertip. (b) Virtual keyboard. (c) Finger painting.

(Figure reproduced from (Zhang et al., 2001).).

Malik and Laszlo (2004) presented Visual Touchpad. Visual Touchpad is a low-cost vision-

based input device that allows fluid two-handed interactions with a planar surface. Two

downward-pointing cameras are attached above a planar surface, and a stereo hand tracking

system provides the 3D positions of a user's fingertips on and above the plane. Thus, the

planar surface can be used as a multi-point touch-sensitive device, but with the added ability

to also detect hand gestures hovering above the surface. Figure 2.8 shows three example

configurations for the Visual Touchpad.

Figure 2.8: Visual Touchpad. (a) Desktop set-up. (b) Laptop setup. (c) Hand-held setup. (Figure reproduced

from (Malik and Laszlo, 2004).)

In a slightly broader context but also using image processing, Pingali et al. (2003) proposed

the concept of steerable interfaces for pervasive computing spaces. This type of interfaces

can be displayed (using projectors) on a wall, or on any surface near to the user, and they can

be displayed when the user needs them. User's hands an head are tracked using cameras, this

allows them to interact with the displayed interfaces by touching them. They built a 3D

environment designer that allows defining the geometry of the environment, indicating the


available surfaces (where interfaces could be displayed), and position cameras and projectors

inside the environment. With this data a geometrical model of the environment is built

allowing to work out the best location to display an interface depending on the user’s

position and orientation of the head. They described the technologies needed to realize this

concept, those include, projecting images at will on different surfaces, visually track users

head orientation and position, track users hands, and finally to direct sound to different

places in the environment.

(a) (b)

Figure 2.9: Steerable interfaces. (a) shows the everywhere display projector. (b) shows the office set-up.

(Figure reproduced from (Pingali et al., 2003)).

A system which presents some similarities to the HMD based VTS presented in Chapter 8 of

this thesis is ARKB (Lee and Woo's, 2003). ARKB uses a video see-through HMD which

allows users to see a virtual keyboard laid horizontally (rendered on top of a fiducial

marker). Users can type on the keyboard with both hands. For that, users needs to wear

markers in the fingers, which are recorded from two cameras that form part of a video see-

through HMD. Using stereovision, the system can detect when one of the markers is inside

the volume of a virtual key and consider the key as pressed. Figure 2.10 shows the user's

hand with the required markers, and the ARKB operation.

Developments on interactive surfaces often use infrared cameras and infrared illumination

combined with image processing. Rekimoto and Matsushita (1997) presented HoloWall.

HoloWall uses an infrared camera located behind the wall and infrared illumination.

Information is projected on the wall (which is opaque) using a rear-projector (with IR-cut


(a) (b) Figure 2.10: ARKB. (a) User's hand with markers. (b) User operating the ARKB. (Figure reproduced from (Lee

and Woo's, 2003).)

filter). The user's shape or any objects in front of the wall are invisible to the camera.

However, when the user's hand or other objects get near enough to the wall, they reflect

infrared light allowing them to be detected. The system can detect when the user touches the

wall with either fingers, hands, their body, or even physical objects. Figure 2.11 shows the

HoloWall configuration, and two example applications.

Wilson (2004) developed TouchLight. TouchLight is a vision-based touch screen technology

which uses stereo image processing techniques to combine the output of two infrared

cameras placed behind a semi-transparent screen in front of the user. The image processing

allows the system to determine when an object is near the surface of the screen. A rear

projector (with IR-cut filter) projects information on the screen. Because of the exclusive

properties of the semi-transparent screen, TouchLight enables various unique applications,

such as video conferencing with gaze awareness and various forms of spatial displays. Figure

2.12 shows the TouchLight configuration and a TouchLight prototype. One year later,

Wilson (2005) developed PlayAnywhere. PlayAnywhere is a front-projected computer

vision-based interactive table. The system uses a projector to display information on a table.

An infrared camera together with infrared illumination is used to capture the user's hands.

Visual analysis of the hands shadow results in the detection of clicks and drags on the

surface. Click and drag detection is demonstrated with the manipulation of various virtual

objects. Optical flow techniques are also used in order to enable panning, rotating, and

zooming of high-resolution maps. Figure 2.13 shows a PlayAnywhere prototype and an

example application.


(a)

(b) (c) Figure 2.11: HoloWall. (a) Configuration of the HoloWall. (b) A simple two-handed interface: a user

simultaneously manipulates two control points (and directions) on a Bezeir curve. (c) A map

browser on a curtain: When a user touches a map projected on a curtain with one or two hands and

moves hands in the same direction, the map moves according to the hand movement. A user can

also expand or shrink the map by controlling the distance between two hands. (Figure reproduced

from (Rekimoto and Matsushita, 1997).)


(a) (b) Figure 2.12: TouchLight. (a) TouchLight physical configuration: DNP HoloScreen with two IR cameras and

IR illuminant behind screen. (b) TouchLight prototype. (Figure reproduced from (Wilson, 2004).)

(a) (b) Figure 2.13: PlayAnywhere. (a) PlayAnywhere prototype. (b) Flow field-based manipulation of objects applied

to panning, rotating, and zooming a high resolution map. (Figure reproduced from (Wilson, 2005).)

There are a number of "virtual keyboards" based on infrared cameras that made it into

commercial products, one of them is the Canesta keyboard (Roeber et al. , 2003). The

Canesta keyboard projects the image of a QWERTY keyboard onto any flat surface and

allows the user to input text by typing on the projected keys. The Canesta keyboard emits a

plane of infrared light slightly above the typing surface. A sensor module detects the

intersection of fingers with the infrared light. The data generated by the sensor module is

interpreted into mouse and keyboard events. Figure 2.14 shows an illustration of the Canesta

keyboard. Another virtual keyboard (this one still in research phase) is the one proposed by


Du et al. (2005). This virtual keyboard is projected onto a flat surface in a similar way to the

Canesta keyboard, but allowing to project other information too. The key press detection is

achieved analysing the depth maps from a 3D optical range camera. Figure 2.15 shows a

prototype of this 3D optical ranging virtual keyboard.

Figure 2.14: Canesta keyboard. (Figure reproduced from (Roeber et al. , 2003).)

Figure 2.15: Virtual keyboard based on true-3D optical ranging. (Figure reproduced from (Du et al., 2005).)

Other interactive surface technologies are based on sensing elements deployed on the

interactive surface, as in the case of touch screens. One of these technologies is

DiamondTouch (Dietz and Leigh, 2001). DiamondTouch is a interactive tabletop which uses

a set of antennas embedded on it. A transmitter is connected to the table and a receiver is

connected each user (typically through the chair where they sit on). When a user touches the

table top, the signal from the transmitter can travel, from the tabletop, through the user, and

to the receiver attached to the user's chair. The position of contact can be found processing

the receiver's signal. The system allows to detect simultaneous, multiple point inputs from

various users, being able to identify each user. This enables new applications for group


collaboration on a common surface. A projector above the tabletop is used to display

feedback information onto the table. Figure 2.16 shows DiamondTouch set-up and a

collaborative work environment.

(a) (b)

Figure 2.16: DiamondTouch. (a) System set-up. (b) Collaborative work environment implemented with

DiamondTouch. (Figure reproduced from (Dietz and Leigh, 2001).)

Finally, there have been some attempts in creating interactive surfaces which are based on

wearable hardware. Those are generally concerned with the sensing of clicks and/or drags,

and the presentation of information and feedback part is generally handled by a common

computer, PDA, or mobile phone, screen. Osawa and Sugimoto (2002) use a data-glove in

order to recognize the user's hand movements. They implemented a VR keyboard

(immersive VR) with which they introduce Japanese characters. Fukumoto and Tonomura

(1997) created FingeRing. FingeRing uses accelerometers on each finger of the user's hand

in order to detect surface impacts. These allows to input symbols by tapping various fingers

on a table, knee, or other surface. They implemented a chord keyboard. Figure 2.17 shows an

illustration of FingeRing. Kim et al. (2005) presented SCURRY. SCURRY is a wearable

input device developed by Samsung Advanced Institute of Technology. Based on inertial

sensors, this device allows a human operator to select a specified character (from a virtual

keyboard), an event, or operation through both hand motion and finger clicking. SCURRY

can also be used as a mouse, using the index, middle, and ring fingers as mouse buttons.

Figure 2.18 shows an illustration of SCURRY. Senseboard (2000) is another commercial

wearable hardware keyboard developed by Senseboard Technologies AB. It consists of two

rubber pads that slip onto the user's hands. Muscle movements in the palm are sensed and

translated into keystrokes with pattern recognition methods. Figure 2.19 shows an illustration

of Senseboard.


Figure 2.17: FingeRing. (Figure reproduced from (Fukumoto and Tonomura, 1997).)

Figure 2.18: SCURRY. (Figure reproduced from (Kim et al., 2005).)

Figure 2.19: Senseboard. (Figure reproduced from (Senseboard, 2000).)

2.6 Summary

In this chapter the basics about human hand anatomy and its modelling in computer systems

have been reviewed. The most important techniques for visual hand tracking have also been

reviewed and a number their applications in HCI have been described. The starting point for

the visual hand tracking developed in this thesis is the work of Blake and Isard (1998) and

MacCormick and Isard (2000). They set up a framework for visual tracking of 2D contours.

From this starting point, this thesis develops further their work by making possible an


efficient tracking of tree-structured articulated objects while using the hierarchical structure

of the object in the matching process. Skin colour is a common visual cue used in hand

trackers. It is generally fast to process and this allows real-time tracking. Typically, hand

trackers that are entirely based on skin colour information do not allow articulated tracking

(Bradski, 1998; Kolsch and Turk, 2005). In contrast, the visual hand tracking developed in

this thesis is entirely based on skin colour, which makes it very fast, and it is fully

articulated. The visual hand tracking developed in this thesis is designed for HCI, and in

particular for use in interactive surfaces.

This chapter has described the concept of interactive surface and several examples of

interactive surfaces have been reviewed. The vision-based interactive surfaces are attractive

with respect to other sensing technologies because they are flexible and avoid the need of

expensive and cumbersome hardware. The interactive surface proposed in Chapter 8 ,

referred to as Virtual Touch Screen (VTS), uses the visual hand tracking developed in this

thesis. This sensing technology is novel in comparison with the sensing technologies for

interactive surfaces seen in this chapter. Most of the vision based interactive surfaces

reviewed in this chapter are based on detecting (rather than tracking) when the user's hand or

fingers are near or inside the interactive surface. They often rely on the interaction of light

and the user's hand or fingers when these are in the proximity of the surface (shadows,

reflected IR light, etc). In contrast, the visual hand tracking used in the VTS does not rely on

such light interactions in the proximity of the surface, for it does not require a physical

surface to operate – it only requires a single video stream containing a hand mostly parallel

to the camera's image plane (which itself is parallel to the VTS). Probably ARKB (Lee and

Woo's, 2003) is the most similar work to the VTS (when a HMD is used), although their

sensing technology employs stereovision and requires adorned hands. When the VTS's

display technology is based on a screen and a projector, TouchLight (Wilson, 2004) or

PlayAnywhere (Wilson, 2005) are probably the most similar works to the VTS, again using

different sensing technologies.

33

3 Hand contour tracking using particle filters and deformable templates

Particle filters, also known as Sequential Monte Carlo methods (SMC), are sophisticated

model estimation techniques based on simulation. Particle filters in conjunction with

deformable templates have been widely used in the past in order to track hand contours

(Cootes et al., 1995; Heap and Hogg, 1998; Bowden, 1999; Blake and Isard, 1998;

MacCormick and Isard, 2000). Blake and Isard established a framework to perform contour

tracking by using deformable templates together with a particle filter, known as

Condensation filter. One of the strongest features of this framework is its ability to track

object contours against cluttered backgrounds, while still being able to keep the

computational requirements within a given resource.

This chapter presents this framework, emphasising its application to hand contour tracking.

Firstly, the basic elements of the framework are presented, and their use within the

Condensation filter explained. Secondly, we will see how this framework can be expanded in

3 Hand contour tracking using particle filters and deformable templates 34

order to track articulated objects. Thirdly, we investigate the problems that arise when

tracking tree-structured articulated objects. Finally, particle interpolation is proposed as a

method to overcome these problems.

3.1 Deformable templates

The tracking techniques used in this thesis are based on modelling the contour of the target

object by using B-spline curves (Blake and Isard, 1998). These B-spline curves are handled

in a specific way by using a configuration or state vector of relatively few degrees of

freedom. In this context a B-spline curve is referred to as contour, referring to its shape, or as

a deformable template, often referring to both its shape and its capability to change

according to a state vector. Blake and Isard presented a framework for contour tracking,

which relies upon the use of deformable templates. This section presents a brief introduction

to this part of Blake and Isard's framework.

A B-spline curve is a parametric curve that allows representation of a smooth, natural-

looking shape by specifying only a small number of "control points". If the coordinates of

the control points are ),),...(,( 11 nn yxyx then the B-spline is a curve

Tsysx ))(),(( parameterised by a real variable s on an interval of the real line:

=

yx

sBsysx

r

r

)()()(

(3.1)

where )(sB is a n22 × matrix, called the metric matrix, whose entries are B-spline basis

functions (polynomials in s) and yx rr, are 1×n column vectors containing the x- and y-

coordinates of the control points respectively. By convention, we will call such a B-spline

curve a contour. Figure 3.1 shows the contour of a hand's middle finger, the contour uses 8

control points indicated by blue crosses.

A contour can use a large number, n, of control points to define its shape. The vector space

of such a contour will have an undesirable large number (2n) of dimensions. In order to

manage the contour more easily, a vector subspace, termed shape space and denoted by X, is

defined. An element Xx ∈ is related to the control point coordinates yx rr, by a linear

transformation with a fixed offset:


0QWxyx

+=

r

r

(3.2)

If the shape space had d dimensions, then W is the dn ×2 shape matrix, x is a 1×d column

vector generally referred as the configuration or state of the contour, and 0Q is a 2 1n× vector

called the template for this object. Given a template for a rigid planar object, it is easy to

define a shape space X corresponding to translations, 2D Euclidean similarities, or affine

transformations of the template 0Q . The final configuration of the contour, defined by the

control point coordinates yx rr, , can be controlled by changing the state vector x.

For example, a shape matrix that represents a 2D affine transformation of the template 0Q

can be define as:

=

00100001

00

00xy

yx

QQQQ

W (3.3)

Where 0 (0,0,0...,0)T= , 1 (1,1,1...,1)T= , and the template 0Q decomposes as ( )Tyx QQ 00 ,

xQ0 being the x-coordinates of the template control points, and yQ0 the y-coordinates of the

template control points. Using the shape matrix in Equation (3.3), here we see some

examples of transformations:

1. Tx )0,0,0,0,0,0(= represents the original template shape 0Q .

2. Tx )0,0,0,0,0,1(= represents the template translated 1 unit to the right.

3. Tx )0,0,1,1,0,0(= represents the template doubled in size.

4. Tx )sin,sin,1cos,1cos,0,0( θθθθ −−−= represents the template rotated through angle θ

5. Tx )0,0,0,1,0,0(= represents the template doubled in width.

This type of contours together with its capability to change are also known as deformable

templates. Deformable templates are linearly parameterised by their state vector. This linear

parameterisation is useful later in contour tracking because it simplifies the fitting algorithms

and avoids problems with local minima. However, there exist other types of deformable

templates that are not linearly parameterised. Yuille and Hallinan use a geometrical


parameterisation of their contours (Yuille and Hallinan, 1992), and MacComick and Isard

use a non-linear parameterisation of their articulated hand contour, which includes joint

angles (MacCormick and Isard, 2000; Isard and MacCormick, 2000). The hand contour

model used in this thesis uses the latter non-linear parameterisation.

Figure 3.1: A B-spline contour fitted to the middle finger of a hand. The contour uses 8 control points,

shown here as blue crosses.

3.2 Measurement model

The tracking algorithms presented in the following sections need to measure how well a hand

contour fits the hand in an image. Within the tracking framework presented by Blake and

Isard, the fitness of a hypothesised contour configuration to the image features is expressed

as a conditional probability, referred to as the contour likelihood. If tx is the state of a

contour (modelled object) at time t, and tZ is the set of image features at time t, then the

likelihood of that contour representing the true object configuration is the conditional

probability )|( ttp xZ . This contour likelihood can be calculated in several ways, and a

detailed discussion about various ways of calculating it is given by MacCormick (2000).

Blake and Isard's approach to calculating a contour likelihood uses a set of line segments

normal to the contour at several informative points, the measurement points. These line

segments, termed measurement lines, are processed in order to find image features along

them. Each of the features found along the measurement lines has a measurable contribution

towards the contour likelihood. Figure 3.2 shows a hypothesised mouse-shaped contour and

several normal measurement lines. Along the measurement lines there are some black dots,

which represent the found features. This method of calculating a contour likelihood


represents a large saving of processing time in comparison to the original implementations of

active contours, or "snakes" (Kass et al. 1987), in which processing is performed on the

entire image, and the resulting edge-map is used as an energy surface across which the

contour moves.

In Blake and Isard's approach, the contributions towards the contour likelihood of each of the

features found along a measurement line, are calculated using a Gaussian centred on the

measurement point. Figure 3.3 depicts the measurement process for a measurement line. The

measurement line is normal to the hypothesised contour at the measurement point. The three

features found along the measurement line, black dots, correspond (from left to right) to: a

feature outside the target object due to cluttered background; the real contour position; and a

feature internal to the target object, due to it having a non-homogeneous texture. The feature

corresponding to the real contour has the highest Gaussian value, out of the three features,

therefore this feature will constitute the largest contribution to the measurement point

likelihood.

Figure 3.2: Measurement lines distributed along a contour. The thick white line represents a hypothesised

configuration of a mouse-shaped contour. The thin lines are measurement lines. The black dots

represent the detected features along the measurement line. (Figure reproduced from (MacCormick,

2000).)


Figure 3.3: Measurement line normal to a hypothesized contour.

The functional form of a measurement point likelihood as formulated by Blake and Isard has

the form )|( xzp , where z is the set of features found along the measurement line1, and x is

the position of the hypothesised contour on the measurement line:

∑ −+∝m

mvxzp 2

2

2exp

211)|(

σσαπ (3.4)

where σ is the standard deviation of the Gaussian; λα q= where q is the probability that the

contour target object is not visible, and λ is the spatial density of the background clutter

(following a Poisson process along the line); and xzv mm −= is the distance between the m

feature found on the measurement line and the position of the hypothesised contour on the

measurement line, this is x. Generally x is at the midpoint of the measurement line.

In practice, considerable economy can be applied when evaluating (3.4); it is not necessary

to include all the features mzz ,...,1 for which mv results in:

12

exp2

12

2

<<−σσαπ

mv (3.5)

1 The specific image processing techniques employed to process the measurement lines and detect image features are described in (Blake and Isard, 1998).


These features bring a negligible contribution to the measurement point likelihood. Then

(3.4) can be simplified as:

);(2

1exp)|( 12 µσ

vfxzp −∝ (3.6)

where ),min();( 22 µµ vvf = , )21log(2 ασπσµ = is a spatial scale constant, and 1v is

the mv lying closest to the hypothesised position x.

In order to calculate the likelihood of the entire contour two assumptions have to be made:

first, observations mz are assumed to be mutually independent; second; the contour's

likelihood depends only on the object's configuration at the current time-step. Then, the

likelihood of the entire contour at time t is the product of each of the likelihoods of each of

the M measurement points:

∏=

=M

mmmtt xzpp

1

)|()|( xZ (3.7)

This can be computed using the following simplified form:

121

1( | ) exp ( ( ) ( ); )2

M

t t m mm

p f z s r s µσ=

∝ − −

∑Z x (3.8)

where ms is the mth measurement point; )(1 sz is the closest feature to the hypothesised

contour along the measurement line s; and )(sr is the hypothesised contour position along

the measurement line s. According to Equation (3.8), the parameters that control the

computation of the contour likelihood are σµ, and M. µ controls the clutter-resistance of

the tracker: if an object is expected to lie in a clutter-free environment then µ can be set

quite large, and as clutter density increases its value should decrease accordingly. σ should

be set according to the accuracy of the shape model. If the expected object appearance is

very well modelled by the shape space then a small value of σ can be used since features can

be expected to be found very close to the predicted curve. If however the shape model is

inaccurate, a larger value of σ will permit tracking of shapes that are not exactly within the

modelled space, while increasing the risk of distraction by clutter. The time spent in

calculating the contour likelihood depends largely on the number of measurement points


along the contour, so a large value of M will slow down the calculation time. Judicious

positioning of the measurement points ms at informative points along the contour, rather than

spacing the ms evenly, can allow a smaller M for an equivalent performance.

3.3 The Condensation algorithm applied to visual contour tracking

At the core of the tracking techniques used in this thesis there is a particle filter known as

Condensation (Isard and Blake, 1998a). Particle filters have been used in a diverse range of

applied sciences, and the basic algorithms were discovered independently by researches in

several of these disciplines. In the field of computer vision, and in particular contour

tracking, Isard and Blake made an important contribution by introducing the Condensation

algorithm. This section describes briefly this algorithm.

In visual contour tracking the task is to find the contour configuration of the target object

throughout T frames of a video sequence. The contour configuration of the target object at

frame t is denoted as tx , with t=1,...T. In order to find information about the target object a

number of measurements are made at each frame, calculating the likelihoods for the

hypothesised contours, measurements at frame t are denoted as tΖ . The measurements

acquired up to frame t are denoted as tΖ :

},...{ 1 tt ZZ=Z

The information of interest for the location of the target object is expressed as a conditional

probability )|( tttp Zx , this is the probability of a hypothesised contour given the history of

measurements. However, in general it is difficult to calculate )|( tttp Zx directly. For this

reason the Bayes' theorem is applied to each time-step, obtaining a posterior

)|( tttp Zx based on all available information:

)(

)|()|()|( 11

tt

ttttttttt p

pppZ

xxZx −−=Z

Z (3.9)

where )|( 11 −− tttp Zx is called the prior, and )|( tttp xZ is the observation density. As usual in

filtering theory, a model for the expected motion between time-steps is adopted. This takes


the form of a conditional probability distribution )|( 1−tttp xx termed the dynamics. Using the

dynamics, (3.9) can be re-written as:

)(

)|()|()|()|(

111111

tt

ttttx tttttt

ttt p

dpppp t

Z

xxxxxZx

−−−−−∫−=

ZZ (3.10)

This is the equation that a filter must calculate or approximate. This equation suggests a

recursive implementation; the conditional probability of the target configuration at time t can

be approximated as a sum of the previous conditional probabilities of the target configuration

multiplied by the dynamics, all weighted by the observation density; )( ttp Z is generally a

constant independent of tx for a given image, therefore it can be neglected in the case where

only relative likelihoods need to be considered. This is what filters such as Kalman filter do

(Gelb, 1974; Welch and Bishop, 2002). In a Kalman filter the observation density is assumed

Gaussian, and the target configuration evolves as a Gaussian pulse throughout the tracking

task. However, it is an empirical fact that the observation densities occurring in visual

tracking problems are not at all Gaussian; this was the original motivation for Isard and

Blake to introduce the Condensation algorithm.

The fundamental idea behind Condensation is simply to simulate Equation 3.10. The

simulation uses the idea of a weighted particle set: this is a list of n pairs niii ,...1),,( =πx ,

where X∈ix (the configuration space) and ]1,0[∈iπ is a weight with ∑ ==

n

i i11π . The

weighted particle set approximates the conditional density )|( tttp Zx at time t. Figure 3.4

shows a weighted particle set approximating a probability density. The particles are the grey

ellipses and their weight is proportional to their area. Picking one of the particles is

approximately the same as drawing randomly from the continuous probability function

shown. One of the strengths of this weighted particle set representation is that it allows the

representation of multimodal distributions.


Figure 3.4: Weighted particle set approximation of a probability density. (Figure reproduced from (Isard

and Blake, 1998a).)

In the context of contour tracking, each particle constitutes a hypothesised contour

configuration ix , and its weight iπ is the likelihood of the contour representing the real

target object. Figure 3.5 illustrates this using a hand as the target object. The figure shows

three hand contours, in different colours, which are the graphical representation of three

particles from a particle set. The thickness of each particle is proportional to the weight of

the particle.

Figure 3.5: Graphical representation of three particles from a particle set. The thickness of the hand

contours is proportional to the weight of the particle.

The application of certain operations on the particle set for each time-step allows simulating

the evolution in time of the conditional density )|( tttp Zx . The repetition of these operations


at each time-step is what constitutes the Condensation algorithm. These operations on the

particle set are called: resampling, prediction, and measurement. Next, each of these

operations is briefly described:

Resampling

The first operation on the particle set is to sample (with replacement) n times. The particles

are chosen with probability equal to their weight iπ . This can be done efficiently with the

use of cumulative probabilities. Some particles, especially those with high weights, may be

chosen several times, leading to identical copies of elements in the new set. Others with

relatively low weights may not be chosen at all. After the resampling operation, the resulting

particles do not have a weight. The particles are endowed with a new weight at the

measurement step.

There are various methods to perform this sampling operation. Kitagawa (1996) used

deterministic and stratified sampling methods; and in the visual tracking field, Blake and

Isard experimented with both random and deterministic sampling methods. Independently of

the method used for the resampling, an important fact is that resampling should not affect the

distribution represented by the particle set. This fact allowed MacCormick (2000)

completing the proof that Condensation is correct.

Prediction

Each of the particles of the new resampled particle set (hypotheses of contour

configurations) evolves from time-step to time-step following certain dynamics. Applying

these dynamics to a particle is referred to as prediction. In this thesis the dynamics used for

the particles follow the second-order auto-regressive processes (ARPs) described in (Blake

and Isard, 1998). A second-order APR model expresses the state tx at time t as a linear

combination of the previous two states and some Gaussian noise:

wttt BxAxAx ++= −− 1122 (3.11)

where 21, AA are fixed dd × matrices which represent the deterministic components of the

dynamics; B is also a fixed dd × matrix that represents the stochastic component of the

dynamics; w is a 1×d vector of independent random normal )1,0(N variates. The values of

21, AA , and B in this thesis have been found empirically.


Looking at the particle set as a whole, the prediction operation can be interpreted as

convolving the particle set with the dynamics )|( 1−tttp xx in order to produce a new particle

set (MacCormick, 2000).

Measurement

At this point, the particle set is composed of n new particles whose configuration has been

predicted, according their dynamics, from the original particles. However, the particles do

not have a weight yet. At this step, the image features are analysed in order to calculate the

likelihood of each particle (hypothesised contour) representing the target object. The

likelihood of a particle will constitute the new weight for that particle. The contours'

likelihood can be calculated as described in Section 3.2. Once the weights for all the particles

are generated, these weights are normalised so that their sum equals 1.

Looking at the particle set as a whole, the measurement operation can be interpreted as

multiplying the particle set by the observation density )|( tttp xZ in order to produce a new

particle set (MacCormick, 2000).

Figure 3.12 depicts one time-step in the Condensation algorithm. The time-step begins with a

weighted particle set of size n from the previous time-step t-1, Figure 3.6(a). This particle set

is resampled in order to produce another set of n particles, and dynamics are applied to each

of the new particles, Figure 3.6(b). Finally, the contour likelihood is calculated for each of

the new particles. This is equivalent to multiplying the particle set by the observation density

)|( tttp xZ , Figure 3.6(c). The result is a new weighted particle set that will be used as a

starting point in the following time-step.

After any time-step of the Condensation algorithm, it is possible to "report" on the current

state, for example by evaluating some moment of the state density. In this thesis, the reported

state at each time-step will be that of the particle with highest weight.

A remarkable quality of the Condensation algorithm is its simplicity, in comparison to other

algorithms such as Kalman, despite its generality. The Condensation algorithm is capable of

tracking a multimodal distribution, and capable doing this within a given computational


Figure 3.6: One time-step in the Condensation algorithm. (Figure adapted from (Isard, 1998)).

resource, determined by the size of the particle set. An interesting advantage of the

Condensation algorithm over other algorithms such as Kalman, is that Condensation avoids

overshooting – as it is just testing for hypotheses, and in real life there are no overshoots.

3.4 Articulated tracking

It is possible to use a Condensation filter to visually track articulated objects. In this case, a

particle from the filter has to contain the configuration of a suitable articulated contour. This

configuration vector will include the deformation parameters for each of the links plus the

parameters that define the relationship between links. Then, a Condensation filter could be

used to track the contour of that articulated object.

However, the size of the configuration vector for many articulated objects of interest, such as

the human body, human hand, etc, is too large (20-40 degrees of freedom) to be dealt

directly with a Condensation filter. As the dimension of a particle's configuration space in a

Condensation filter increases, the number of particles needed to explore this configuration

space increases exponentially for a given level of performance, rendering the Condensation

filter ineffective. One possible solution to stop this increase in dimensionality could be using


PCA to reduce the number of dimensions needed, as described by Isard (1998). However,

even after this reduction in dimensions the configuration space may still be too big to be

explored using Condensation. Fortunately, there exist a technique that can make tractable

the tracking of articulated objects, this technique is called partition sampling.

3.4.1 Partition sampling

A technique called partition sampling was introduced by MacCormick and Blake (1999) for

avoiding the high cost of particle filters when tracking more than one object. Later, this

technique was used by MacCormick and Isard (2000) to implement a vision based articulated

hand tracker. This technique makes it possible to deal with larger configuration spaces,

provided that certain conditions are met, such as the ones found in articulated object

tracking.

Partition sampling is based on a hierarchical decomposition of the problem's configuration

space X . The problem's configuration space is divided into a number of smaller

configuration spaces called partitions kXX ,...,1 , where the posterior in a partition jX

determines the search scope of the following partition 1+jX . Each partition jX has an

associated particle set jS , the size of which is related to the number of dimensions of that

partition, fewer dimensions means much fewer particles for a given level of coverage. As a

result, the sum of the sizes of the particle sets of each partition is much smaller than the size

of an equivalent particle set that could cover the original configuration space. This results in

needing much fewer particles to achieve the same level of performance.

From a general point of view, the objective of partition sampling is to use one's intuition

about the problem to choose a decomposition of the configuration space, the dynamics, and a

measurement function, which is beneficial.

An intuitive idea of how partition sampling operates can be gained by considering the

example in Figure 3.7. The figure shows a two-dimensional configuration space (x,y), in

which there is a peak of a 2D likelihood function. Here, the original two-dimensional

configuration space has been decomposed into two partitions, one for the x coordinate, and

another for the y coordinate. Therefore, in order to locate this peak the search is split into two

stages. In the first stage, the x coordinate is explored. As a result, an area of high likelihood

in the x coordinate, the grey shaded area in Figure 3.7, is located. In the second stage, the y


coordinate is explored, but only for the particles that constituted the area of high likelihood

in the x coordinate. As a result, an area of high likelihood in the y coordinate, the hatched

area in Figure 3.7, is located. In this way the peak is located in two stages, and the combined

number of particles required to explore the two partitions is smaller than the number of

particles that would be required to directly explore the original configuration space (x, y).

Figure 3.7: An intuitive partition sampling example. The two-dimensional configuration space is divided into

two partitions, one for the x coordinate, and another for the y coordinate.

Partition sampling can be used in a problem if the three following conditions hold

(MacCormick and Isard, 2000):

• The configuration space, X, can be partitioned as a Cartesian product kXXX ××= ...1 .

• The dynamics, h, can be decomposed as khhh ∗∗= ...1 with each jh acting on

kj XX ×× ... . The symbol ∗ denotes convolution.

• There are weighting functions 121 ,..., −kggg with each jg peaked in the same region as

the posterior restricted to jX .

These conditions hold for articulated tracking. An example is used to describe how partition

sampling operates with articulated objects. For a rigorous description refer to (MacCormick

and Isard, 2000).

Let us consider, as an example, the articulated object of Figure 3.8. This articulated object is

composed of three links, of fixed length, connected by two-dimensional hinges, i.e. a chain

of three links. The whole object can translate (parameters x, y), rotate around the base link

centroid (parameter r), and scale (parameter s). The hinges that joint the links have angle

parameters 1α and 2α . If we wanted to visually track this object using a Condensation filter,

as described in Section 3.3, we would need to define a number of things: first, an articulated


contour that matches the articulations of the object and shares the same six parameters;

second, the dynamics h of the articulated object; and third, a measurement function g capable

of calculating a contour likelihood. In order to use partition sampling for this example we

can still use the same articulated contour that the one defined for Condensation, but the

configuration space, dynamics, and measurement function need to be decomposed for each

partition. A convenient decomposition of this articulated object's configuration space is:

• first partition 1X , parameters (x, y, r, s), corresponding to the base link.

• second partition 2X , parameter 1α , corresponding to the L1 link.

• third partition 3X , parameter 2α , corresponding to the L2 link.

Figure 3.8: Articulated object with three links forming a chain.

Each partition has a particle set; 1X has a particle set 1S , 2X has a particle set 2S , and 3X has

a particle set 3S . The particles in each set are associated hierarchically, so that each particle

in 3S will have a particle associated in 2S (its parent), and each particle in 2S will have

associated a particle in 1S . However, only a few selected particles in 1S will have associated

particles in 2S , and only a few selected particles in 2S will have associated particles in 3S .

These particles sets are complementary, thus, by putting together the associated particles

from each of the partitions 1S , 2S , and 3S , it is possible to have a valid configuration for the

whole articulated object.

The dynamics for each partition have to decompose as 321 hhhh ∗∗= ; with 1h acting on 1X ,

2X , and 3X ; 2h acting on 2X , and 3X ; and 3h acting on 3X . The measurement functions

are specific for each partition; ,, 21 gg and 3g will measure the likelihood of the contour

segments for the base, L1, and L2 links respectively.


Figure 3.9 shows one time-step of partition sampling for the example articulated object.

Before starting the first time-step, there will be one particle pre-selected in each particle set:

one particle in 1S , which will be associated to the pre-selected particle in 2S , and this one

will in turn be associated to another pre-selected particle in 3S . These three particles will

contain a known initial configuration of the articulated object's base, L1, and L2 links.

1. First partition 1X : 1.1. From the selected particles in 1S (of the previous time-step) generate new particles for 1S . 1.2. Apply dynamics 1h to each of the particles in 1S . 1.3. Weight particles in 1S using the measurement function 1g . 1.4. Select particles from 1S . 2. Second partition 2X : 2.1. From the selected particles in 1S generate new particles for 2S . 2.2. Apply dynamics 2h to each of the particles in 2S . 2.3. Weight particles in 2S using the measurement function 2g . 2.4. Select particles from 2S . 3. Third partition 3X : 3.1. From the selected particles in 2S generate new particles for 3S . 3.2. Apply dynamics 3h to each of the particles in 3S . 3.3. Weight particles in 3S using the measurement function 3g . 3.4. Select particles from 3S . 4. Reorganize particles from 1S , 2S , and 3S for the next time-step.

Figure 3.9: Algorithm for one time-step of partition sampling on the chain of links of Figure 3.8.

The first operation for each partition, [1.1], [2.1], and [3.1], uses the particles selected in the

previous partition; for [1.1] the previous partition becomes the same partition, but from the

previous time-step. A selected particle in the previous partition, referred to as parent, will

produce a number of particles in the new partition proportional to the weight of the parent

particle. The configuration of the child particles will be taken from the parent's associated

particles of the previous time-step.

The second operation for each partition, [1.2], [2.2], and [3.2], applies the relevant dynamics

to the particles of each partition. This dynamics consist of a deterministic drift plus a

stochastic diffusion.


The third operation for each partition, [1.3], [2.3], and [3.3], calculates a weight for each

particle in the respective set. Note that to measure the contour likelihood of a particle in 2S ,

the measurement function 2g needs to consider the particle's configuration together with its

associated parent's configuration in 1S . Likewise, to measure the contour likelihood of a

particle in 3S , the measurement function 3g needs to consider the particle's configuration

together with its associated parents' configurations in 2S , and 1S . In other words, this means

that in order to measure a contour hypothesis for the L2 link, the configuration of the base,

and L1 links, needs to be known.

The fourth operation for each partition, [1.4], [2.4], [3.4], is to select the particles from a

particle set that constitute peaks of weight in the set.

At the end of the time-step [4.] the particles selected in each partition need to be reorganized

so that a particle in 1S has exactly one particle associated in 2S , and this particle in 2S has

exactly one particle associated in 3S . For that it may be necessary to replicate some particles

in 1S , or 2S .

Note that the sizes of the particle sets 1S , 2S , and 3S , can be different. The size of a particle

set is typically a trade-off between: the number of particles required to reach certain degree

of coverage, and thus accuracy, in a particular configuration space; and the desired amount

of allocated resources.

In this example the number of parameters of the articulated object is only six. For a

configuration space of six dimensions a plain Condensation filter would probably be enough

in order to track a chain of three links (though not without using a considerable number of

particles). However, if the chain was longer, a plain Condensation filter would not be the

best choice.

When tracking a chain of N links, partition sampling greatly reduces the search in the

configuration space. In each partition the most likely configurations for a link are found,

using a Condensation time-step, and these likely configurations are used as starting points for

the search in the following link. At the end of the time-step, there will be a set of likely


configurations for each partition, all of them associated with their parents forming a series of

tree structures.

3.4.2 Incomplete particles in a chain of links

Continuing with the example of the previous section. At the end of a time-step of partition

sampling, there are three sets of selected particles, one for each partition. Recall that the

selected particles in a particle set are the particles that constitute peaks of weight in the set.

These selected particles are associated forming a series of tree structures. Before the next

time-step takes place, these particles need to be grouped in order to form a set of complete

particles. The term complete particle refers to three selected particles (one from each

partition) that are associated one to another, and that together form a valid configuration of

the articulated object. On the other hand, the term incomplete particle refers to a selected

particle in a partition whose associated particles in the following partition are not selected.

These two terms can be better understood by looking at Figure 3.10.

Figure 3.10: Particle set diagram showing two fictitious time-steps of partition sampling. Complete

particles are encircled in red.


Figure 3.10 shows two fictitious time-steps of partition sampling, for t=0, and t=1. Each of

the horizontal lines represents a particle set. There are three particle sets per time-step, one

for each partition: base, L1, and L2. The lengths of the horizontal lines are the same for all the

particle sets; however, this is just a representation and the particle sets could actually have

different sizes, for example: the Base particle set could have size 400 particles, and the L1,

and L2 particle sets could have 100 particles each. The black dots represent the selected

particles in each particle set. At t=0 there are three selected particles in the base particle set.

This diagram is for illustrative purposes; in practice, there could be a large number of

selected particles in each particle set. Each selected particle gives place to a number of

particles, in the following particle set, proportional to its weight. This is illustrated in the

diagram by arrows coming out of the selected particles and pointing to a section, indicated

with horizontal braces, of the following particle set. The longer the brace, the more particles

this brace involves in the particle set.

At the end of the t=0 time-step, it is possible to form two complete particles, encircled in red

on the diagram. These complete particles will be propagated to the following time-step t=1.

On the other hand, there are some selected particles that do not have any child selected

particles associated to them. These constitute incomplete particles.

Note that which particles get selected in a particle set depends on a few factors:

• The method of selection in use.

• The number and extension of the likelihood peaks in the configuration space represented

by that particle set.

• The level of coverage of a configuration space by a particle set. A sparse coverage may

result in likelihood peaks to go unselected.

As a result, it is possible to find a selected particle with a large weight in its particle set, but

that produces no other selected particles in the following particle sets down the hierarchy.

This is an unfortunate waste of resources, as some particles with high weight may not be

propagated to the following time-step, just because they did not belong to a complete

particle.


3.5 Tree-structured articulated objects

The previous section discussed how articulated objects could be tracked using partition

sampling; however, it only addressed chains of links. Other articulated objects of greater

interest in this thesis, such as the human body, or a human hand, have tree structures.

Fortunately, it is possible to adapt partition sampling in order to track a tree-structured

articulated object. This section proposes a natural extension of partition sampling for

tracking of tree-structured articulated objects. This extension, however, as proposed in here,

has some drawbacks, which will be resolved later in Section 3.7.

Let us consider the tree-structured articulated object of Figure 3.11(a). This is a simplified

articulated hand contour model, whose only purpose is illustrative. The hand contour model

assumes that the hand palm is always parallel to the camera's image plane, although allowing

for independent finger and thumb movements. The hand contour model is made of a hand

palm and five fingers, named L, R, M, I, T after Little, Ring, Middle, Index, and Thumb.

Each finger can rotate around its finger pivot (the black dot at the base of the finger) in order

to represent the abduction/adduction movements of the fingers. The finger pivots are situated

(a) (b) Figure 3.11: Tree-structured articulated object. (a) Simplified hand contour model with 14 DOF. (b) Particle

set tree for the articulated hand model; Palm is the parent particle set, and L, R, M, I and T are the

child particle sets.

approximately where the Metacarpophalangeal (MPC) joint of each finger would be

expected to be. The length of the fingers can change in order to represent the 2D projection

of a finger's flexion/extension movement. The thumb is treated as if it was a finger; it is


modelled by only an angle and a length. The angle and length of a finger are indicated with

α , and L respectively and then adding the name of the finger as a subscript. The whole hand

is allowed to translate (x, y), rotate around the hand palm pivot r, and scale s. In total the

articulated hand contour has 14 parameters.

The first step in order to use partition sampling with this object is to choose a convenient

decomposition of the configuration space, for example: one partition for the hand palm,

parameters (x, y ,r, s); and one partition for each of the five fingers, each one with parameters

(α , L). The following steps will be to define: 6 particle sets, 6 motion models, and 6

measurement functions, related to each of the 6 partitions.

In the chain of links example, a parent partition was related to just one child partition. Here,

the hand palm partition is related to all the finger partitions at the same time, forming a two-

level tree of partitions. The particle sets associated to each partition form an equivalent two-

level tree. Figure 3.11(b) shows the particle set tree. A method for handling this tree structure

is to proceed as in the chain of links example for the case of only two levels: one level for the

hand palm particle set, and another level made by all the finger particle sets. In the first level,

hand palm, a selected particle will produce a number of particles, proportional to the selected

particle's weight, in each one of the finger particle sets. Then, after dynamics and

measurements are applied to each one of the finger sets, new particles are selected in each

one of the finger sets. Finally, selected particles from the hand palm and fingers particle sets

are grouped to form complete particles, and those are propagated to the following time-step.

The process for one time-step is described in Figure 3.12.

Unfortunately, the process of forming complete particles is more difficult with a tree-

structured articulated object than it is with a chain of links articulated object. The next

section will explain these difficulties.


1. For the parent particle set (Palm) do: 1.1. Use the complete particles of the previous time-step to generate new particles for Palm. 1.2. Apply dynamics to each of the particles in Palm. 1.3. Weight particles in Palm. 1.4. Select particles from Palm that constitute peaks of weight in the set. 2. For each of the child particle sets (L, R, M, I, and T) do: 2.1. For each of the selected particles in Palm, generate a number of new particles in the finger particle set, proportional to the weight of the selected particle in Palm. 2.2. Apply dynamics to each of the particles in the finger particle set. 2.3. Weight particles in the finger particle set. 2.4. Select particles, from the finger particle set, that constitute peaks of weight in the set. 3. Form complete particles for the next time-step.

Figure 3.12: Algorithm for one time-step of partition sampling on the tree-structured articulated object of

Figure 3.11(a).

3.6 Incomplete particles in tree-structured articulated objects

Continuing with the previous section's example. Figure 3.13 shows a particle set diagram

representing two fictitious time-steps of partition sampling for the simplified articulated hand

of Figure 3.11(a). The horizontal lines represent particle sets, as in Figure 3.10. However, in

this diagram there are only two levels of hierarchy in each time-step; at the first level there is

the hand palm particle set, referred to as "Palm", and at the second level there are five

particle sets, one for each finger, referred to as L, R, M, I, and T. The selected particles are

indicated with black dots on the particle sets. The weight of the selected particles in the palm

particle set determines how big the portion of associated particles in the fingers particle sets

is. These portions of the fingers particle sets that are associated to the same parent particle

will be referred to as subsets, and in Figure 3.13 are separated by vertical dashed lines, four

subsets in the time-step t=0, and three subsets in the time-step t=1. The reader is reminded

that this type of particle set diagram only represents the relationships between particle sets.

In the practice, each particle set can have different sizes, and the number of selected particles

in each set can be much larger.

Each of the time-steps consist of four operations: first, the selected particles in the palm

particle set are used to generate new subsets in the fingers particle sets; second, dynamics are

applied to finger particle sets; third, measurements are applied to fingers particle sets; and


fourth, new particles are selected from the fingers particle sets. At the end of each time-step

the selected particles have to be grouped to form complete particles. In the current context a

complete particle is defined as: the combination of a selected particle in the palm particle set,

and from its associated subset, a selected particle in each of the finger particle sets.

We can see that at the end of time-step t=0 there is only one complete particle which

propagates to t=1. However, at the end of the time-step t=1 it is not possible to form any

complete particles. Figure 3.13 illustrates this potential situation. In the practice, the more

links the tree-structured articulated object has, the more frequently incomplete particles

appear. This situation worsens if the finger particle sets have child particle sets of their own.

In the best case these incomplete particles are a waste of resources, in the worst case the

tracking cannot continue.

One way of avoiding incomplete particles is to force at least one selected particle in each

subset for each finger particle set, for example, the particle with highest weight in the subset.

However, this could lead to the selection of particles with very low weight, i.e. particles that

do not represent properly the relevant link; and this would probably constitute a waste of

particles in the child's particle set. Another possible solution is to combine various selected

particles in the same particle set, then generate new particles so that each subset has at least a

selected particle, this leads to the following section.

3.7 Particle interpolation

The idea of particle interpolation in partition sampling involves creating new particles in a

particle set using the combined data of other particles, from the same or other particle sets,

and following certain creation constrains in order to form useful complete particles. The idea

of combining particles to form new ones, as opposed to the particle filter's primary method of

generating new particle sets, is not new in a broad sense. Various works on hybridization of

particle filters (PF) and genetic algorithms (GA) use crossover operators between particles

(Uosaki et al., 2004; Drugan and Thierens, 2004), to the extreme that in Kwok et al. (2005)

the resampling operation of a PF is completely replaced by crossovers and other techniques

popular in GAs. In the context of articulated visual tracking Pantrigo et al. (2005) used two

metaheuristics, known as path relinking, and scatter search, in order to create new particles,

from a few selected ones, that efficiently cover a search space. However, the particle


Figure 3.13: Particle set diagram showing two fictitious time-steps of partition sampling for the example

articulated hand. At the end of the time-step t=0 there is one complete particle. At the end of the

time-step t=1 there are none complete particles.

interpolation is the first attempt at interpolating particles in a partition sampling scheme with

the capability of producing particles that maintain coherence between partitions, and have

the highest possible weights. The ultimate aim of this particle interpolation is to avoid

incomplete particles. In order to describe particle interpolation we will continue with the


tree-structured articulated object example of Figure 3.11. Remember the fourth operation on

each finger particle set (see Figure 3.12). In this operation the particles that constitute peaks

of weight in the particle set are selected; however, there is no guarantee that each of the

subsets will contain at least one selected particle, which results in potential incomplete

particles. For this reason, the first aim of particle interpolation is to generate a new particle

for each subset, and the second aim is that the generated particle has the highest possible

weight.

Let us consider the particle with highest weight in a finger particle set. This particle has the

highest weight because the finger contour it represents matches the image features, a finger,

better than the others. This particle gives us information about where the real finger in the

image is. Particle interpolation consist in generating a particle, for each subset, that shares

some of the image features of the particle with highest weight, and therefore it is expected to

achieve a high weight too, but it is relative to the subset's parent particle, from the palm

particle set. This process is illustrated in the particle set diagram of Figure 3.14. In this

particle set diagram there is a particle set for the palm, and a particle set for each of the five

fingers, for a time-step t=0. The palm particle set has four selected particles, and therefore

there are four subsets in the finger particle sets. The large black dots in each of the finger

particle sets represent the particle with highest weight in that set. The smaller red dots in the

finger particle sets represent the interpolated particles, one for each subset. The interpolated

particles in each particle set are calculated by combining data from the particle with highest

weight in the particle set and the parent of each one of the subsets, this is represented with

red arrows in the diagram, coming from the particle with highest weight and going to the

interpolated particles in each of the subsets.

We can see from the diagram in Figure 3.14 that at the end of the time-step there will be as

many complete particles as selected particles in the palm particle set. However, although the

interpolated particles are generated in a way that makes then likely to have a high weight, the

exact weight is not known. In order to be able to form complete particles with a known

weight, the interpolated particles need to be weighted.


Figure 3.14: Particle set diagram showing the particle interpolation process. The big black dots in the

finger particle sets are the particles with highest weight in the set. The smaller red dots in the finger

particle sets are the interpolated particles, one for each subset.

1. For the Palm particle set do: 1.1.Use the complete particles of the previous time-step to generate new particles for Palm. 1.2. Apply dynamics to each of the particles in Palm. 1.3. Weight particles in Palm. 1.4. Select particles from Palm that constitute peaks of weight in the set. 2. For each of the finger particle sets, i.e. L, R, M, I, and T, do: 2.1. For each of the selected particles in Palm, generate a number of new particles in the finger particle set, proportional to the weight of the selected particle in Palm (the finger particles that come from the same parent particle in Palm form a subset). 2.2. Apply dynamics to each of the particles in the finger particle set. 2.3. Weight particles in the finger particle set. 2.4. Select the particle with highest weight in the finger particle set. 2.5. Generate a new interpolated particle for each subset, based on the particle selected in step 2.4. 2.6. Weight the interpolated particles. 3. Form complete particles for the next time-step.

Figure 3.15: Algorithm for one time-step of partition sampling, and particle interpolation. The algorithm is

based on the articulated hand model of Figure 3.11.

Figure 3.15 shows the algorithm for one time-step of partition sampling, including particle

interpolation, for the example articulated hand model. The algorithm with particle


interpolation differs from that of Figure 3.12 in steps [2.4], [2.5], [2.6], and [3]. In the step

[2.4] the particle with highest weight in the finger particle set is selected, this is a simpler

step than [2.4] without particle interpolation. Step [2.5] and [2.6] are new steps. Finally, step

[3] is simpler than the equivalent step without particle interpolation, since in this case there is

exactly one complete particle for each subset.

3.7.1 Generating interpolated particles

This section describes the criteria used to generate interpolated particles. These criteria may

appear as an "ad hoc" solution because it uses specific knowledge about the problem, i.e.

visually tracking a hand that moves in a plane parallel to the image plane, using the

simplified contour model of Figure 3.11. However, similar criteria could be applied in other

partition sampling applications by interpreting the relationships between particle parameters.

We start by considering the particle with highest weight in a finger particle set. In the

previous section, we reasoned that this particle has the highest weight because the finger

contour it represents matches the image features, a finger, better than the others. Particle

interpolation assumes that the generated particle will also have high weight because it is

constructed in a way that shares some of those image features. The proposed method to

generate an interpolated particle makes use of this assumption with an additional constrain,

that the generated particle has to be consistent with its parent particle.

Let us consider a simplified version of the articulated hand model of Figure 3.11. This

version only includes the palm and the little finger. Let us consider two particles A, and B,

which use this simplified model, see Figure 3.16. Particle A is formed by two partitions: Ap

corresponding to the palm, and dependent on this one, Af corresponding to the finger.

Similarly, particle B is formed by Bp, and Bf. Now, let us assume that Af is the particle with

highest weight in the finger particle set, and Bf is a particle with low weight in the same

particle set. The interpolation procedure finds new parameters (length, and angle) for Bf in

order that it can share some image features with Af. The goal of this operation is to maximize

Bf's weight, while taking into account that the two particles come from different parents: Ap,

and Bp. Some example rules that attempt to maximize Bf's weight in this manner could be:


(rule 1) Bf maximizes its weight if its fingertip coordinates are the same as Af's

fingertip coordinates.

(rule 2) Bf maximizes its weight if both Bf and Af share a common point along

their respective major axes.

(rule 3) Bf maximizes its weight if it also maximizes its area overlap with Af.

(rule 4) Bf maximizes its weight if it has the same angle as Af.

Other rules are possible; however, after some experimentation (rule 1) produced the best

results and was adopted in the implementations of Chapter 4. The implementation of this rule

is detailed next, with aid of Figure 3.16.

Figure 3.16: Graphical representation of the interpolation process using (rule 1). (a) Particle A's finger, Af,

has the highest weight. (b) Bf parameters ),( BB Lα are updated in order that Bf's fingertip

coordinates match those of Af fingertip.

In Figure 3.16(a) we can see a graphical representation of particle A's state, and particle B's

state. ),( AA cenYcenX and ),( BB cenYcenX are the palm pivots of Ap and Bp respectively.

),( AA pivYpivX and ),( BB pivYpivX are the finger pivots of Af and Bf respectively. These

finger pivots can be calculated from the palm pivots and the palm's state i.e. translation,

rotation, and scale. ),( AA tipYtipX and ),( BB tipYtipX are the fingertips of Af and Bf

respectively. These fingertips can be calculated from the finger pivots and the finger's state

i.e angle, and length. We assume that Af has the highest weight, and Bf has a low weight. In


order for Bf to maximize its weight its fingertip coordinates must be the same as Af's

fingertip coordinates, Figure 3.16(b). This can be achieved by updating Bf's parameters

),( BB Lα in the following manner:

BA pivXtipXdx −=

BA pivYtipYdy −=

)(tan 1

dydxAngleF −= (3.12)

22 dydxLengthF += (3.13)

anglengerAngleOriginalFiAngleFB −−=′α (3.14)

)*( scalengerLengthOriginalFi

LengthFLB =′ (3.15)

Equations (3.12) and (3.13) calculate the angle and Euclidean distance between Bf's pivot

and Af's fingertip. Equations (3.14) and (3.15) apply a normalisation to AngleF and LengthF

in order that Bα′ is relative to the Bp's angle, angle; and BL′ is a number between 0 and 1. In

these equations OriginalFingerAngle and OriginalFingerLength are the angle and length of

the finger in the template position, i.e. for 0=α and 1=L ; and angle and scale are the

rotation and scale parameters of Bp's state.

Following this rule we can generate new finger particles for any palm particles, and the

weight of these new particles is likely to be high.

3.7.2 Differences from Condensation.

We have seen how particle interpolation can be used with partition sampling, and tree-

structured articulated objects. In this context, particle interpolation is essential in order to

prevent tracking from terminating, due to inability to form complete particles. At the same

time, particle interpolation provides an efficient solution in the sense of propagating as many

particles as possible from one time-step to the next one, and ensuring that these particles

have high weights. Particle interpolation makes it possible that, for each selected particle in

the palm particle set, a complete particle is formed and propagated to the next time-step.

However, the particle selection policies in partition sampling differ slightly from the

resampling operation in Condensation.


In the Condensation algorithm, ideally, the resampling operation does not affect the

distribution represented by the sample set. On the one hand, the resampling operation avoids

the unbounded growth of the sample set as it evolves from one time-step to the next one. On

the other hand, the resampling operation introduces noise in the sample set, from one time-

step to the next one, in order to prevent the sample set from degenerating into a few particles

that represent only the peaks of the distribution. It turns out that a particle set represents a

distribution more efficiently when most of the particles have equal weights (MacCormick,

2000).

As opposed to the idea of resampling in Condensation, partition sampling produces a particle

set that represents only the peaks of the distribution. The benefit of partition sampling is that

we can explore configuration spaces of a high number of dimensions using much fewer

particles than an equivalent Condensation filter would require. This is possible thanks to the

hierarchy between partitions. The search results in parent partitions allows focusing the

search for the child partitions, which in turn have a smaller space to search from. Particle

interpolation adds an extra focusing into the child's partition search space, by generating

particles with high weight, and ensuring that the same number of particles, that generated the

particle set, is propagated to the child's particle set.

The main reason of using a Condensation filter, instead of a Kalman filter, for visual

tracking, is that the Condensation's capability of keeping several hypotheses of the target

makes it resistant to background clutter and partial occlusion. During these situations some

hypothesis with medium weight will be carried from one time-step to the next one, because

they could turn later into high weight hypotheses. The fact that partition sampling, together

with particle interpolation, focuses the particles on the peaks of the underlying distribution,

may seem to reduce the tracking resistance against background clutter. However, despite the

chance that the clutter resistance of partition sampling and particle interpolation may be

reduced in comparison to an equivalent Condensation filter, it is not annulled; as a number of

different hypotheses are always propagated between time-steps. On the other hand, partition

sampling and particle interpolation offer a robust, versatile, and time-efficient, solution that

Condensation alone would, otherwise, not be able to offer.


Finally, as a comparison note, it is worth mentioning that the articulated tracking in this

thesis is based on MacCormick and Isard (2000) work. They presented an interface-quality

hand tracking, which involved a tree-structured articulated hand contour and partition

sampling. The hand's configuration space was divided into 4 partitions: hand palm, index

finger, and two partition for the thumb. However, the approach taken in their implementation

was to deal with each of the partitions in sequence: the particles selected in partition 1

generate the particles in partition 2, the particles selected in partition 2 generate the particles

in partition 3, and so on. Finally, the particles selected in partition 4 generate the particles in

partition 1, for the next time-step. This strategy avoids the task of forming complete

particles. However, in their solution, each time an extra partition is involved in the tracking,

the number of particles required to keep a fix level of performance, grows faster than the

number of particles required in our solution, for each extra partition. They point out that their

approach is valid mathematically but an approach that takes into account the tree structure

would be more appropriate and possibly more efficient.

65

4 Implementations and results

Chapter 3 presented Blake and Isard's framework for contour tracking. The chapter reviewed

the framework elements and the Condensation algorithm; and it was shown how partition

sampling can be used in order to track articulated objects, and tree-structured articulated

objects such as a human hand. Finally, particle interpolation was presented as a novel

solution to the incomplete particles problem in tree-structured articulated objects.

In this chapter, these techniques are put into practice with the implementation of two

articulated hand trackers. The first hand tracker, which we will refer to as particle-set

implementation, implements exactly the concepts of Chapter 3 . The second hand tracker,

which we will refer to as sweep implementation, implements the same general ideas, but

using a deterministic search for the fingers. Both hand trackers use the same articulated hand

contour model, and a skin colour based measurement model, which provides various

advantages in comparison to the measurement models used by Blake and Isard.

There are numerous possible improvements to the two articulated hand trackers presented in

this chapter; however, these improvements will not yet be used in this chapter, they will be

covered later in Chapter 7 .

4 Implementations and results 66

4.1 Articulated hand contour model

The hand tracking techniques used in this thesis use the assumption that the user's hand palm

is always approximately parallel to the camera's image plane, although allowing for

independent finger and thumb movements. In order to support tracking of a hand in such a

configuration we built a hand contour model, shown in Figure 4.1(a). The hand contour

model is an articulated BSpline template constructed from 50 control points. Each finger can

rotate around its finger pivot in order to represent the abduction/adduction movements of the

fingers. The finger pivots are situated approximately where the Metacarpophalangeal (MPC)

joint of each finger would be expected. The length of the fingers can change in order to

represent the 2D projection of the finger's flexion/extension movement as it would be

perceived from the camera's point of view. Finally, the thumb consists of two segments

which can rotate around their pivots but on the contrary to the fingers, the thumb segments

have a constant length (as the thumb is assumed to flex only on the same plane as the palm),

see Figure 4.1(b). Altogether the articulated hand contour has 14 DOF which are represented

by the following state vector:

( )0 0 1 1 2 2 3 3 4 5, , , , , , , , , , , , ,x y l l l lα λ θ θ θ θ θ θ

Where

x is the x-coordinate for the centroid of the hand.

y is the y-coordinate for the centroid of the hand.

α is the rotation angle of the whole hand.

λ is the scale of the whole hand.

3210 ,,, θθθθ are the angles with respect to the hand palm for the little, ring, middle,

and index finger respectively.

3210 ,,, llll are the lengths of the of the little, ring, middle, and index finger

respectively, as perceived from the camera's point of view.

4θ is the angle of the first segment of the thumb with respect to the hand

palm.

5θ is the angle of the second segment of the thumb with respect to the

first segment of the thumb.


The first four parameters ( )λα ,,, yx are a non-linear representation of a Euclidean similarity

transform applied to the whole BSpline template. The finger and thumb angles are 0º when

they are in the template position. The angles of the fingers and thumb are only allowed to

change within a range of valid angles. Note that in Figure 4.1(b) the two thumb pivots are off

the major axis of their corresponding thumb segments. This offset allows for a better

mapping of the thumb's articulated contour movement to the real thumb joints movement, as

seen in a 2D projection. The length of the fingers is relative to the template's finger lengths;

it is 1 when they have the same length as in the template, smaller than 1 when shorter, and

greater than 1 when longer. The two thumb segments have a fixed length of 1.

Figure 4.1: Hand contour model. (a) Hand contour showing its 50 control points. (b) Articulated hand contour

showing its joint parameters.

4.2 Dynamical model

The dynamical model used in the proposed hand tracker consists of one-dimensional

oscillators, one for each parameter of the articulated hand model, as described in (Blake and

Isard, 1998). Each oscillator is defined by three parameters: a damping constant β , a natural

frequency f, and a root-mean-square average displacement ρ . Table 4.1 shows the

parameter values used in the dynamical model. These values were found empirically to suit

how the user's hand moves. Note that the natural frequencies of all the oscillators are zero.

This means that the parameters of the articulated hand model are not expected to oscillate,

although the dynamical model can represent oscillations if needed.


4.3 Measurement model

The two hand contour tracker implementations presented in this chapter use a measurement

model based on that described in Section 3.2. This section proposes a measurement model

that uses measurement lines too (as in Section 3.2), but the processing of these lines differs

from that of Section 3.2 in that only skin colour features are used.

1( )sβ − ( )f Hz ρ

x 6 0 50 pixels

y 6 0 45 pixels

α 6 0 0.3 rad

λ 6 0 0.1

0θ 8 0 0.2 rad

0l 10 0 0.2 pixels

1θ 8 0 0.2 rad

1l 10 0 0.2 pixels

2θ 8 0 0.2 rad

2l 10 0 0.2 pixels

3θ 8 0 0.2 rad

3l 10 0 0.2 pixels

4θ 8 0 0.2 rad

5θ 8 0 0.2 rad

Table 4.1: Parameter values for the hand tracker dynamical model.

4.3.1 Measurement lines

Figure 4.2 shows the location of the measurement lines used in the articulated hand contour

of Section 4.1. There are 70 measurement lines, 19 on the palm, 10 on each finger, 5 on the

first thumb segment, and 6 on the second thumb segment. The measurement lines on the

hand palm and thumb are normal to the contour. However, the measurement lines in the

fingers are normal to the finger's axis, for the 8 closest lines to the hand, and parallel to the


finger's axis, for the 2 lines on the fingertip. The average2 length of a measurement line is 20

pixels.

Blake and Isard suggest using an anti-aliasing scheme, such as bilinear interpolation, when

sampling their measurement lines. On the other hand, the approach adopted in this thesis is

to retrieve the image pixels along a measurement line, drawn according a Bresenham line.

This approach may seem less accurate; however, if an anti-alising scheme was to be used on

the measurement lines, other inaccuracies present in the hand contour model3 would obscure

this gain in accuracy. As we will soon see, the pixels along the measurement line will be

classified as being skin colour or non-skin colour. In this context, interpolation of the pixel

values could interfere with the skin colour classification. Finally, non-anti-aliased

measurement lines can be sampled faster. This is an important fact to consider for real-time

performance, since the measurement lines are heavily used in the tracking algorithms.

Figure 4.2: Measurement lines used in the articulated hand contour.

2 When rotating a line of length L, drawn using a Bresenham algorithm, the resulting line can sometimes involve a number of pixels slightly different than L. These are the pixels that later will be scanned to find features. 3 The hand contour model can only undergo Euclidean similarity transformations. It is assumed that the user will keep the hand parallel to the image plane of the camera; however, this will often not be the case, resulting in the hand contour not fitting the tracked hand perfectly.


4.3.2 Skin colour based measurement

In the Section 3.2, we saw how the measurement lines are processed in order to find image

features. Typically these image features refer to edges and valleys. Although MacCormick

and Isard (2000) have used edges in combination with skin colour for their hand contour

tracking: they first calculate a likelihood based on the edges found along the measurement

line, and then the likelihood is increased or decreased after testing that the correct end of the

measurement line is in a skin colour area (the hand interior) and the other end is on a non-

skin colour area (the exterior of the hand). The hand trackers presented in this chapter use

skin colour in a rather different way.

In Chapter 5 a skin colour classifier, called the LC classifier, is presented. This skin colour

classifier possesses a number of interesting features that make it especially suitable for use in

HCI. Some of these features are: its computational time and storage efficiency; its tuning to a

specific person's skin colour; its resistance to illumination (brightness) changes; and that the

detected skin colour areas tend to be solid and clearly defined without requiring post-

processing. This classifier makes it possible to detect the edge of an object along a

measurement line, based entirely on the skin colour information. The edge of an object found

in this manner is referred to as skin edge. This is the approach used in all the hand tracking

experiments of this thesis, unless otherwise specified. Figure 4.3 shows the output of the LC

skin colour classifier applied to the whole image, with the measurement lines drawn on top

in order to see where the skin edges of the hand would be detected.

Figure 4.3: Skin colour image with the measurement lines on top.


The position of the skin edges on each measurement line is used to calculate the contour

likelihood. However, the procedure used in this thesis to calculate the contour likelihood

differs slightly from that explained in Section 3.2. In Section 3.2 a simplified expression to

calculate the contour likelihood is given in Equation 3.8. This expression involves the

calculation of a log-likelihood for each measurement point. These log-likelihoods are then

added and an exponential is taken over the total in order to obtain a probability. The

approach used in this thesis to calculate the contour likelihood does not use probabilities, it

uses a score; although, effectively, this score works as a likelihood.

The procedure to calculate the contour likelihood is as follows: each measurement line is

scanned, starting from the outer part of the line, until two consecutive pixels are classified as

skin colour, this point is called SkinEdge; then the distance from this position to the midpoint

of the line is used to access a look-up table of scores. The scores obtained in this way, for

each measurement line, are multiplied resulting in the final contour score. Figure 4.4 shows

the scores look-up table. The look-up table is made from Gaussian values taken at integer

distances. The Gaussian's standard deviation was found empirically and the mean is situated

at the origin. The Gaussian is transformed so that its maximum value is 2, and minimum

value 0.5. This means that each measurement point can potentially double or halve the score.

If no SkinEdge is found along the measurement line, the contribution of the line is to halve

the score. Notice that with this measurement function the score for the whole hand, 70

measurement lines, could potentially go from 702 1.180 21E≈ + to 70(0.5) 8.47 22E≈ − .

Figure 4.5 shows the procedure to calculate the contour score.

Figure 4.4: Score look-up table.


1. Score = 1 2. For each of the measurement lines in a contour repeat: 2.1. For each of the pixels along the measurement line, starting from the end of the line exterior to the contour, use the LC skin colour classifier to determine whether the pixels are skin colour or not. 2.2. Once two consecutive pixels are classified as skin colour stop scanning the line, and store in i the position just before that point. 2.3. Calculate the distance between i and the midpoint of the measurement line. This is the SkinEdge value. 2.4. Then use SkinEdge to access the look-up table in Figure 3.6: Score = Score * Look-upTable [SkinEdge] 3. End.

Figure 4.5: Algorithm to calculate the contour's score.

This approach has a number of advantages with respect to that of Section 3.2:

• Its implementation only requires multiplication and access to a look-up table (avoiding

the use of exponentials for each measurement point). This approach results in a faster

calculation of the contour likelihood than that of Section 3.2. This fact is important to

achieve real-time operation (as the measurement function is heavily used in the tracking

algorithm).

• It allows penalising the score for measurement lines that do not follow the right non-

skin/skin pattern. If a measurement line does not contain any skin colour pixels, its

contribution is to halve the score. If a measurement line only contains skin colour pixels,

its contribution is to halve the score. If a measurement line contains first skin colour and

then non-skin colour, its contribution is to halve the score. Only when a measurement

line follows the pattern first non-skin colour and then skin colour, its contribution can be

bigger than 0.5. This results in a very sensitive measurement function, in the sense that

the contour hypothesis that are very close to the real contour will have a very large score,

while the rest of contour hypothesis will have on average a score around 1. This make the

measurement function every resistant to outliers.

One disadvantage to this measurement function is that as it does not use probabilities, the

Condensation algorithm cannot be interpreted anymore in terms of propagation of

conditional probabilities. However, in practical terms, this does not affect the tracking

performance.


4.4 Resampling scheme

Section 3.3 described the resampling operation, within the Condensation algorithm, as an

operation that does not affect the distribution represented by the sample set. However, when

using partition sampling this changes slightly, for efficiency reasons; the resampling

operation on a particle set will generate a new particle set containing only the peaks of

weight of the first particle set, as discussed in Section 3.7.2.

There are various ways of implementing this resampling operation for partition sampling. In

(Isard and MacCormick, 2000) a threshold is used. All the particles with weight above the

threshold are selected, in order to generate the new particle set. The threshold is calculated,

for each time-step, from the particle with largest weight minus a constant offset. In this

method of resampling, there are two extreme situations which can potentially arise: in the

first situation only one particle is selected; and in the second situation all the particles are

selected; the number of particles selected depends on the weight distribution of the particle

set. This method of resampling may be suitable for their partition sampling implementation,

in which partitions are processed in a circular sequence, see final note in Section 3.7.2.

However, in the implementation of partition sampling for tree-structured articulated objects

described in Section 3.5, the threshold-based resampling scheme would result in that the

number of particles required, and consequently the time required, to process the partitions

could vary largely. For this reason, the chosen method of resampling is one in which the

number of selected particles is fixed. In each resampling operation, the N highest weight

particles in the particle set are selected. This guaranties that each time-step has a fixed

duration, resulting in a stable frame rate along the tracking. A selection of the highest

weighted 10% of the particles in the particle set was found to produce good results. This

percentage of particles was also observed by Isard and Blake (1998b), as the percentage of

particles that may have high enough weight to be used as a base, that is to be used to

generate particles in the following particle set.

One feature added to the resampling scheme is that the particle with highest weight in the

original particle set is always copied to the new particle set. The particle with highest weight

in a particle set is used for display purposes. Hence, if there is no change at all in the

configuration of the tracked object, from one time-step to the next, the output of the tracking


will fit the object at least as well as in the previous time-step. This technique helps the

tracker to display a more stable output.

4.5 Particle-set implementation

Here we present an articulated hand tracker capable of tracking the contour of a hand through

a video sequence. The tracker can handle the rigid movement of the hand, and the

independent movement of each finger, according to the hand contour model described in

Section 4.1. This implementation is referred to as the particle-set version, because each of

the partitions is covered by a particle set, exactly as described in Chapter 3. This hand tracker

uses the following features:

• Partition sampling and particle interpolation as described in Chapter 3.

• The articulated hand contour model described in Section 4.1.

• Dynamics as described in Section 4.2.

• The measurement model described in Section 4.3.

• The resampling scheme described in Section 4.4.

The articulated hand model has 14 parameters, ( )0 0 1 1 2 2 3 3 4 5, , , , , , , , , , , , ,x y l l l lα λ θ θ θ θ θ θ ,

which we decompose into 7 partitions, as follows:

• One partition for the hand palm, comprising the parameters ( )λα ,,, yx , and to whose

associated particle set we will refer to as Palm.

• Four partitions, one for each finger, comprising the parameters 0 0 3 3( , ),..., ( , )l lθ θ , and to

whose associated particle sets we will refer to as L, R, M, and I .

• Two partitions for the thumb, one for the first thumb segment, comprising the parameter

4θ , and another for the second thumb segment, 5θ . To whose associated particle set we

will refer to as T1 and T2 respectively.

A particle set diagram representing one time-step of the particle-set hand tracker version is

shown in Figure 4.6. This diagram is very similar to the one shown in Figure 3.14, with the

addition of two segments for the thumb. Each of the thumb segments has a unique parameter

for its angle (the length of the segment is constant). The second thumb segment depends

hierarchically from the first thumb segment; therefore, the subset sizes in the T2 particle set

are determined by the weights of the selected particles in the T1 particle set.


Figure 4.6: One time-step of tracking for the particle-set implementation.

An algorithm showing the specific operations for one time-step of the particle-set hand

tracker is shown in Figure 4.7. The algorithm is very similar to the one described in Figure

3.15, but differs from it in that there is one extra level in the hand tree structure, the second

thumb segment.

Some comments about the particle-set hand tracker are:

• The number of particles used in each particle set is set as follows: 250 particles for the

Palm particle set; and 100 particles for each of the finger particle sets, including the two

thumb segment particle sets.

• The contour templates are fit in the following order: first, the hand palm; second, the

fingers and first thumb segment; finally, the second thumb segment. Alternatives to this

order will be discussed in Section 7.1.


1. For the Palm particle set do: 1.1.Use the complete particles of the previous time-step to generate new particles for Palm. 1.2. Apply dynamics to each of the particles in Palm. 1.3. Weight particles in Palm. 1.4. Select particles from Palm that constitute peaks of weight in the set. (Resampling) 2. For each of the finger particle sets, i.e. L, R, M, I, and the first thumb segment particle set, T1, do: 2.1. For each of the selected particles in Palm, generate a number of new particles in the finger particle set, proportional to the weight of the selected particle in Palm (the particles, in the finger particle set, that come from the same parent particle in Palm form a subset). 2.2. Apply dynamics to each of the particles in the finger particle set. 2.3. Weight particles in the finger particle set. 2.4. Select the particle with highest weight in the finger particle set. 2.5. Generate a new interpolated particle for each subset, based on the particle selected in step 2.4. Select the new interpolated particles. 2.6. Weight the interpolated particles. 3. For the second thumb segment particle set, T2, do: 3.1. For each of the selected particles in T1, generate a number of new particles in the T2 particle set, proportional to the weight of the selected particle in T1 (the particles, in the T2 particle set, that come from the same parent particle in T1 form a subset). 3.2. Apply dynamics to each of the particles in the T2 particle set. 3.3. Weight particles in the T2 particle set. 3.4. Select the particle with highest weight in the T2 particle set. 3.5. Generate a new interpolated particle for each subset, based on the particle selected in step 3.4. Select the new interpolated particles. 3.6. Weight the interpolated particles. 4. Form complete particles for the next time-step.

Figure 4.7: Algorithm for one time-step of tracking with the particle-set implementation.

• The finger lengths and finger angles are constrained. The minimum allowed length is

0.15, and the maximum allowed length is 1.2. Remember, that a length of 1 is the length

of the finger in the original finger contour template, see Section 4.1. The finger angles

are constrained specifically for each finger according to Table 4.2. The values for these

angle limits have been found empirically, in order that they suit the expected finger

movements.

• In order to bootstrap the tracking an initial complete particle is needed. This initial

complete particle contains a known configuration for the articulated hand contour that

matches the configuration of the target hand in the first frame of the tracking sequence.


Minimum angle

(rad)

Maximum angle

(rad)

Little -0.6 0.4

Ring -0.25 0.15

Middle -0.25 0.15

Index -0.45 0.4

Thumb segment 1 -0.9 0.9

Thumb segment 2 -1.3 0.35

Table 4.2: Finger angle constrains.

4.6 Refining finger length estimates

The visual hand tracking in this thesis was designed as a solution to the hand tracking

requirements of the VTS application. In this application the flexion/extension of the fingers

could be very fast, for example when typing on a virtual keypad, or clicking on a virtual

button. The flexion/extension movements of the fingers are modelled, in the articulated

contour model, as changes in the length of the finger contour. The parameter that governs the

dynamics of the finger contour length could be tuned in order to suit these fast changes;

however, the required value would also make the measured finger contour length less stable.

On the other hand, the measured finger contour length needs to be as accurate as possible in

order that the inferred flexion/extension of the fingers is accurate enough for the VTS. For

these reasons, a mechanism, additional to the described contour tracking, is needed in order

to guarantee this extra accuracy.

This section introduces a technique that allows previously found finger length estimations to

be refined. The technique consists in using two parallel measurement lines placed along the

finger contour. The length of these lines is slightly longer than the maximum possible finger

length. The pixels along these measurement lines are tested for skin colour, starting from the

base of the finger, in order to find a skin colour edge, from skin colour to non-skin colour.

The position of the skin colour edge is used to correct the length of the finger. The procedure

is illustrated in Figure 4.8.


(a)

(b)

(c)

Figure 4.8: Procedure to refine finger length estimations. (a) the middle finger length is initially estimated by

the articulated hand contour tracking. (b) two measurement lines are used in order to find a more

accurate length of the middle finger. (c) the length of the middle finger is updated.

For this technique to produce good results, it needs to be able to handle noise along the

measurement lines, and potentially, skin colour from adjacent fingers. In order to achieve

this, the measurement lines are processed in the following way:

1. The pixels of the measurement line are tested for skin colour using the Linear Container

skin-colour classifier described in Chapter 5. The results are stored in an array, if the

pixel is skin coloured it is store as 1, if it is not skin coloured it is stored as 0.

2. A one-dimensional morphological erosion is applied to the array in order to remove

noise; one pixel erosion considering 1 as foreground, and 0 as background.

3. Morphological dilation is applied to the array in order to fill any skin holes inside the

finger; 25 pixel dilation.

4. Morphological erosion is applied to the array in order to return the skin-colour edge to

its original location inside the array; 24 pixels erosion.

5. Finally, the array is scanned in order to locate a skin-colour edge.

This processing of the measurement lines makes the location of a skin-colour edge, on the

measurement line, more accurate. In addition, two parallel measurement lines are used in

order to combine their skin-colour edge positions, and, in this way, obtain a more reliable

length for the finger. If the distance between the skin-colour edges of the two measurement

lines is smaller than 0.3 (remember, the length of the fingers goes from 0 to 1), the resulting

fingertip position is calculated as an average of the two skin-colour edge positions;


otherwise, the fingertip position is calculated from the skin-colour edge that is further away

form the base of the finger.

The procedure here presented, for refining a previously estimated finger length, is applied

directly after the step [2.4] (only applied to the fingers, not to the first thumb segment) of the

particle-set version algorithm in Figure 4.7. This step corresponds to selecting the particle

with highest weight in the particle set. After this particle is selected, the procedure for

refining the finger length estimation is applied. The resulting particle, with the refined finger

length, will be used in the following step, in order to generate interpolated particles for each

of the subsets of the particle set.

4.7 Sweep implementation

The previous section described an articulated hand tracker implementation based on the

techniques presented in Chapter 3 . In this section a new articulated hand tracker

implementation is described. This implementation is also based on the techniques presented

in Chapter 3 ; however, the particle sets for the fingers are substituted by a deterministic

search of the fingers' positions. The approach has two major advantages with respect to the

particle-set implementation. These advantages are: increased accuracy in locating the fingers,

and as a consequence increased tracking accuracy; and faster computation time.

This hand tracker works mostly in the same way as the particle-set implementation, it uses

the same hand contour model, the same dynamical model, the same measurement model, and

the same resampling scheme. The only difference is the way in which finger positions are

found. In the particle-set implementation, the particles in the finger particle sets correspond

to many potential finger positions, distributed according the finger dynamics. The reported

finger position is determined by the particle with highest weight in this finger particle set.

The accuracy with which the finger position is found depends on the number of particles in

the finger particle set; more particles means better accuracy.

The hand tracker presented in this section substitutes the fingers' particle sets for a

deterministic search, as a means of finding the finger positions. This search involves

measuring the fitness of each finger contour for a specific range of angles. The angle whose

finger contour has the highest fitness is selected for that finger. The angle is swept in a


progressive pattern; it starts with the finger angle, and length, of the previous time-step, then

the angle changes gradually with increasing steps; first positive increases, and then returning

back to the initial angle, with negative increases. Note that, with this approach, no dynamics

are applied to the finger angle. Figure 4.9 illustrates this angle sweep pattern. Once the finger

angle is found in this way, the length of the finger is found using the finger length refinement

procedure of Section 4.6.

Figure 4.9: Angle sweep pattern. (a) First, the finger angle changes with positive increasing steps. (b) Then,

the finger angle changes with negative increasing steps.

It turns out that this sweep procedure requires fewer measurements, of the finger contour

fitness, than the particle-set implementation does. Using the presented progressive angle

sweep pattern, 15 angle positions are enough to estimate the finger angle with better

accuracy, on average, than using the particle set method with 100 particles. This is the reason

why the sweep implementation is faster than the particle-set implementation – it essentially

involves fewer measurement steps.

Conceptually, this algorithm uses partition sampling, as the configuration space is partitioned

exactly as in the particle-set version, and the searches in each of the partitions are performed

hierarchically. The finger position is found in two stages, first the finger angle and then the

finger length. This is as if we had partitioned the configuration space one level more, one

partition for the finger angles, and another partition for the finger lengths.


Figure 4.10 represents one time-step for the sweep articulated hand tracker implementation.

The top horizontal line is the Palm particle set. Dots on this particle set represent selected

particles, and the bigger dot is the particle with highest weight. From this particle, five

branches come out, one for each finger, and one for the first thumb segment: L for little, R

for Ring, M for middle, I for index, and T1 for the first thumb segment. Each of the finger

branches undergoes, first, a sweep procedure (triangles with S on them) and then, a finger

length refinement (boxes with R on them). Once the angle and length of fingers are found,

these are used in order to generate interpolated particles, one for each of the selected

particles in the Palm particle set. This stage is represented in Figure 4.10 as a smaller

horizontal line with four particles (dots) on it. The thumb segments follow a slightly different

path. They only require angle estimation, which then can be used to generate interpolated

particles. The second thumb makes an angle estimation based on the first thumb angle. At

the end of the time-step it will be possible to form as many complete particles as particles

were selected in the Palm particle set. Figure 4.11 shows the algorithm for one time-step of

tracking with the sweep implementation.

Figure 4.10: Sweep hand tracker implementation diagram for one time-step. The triangles, with an S on

them, represent the sweep procedure. The boxes, with an R on them, represent the finger length

refinement procedure. The horizontal lines at the bottom represent particle interpolation; the red

dots are interpolated particles from the black dot, the particle with highest weight.


1. For the Palm particle set do: 1.1.Use the complete particles from the previous time-step to generate new particles for Palm. 1.2. Apply dynamics to each of the particles in Palm. 1.3. Weight particles in Palm. 1.4. Select particles from Palm that constitute peaks of weight in the set. (Resampling) 2. For each of the fingers i.e. L, R, M, I do: 2.1. Apply the sweep pattern, using as starting position the finger angle of the previous time-step. 2.2. Apply the finger length refinement. 2.3. Use the angle and length found in the previous two steps in order to generate a new interpolated particle for each of the particles selected in the Palm particle set. 2.4. Weight the interpolated particles. 3. For the first thumb segment do: 3.1. Apply the sweep pattern, using as starting position the first thumb angle of the previous time-step. 3.2. Use the angle found in the previous step in order to generate an interpolated particle for each of the selected particles in the Palm particle set. 3.3. Weight the interpolated particles. 4. For the second thumb segment do: 4.1. Apply the sweep pattern, using as starting position the second thumb angle of the previous time-step. 4.2. Use the angle found in the previous step in order to generate an interpolated particle for each of the interpolated particle in the first thumb segment. 4.3. Weight the interpolated particles. 4. Form complete particles for the next time-step.

Figure 4.11: Algorithm for one time-step of tracking with the sweep implementation.

4.8 Performance measures for articulated contour trackers

In the previous sections, we have presented two articulated hand contour tracker

implementations. Both implementations make it possible to track the user's hand contour

through a video sequence. However, we may ask: how accurate is the tracking in each of

these implementations? In order to answer this question, we need to define suitable

performance measures for articulated contour trackers. The performance measurement of an

articulated contour tracker is not an easy task. One problem is that the accuracy of the global

tracking and the accuracy of the articulated tracking interact with each other, which makes

difficult to give a general benchmark. In particular, it is not very meaningful to give a single

value to explain the quality of a particular tracker on a particular video-sequence. For this


reason, the performance measures treated in this section will refer to a single time-step of

tracking. Hence, for a tracking sequence, the results will be in the form of a sequence of

values, one for each time-step of tracking.

This section presents five performance measures: cost function, contour distance, SNR,

distance between model points, and distance between model parameters. The first one is

based on a value internal to the tracker, the cost function; the rest are based on a ground

truth. The ground truth for the articulated hand contour tracker requires placing, for each

frame of a test video sequence, an articulated hand contour exactly on top of the real hand.

This process is done manually, and this means that it is costly, and not completely accurate.

4.8.1 Cost function

In Section 4.3, we presented a measurement model. Within this measurement model we

define a function whose return value describes the degree of fit between an articulated hand

contour and a real hand in an image. This function is referred to as a weighting function

when dealing with particle filters, and as a fitness function, when referring to template

matching. Here, in a more general sense, we refer to it as a cost function. Most trackers use a

cost function in their tracking mechanisms as a way of evaluating how good any particular

tracking state is, at a given time-step. Generally, the objective of a tracker is to maximize (or

minimize) the value of the cost function by changing the values of some model variables

from one time-step to the next one. This is an important point to have in mind when

comparing various trackers implementations which use the same cost function; one tracker

implementation will be better than another one in terms of which one best maximizes (or

minimizes) the value of the cost function.

However, the original design of a cost function is not as a performance measure. The cost

function may be a simplification and have a limited accuracy itself, because of computation

speed requirements, or because of measurement model limitations. In fact, as we will see in

Section 4.11, when the proposed skin colour based cost function is compared, as a

performance measure, with an external performance measure based on a ground truth, we

can sometimes observe a disagreement between the two performance measures.


We finish this section pointing out the fact that as trackers, in most cases, already provide the

cost function, its use as a performance measure is very attractive; however, a proper

benchmarking of the tracker should use other performance measures too.

4.8.2 Contour distances

A performance measure for contour trackers can be constructed by calculating, for every

frame of a tracking sequence, a distance metric between the tracker's output contour and a

ground truth contour. A simple distance metric between two contours could be defined in

terms of their control points. Let ),( ii yx=x and ),( ii yx ′′=′x be two sets of control points,

for the tracker output contour and the ground truth contour respectively. A distance metric

can be given by the norm of their difference:

1

1 22 2

0( ) ( )

N

i i i ii

x x y y−

=

′ ′ ′− = − + − ∑x x (4.1)

However, the contours used in this thesis are represented by cubic B-splines parametric

curves, and (4.1) does not take into account the distances between corresponding points on

each curve. A better distance metric can be formulated by including the B-spline metric

matrix as given in (Blake and Isard, 1998). Given two cubic B-splines ( )sP and ( )s′P

defined by their control points ( , )i ix y and ( , )i ix y′ ′ , a more accurate distance metric, d,

measures the difference between corresponding points on each spline, sampled densely and

uniformly over the parametric curves. The distance metric is given by,

( )( ) ( )( )1 1

2 21 12 22

0 00 0 0

( , ) ( ) ( ) ( ) ( )N N NN N

i i i i i ii i

d s s ds x x B s ds y y B s ds− −

= =

′ ′ ′ ′= − = − + −

∑ ∑∫ ∫ ∫x x P P (4.2)

where ( )iB s is the cubic B-spline basis matrix.

A performance measure based on this distance metric can be convenient, and gives useful

information about how similar, or different, the output of a contour tracker is to a ground

truth contour.

4.8.3 Signal to Noise Ratio (SNR)

A popular measure for evaluating the performance of a tracker is the SNR. The SNR is an

image processing based measurement (Tissainayagam and Suter, 2002); thus, it is

independent of the parameterisation of the contour representation. The SNR is defined in


terms of the overlap between the output contour of the tracking and the ground truth contour.

The output SNR (in dBs) denoted as outSNR , for a single frame, is calculated as follows:

2

,2 ( , )ref

x ysignal I x y = ∑

2

,( , ) ( , )ref track

x ynoise I x y I x y = − ∑

10( ) 10 logoutsignalSNR dBnoise

= (4.3)

where refI and trackI are binary images. refI is 1 for points inside the ground truth contour,

and 0 outside. trackI is similar, but uses the tracked contour. The scale factor 2 in the signal

value was chosen so that a SNR of 0 (ie. signal = noise) would occur if the tracker silhouette

consisted of a shape of the same area as the ground truth shape, but inaccurately placed so

that there is no overlap between the two. This is the worst-case scenario where the tracker

has completely failed to track the object.

4.8.4 Distance between model points

The previous performance measures refer to the whole contour as a single entity; however, in

an articulated tracking we may want to know how good is the tracking of individual

articulations. One easy way of calculating this is by measuring the 2D Euclidean distance

between two corresponding points, situated on some articulation of the tracking output

contour and a ground truth contour. Some candidate points are: the origin of the articulated

contour, any of the finger pivots, and the fingertips. The measurement of the tracking

accuracy of the individual fingertips, against a ground truth, is of special importance in the

applications of Chapter 8 ; where the movement and position of the fingertips determines the

location of the 'touch' on an interactive surface.

4.8.5 Distance between model parameters

As the proposed articulated hand contour tracking is model based, an alternative benchmark

is the discrepancy, for each time-step of tracking, between the model parameters and the

ground truth model parameters. This performance measure allows examining individual

model parameters but relies completely on the ground truth. This makes the measure a bit

inflexible, as there could be situations in which a combination of model parameters could


differ from the ground truth, but the resulting contour still fits correctly in some areas of

interest, for example the fingertips.

4.9 Test video sequence

In order to assess the performance of the hand tracking implementations presented in this

thesis, a test video sequence was created. As the purpose of the articulated hand tracking is to

be used in the VTS applications of Chapter 8 , this video sequence was designed for testing

each of the VTS hand tracking requirements. The video sequence consist in a VTS user

holding their hand, open, in front of a camera, hand parallel to the camera's image plane,

exactly as if the user was to interact with a VTS. Then, the user starts moving their hand

following certain patterns of rigid and articulated movement. Figure 4.12 shows a diagram of

the test video sequence structure. The video sequence has a total of 890 frames (although

tracking starts at frame 30) and it is composed of the following five sections:

• From frame 30 to frame 174, the hand moves rigidly with splayed fingers. This section

tests the tracker's capability for rigid tracking.

• From frame 175 to frame 359, the hand moves rigidly with splayed fingers again; this

time with very brisk movement. The movement is so intense, sometimes more than 50

pixels of swing from one frame to the next one, that this section of the test video

sequence can only be successfully tracked by using some of the techniques introduced in

Chapter 7 . Note that the brisk motion not only involves fast translations but also fast

rotation, and fast changes in scale. This section tests the tracker's capability for brisk

rigid tracking.

• From frame 360 to frame 402, the hand remains static but fingers flex and extend, as if

the user was typing on specific area of a VTS. This section tests the tracker's capability

for articulated finger tracking, isolating it from global hand tracking.

• From frame 403 to frame 572, the hand moves rigidly at the same time that fingers flex

and extend, as if the user was typing on a wider area of a VTS. This section tests the

combined tracker's capability for simultaneous global and articulated tracking.

• From frame 573 to frame 890, the hand moves rigidly while a finger is kept flexed, as if

the user was dragging an object on the VTS. The dragging of the finger describes a

square on the VTS; this is first done for the index finger, then for the middle finger, and

finally for the ring finger. This section tests the tracker's capability for tracking the rigid

movement of the hand while a finger remains flexed.


In addition to the main movement pattern of each section, at the end of the first, second, and

last sections the thumb is flexed. This allows testing the tracking for the two joints of the

thumb.

Figure 4.12: Test video sequence structure.

A ground truth for this test video sequence was created in the following way: for each frame

of the video sequence an articulated template, like the one in Figure 4.1(b), was manually

adjusted in order to match the hand configuration in that frame. Once the template was

adjusted for each frame, the parameters of the articulated template were stored. This allows

the recreation of the tracking for the whole test video sequence. The test video sequence is

available in Appendix B and on the supporting webpage as "video_sequence_4.1.avi". The

ground truth for the test video sequence is available in Appendix B and on the supporting

webpage as "ground_truth.txt".

4.10 Results and comparisons

This section presents tracking results of the two articulated hand contour tracking

implementations: the particle-set implementation, and the sweep implementation. The

particle-set implementation uses 250 particles for the Palm particle set, and 100 particles for

each of the finger and two thumb partitions. The sweep implementation uses 250 particles

for the Palm particle set; the finger and thumb positions are found using the deterministic

search described in Section 4.7. Both hand tracker implementations use the resampling

scheme described in Section 4.4, which involves propagating 10% of the particles from one

time-step to the next.

Both tracker implementations are tested with the video sequence described in the previous

section. However, the section between frames 175 and 359, corresponding to brisk rigid-

hand motion, cannot be successfully tracked by either of the two trackers. For this reason the


test is performed in three sections, initialising the tracker at the beginning of each one. The

first section goes from frames 30 to 174, and corresponds to the rigid hand motion. The

second section, goes from frames 175 to 359, and corresponds to the brisk rigid hand motion.

During this section both trackers fail to keep a lock on the target. The third section goes from

frames 360 to 890, and corresponds to static typing, dynamic typing, and dragging. The two

tracker implementations calculate three performance measures for each frame along the

tracking, these are: cost function, contour distance, and SNR. The results are presented in

form of charts; the vertical axis is specific to each performance measure, and the horizontal

axis is the frame number. The cost function is shown in a logarithmic scale because of the

large variation the cost function values can reach, these could potentially go from about

8.47E-22 to 1.180E+21. The large range of the cost function values is due to the way the cost

function is calculated (see Section 4.3.2 for further information). Each chart displays the

mean average and variance of its values, with the exception of the cost function chart, which

shows the log variance4.

Videos for the tracking output of both trackers are available in Appendix B and on the

supporting webpage as:

"video_sequence_4.2.avi" for particle-set implementation frames 30-174;

"video_sequence_4.3.avi" for particle-set implementation frames 360-890;

"video_sequence_4.4.avi" for sweep implementation frames 30-174;

"video_sequence_4.5.avi" for sweep implementation frames 360-890.

Figure 4.13 shows a comparison between the particle-set results, left column, and the sweep

results, right column, for frames 30 to 174. Figure 4.13(a) and (b) show the cost function

charts for the particle-set and sweep implementations respectively. We can observe that the

average of the cost function values for the sweep implementation is larger, and the log

variance is smaller, than for the particle-set implementation. A larger value for the cost

function indicates that the fit between the hand contour resulting from the tracking, and the

real hand on the image, is better. A smaller value for the log variance indicates that the

tracking is more stable. Figure 4.13(c) and (d) show the contour distance charts for the

particle-set and sweep implementations respectively. The contour distance is calculated

using the distance metric of Section 4.8.2. When the value of the metric is smaller indicates

4 We calculate the log variance by taking the base-10 logarithm of the data, and then calculating its variance.


that the hand contour in the tracking is closer to the hand contour in the ground truth. We can

see that, for this section of the tracking, the metric values for the sweep implementation are

considerably smaller than the values for particle-set implementation. The variance of the

metric is also smaller in the sweep implementation than in the particle-set implementation.

This indicates that the sweep implementation tracking is more stable. Finally, in Figure

4.13(e) and (f) we can see the SNR charts for the particle-set and sweep implementations

respectively. A larger SNR value indicates that the overlap between the hand contour in the

tracking and the hand contour in the ground truth is greater. The SNR in both charts is quite

similar; however, if we look at the average and variance values, we can see that the sweep

implementation has a slightly larger average SNR, about 1 dB more, and less variance.

Particle-set Results

(a)

Sweep Results

(b)

(c) (d)

(e) (f) Figure 4.13: Performance comparison from frame 30 until 174.

Figure 4.14 and Figure 4.15 show four example frames for the particle-set tracker output and

the sweep tracker output respectively. These frames illustrate the type of rigid motions and

common fitting errors for each tracker between frames 30 to 174.


frame 50 frame 72

frame 130 frame 158 Figure 4.14: Example frames of the particle-set tracker output from frame 30 to frame 174.

frame 50 frame 72

frame 130 frame 158 Figure 4.15: Example frames of the sweep tracker output from frame 30 to frame 174.



results, right column, for frames 175 to 360. During this section the motions of the target

hand are so intense that the tracking is lost for both trackers. Note that both trackers have a

stopping mechanism, which stops the tracking from continuing if the cost function goes

below 1E-4. If this mechanism did not exist the tracking would continue for longer; however,

the lock could be on the wrong object.


(a)

Sweep Results

(b)

(c) (d)

(e) (f) Figure 4.16: Performance comparison from frame 175 until 359. Both trackers fail at keeping a lock onto

the target hand.


results, right column, for frames 360 to 890. Figure 4.17(a) and (b) show the cost function

charts. We can see that the cost function values for the sweep implementation are again

larger in average than the cost function values for the particle-set version. The log variance

of the cost function for both implementations is approximately the same. Figure 4.17(c) and

(d) show the distance metric charts for the particle-set and sweep implementations


respectively. We can see that both the average and the variance of the metric for the sweep

implementation are considerably smaller than the average and variance for the particle-set

implementation. Finally, in Figure 4.17(e) and (f) we can see the SNR charts for the particle-

set and sweep implementations respectively. Both charts look quite similar; however, if we

look at the average SNR value, the sweep implementation has a slightly larger one than the

particle-set implementation. Also, the SNR variance in the sweep implementation is slightly

smaller, which indicates that the tracking for this implementation is slightly more stable.

Figure 4.18 and Figure 4.19 show four example frames for the particle-set tracker output and

the sweep tracker output respectively. These frames illustrate the type of rigid motions and

common fitting errors for each tracker between frames 360 to 890.


(a)

Sweep Results

(b)

(c) (d)

(e) (f) Figure 4.17: Performance comparison from frame 360 until 890.


frame 523 frame 584

frame 627 frame 817 Figure 4.18: Example frames of the particle-set tracker output from frame 360 to frame 890.

frame 523 frame 584

frame 627 frame 817 Figure 4.19: Example frames of the sweep tracker output from frame 360 to frame 890.


There is certain variability among the three performance measure results for each

implementation; this is due to the different ways in which the performance measures are

calculated. Despite this variability in the results, we can clearly conclude that the sweep hand

tracker implementation produces a more accurate output, and it is slightly more stable, than

the particle-set implementation. Finally, another advantage of the sweep implementation is

that it is about 1.42 times faster than the particle-set implementation. This difference is

important in real-time systems as is can increase the responsiveness of the system.

4.11 Relationship between performance measures

In the previous section we used three indicators, cost function, contour distance, and SNR, in

order to assess the performance of the two hand tracking implementations. If we look at the

three charts for the sweep implementation, we can see that, on average, the three indicators

agree: when one indicator shows good performance the other two tend to show good

performance too, and when one indicator shows bad performance the other two tend to show

bad performance too. However, if we look to the results for individual frames we can often

see small disagreements, for example: for a particular frame the cost function could rise,

indicating a better fit of the hand model to the image, and the distance metric could rise too,

indicating a larger distance from the ground truth. The same is true between cost function

and SNR; and even, to a lesser degree and because of different reasons, between distance

metric and the SNR.

The main cause of these disagreements is that the distance metric and the SNR are based on

a ground truth, but the cost function is based on a measurement model, as discussed in

Section 4.3. On the one hand, the ground truth is calculated manually and can potentially

involve human errors. On the other hand, the measurement model used in the cost function,

explores only certain areas of the image, the measurement lines, in order to evaluate the

fitness between a contour model and the image features. This is an efficient measurement

model in the sense of not having to explore the whole image in order to assess the contour

likelihood. However, the measurement lines can pick up noise and other non-target-related

image features that could result in the incorrect fitness of a contour. Finally, the contours of

the hand in the image (used to establish the ground truth) may differ slightly from the

contours of the hand in the skin colour image (used to calculate the fitness of the contour).


Depending on what type of features are searched along the measurement lines, the cost

function can be more or less prone to report incorrect contour likelihoods. In Section 6.1 we

will see that, under certain assumptions, the skin colour based measurement model is less

prone to report incorrect likelihoods than an equivalent edge detection based measurement

model.

In order to study the relationship between the cost function and a ground truth based

measurement function, the following experiment is performed:

• We run the sweep hand tracker implementation with the test video sequence, section for

rigid tracking; while tracking is locked onto the target, we collect 30000 hand contour

hypotheses, these are particles from the Palm particle set. Some of these hypotheses will

be well aligned on top of the hand, as it appears on the image, and therefore will have a

high weight. Others hypotheses will be poorly aligned or placed away from the hand and

therefore will have a low weight. We evaluate the cost function and distance metric for

all of them.

The results of this experiment are collected in a chart containing 30000 points, one for each

hand contour hypothesis. Figure 4.20(a) shows this chart using logarithmic axis for both the

cost function and distance metric. In this chart is possible to see a relationship between the

cost function and the metric distance. As the metric distance increases, the cost function

decreases. The strength of this relationship can be calculated performing a Spearman's rank

correlation on the dataset. The resulting correlation factor is -0.6941. This indicates a fairly

strong negative correlation between the cost function and the metric distance. In Figure

4.20(b) we can see the same data using linear axes. This chart shows that the cost function is

always very low and grows extremely rapidly when the metric distance is under 200. Figure

4.20(c) shows a close-up view, in which it is possible to see that the cost function peaks at

metric distance of around 20. These observations give us a better insight into the distance

metric performance charts of Section 4.10. One thing to note is that even in an idealized

tracker (that is a tracker that could find the hypothesis with the largest possible weight for

every frame of a tracking sequence) a chart of the distance metric for some tracking sequence

will show an average distance metric of about 20. Another thing to note is that a hypothesis

with a large weight is certainly going to be near the real contour. As the weight of a

hypothesis decreases, it can be said that for a small interval the distance to the real contour

increases. However, if a hypothesis has a low weight it does not tell us anything about how


near to the real contour is. In general it would be far from the real contour, but it could also

be near.

(a)

(b) (c) Figure 4.20: Relationship between cost function and distance metric. (a) Both axis in logarithmic scale, (b)

Both axis in linear scale, (c) Both axis in linear scale, close-up view.

4.12 Technologies employed in the implementations

The development of the hand contour trackers presented in this chapter has involved a

number of third-party libraries:

• Oxford Tracking Library (OTL, 1999). Developed by the Oxford Visual Dynamics

Group, this library provides a framework for contour tracking. This library was useful for

understanding the implementation of contour trackers, and was used in early

experiments. From there, a new library was specifically written to suit the requirements

of the hand trackers developed in this thesis. The only part from OTL that trackers in this

thesis still use is that related with BSpline drawing.


• OpenCV (OpenCV, 2006). OpenCV is an open source computer vision library developed

by Intel. Several functions from this library are used in the implementation of this thesis

hand trackers.

• DirectShow SDK (DirectShow, 2005). This library is used for video streaming under

Microsoft Windows platforms.

4.13 Conclusions

This chapter has presented two articulated hand contour tracker implementations based on

the techniques described in Chapter 3 . These are named particle-set and sweep

implementations. Both implementations were benchmarked using the same test video

sequence, and three performance measures were evaluated for each implementation. The

sweep implementation produced the best results in both tracking accuracy and computational

speed. Note that an increase in the number of particles used in any of this chapter's hand

tracking implementations would also increase the tracking accuracy. However, all the hand

tracking implementations in this thesis use only 250 particles – unless otherwise specified.

This number of particles is chosen because produces a reasonably good tracking performance

while allowing the tracker to run in real-time.

The two hand trackers presented in this chapter are novel for a number of reasons:

• The two trackers use partition sampling in combination with particle interpolation. The

combination of these two techniques makes it possible to track a 14 DOF articulated

hand template using only 250 particles. An equivalent hand contour tracker using only

Condensation would be unfeasible, as it would require nearly 250 million particles in

order to have a similar tracking accuracy.

• The measurement model is exclusively based on skin colour.

• The resampling scheme uses only the 10% of the particles with highest weight.

• The particular point of view from which the hand is tracked: palm parallel to camera's

image plane. In practice, small variations from the plane parallel to the camera's image

plane are still tracked, especially variations on the tilt of the hand with respect to the

vertical.

• The articulated hand model used is novel on its own, specially for the representation of

the finger flexion/extension as a change in the perceived finger length.


The measurement model used in this chapter uses a skin colour classifier that is specially

well suited for use in HCI. This classifier is presented in the following chapter.

99

5 A skin colour classifier for HCI

This chapter proposes a skin colour classifier, which is designed to detect skin colour in

tracking applications, such as the ones found in Human Computer Interaction (HCI) systems;

in particular, this classifier is used in the measurement function, Section 4.3, of the

articulated hand trackers presented in this thesis. The classifier, named the Linear Container

(LC) classifier, uses four decision planes in order to define a volume of the RGB space

where skin colours are likely to appear. The main features of the LC classifier are:

• Rapid evaluation.

• Minimal storage requirements.

• Resistance to illumination (brightness) changes equivalent to that of classifiers that work

in normalised RGB.

The classifier needs an initialisation step, where a single training image, with marked skin

and background areas, is analysed in order to find the model parameters of the classifier.

Other features of the LC classifier related to the initialisation step are:

• It can be tuned to maximize the detection of skin colour for a specific person, and for

specific illumination conditions.

5 A skin colour classifier for HCI 100

• Fast calculation of the model parameters, which can allow the model to be updated

dynamically.

• Even when the initialisation of the classifier is under unideal conditions, the skin

detection rates are still high.

• Good generalisation of training data.

The LC classifier is tested on various illuminations and on various skin colour tonalities, the

results are then compared with other classifiers. The time the LC classifier spends during the

initialisation step can be greatly reduced by reducing the resolution of the training image.

The performance of the LC classifier is studied when the initialisation step is repeated at

various decimated resolutions. Some HCI usability factors, related with unideal

initialisations, are also explored. Finally, the LC classifier use is illustrated with two example

applications: firstly, an initialisation example of the LC classifier in a HCI application;

secondly, a dynamic tuning procedure for the LC classifier is presented in the context of

contour tracking.

5.1 Previous Work on Skin Colour Detection

Skin colour provides an important source of information for computer vision systems that

monitor people. The skin colour cue is widely used in face detection and recognition

systems, various types of surveillance, vision-based biometric systems, and vision-based

HCI systems. All these application areas use skin colour to track, locate and interpret people,

with relatively efficient, fast, low-level methods.

The goal of skin colour detection is to build a decision rule that can discriminate between the

skin and non-skin colour pixels of an image. Because of the importance of skin colour

detection there have been numerous approaches to solve this task. The various approaches

can be grouped into the following four categories: non-parametric skin distribution

modelling, parametric skin distribution modelling, explicitly defined skin region modelling,

and dynamic skin colour modelling (Vezhnevets et al., 2003).

Non-parametric skin distribution modelling uses training data to estimate a skin colour

distribution. This estimation process is sometimes referred to as the construction of a Skin


Probability Map (SPM) assigning a probability value to each point of a discretized colour

space (Jones and Rehg, 1999; Brand and Mason, 2000; Gomez, 2002). A SPM can be

implemented by a colour histogram, and such approaches normally use the chrominance

plane of some colour space in order to offer resistance to illumination changes (Jones and

Rehg, 1999; Chen, et al., 1995; Zarit, et al., 1999; Schumeyer and Barner, 1998). SPMs can

use a Bayes classification rule in order to improve their performance, in this case two colour

histograms are required; one for the probability of skin colour, and another for the

probability of non-skin colour (Jones and Rehg, 1999; Zarit, et al., 1999; Chai and

Bouzerdoum, 2000). The main disadvantages of SPMs are the high storage requirements and

the fact that their performance directly depends on the representativeness of the training

images.

Parametric skin distribution modelling can represent skin colour in a more compact form.

Common examples of parametric modelling model a skin colour distribution using a single

Gaussian (Menser and Wien, 2000; Terrillon, et al., 2000; Ahlberg, 1999), or a mixture of

Gaussians (Jones and Rehg, 1999, Terrillon, et al., 2000; Yang and Ahuja, 1999).

Expectation Maximization (EM) algorithms are used on training data to find the model

parameters that produce the best fit. The goodness of fit, and therefore the performance of

the model, depends on the shape of the chosen model and the chosen colour space. This

performance dependency with the colour space is stronger in the case of parametric

modelling than it is in the case of non-parametric modelling (Brand and Mason, 2000; Lee

and Yoo, 2002).

Another way to build a skin colour classifier is to define explicitly, through a number of

rules, the boundaries of a skin cluster in some colour space; this is called explicitly defined

region modelling. The obvious advantage of this method is its computational simplicity,

which has attracted many researchers (Ahlberg, 1999; Peer, et al., 2003; Fleck, et al. 1996;

Jorda, et al., 1999), as it leads to the construction of a very rapid classifier. However in order

to achieve high recognition rates both a suitable colour space and adequate decision rules

need to be found empirically. Gomez and Morales (2002) proposed a method that can build a

set of rules automatically by using machine learning algorithms on training data. They

reported results comparable to the Bayes SPM classifier in RGB space for their data set.


Finally, we have dynamic skin colour modelling. This category of skin modelling methods is

designed for skin detection during tracking. Skin detection in this category is different from

static image analysis in a number of aspects. First, in principle, the skin models in this

category can be less general – i.e tuned for a specific person, camera, or lighting. Second, an

initialisation stage is possible, when the skin region of interest is segmented from the

background by a different classifier or manually; this makes possible to obtain a skin

classification model that is optimal for the given conditions. Finally, this category of skin

models can be able to update themselves in order to match changes in lighting conditions.

Some of the methods in this category use Gaussian distribution adaptation (Yang and Ahuja,

1998), or dynamic histograms (Soriano, et al., 2000; Stern and Efros, 2002; Sigal, et al.,

2000). In (Soriano, et al., 2000) a skin locus, in rg space, is constructed beforehand from

training data. Then, during tracking, their dynamic skin colour histogram is updated with

pixels from the bounding box of the target, provided these pixels belong to the skin locus.

This makes the dynamic histogram less likely to adapt to colour distributions other than that

of skin.

The proposed LC classifier belongs to the last two categories. The classifier is implemented

using rules similar to those of the explicitly defined skin region models; however, these rules

are parameterised in order that they can be tuned to specific conditions, during an

initialisation stage. The parameters of the LC classifier can also be recalculated rapidly in

order to adapt to changing illumination conditions.

5.2 Development of the LC classifier

The LC was designed to overcome some of the shortcomings of three other skin colour

classifiers. This section briefly describes these three skin colour classifiers that lead towards

the construction of the final LC classifier.

5.2.1 RGB histogram classifier

One of the simplest skin colour classifiers is a SPM implemented with a RGB histogram

(Jones and Regh, 1999; Chen, et al., 1995; Zarit, et al., 1999; Schumeyer and Barner, 1998).

The RGB colour space is quantified into a number of bins, for example 256x256x256 bins.

Each bin, defined by a triad of values, stores the number of times this particular colour

occurred in the training skin images. After the training stage, a pixel can be tested as being


skin colour or not by using the RGB components of the pixel to form the address of a bin in

the histogram. The main features of the RGB histogram classifier are:

• Very fast.

• Large storage requirements.

• Poor resistance to illumination changes.

• Need of a large data set for training.

• Larger bin size can reduce storage requirements, and account for training data sparsity;

however, then, the false positives can increase.

Jones and Regh (1999) reported the 32x32x32 quantification as being the best compromise,

regarding storage, generalisation, and detection rates. Figure 5.1 shows a skin colour

histogram made from a single skin colour sample. The quantification is 32x32x32, and only

the bins bigger than 5 are shown.

Figure 5.1: Skin colour RGB histogram.

5.2.2 Normalised RGB histogram classifier

Normalised RGB can be easily obtained from the RGB values by the following

normalisation procedure:

BGR

BbBGR

GgBGR

Rr++

=++

=++

= (5.1)

As the sum of the three normalised components is 1, the third component does not hold

significant information and can thus be omitted. The remaining components, often called

“pure colours”, have a diminished dependence on brightness. A property of this

representation is that for matte surfaces normalized RGB is invariant (under certain

assumptions) to changes of surface orientation relative to the light source (Skarbek and


Koschan, 1994). This, together with the transformation simplicity has made this colour space

a popular choice among several researchers (Zarit, et al., 1999; Lee and Yoo, 2002; Peer, et

al., 2003; Stern and Efros, 2002; Yang and Ahuja, 1998; Brown, et al., 2001; Soriano, 2000;

Oliver, et al., 1997). A skin colour histogram using the r and g components is therefore going

to be more resistant to illumination (brightness) changes than a bare skin colour RGB

histogram. The main features of the rg histogram classifier are:

• Fast, but slower than an RGB histogram. This is because every pixels has to be

normalised before accessing the histogram.

• Good resistance to illumination (brightness) changes.

• Needs less training data than an RGB histogram.

Figure 5.2 shows an rg histogram made from a single skin colour sample.

Figure 5.2: Normalised RGB histogram.

5.2.3 Projected RGB histogram

The projection from RGB (3D space) to normalised RGB (2D space) corresponds with a

cone in the original 3D RGB space, in that each point in the rg-plane corresponds to a 3D

line of colour values in the original RGB space. These lines meet at (0, 0, 0), and points

along the lines correspond to scaling of white illumination. Therefore, a skin colour cluster in

the rg-plane corresponds to a cone-like cluster in RGB space. This is illustrated in Figure 5.3.


(a) (b)

Figure 5.3: Projection from rg to RGB. (a) Skin colour cluster, in the rg-plane, from a single sample. (b) The

rg-plane skin colour cluster projected to RGB space; each point in the rg-plane becomes a line in

RGB space.

A new skin colour classifier is proposed: the projected RGB histogram classifier. This

classifier tries to combine the evaluation speed of the RGB histogram and the illumination

independence of the rg histogram. The procedure to construct this projected histogram is to

create an rg histogram from the training data, and then project it to RGB, creating an RGB

histogram. The bins of the resulting RGB histogram are processed using a 3D median filter

in order that the skin colour cone becomes more solid, and fills any possible gaps (due to

sparse data in the original rg histogram). The main features of the projected RGB histogram

classifier are:

• As fast as the RGB histogram classifier.

• Good resistance to illumination changes (same as the rg histogram classifier).

• Needs less training data than an RGB histogram.

• Large storage requirements.

• The processing of the resulting RGB histogram with a 3D median filter, can account for

some training data sparsity, resulting in better generalisation.


5.3 The Linear Container (LC) classifier

The proposed LC classifier attempts to reproduce and improve the results of the projected

RGB histogram while reducing the storage requirements. The LC classifier uses a polyhedral

cone, constructed from four decision planes, in order to model the cone-like region in RGB

space that results from the projection of a skin colour cluster in the rg-plane to the RGB

space. The LC classifier performs pixel-based segmentation. If an RGB value is inside the

polyhedral cone volume, it is classified as skin; if the RGB value is outside the polyhedral

cone volume then it is classified as non-skin. The definition of the four decision planes is:

BGhmin G BRmin R B BGhmax G BRmax R⋅ + ⋅ < < ⋅ + ⋅ (5.2)

where BGhmin and BRmin parameterise the lower "horizontal" plane, and BGhmax and

BRmax parameterise the higher horizontal plane. The horizontal planes are illustrated in

Figure 5.4(a). These two planes confine a volume between them by constraining the values

that B can take in relation to R and G. This volume is further constrained by two "vertical"

planes:

BGvmin B GRmin R G BGvmax B GRmax R⋅ + ⋅ < < ⋅ + ⋅ (5.3)

where BGvmin and GRmin parameterise the left vertical plane, and BGvmax and GRmax

parameterise the right vertical plane. The vertical planes confine a volume between them by

constraining the values that G can take in relation to R and B. The vertical planes are

illustrated in Figure 5.4(b).

(a) (b) Figure 5.4: LC classifier decision planes. (a) Horizontal decision planes. (b) Vertical decision planes.

As the RGB values that are close to the origin carry too little colour information, we need an

additional rule in order to truncate the apex of the polyhedral cone. Possible rules are:


Rmin R< illustrated in Figure 5.5(a) (5.4)

RGSum R G< + illustrated in Figure 5.5(b)

RGBSum R G B< + + illustrated in Figure 5.5(c)

Each one rule is better than the previous one, but it also has one addition more than the

previous one. If a colour value satisfies Equations 5.2 and 5.3, and one of the dark rule

equations 5.4, then it is inside the truncated polyhedral cone, and therefore it is classified as

skin colour.

(a) (b) (c) Figure 5.5: Possible decision planes to avoid dark pixels.

The LC classifier can be tuned for a specific person, camera, or lighting conditions, in an

initialisation step. For this, an initialisation image is needed. This initialisation image is

composed of two approximately complementary masks; one mask delimits the target skin

colour area, we call this mask SkinMask; and the other mask comprises areas where we do

not expect to find skin colour, we call this mask BackgroundMask. Figure 5.6 shows an

initialisation image segmented by the two masks. The BackgroundMask can be tailored in

order to avoid areas of skin colour in addition to those included in SkinMask, for example,

Figure 5.6(b) avoids the subject's wrist. The two masks can be generated manually, or

automatically by a tracking system.

(a) (b) Figure 5.6: Initialisation image masks. (a) Initialisation image segmented by SkinMask. (b) Initialisation

image segmented by BackgroundMask.


The tuning procedure uses a heuristic method by which the parameters of the decision planes

are changed in sequence. Each time a parameter is changed, the fitness of the LC classifier,

to the detection of skin colour in the SkinMask and to the rejection of skin colour in the

BackgroundMask, is measured using the following equation:

# #

skin pixels in SkinMask skin pixels in BackgroundMaskfitness TI

size of SkinMask size of BackgroundMask= × − (5.5)

where TI (Target Importance) is used to control the importance of the target skin colour area

in the fitness. In the experiments of the following sections TI = 2 so as to give double

importance to detecting skin on the SkinMask than to avoid detecting skin on the

BackgroundMask. This parameter allows the classifier to be tuned to favour true positives or

negatives.

The heuristic search, by which the parameters of the decision planes are changed, is

illustrated in Figure 5.7(a). This figure shows a section view of the RGB cube, corresponding

to the B-G-plane with maximum R. Lines 1, 2, 3 and 4 are the intersections of the four

decision planes with the section view. Starting from some a priori values, given in Table 5.1,

the search varies BRmin, then BRmax, GRmin, and finally GRmax; first, reducing their

values, then increasing their values, and measuring the fitness (Equation 5.5) at each step.

The values that produce the best fitness are finally selected. Note that the angle of each

decision plane remains unchanged in this heuristic search.

(a) (b) Figure 5.7: Tuning heuristics. (a) First tuning heuristic. (b) Tuning enhancement.


BGhmin = -1.2 BGhmax = -1.2

BGvmin = 0.76 BGvmax = 0.76

BRmin = 0.973 BRmax = 1.55

GRmin = 0.104 GRmax = 0.476

Table 5.1: LC classifier priori values.

The performance of the tuned LC classifier can be increased by using a further tuning

heuristic. This tuning heuristic, regarded as tuning enhancement, is performed after the first

tuning heuristic is finished. Starting from the model parameters resulted from the first

heuristic search, this tuning enhancement proceeds to vary the eight model parameters of the

LC classifier. First, a rotating pivot is calculated for each decision place. A rotating pivot is

the midpoint of each line 1, 2, 3, and 4. In Figure 5.7(b) the rotating pivots are labels as a, b,

c, and d. Then each decision plane is rotated around its pivot, sequentially, decision plane 1

around pivot a, decision plane 2 around pivot b, and so on. The rotation of a decision plane

around its pivot involves the two parameters that define the plane, for example for decision

plane 1, rotation around pivot a involves BRmin, and BGhmin. At each step during a rotation

the fitness (Equation 5.5) is calculated. The values that produce the best fitness are finally

selected. A video illustrating the tuning operation is available in Appendix B and on the

supporting webpage as "video_sequence_5.1.avi".

5.4 Performance Results

The LC classifier is tested on video sequences of subjects with four different skin tonalities:

Mediterranean, white Caucasian, black African, and Chinese. The target skin colour area is

the subject's hand. The subjects hold their hand open in front of the camera, and move the

hand towards and away from the camera. An overhead lamp affects the illumination of the

subject's hand. When the subject's hand is closer to the camera, it is under a shadow and

looks darker. When the hand is further away from the camera, it is under the lamp and looks

brighter. The classifier is initialised once, using the first frame of each video sequence, and

using the first tuning heuristic described in Section 5.3.

The skin colour detection performance is calculated for each video sequence, using a ground

truth. The ground truth consists of two masks, which have been manually generated for every

fifth frame of the four video sequences. The ground truth considers the subject's hand as the


target area for skin colour detection. This area is segmented using the SkinTruth mask,

Figure 5.8(b). The background is segmented using the BackgroundTruth mask, Figure 5.8(c).

Note that the BackgroundTruth mask is not the complement of the SkinTruth mask. The

BackgroundTruth mask avoids the target skin colour area, the subject's hand, and any other

skin colour areas in the image; therefore, for each measurement frame, there will be some

areas which will not take part in the counting; these areas correspond to the subject's face and

arms. Both masks are tested for skin colour. Skin colour pixels found in the SkinTruth mask

constitute true-positives. Non-skin colour pixels found in the BackgroundTruth mask

constitute true-negatives. In order to compare detection results between frames, the true-

positives and true-negatives are normalised to the size of SkinTruth and BackgroundTruth

masks respectively. Normalised true-positives are referred to as NTP, and normalised true-

negatives are referred to as NTN.

(a)

(b)

(c)

Figure 5.8: Ground truth masks. (a) Original frame. (b) SkinTruth mask. (c) BackgroundTruth mask.

We use the skin colour classifiers described in Section 5.2 as a comparison reference. The

RGB skin colour histogram has a size of 32×32×32 bins; the rg histogram has a size of

64×64 bins; and the projected RGB histogram has a size of 100×100 bins for the initial rg

histogram, and 32×32×32 bins for the projected RGB histogram. All the histograms used for

comparison are constructed in an initialisation step at the beginning of each sequence from

the pixels in SkinMask. A pixel is classified as skin colour if its corresponding bin in the

histogram is bigger than a threshold. The choice of the threshold affects the detection rate of

the histogram. In general, if the threshold increases, NTN tends to be higher, but NTP tends

to be lower; if the threshold decreases, NTP tends to be higher, but NTN tends to be lower.

For the tested video sequences the thresholds that produce the best results for each histogram

are: 5 for the RGB and projected RGB histograms, and 25 for the rg histogram.

Figure 5.9 shows the results for the Mediterranean subject. The figure presents plots of the

NTP and NTN against the frame number, and two example frames showing the skin colour

classification. The example frames correspond to points in the video sequence at which the


detection rate is maximum and minimum. The row (b) corresponds to the RGB histogram;

the row (c) corresponds to the rg histogram; the row (d) corresponds to the projected RGB

histogram, and the bottom row (e) corresponds to the LC classifier. Figure 5.10, Figure 5.11,

and Figure 5.12, follow the same layout for the white Caucasian, black African, and Chinese

subjects respectively.

We can see in Figure 5.9 that the skin colour detection results for each classifier are

different. The RGB histogram is the most sensitive to light changes, and its performance is

the worst of all classifiers. The performance of the last three classifiers is more alike, this is

because all of them present an equivalent illumination (brightness) resistance. However, the

LC classifier exhibits slightly larger NTP and NTN along all the video sequence than both

the rg histogram classifier, and the projected RGB classifier. Tests for the other three ethnic

skin tonalities, Figure 5.10, Figure 5.11, and Figure 5.12, reported similar results: the LC

classifier exhibited the same or larger NTP and NTN than both the rg histogram classifier,

and the projected RGB histogram classifier. Note that the ambient illumination was not

controlled during the recording of the video sequences, and this results in slightly different

ambient illuminations for each test.

The original video sequences, and the output of the classifiers, for each of the test subjects,

are available in Appendix B and on the supporting webpage as:

• "video_sequence_5.2.avi" for the Mediterranean subject video sequence, including the

output of the rg histogram classifier and the LC classifier.

• "video_sequence_5.3.avi" for the White Caucasian subject video sequence, including the


• "video_sequence_5.4.avi" is the Black African subject video sequence, including the


• "video_sequence_5.5.avi" is the Chinese subject video sequence, including the output of

the rg histogram classifier and the LC classifier.


(a)

Frame 90 Frame 110

(b)

(c)

(d)

(e)

Figure 5.9: Mediterranean subject test. (a) Original frames. (b) RGB histogram classifier. (c) rg histogram

classifier. (d) Projected RGB histogram classifier. (e) LC classifier.


(a)

Frame 15 Frame 85

(b)

(c)

(d)

(e)

Figure 5.10: White Caucasian subject test. (a) Original frames. (b) RGB histogram classifier. (c) rg histogram



(a)

Frame 10 Frame 70

(b)

(c)

(d)

(e)

Figure 5.11: Black African subject test. (a) Original frames. (b) RGB histogram classifier. (c) rg histogram



(a)

Frame 25 Frame 85

(b)

(c)

(d)

(e)

Figure 5.12: Chinese subject test. (a) Original frames. (b) RGB histogram classifier. (c) rg histogram classifier.

(d) Projected RGB histogram classifier. (e) LC classifier.


An experiment comparing the computational speed of the LC classifier, against other

classifiers, is also carried out. The experiment measures the time taken for a classifier to

check all the pixels in a 640×480 frame. The experiment is repeated for 100 frames of a

video sequence, containing skin colour, and the times used in each frame are averaged. The

test is carried out in an AMD Athlon 3500+, 1GB of RAM. The results are shown in Table

5.2.

The RGB LC classifier uses an extra rule to avoid dark pixels; the other classifiers do not use

this rule. The rg LC classifier is the 2D equivalent to the proposed RGB LC classifier. It

works in the rg-plane by using 4 decision lines instead of 4 decision planes. The skin

detection performance of this classifier is equivalent to the RGB LC classifier. The equations

in the rg LC classifier are simpler than those of the RGB LC classifier; however, the former

is slower because it has to normalise each pixel from RGB to rg. The use of lookup table

containing all the possible normalisations can speed up the normalisation procedure. But,

even when using a lookup table, the RGB LC classifier is ×1.172126 times faster than its rg

LC equivalent. The average times for the rg histograms are worse than those of the rg LC

classifier, with the additional storage cost. The RGB histogram is only included as a speed

reference because it is the fastest classifier; however, its skin detection rates fall far behind

the other classifiers.

Average time per frame

Speed-up of the RGB LC classifier with respect to the

other classifiers RGB LC classifier 0.0090 secs rg LC classifier 0.0147 secs x1.62 rg LC classifier with lookup table 0.0107 secs x1.17 rg histogram 0.0235 secs x2.59 rg histogram with lookup table 0.0204 secs x2.25 RGB histogram 0.0022 secs x0.24

Table 5.2: Execution time results

5.5 Tuning at Various Resolutions

So far, the LC classifier has been tuned using an initialisation image of the same size as the

video sequence in which it was tested, 640×480 pixels. It was observed that the tuning of the

LC classifier on a decimated version of the initialisation image, results in little degradation

of the classifier's detection performance on the non-decimated video sequence. This is

because the result of the tuning is more dependent upon the range of colours of the pixels in


the initialisation masks than upon the number of pixels. This fact allows us to speed-up the

tuning procedure, because the amount of data to be dealt with is reduced, while still keeping

similar detection performance. The speed-up of the tuning procedure as a result of using a

decimated initialisation image instead of using a non-decimated initialisation image is: ×4 for

a 320×240 resolution, ×16 for 160×120, ×64 for 80×60, ×256 for 40×30, and ×1024 for a

20×15 resolution. Figure 5.13 shows the NTP of the LC classifier, on the video sequence of

the Mediterranean subject, for various resolutions of the initialisation image. The NTN are

not shown as they remain at almost 1 for the six resolutions. Notice that the NTP for an

initialisation image of 320×240 is virtually the same as the NTP for an initialisation image of

640×240.

When the LC classifier is used as a part of an HCI system, the tuning time becomes

extremely important, as the skin detection system has to work at real-time, consuming little

computing power. This is particularly true when the LC classifier is retuned periodically, in

order to cope with new illumination conditions. The speed-up resulting from the use of

decimated initialisation images allows us to meet the real-time requirements of an HCI

system.

Figure 5.13: NTP when tuning at various resolutions.


5.6 HCI Usability Factors

The tuning stage in the experiments of the previous sections was idealised, in that no

background colours appear in the SkinMask, and no skin colour appeared in

BackgroundMask. If the LC classifier is used in a HCI system, which could generate the

initialisation masks automatically from a tracking subsystem, it is possible that background

appears in SkinMask, and skin colour appears in BackgroundMask. In this section we study

the robustness of the LC classifier against non-ideal tuning conditions.

The detection performance of the LC classifier is calculated, once more, for the video

sequence of the Mediterranean subject. This time, the tuning is repeated for a misaligned

SkinMask and BackgroundMask. In each repetition SkinMask only contains a percentage of

the target's skin colour area. The skin that is not in the SkinMask is in the BackgroundMask,

this affects the final configuration of the LC parameters found during the tuning stage.

Figure 5.14 shows the NTP for four percentages of skin colour in SkinMask. The NTN is not

shown as it is almost unaffected in all the four cases. We can see that the degradation in NTP

for a 50% skin in SkinMask is small; and even when the amount of skin in SkinMask is as

small as 25%, the NTP along the whole sequence may still be useful for some applications.

However, the model parameters found during the tuning stage, depend on the colours

appearing in each initialisation mask; hence, different results are possible even when

SkinMask contains the same amount of skin. This is illustrated in Figure 5.15, where the

tuning of the LC classifier using two SkinMasks with the same percentage of skin inside the

mask, produce different detection performances.

(a) (b)

(c) (d)

Figure 5.14: NTP chart for four percentages of skin in SkinMask. (a) profile of SkinMask containing 100%

skin, (b) 75% skin, (c) 50% skin, and (d) 25% skin.


25% A

25% B

Figure 5.15: NTP chart for two different SkinMask containing 25% of skin colour.

5.7 Target importance selection

The tuning of the LC classifier maximizes the skin colour detection inside the SkinMask. The

maximisation procedure is governed by the fitness Equation 5.5. The target importance

parameter in Equation 5.5, TI, was set by default to 2, so as to give double importance to

detect skin colour inside SkinMask than to avoid detecting skin colour in BackgroundMask.

This default value for TI worked well in previous experiments; however, depending on the

amount of skin colour present during initialisation, better detection rates can be achieved by

selecting a specific target importance. The selection of the target importance is a compromise

between the desired levels of NTP and NTN. As NTP increases with the target importance,

NTN will decrease. However, they will do this at different rates, depending on the amount of

skin colour in the initialisation image. The NTP and NTN rates will also change along the

video sequence on which the LC classifier is used. For this reason, it is recommended to

have an initialisation image that is representative of the skin colour ratios in the video

sequence.

Figure 5.16 shows how detection rates vary depending on the target importance, for three

tuning situations, each one with an increasing amount of skin colour in the BackgroundMask.

On the left column there are three charts showing the NTP and NTN, for the initialisation

image, plotted against the target importance. On the right column there are the three

initialisation images used in the tuning. In Figure 5.16(a) the initialisation image is ideal,

there is only skin colour in the SkinMask, and no skin colour in the BackgroundMask. From

the chart we can see that, in this situation, the target importance has very little effect on the


NTP, and NTN rates. In Figure 5.16(b) there is some skin colour in the BackgroundMask

during initialisation. Here, NTP, and NTN change with the target importance. We can see

that as the target importance increases, the NTP increase and the NTN decrease. Here, a

target importance of up to 3.5 does not substantially decrease NTN. In Figure 5.16(c) there is

a high amount of skin colour in the BackgroundMask during initialisation. Also, in this

situation, the T-shirts of two of the persons in the background could be taken as skin colour

depending on the foreground importance value. We can see, in this case, that the NTP and

the NTN change faster with the target importance than in the other cases. A target

importance from 2 to 3 would generally be a suitable compromise between increasing NTP

and reducing NTN.

5.8 Example of a LC classifier initialisation in HCI

A vision-based HCI system that uses the LC classifier can, in most cases, generate the two

initialisation masks, SkinMask and BackgroundMask, automatically from the tracking

subsystem. In this section, an example of initialisation of the LC classifier in a HCI is given.

The HCI system tracks the user's hand, using the sweep articulated hand tracker described in

Section 4.7; this hand tracker uses the LC classifier in order to locate the user's hand. The

important part in this example is the initialisation procedure. In order that the LC classifier

can be initialised the hand tracking has to go through three stages:

• First stage, there is no hand tracking. In this stage, a red hand template appears in the

centre of the tracking area.

• Second stage, when the user places their hand on top of the red hand template an initial

tracking of the hand is started. The hand template turns green to indicate this second

stage. During this stage the LC classifier uses its priori model parameters.

• Third stage, when the location of the hand, during the initial tracking, is good enough the

initialisation of the LC classifier takes place. The hand contour resulting from the

tracking is used to generate SkinMask, and BackgroundMask. The tuning of the LC

classifier is then performed at a reduced resolution of 160x120 in order to speed-up the

tuning procedure. The tuning takes places in a fraction of a second, and from this point

the full articulated hand contour tracking is started. The hand contour turns blue to

indicate full tracking.


(a)

(b)

(c)

Figure 5.16: Target importance for various tuning situations. The horizontal axes on the charts are the target

importance. On the right, there are the initialisation images corresponding each tuning situation. (a)

Ideal case, no other skin colour in the BackgroundMask. (b) Medium case, some skin colour in the

BackgroundMask. (c) Worst case, there is a high amount of skin colour in the BackgroundMask.

Figure 5.17 illustrates the initialisation sequence. Rows (a) and (b) correspond to the original

video frames and detected skin colour for an initialisation sequence during day illumination

levels. It can be noted in columns 1 and 2, that the LC classifier with priori parameters can

detect the skin colour of the user, however at that illumination level it also detects a few

areas of the background as skin colour. When the LC classifier is tuned to the user's hand,

column 3, the detection of the user's hand skin colour is maximized for the day illumination

levels. Figure 5.17 rows (c) and (d) correspond to the original video frames and detected skin


colour for an initialisation sequence during night illumination levels. It can be noted in

columns 1 and 2, that the LC classifier with default parameters can detect the skin colour of

the user, however at this illumination level the detected skin regions contain holes, and some

missing areas. When the LC classifier is tuned to the user's hand, column 3, the detection of

the user's hand skin colour is maximized for the night illumination levels. Videos of the

initialisation sequences shown in Figure 5.17 are available in Appendix B and on the

supporting webpage as "video_sequence_5.6.avi" for the day illumination test, and

"video_sequence_5.7.avi" for the night illumination test.

(1) (2) (3)

(a)

(b)

(c)

(d)

Figure 5.17: Initialisation sequence. Office daylight. row (a) columns 1, 2, 3, initialisation stages first, second,

and third original video frames; row (b) detected skin colour at each stage. Office nighttime. row

(c), initialisation stage first, second, and third original video frames; row (d) detected skin colour at

each stage.


5.9 Example of dynamic skin colour modelling during tracking

One of the advantages of the LC skin colour classifier is that the detection results of the

classifier can rapidly adapt to changing illumination conditions by updating just four

parameters, these are: BRmin, BRmax, GRmin, and GRmax. This capability is especially

useful when tracking a skin colour target that undergoes various illumination changes,

because it allows to dynamically tune the classifier. This section presents a method for

dynamically tuning an LC classifier while tracking a skin colour object.

Consider a contour tracker similar to the ones described in Chapter 4 . This tracker uses the

LC skin colour classifier in order to perform tracking of a skin colour target. If the

illumination conditions of the target change substantially with respect to the moment at

which the LC classifier was tuned (typically a the beginning of the tracking sequence), the

LC classifier will not be able to produce a good segmentation of the target, and the tracking

may be lost. A possible solution to this problem involves to continuously tune the classifier

to the new skin tones of the target as its illumination conditions change. This solution is

commonly referred to as dynamic skin colour modelling.

Unfortunately, the procedure presented in Section 5.3 for tuning a LC classifier is too slow to

be performed on every frame of a video sequence. The cause for this tuning being slow is

that the LC classifier has to be evaluated for every pixel of the image each time a parameter

of the classifier is tested – and the tuning procedure as implemented for the trackers of

Chapter 4 involves testing about 360 combinations of parameter values. A much faster

method for testing combinations of LC parameters involves using the cost function of the

contour tracker. The cost function of a contour tracker, such as the ones described in Chapter

4 , only explores the pixel values along certain measurement lines normal to the contour, as

opposed to exploring all the pixels in the image. This makes the evaluation of the cost

function comparatively faster. On the other hand, this cost function uses the LC classifier in

order to determine a score. If the parameters of the LC classifier change, the sensitivity of the

cost function will change too, and so will its score value. Generally, the better the skin colour

segmentation of the target is, the higher the cost function's score will be, and vice versa. This

fact can be used in order to adapt the tracking to changes in the illumination conditions of the

target. The procedure involves producing the best skin colour segmentation of the target at

every time-step of tracking by maximizing the cost function of the tracking output. Note that


during this second maximization the contour hypothesis on which the cost function is

evaluated, remains fix. The cost function is maximized by changing the LC classifier

parameters, as opposed to finding the contour hypothesis that better fits the image features

(first maximization).

The procedure to tune the LC classifier during tracking is constructed by using the same

tuning heuristics of Section 5.3, but substituting the part involved in the calculation of the

fitness, Equation 5 .5, for the skin colour based cost function of Section 4.3.2. This new

tuning procedure has to be executed just after the tracker has found the contour hypothesis

with highest weight. Then, the tuning will further increase the cost function of this contour

hypothesis. Only when the cost function increases substantially the new LC parameters are

used. The approach can be understood as changing the skin colour perception of the cost

function so that the best contour hypothesis at each time-step is even more highlighted. The

approach works considerably well while the best contour hypothesis is placed near the

configuration of the target, but the results degrade very rapidly (cascade of errors) if the

configuration of the contour separates from the configuration of the target.

The proposed dynamic tuning of the LC classifier is tested on a hand contour tracker similar

to the ones described in Chapter 4 . The target object is a subject's hand, but in this case the

hand is tracked from its back. During the tracking sequence, the hand is most of the time

near the centre of the image, but the background and the illumination conditions change as

the subject moves the hand around. The illumination changes on the hand are so dramatic

that a hand contour tracker whose LC classifier is tuned only once at the beginning of the

sequence loses tracking easily. In practice, the cost function used in this dynamic tuning is

slightly modified in order to better capture the skin colour segmentation of the hand for the

current set of LC parameters. Figure 5.18 shows a hand contour superimposed with the

measurement lines of the modified cost function. This modified cost function has two new

measurement lines longitudinal to each of the fingers and first thumb segment. These new

measurement lines are processed differently to the ones normal to the contour. When the skin

colour segmentation of the hand is good, these measurement lines will retrieve skin colour

pixels only. If these measurement lines retrieve non-skin colour pixels means that the skin

colour segmentation of the hand is not good. The new measurement lines are used in

combination with the measurement lines normal to the contour in order to generate a score

for the segmentation of the subject's hand.


A video sequence showing the results of the test is available in Appendix B and on the

supporting webpage as "video_sequence_5.8.avi". This video sequence shows the tracking

output and skin segmentation of two trackers. The tracker on top uses dynamic tuning, the

one on the bottom uses static tuning (tune only once at the beginning of the tracking). The

state of the LC decision planes for each frame of tracking is shown in a diagram beside the

respective tracking output. This diagram shows the LC decision planes as they intersect the

B-G plane with maximum R (this type of diagram was used to explain the tuning heuristics,

see Figure 5.7).

Figure 5.18: Modified cost function. Note the measurement lines along the fingers and first segment of the

thumb.

In "video_sequence_5.8.avi" the tracker with dynamic tuning tracks the subject's hand

successfully through all the sequence. However, the tracker with static tuning, first fails to

initialise, and then losses the tracking consistently. The procedure to initialise the hand

tracker follows the three stages described in Section 5.8, these stages are indicated with a red

hand contour (waiting for initialisation), green hand contour (partial tracking), and blue hand

contour (initialisation, and full hand tracking). Figure 5.19 shows some example frames of

this video sequence at the moment in which the tracker with dynamic tuning initialises, and

the tracker with static tuning fails to initialise. In frame 66, the tracker with dynamic tuning

starts to adapt the LC classifier in order to detect the subject's hand (this is indicated with a

green hand contour). The tuning at this moment can be seen in the LC decision planes

diagram on the left. In frame 74 the tracker with dynamic tuning is fully initialised (indicated

with a blue hand contour). The tracker with static tuning makes an attempt to initialise


Frame 65 Frame 66

Frame 71 Frame 73

Frame 74 Frame 76 Figure 5.19: Dynamic tuning vs. static tuning. The tracker with static tuning fails to initialise.


around frame 74, but the score of the hand contour hypothesis does not reach the

initialisation score, resulting in a unsuccessful attempt.

Later in frame 376, the tracker with static tuning initialises by chance when the subject's

hand passes just under the initialisation position. However, the tracking is lost again in frame

1182. Figure 5.20 shows some example frames at the moment in which the tracker with

static tuning loses the location of the hand.

Frame 1144 Frame 1158

Frame 1175 Frame 1182 Figure 5.20: Dynamic tuning vs. static tuning. The skin colour detection of the tracker with static tuning,

progressively degrades from frame 1144 to frame 1175. Eventually, in frame 1182 the location of

the hand is lost. The tracker with dynamic tuning adapts to the current lighting conditions and the

location of the hand is maintained.


In a more general note, when comparing the skin colour segmentation of both tracker outputs

in "video_sequence_5.8.avi", it is possible to see that the LC decision planes of the tracker

with dynamic tuning change every a few frames, this produces radical changes in the skin

colour segmentation. In addition, the segmentation is only good as far as the modified cost

function can tell. This means that areas of the hand where there are no measurement lines are

not guaranteed to appear as skin colour, let alone areas outside the hand. The result is that the

segmented skin colour areas keep changing and sometimes most of the image is classified as

skin colour. This would suggest a worse skin colour segmentation than the tracker with static

tuning; however, the tracker with dynamic tuning can adapt to the changes in illumination

and the location of the subject's hand is not lost at any point in the video sequence.

The output of the tracker with dynamic tuning (but without showing the skin colour

segmentation) is also available in Appendix B and on the supporting webpage as

"video_sequence_8.4.avi".

5.10 Conclusions

This chapter has presented the linear container skin colour classifier. This classifier

constitutes a contribution to the dynamic skin colour modelling methods. Its detection

performance compares well with an rg histogram classifier, resulting in equal or better

detection rates, when using a single training image. Two remarkable qualities of this

classifier are its evaluation speed, and its low storage requirements. The four rules that define

the decision planes, and an extra rule to avoid dark pixels, can be rapidly evaluated, resulting

in a ×2.24 speed-up with respect to a simple rg histogram classifier. As the rules of the

classifier operate in the RGB space, there is no need to spend time normalising pixels to the

rg-plane. However, the LC classifier has a resistance to illumination changes equivalent to

that of a classifier that operates in the rg-plane. The detection performance of the LC

classifier is not greatly impaired when the tuning is performed in a decimated initialisation

image, but the execution time of the tuning is notably reduced. The LC classifier also proved

to be robust to non-ideal initialisations, in which skin colour appears in BackgroundMask,

and background appears in SkinMask. Two example applications of the LC classifier have

also been presented in this chapter. The first application demonstrates the usage of the LC

classifier in a HCI system. The second application shows how hand tracking can improve

when tuning the LC classifier for every time-step of tracking (dynamic tuning), in


comparison to tuning the LC classifier only once at the beginning of the tracking (static

tuning).

A subject of further work is the tuning stage. Different heuristics or maximisation procedures

could be used in order to find a set of parameters for the LC classifier that produce better

detection results. Finally, the LC model itself could be changed. Linear containers are fast to

evaluate, but other type of containers, could produce a better fit of the skin colour cluster

through scaling of white illumination. Sets of rules such as the ones proposed in Gomez and

Morales (2002) could give better detection results, although the tuning procedure for these

type of rules could be more complex.

130

6 Using skin colour and edge features

In Section 4.3 a measurement model based entirely on skin colour features was presented. In

this chapter we will compare this measurement model against an edge based measurement

model. The measurement function of the sweep tracker, Section 4.7, is modified in order to

use edge features, then the performance of the tracker is evaluated when using exclusively

edge features, and when combining both skin colour and edge features.

The classical approach to use edge features in a contour tracker is to process a number of

measurement lines normal to the contour. The position of edge features found on these

measurement lines is then used to calculate the contour likelihood, see Section 3.2. Another

method of using edge features was proposed by Isard (1996). He uses a Sobel filter to find a

directed edge strength, which is then convolved with the direction of the measurement point;

the result of the convolution is transformed in order to directly measure a log likelihood for

that measurement point. MacCormick (2000) proposed another type of measurement

function that used edge features. In this measurement function, the measurement lines

formed a grid on the image and were static, as opposed to be on top of the tracked contour.

6 Using skin colour and edge features 131

The approach taken in this chapter to use edge features is the classical one, measurement

lines normal to the contour.

6.1 Edge features vs. skin colour features

There are some important differences between using edges and using skin colour as features

for calculating the contour likelihood. This section discusses these differences. In order to

avoid confusion in the following discussion, we will refer to an edge found along a

measurement line as image edge, and we will refer to and skin colour edge found along a

measurement line as skin edge.

In Figure 6.1 we can see three views of a frame from a tracking sequence. Figure 6.1(a)

shows the original frame superimposed with the measurement lines of the current contour

position; Figure 6.1(b) shows the skin colour image superimposed with the measurement

lines; and Figure 6.1(c) shows the edge image superimposed with the measurement lines.

When measuring the contour likelihood both edges and skin colour are only detected along

the measurement lines. However, showing the edges and skin colour for the whole image

allows us to see the potential edge and skin colour features that could be found along any

measurement line. Observing Figure 6.1 we can see that the edges are more spread over the

image than the skin colour is. We call an edge in the image that is selected but does not

belong to the object of interest a wrong edge. Assuming that no other skin colour objects

appear in the image when calculating the contour likelihoods, it would seem, intuitively, that

there is less chance of detecting wrong skin edges than of detecting wrong image edges.

Some understanding about the differences between using edges and skin colour can be

gained by analysing the positions of image edges and skin edges found on a large number of

measurement lines. For this purpose, we analyse the measurement lines in a hand tracking

sequence and construct two histograms: a histogram of the positions of image edges found

along measurement lines; and another histogram of the positions of skin edges found along

measurement lines. The hand tracking sequence uses the first 50 frames of the Mediterranean

subject's video sequence of Chapter 5 , this is "video_sequence_5.2.avi". During these first

50 frames of this video sequence, the only skin coloured object in the scene is the subject's

hand (and a small part of the arm). The subject's hand is still on the centre of the image, over


(a)

(b)

(c)

Figure 6.1: Skin colour vs. edges. (a) shows the original frame superimposed with the measurement lines of

the current contour position. (b) shows the skin colour image superimposed with the measurement

lines. (c) shows the edge image superimposed with the measurement lines.

an averaged cluttered background (see exemplar frame in Figure 6.2). The video sequence is

tracked using the sweep hand tracker described in Chapter 4 . The tracker uses 1000 particles

per time-step, and uses both skin colour and edge information. The length of the


measurement lines is 20 pixels. The histograms are constructed from the collected positions

of image edges and skin edges found along the measurement lines of each of the 1000

contour hypothesis, for the 50 frames that the tracking lasts. The total number of

measurement lines taking part in each histogram is about 3.5 million. Because of the nature

of the tracker (a particle filter) only a fraction of the hypothesized contours will fall on the

subject's hand, and therefore, only a fraction of the measurement lines will measure the hand.

However, the hypothesized contours will be always relatively near the to the subject's hand,

and so will be the measurement lines.

Figure 6.2: Exemplar frame. The image edge and skin edge histograms are constructed from tracking frames

similar to this one. The only skin coloured objects are the hand and part of the subject's arm.

Figure 6.3 shows the histograms of the positions where image edge and skin edge features

were found along measurement lines. The histograms are normalised to the total number of

measurement lines. The horizontal axis represents the distance from the measurement point

to where a feature was found. A distance of 0 means that the feature was found right on the

measurement point, a distance of 1 means that the feature was found 1 position away from

the measurement point, and so on. However, the bin 10 is used for when no feature at all was

found along a measurement line.

Before interpreting the histograms it is important to remember two points: Firstly, that the

only skin coloured object present in the scene is subject's hand. Secondly, that when image

edges are used, if various edges are found along a measurement line the selected edge is the

closest to the measurement point, the other image edges are ignored. And when skin edges

are used, there is always a single skin edge in a measurement line, and this is at the first

occurrence of two consecutive skin colour pixels, starting from the end of the line exterior to

the contour. The image edges and skin edges counted in the histogram are in fact selected


(a) (b) Figure 6.3: Normalised histograms of feature positions along a measurement line. (a) Features found were

image edges. (b) Features found were skin edges.

image edges and selected skin edges. Having all this in mind, we can interpret the two

histograms. The first thing that can be observed about the two histograms is the large value

of the 10th bin on the skin edge histogram, about 0.56, in comparison to the image edge

histogram, about 0.25. This bin corresponds to the number of measurement lines than did not

detect a skin edge or an image edge respectively. Taking into account that the number of

measurement lines is the same in both histograms, this first observation implies that there are

about twice as many detected image edges than detected skin edges. On the other hand, as

the only skin colour object during this tracking sequence is the subject's hand, it is logical to

assume that the detected skin edges are mostly due to the subject's hand contour. Therefore,

the difference between the number of detected skin edges and detected image edges found in

the measurement lines must be because about one half of the detected image edges belong to

the background or features inside the hand, and hence, they are wrong edges.

The second thing to notice in the two histograms is the shape of the bins 0 to 9. The general

progression of values in these bins is the same in both histograms. There is a peak at the bin

1, and then values decrease progressively from bins 2 to 9. However, in the image edge

histogram, the peak is much larger, and the following bin values decrease much faster than in

the skin edge histogram. Bins 8 and 9 in the image edge histogram are zero because in order

to detect an image edge the measurement line is convolved with a kernel of 5 elements, and

the result of the convolution is for the element at the middle of the kernel. In the image edge

histogram we can see that most of the detected edges are within 2 or 3 pixels of the

measurement point – in contrast with the skin edge histogram where edges are spread rather

evenly from bins 0 to 9. Taking into account that about half of the detected image edges


belong to the background or features inside the hand, the distribution of values in the image

edge histogram gives us an idea of the density of the background clutter in the image.

When skin edges are used and a hypothesized contour has a high likelihood, it is almost

certain to be due to the correct alignment of the hypothesized contour and the target object.

Where as when image edges are used and a hypothesized contour has a high likelihood, it is

more probable that the hypothesized contour is not so well aligned with the target object,

because they involve more wrong edges. This difference between skin edges and image

edges is reduced when the number of measurement lines increases, as a contour hypothesis

with high likelihood requires the simultaneous contribution of all the measurement lines, and

it is unlikely that all of the measurement lines are wrong edges at the same time.

From the performance point of view, skin edges can be computed slightly faster than the

image edges for two reasons: firstly, the LC skin colour classifier used to calculate whether

pixels belong to skin colour or not, can be evaluated very fast as it consists of only 5 simple

inequalities; and secondly, because in the average case only half of the pixels in a

measurement line will need to be processed in order to find a skin edge. The measurement

operation is typically the most computationally expensive operation in a particle filter.

Therefore, any speed increases in this operation result in significant speed increases during

tracking.

We conclude that the use of skin colour for calculating a contour likelihood is more

attractive than using image edges, provided the skin colour segmentation is good.

6.2 Using only edge features in the measurement function

In the previous section we saw an argument in favour to the use of skin edges over the use of

image edges. In this section we test the performance of the sweep tracker (described in

Section 4.7) when using only image edges in the measurement function. The sweep tracker

has been modified in order to use image edges only. The measurement function uses the

same measurement lines as before, but these lines are now processed using the same edge

detection operator as in Blake and Isard (1998), in order to find image edges on them. This

edge detection operator uses a threshold value, which was set to 80. The method used for the

refinement of the fingers' length (Section 4.6) remains unchanged, as the two measurement


lines involved in it are processed using morphological operations which are not possible to

simulate using edges. Figure 6.4 shows the distance metric results on the test video sequence

of Section 4.9 for both the skin colour based sweep tracker, and the modified image edge

based sweep tracker; both trackers use 250 particles. We can see that the distance metric of

the sweep tracker using image edges is much worse than the sweep tracker using only skin

edges. The decrease in performance is related to the frequent mislocations of the fingers,

which produce the frequent peaks in the charts.

Skin edge based sweep tracker

(a) (b) Image edge based sweep tracker

(c) (d) Figure 6.4: Distance metric of the skin edge vs. the image edge based sweep tracker.

6.3 Combining edge detection and skin colour detection in the measurement function

In the previous section we saw that the performance of the sweep tracker using only image

edges in the measurement function is inferior to the performance of the same tracker using

skin edges in the measurement function. However, if both image edges and skin edges are

used together in the measurement function then the performance of the tracker can be

increased. In fact, there are some situations which cannot be tracked reliably if the edge

information is not available. Two of these situations are illustrated in Figure 6.5. On the left

column of Figure 6.5, there is hand with the fingers together Figure 6.5(a). When the fingers

are together the detected skin colour areas in the fingers become merged, making difficult to

find skin edges between the fingers, mainly for the ring and heart fingers, Figure 6.5(b);


(a) (d)

(b) (e)

(c) (f) Figure 6.5: Situations in which the use of edges is essential for the correct location of the hand. (a) Hand

with closed fingers. (b) The skin colour areas of the fingers become merged, making difficult to

find skin edges between the fingers. (c) The image edges between the fingers are more reliable. (d)

The skin colour of the subject's face occludes the skin colour of the thumb. (e) The thumb position

cannot be recognized from the skin colour information. (f) The image edges of the thumb are not

occluded by the subject's face.

however, it is possible to find image edges between the fingers, Figure 6.5(c). On the right

column of Figure 6.5, the thumb of the target hand is in front of the subject's face, Figure


6.5(d). If we look at Figure 6.5(e), we can see that the skin colour of the face partially

occludes the skin colour of the thumb. It is only by using the edge information, Figure 6.5(f),

that the thumb location can be found. The next step is to combine the information gained

from an image edge and a skin edge in order to have a single score that can represent the

measurement point. One possible approach is to use the same Gaussian profile as in the skin

colour based measurement function of Section 4.3 for both the skin edge distance and the

image edge distance and to multiply the results together. The approach taken is exactly this

one, but the products of the two Gaussian profiles are pre-calculated in a combination matrix

for efficiency. The Gaussian profiles are transformed so that when both the image edge and

the skin edge are zero, the score for that measurement point is 2; and when both the image

edge and the skin edge are 10, the score for that measurement point is 0.5. Note that when

either no image edge or skin edge is found along the measurement line, the selected value in

the combination matrix is 10. Figure 6.6(a) shows the values of the combination matrix, and

Figure 6.6(b) shows the graphical representation of that combination matrix.

(a)

(b)

Figure 6.6: Combination matrix. (a) Values of the combination matrix. (b) Graphical representation of the

combination matrix.


The sweep tracker of Section 4.7 is modified once again in order to find both image edges

and skin edges along the measurement lines, and get a score for each measurement line based

on the combination matrix. The modified sweep tracker is tested on the test video sequence

of Section 4.9. The threshold value of the edge detection operator is set to 100. The

performance results of this test are shown in Figure 6.7. The values for the distance metric in

charts (a) and (b) are, in average, smaller than those of the sweep tracker using only skin

colour in the measurement line. However, there are various peaks in the distance metric

chart, frames 360 to 890, Figure 6.7(b), that did not appear on the equivalent chart for only

skin edges, Figure 6.7(d). These peaks in the distance metric are produced by mislocations of

the fingers. The use of both skin edge and image edges in the measurement function is

beneficial on average. However, under certain punctual circumstances the image edges can

introduce fitting errors to the tracking.

Skin edge and image edge based sweep tracker

(a) (b) Skin edge based sweep tracker

(c) (d) Image edge based sweep tracker

(e) (f) Figure 6.7: Performance of the sweep tracker using edges and skin colour in the measurement function.


6.4 Conclusions

We have compared the use of image edge features against skin colour features. We argued

that if no other skin colour objects interfere with the target hand, the use of skin colour

features is more reliable (and faster) than the use of edge features. Then, it was shown that

the performance of the sweep tracker when using both image edge features and skin colour

features was better on average than using only skin colour features. Finally, despite that the

use of edge features could be essential in order to track successfully certain situations in

which the skin colour areas are merged or occluded, the use of edge features alone proved to

be inferior than the use of only skin colour features – and introduced some fitting errors

when combined with skin colour features. The most common error is peaks in the distance

metric. These peaks are the result of mislocations of the fingers. This is a worse error than it

appears in the distance metric charts, as if the tracker was to be used in a HCI system a finger

mislocation will result in a wrong input for the HCI system.

141

7 Tracking improvements

A typical strategy to increase the accuracy and robustness of contour trackers based on

particle filters is to increase the number of particles used in the filter (Isard (1998); Isard and

Blake (1998a)). This strategy can also be used in the two articulated hand contour trackers

presented in Chapter 4 . However, such an increase in the number of particles results in a

slowdown of the tracking, which is not good for real-time applications. A trade-off between

the accuracy and the execution speed of the trackers has to be reached by using a certain

number of particles. This chapter presents a number of techniques that can improve the

tracking of the articulated hand trackers presented in Chapter 4 , without having to increase

the number of particles. These techniques can improve the following aspects of the tracking:

• Tracking accuracy, by using the techniques of Section 7.1, Section 7.3, and Section 7.4.

• Tracking robustness, by using the techniques of Section 7.3, and Section 7.4.

• Tracking repeatability, by using the technique of Section 7.2.

• Automatic reinitialisation, by using the technique of Section 7.4.

These techniques can be used separately or together, each one bringing an improvement to

the tracking. The techniques are first introduced and tested independently, by just adding the

technique in question to the sweep tracker of Section 4.7. Finally, the techniques are

7 Tracking improvements 142

combined together in a single tracker. This results in an improved tracker that benefits from

the improvements of each technique.

7.1 Switching template fitting methods during articulated tracking

This chapter describes two methods of fitting deformable templates when tracking

articulated objects using particle filters. One method fits a template to each of the links of an

articulated object in a hierarchical way; the method first fits a template for the base of the

articulated object and then fits a template for each of the links deeper in the hierarchy. The

second method fits the whole articulated object as a rigid object, and then refines the fitting

for each of the links of the articulated object in a hierarchical way, starting from the base.

Advantages and disadvantages of each method are discussed and a way of combining the

best of each method in a single tracker is presented. Results are given for the case of

articulated hand tracking.

7.1.1 Fitting templates to the links of an articulated object

This section discusses two methods of fitting an articulated template model to an articulated

object. It must be taken into account that when we refer to “fitting”, it is in the context of

particle filters, in particular the Condensation filter. This means that a template fitted to a

link is in fact a set of particles (various hypotheses of the template configuration) that

represent the link to some degree of accuracy. For display purposes one of these particles,

normally the one that fits the link best, or a weighted average of these particles is selected.

This means that in general the fitting of the template to the link is not perfect. Due to this

effect, when using particle filters to track an object through a video sequence, the output of

the tracking exhibits a certain degree of jitter. This jitter can depend on many factors, but in

general, the more particles the filter uses, the less jitter the output exhibits. Using other filters

like Kalman or recursive least squares can produce an optimal fit. However these filters often

cannot handle background clutter well.

The first method of fitting an articulated template to an articulated object involves finding

the configuration of the base link of the articulated object; then finding the configuration of

the second link relative to the first; then the configuration of the third link relative to the

second, and so forth. We will refer to this method as method 1.


Another method of fitting an articulated template to an articulated object is to first try to fit a

previous configuration of all the links of the articulated template as a single rigid template,

which we refer to as the combined template. Then the base link is refined using the position

of the base link in the combined template as an initial position estimate. We then proceed to

refit the second link, with respect to the first link, then the third, and so forth. We will refer

to this method as method 2.

When tracking an articulated object through a video sequence the fitting procedure, either

method 1 or 2, is repeated for each frame. Figure 7.1 shows a representation of the fitting

procedure for both methods 1 and 2. Templates in their fitted position are shown in grey. The

slight misalignments between the template and the link represent the typical jitter observed

in particle filters.

Method 2 has one more step than method 1. On the other hand, the final fit is often better

than with method 1, which means less jitter. The reduced jitter in method 2 can be explained

assuming that the initial configuration of the combined template is close to the true

configuration, and that there is significant background clutter. In general, the contour of the

combined template is more distinctive than the contour of any individual link. The combined

template is, therefore, less likely to match features on the background, as it is more likely to

be a unique structure in the image.

This suggests that method 2 is better than method 1; however the final fitting of the

articulated template depends of how good the initial fitting of the combined template as a

rigid object is. A potential problem that method 2 could face is illustrated in Figure 7.2. In

this case the second and third links of the object have changed considerably from the last

frame, and there is also background clutter that can distract the fitting procedure. In this

situation, the fitting of the combined template as a rigid object could be far from the real

position, and the following refitting steps for each of the links are not going to be able to find

the right configuration of links (because they are too far form the real configuration). This

effect can be carried on from one frame to the next, stopping the tracker from recovering. On

the other hand, if we use method 1 to fit the new configuration of links, the fit of the base

link is not going to be affected by the clutter that affects the 2nd and 3rd links in method 2. It

is more likely that the fit of the base link will be correct in this case, and consequently, the

rest of links may be fitted more precisely.


Method 1 Method 2

Figure 7.1: Fitting methods for an articulated object. On the left, fitting method 1, steps 1 to 3. On the right

fitting method 2, steps 1 to 4.

Figure 7.2: Potential problem of method 2. Left, fitting of the combined template as a rigid object is wrong

due to the significant change on the configuration of the articulated object and the effect of

distracting features on the image. Right, fitting of the base link alone is more precise.

In summary, the advantages of method 1 are fewer steps and better capability to keep and

recover tracking. On the other hand a particle filter tracker using method 1 exhibits more


jitter than using method 2. The advantage of method 2 is that it has less jitter than method 1;

however, it is more likely to lose track and not recover again. It makes sense to combine both

methods in a single tracker. One way of doing this is to switch from one method to the other

depending on the tracking conditions. This is what has been implemented in the following

articulated hand tracker.

7.1.2 Simplified articulated hand tracker

In order to illustrate the two template fitting methods, we have constructed a simplified

version of the sweep articulated hand contour tracker of Chapter 4 . This articulated hand

tracker implements Condensation (Isard and Blake, 1998a) and uses ideas of partition

sampling (MacCormick and Blake, 1999; MacCormick and Isard, 2000).

Each of the particles contains the parameterisation of an articulated hand template as the one

shown in Figure 7.3. This template is allowed to undergo Euclidean transformations, i.e.

translation, rotation and scale. These transformations are applied with respect to the point

indicated as hand pivot in Figure 7.3. Each of the fingers, including the thumb, can rotate

around finger pivots, also indicated in Figure 7.3, in order to model the abduction/adduction

movements of the fingers. These are the only modelled movements of the hand and fingers,

in total 9 DOF. In comparison with the articulated hand contour model described in Section

4.1, the simplified model used in this section lacks the second thumb joint, and the projected

length of the little, ring, middle and index fingers; therefore, these parameters cannot be

tracked. Seeing the hand template as an articulated object, the palm of the hand would be the

base link, and the fingers would introduce a second level in the hierarchy.

Figure 7.4 shows a flow chart of the hand tracker's operation. The initial position of the hand

template (initial particle) is adjusted manually on top of the hand. When the tracking starts, a

distribution of particles is generated from the initial particle. The particles evolve in time

following the Condensation algorithm. After each Condensation time-step, the fingers’

angles are found for a subset of the particles with the highest likelihood of representing the

configuration of the hand. The fingers’ angles are found using a deterministic search. This

deterministic search involves sweeping a certain range of angles, for each of the selected

particles, and selecting the angle for which the finger template has the highest likelihood of

representing the finger. This differs from the sweep implementation of Section 4.7 in that


there is no particle interpolation. This hand tracker can track a hand moving in a plane

parallel to the image, allowing abduction/adduction movement of the fingers. In addition, the

fact that in each time-step of the Condensation algorithm several particles, or hypotheses, are

propagated to the next step, allows the hand tracker to have a certain degree of resistance to

background clutter.

Figure 7.3: Simplified articulated hand contour model with 9 DOF.

Figure 7.4: Flow chart of the simplified hand tracker.

7.1.3 Results with the simplified articulated hand tracker

This section gives results of the articulated hand tracker described in the previous Section

7.1.2, supporting the conclusions about jitter and capability of tracking an articulated object

for method 1, method 2, and the combination of both methods. Finally, we study how the

combination method affects the tracking performance of the hand contour tracker described


in Section 4.7 (sweep implementation). Videos of the experiments are available in Appendix

B and on the supporting webpage as "video_sequence_7.1.avi" for the tracking output using

method 1; "video_sequence_7.2.avi" for the tracking output using method 2; and

"video_sequence_7.3.avi" for the tracking output using the combination of both methods.

In the articulated hand tracker, the palm of the hand is the base link of an articulated object,

and the fingers form a second level in the hierarchy of links. The palm consists of 3 contour

segments, as indicated in Figure 7.3. When fitting the hand template to the hand in an image

in method 1, the 3 contour segments of the palm are fitted first, and then the angle for the

fingers is found. In method 2 the whole hand is fitted as a rigid object and then the fingers

are refitted. When using method 2, the palm, which is the base link, is not refitted. This is

done in such a way because it reduces processing time considerably.

The combination of methods 1 and 2 is based on how good the lock on the fingers is. If the

lock on all the fingers is good, method 2 is used. As soon as the lock on a finger is lost, then

the tracker switches to use method 1. The criterion to decide whether a finger has a good

lock is based on a threshold of the likelihood of a finger template representing a finger on the

image.

All the experiments have been made using a fixed number of 250 particles for the

Condensation tracker. The input is a video sequence of 240 frames containing a hand moving

parallel to the image plane. The sequence includes both motion of the hand as a whole, and

of the fingers in relation to the palm. In this video sequence there are two critical zones.

From frame 66 to frame 80, both methods 1 and 2 almost loose lock on the hand due to a fast

horizontal rigid movement of the hand. Later, from frames 186 to 209, there is a fast

rotational movement of the hand that confuses the tracker using method 2. This results in the

fingers of the hand template being locked on the wrong fingers and not being able to recover

again. Method 1 and the combined methods, are able to recover from this situation.

Figure 7.5 shows selected frames from the critical zone in the range 186 to 209, for the

methods 1, 2 and the combination of both. Each of the frames from the combination of

methods 1 & 2, contains a number, either 1 or 2. This number indicates the method used for

that frame. At the beginning of the sequence the lock is kept in the three cases. From frame

191 there is a loss of lock in all three cases. Note that in method 2, the wrong lock is kept.


The little, and ring finger templates are locked on the ring and middle fingers, leaving the

middle finger template unlocked between the middle and index fingers. Finally from frame

204, methods 1 and the combined method, recover the lock on the hand. However, method 2

continues with a wrong lock and does not recover the correct track. Frame Method 1 Method 2 Combination 1 & 2

186

191

197

204

209

Figure 7.5: Selected frames from the critical zone. Method 1 and the combination 1 & 2 can track the whole

sequence. Method 2 gets confused and keeps a lock on the wrong fingers.


On the other hand, when comparing the tracking sequences for method 1 and 2, it is possible

to see that method 1, despite recovering from the second critical zone, has more jitter than

method 2. Jitter is the result of small misalignments between the fitting of the fingers and

palm templates and the real fingers and palm positions. These misalignments will be

different from one frame to the next, producing the impression that the output of the tracker

is shaking on top of the target, despite the lock being kept all along. This jittery output is

inherent to particle filters since they generally calculate an approximation to a solution.

When the number of particles used in a particle filter is higher, the approximation to the

solution is better, which means smaller misalignments and therefore less jitter5. Figure 7.6

shows examples of the misalignments that jitter produces for the three methods.

A plot of the variance of the parameters controlling the rigid movement of the hand shows

the amount of jitter exhibited in the tracker's output for each frame of the video sequence.

This variance is calculated over the set of particles that propagate from one time-step to the

following one. Figure 7.7 shows the variance (in pixels2) of the x and y coordinates of the

hand pivot for the three methods. Figure 7.8 shows the variance of the rotation angle and

scale factor of the whole hand as a rigid object. The rotation angle and scale have different

units– radians and a scaling factor; however, they are shown on the same chart as their

variances are in the same range.

Both Figure 7.7 and Figure 7.8 show a peak in the variance near frame 80. This corresponds

to the first critical zone. It can be appreciated that for all four parameters the variance is

smaller in the case of method 2. In the chart for the combined method, there is a signal that

tells when the tracker is using method 1 or method 2. It is possible to see that most of the

time the tracker is in method 2, switching to method 1 for short periods of one or two frames.

This brief switching from method 2 to method 1 is enough in most cases to allow the tracker

to recover a good lock on the fingers. The exception is during the first and second critical

zones, when the tracker is predominantly using method 1. In Figure 7.7 it is also possible to

see that for method 1, the variance for y is bigger that the variance of x. This is because the

three segments that form the hand palm are mostly in a vertical orientation. The

measurement function used to calculate the likelihood of a template representing an object in

the image, uses edge and colour information in the same way as MacCormick and Isard 5 Jitter also depends on how well the template models the object. If the template does not model the object properly the fitting will jump between local minima.


Method 1

Method 2

Combination 1 & 2

Figure 7.6: Various misalignments for each method of fitting the articulated template. Frame 150. Method

1 tends to produce more misalignments.

(2000). Having the three segments of the hand palm in a vertical orientation allows the

tracker to be more precise on the horizontal location of the template than on the vertical.


Figure 7.7: X and Y variance of the hand pivot.

Figure 7.8: Variance of the rotation angle and scale factor.


7.1.4 Tracking performance with the sweep implementation

In this section, we study how the combination of fitting methods 1 and 2 affects the tracking

performance of the hand contour tracker described in Section 4.7. This is the sweep

implementation; it originally uses a template fitting method 1, but for this experiment we add

the method switching capability. The template fitting method is switched from method 2 to

method 1 when the likelihood of any individual finger is smaller than 0.01 for at least three

consecutive frames; otherwise the fitting method is 2. The hand tracker is run on the test

video sequence of Section 4.9, frame intervals 30-174 and 360-890. Note that the frame

interval 175-359 is not used because the sweep tracker (as described in Section 4.7) cannot

keep tracking during this interval. The tracker is initialised at frames 30 and 360. The

evaluated performance measures are cost function, distance metric, and SNR (these

performance measures are described in Section 4.8). Videos of the tracking output are

available in Appendix B and on the supporting webpage as "video_sequence_7.4.avi" for

frames 30-174, and "video_sequence_7.5.avi" for the frames 360-890.

Figure 7.9 shows the tracking performance results when using the combined template fitting

method. On the left are the frames from 30 until 174; on the right are the frames from 360

until 890. Let us compare these results with the sweep results in Figure 4.13 and Figure 4.17;

we will refer to these figures as the results for fitting method 1. We can see that the cost

function, Figure 7.9(a) and (b), in the combined method has smaller average values than in

the method 1. At a first glance this suggests worse performance with the combined method;

however, the results for the distance metric, Figure 7.9(c) and (d), are considerably better for

the combined method than for the method 1. The most important data, though, is the variance

results for the distance metric, these are much more smaller in the combined method than in

the method 1; this indicates that the tracking is considerably more stable with the combined

method than with method 1 – as is expected. Figure 7.9(e) and (f), show the SNR results, the

average SNR is just marginally higher in the combined method than in the method 1. Finally,

Figure 7.9(g) and (h), shows how the template fitting method switches, between method 1

and method 2, along the test video sequence.


(a) (b)

(c) (d)

(e) (f)

(g) (h) Figure 7.9: Sweep implementation tracking performance when using the combined template fitting

method. The performance measures are cost function, (a) and (b); contour distance (c) and (d); and

SNR, (e) and (f). The bottom row shows the switching between fitting methods, (g) and (f).

7.1.5 Conclusions

We have described two methods for fitting templates to the links of an articulated object.

Method 1 can keep track and recover track better than method 2, but it has more jitter that

method 2. Method 2 has less jitter than method 1 but can loose track more easily. We present

a method of combining both methods, by switching between the two, that allows a more

robust tracking and less jitter when the tracking conditions are good.


Though the articulated tracking presented here is based on Blake and Isard’s Condensation

algorithm, and the measurement model described in (Isard and Blake, 1998; MacCormick

and Isard, 2000), there are two major differences. Blake and Isard’s method uses only one

fitting method, similar to the method 1 in this chapter. This means it does not use different

template fitting methods for different tracking conditions. Another difference is in the

implementation of the hand tracker. In the proposed methods, the fitting of the palm or the

hand (depending on whether method 1 or method 2 is in use) follows largely the

Condensation algorithm, but the fitting of the fingers is achieved by a deterministic search

instead of having separate particle distributions for fingers. Another way of making this point

clear is to say that some of the particle parameters, at each time-step, are found using the

condensation algorithm (parameters for the hand as a rigid object), while others are found by

a deterministic procedure (parameters describing angles between fingers).

Finally, we applied the template switching method to the sweep hand tracker implementation

of Section 4.7. The results were favourable, showing more tracking stability, and even more

accurate tracking, than the original sweep implementation, which uses method 1.

7.2 Quasi random sampling

The dynamical model of the Condensation algorithm, as described in Section 3.3, is

composed of a deterministic part, and a stochastic part. The stochastic part of the dynamics is

represented by the term wB in Equation (3.11); where w is a vector of independent random

normal )1,0(N variates, and B is a matrix that modulates the strength of the stochastic

component for each one of the dimensions of the configuration space. The reason for having

a stochastic component in the dynamics is to allow a random sampling of the configuration

space around a point fixed by the deterministic part of the dynamics.

The random normal variates )1,0(N , are typically generated using a uniform pseudo-random

number generator, whose output is then shaped into a Gaussian. A common implementation

uses the system rand() function, which is almost always a linear congruential generator,

as a uniform pseudo-random number generator. These generators, although very fast, have

an inherent weakness that they are not free of sequential correlation on successive calls. If

one of these pseudo-random number generators is used to generate points in a k-dimensional

space, the points will not fill up the space evenly, clumping in some occasions, leaving large


gaps in others. And these effects tend to worse when the number of dimensions increases.

Thus the random sampling will be sub-optimal and even inaccurate.

A promising extension to Condensation that addresses the sub-optimality of random

sampling is the incorporation of quasi-Monte Carlo methods (Press, et al., 1996;

Niederreiter, 1992). In such methods, the sampling is not done with random points, but with

a carefully chosen set of quasi-random points that span the sample space so that the points

are maximally far away from each other. Philomin, et al. (2000) used quasi-random sampling

with Condensation tracking; they reported superior tracking performance, in a pedestrian

contour tracking application, when substituting random sampling for quasi-random

sampling. The use of quasi-random sampling constitutes an interesting and straightforward

strategy in order to increase the performance of Condensation tracking. This section explores

how the articulated hand trackers presented in Chapter 4 can benefit from the use of quasi-

random sampling.

7.2.1 Quasi-random sequences

There exist a number of quasi-random sequences which possess beneficial properties for

sampling a configuration space; some examples include Hammersley, Halton, Sobol, Faure

and other sequences (Morokoff, 1994; Niederriter, 1992; Tezuka, 1992). The values of a

quasi-random sequence are generated in groups for a specific dimension; for example if the

configuration space has 8 dimensions, the values of a quasi-random sequence will be

generated in groups of 8 – forming points in the target configuration space. Intuitively, the

points resulting from a quasi-random sequence must be distributed such that any subvolume

in the space should contain points in proportion to its volume. The difference between this

quantity and the actual number of points in the subvolume is called the discrepancy. Quasi-

random sequences have low discrepancies and are also called low-discrepancy sequences.

Thus, quasi-random sequences can be used to generate samples that fill the configuration

space in a more desirable way. Some of the criteria that define a desirable sampling of a

configuration space are (Lindemann and La Valle, 2003):

• Uniformity: Good covering of the space is obtained without clumping or gaps. This can

be formulated in terms of optimising discrepancy.


• Lattice structure6: For any sample, the location of nearby samples can easily be

determined.

• Incremental quality: If the sequence is suddenly terminated, it has a decent coverage.

This is an advantage over a sequence that only provides high-quality coverage for a fixed

n.

Figure 7.10 shows the result of plotting 250 points from (a) a pseudo-random sequence, (b) a

Halton sequence, (c) a Sobol sequence. The Halton and Sobol sequences are generated for a

dimension d=4; then dimensions 1 and 2 are plotted. Notice how the pseudo-random points

clump in some regions, while in other regions there are gaps. The Halton points are better

distributed, they have lower discrepancy than the pseudo-random points. The Sobol points

also have a low discrepancy, and they show a more regular pattern than Halton points, so

they have a better lattice structure. The patterns will look different when different

dimensions are plotted together. The Halton sequence is based on a list of prime numbers

used for each dimension. The Sobol sequence is based on a number of direction numbers,

also specific for each dimension.

Pseudo-Random

(a)

Halton

(b)

Sobol

(c) Figure 7.10: Distributions of pseudo-random and quasi-random points. Figures show 250 points generated

using (a) pseudo-random sequence, (b) Halton sequence, and (c) Sobol sequence.

7.2.2 Application of quasi-random sequences in Condensation

In a Condensation algorithm, the stochastic part of the dynamics is generated from a vector

of independent random normal )1,0(N variates. The known general way to obtain )1,0(N is

by using the Box-Muller algorithm (Press, 1992) on a uniform distribution (pseudo-random

6 The lattice structure criteria is not typically so important in Monte Carlo methods, but it is important in other sampling applications.


based). However, when the uniform distribution is a low-discrepancy sequence, Box-Muller

algorithm damages the low-discrepancy sequence properties, altering the order of the

sequence, or scrambling the sequence uniformity (Moro, 1995; Galanti and Jung, 1997). In

this thesis, the conversion from a uniform quasi-random distribution to a Gaussian quasi-

random distribution is achieved using the Moro transformation (Moro, 1995): A Gaussian

value g is obtained from the uniform value u by applying the following mapping to each of

the dimensions of the configuration space:

12 erf (2 1)g u−= − (7.1)

where 1erf − is the inverse of the error function given by

2

0

2erf( )z

tz e dtπ

−= ∫

The results of applying this transformation to the uniform distributions of Figure 7.10 are

shown in Figure 7.11. These plots show only two of the dimensions, but similar plots would

result from plotting other dimensions of the configuration space. Notice how even after

transforming the original sequences into Gaussian distributions, those still retain properties

from the original uniform distributions: Figure 7.11(a) shows some clumping and gaps

among the points, Figure 7.11(b) and (c) have a better coverage.

Pseudo-Random

(a)

Halton

(b)

Sobol

(c) Figure 7.11: Gaussian transformation of uniform pseudo-random and quasi-random points. Figures show

250 points transformed into a Gaussian distribution from the uniform distributions of Figure 7.10.

The Gaussian transformed quasi-random sequences can be directly used as the term w in the

particle dynamics, Equation (3.11), of the Condensation algorithm. Each time a particle is

propagated from one time-step to the next one, the seed for the Halton and Sobol sequence

generators is reset. This assures a coherent sampling for each of the propagated particles.


7.2.3 Results

The performance of the articulated hand contour tracker of Section 4.7, sweep

implementation, was tested using three sampling methods: pseudo-random sampling, Halton

quasi-random sampling, and Sobol quasi-random sampling. The tests were performed on the

test video sequence of Section 4.9, frames 30 to 174, and from 360 to 890. The tracker was

initialised at frames 30 and 360. The recorded performance measure is the contour distance,

thus smaller values mean better performance. The quasi-random sampling always produces

the same performance results when run on the same video sequence, as the sample positions

are always the same. However, the performance using pseudo-random sampling may change

slightly, even in the same video sequence, depending on the particular sequence of numbers

produced by the pseudo-random number generator. Therefore, in order to make a fair

comparison between the quasi-random and pseudo-random samplings, the performance for

pseudo-random sampling is the averaged from 100 trials on the same video sequence, and

same frame interval.

Table 7.1 shows the contour distance performance metric of three sampling methods. For

each sampling method the average, variance, and median of the distance metric is shown.

For frames 30 to 174, we can see that the Halton, and Sobol samplings produce better

average and median results than the pseudo-random sampling; the variance for Halton and

Sobol samplings is also smaller than with the pseudo-random sampling. However, when we

look at the results for frames 360 to 890, the pseudo-random sampling has slightly better

average and variance results than the Halton and Sobol samplings; the median, though, is

slightly smaller in the case of Halton and Sobol than in pseudo-random.

7.2.4 Conclusions

From the results shown in the previous section, no significant performance increase is

perceived when using quasi-random sampling instead of pseudo-random sampling. Even in

the second interval of frames, from 360 to 890, the results point out slightly inferior

performance for quasi-random. In any case, these differences in performance are very small,

and could well be produced by the small disagreements between the distance metric and the

cost function, see Section 4.11. Previous research by Philomin, et al. (2000) reported

superior performance when using quasi-random sampling with Condensation tracking.


Table 7.1: Contour distance performance metric comparison using three sampling methods.

However, their results are presented for a basic Condensation tracker with synthetic

experiments, tracking an ellipse; and for a pedestrian tracking application. The tracking

mechanisms of these trackers differ from the ones used in the articulated hand contour

trackers of this thesis. Partition-sampling (Section 3.4.1), particle interpolation (Section 3.7),

finger sweep searches (Section 4.7), and techniques for refinement of the finger length

estimations (Section 4.6), can affect the tracking performance to such a degree that any

improvements a quasi-random sampling could introduce, would be severely damped on the

distance metric performance measure.

On the other hand, despite no real performance improvement is observed when quasi-random

sampling is used in the articulated hand tracker, the use of quasi-random sampling is still

highly desirable because it gives repeatability to the tracking. When using pseudo-random

sampling in the articulated hand tracker, the performance may vary from one trial to the next

one, even if the tracking is run on the same video sequence. When using the hand tracker in

HCI applications such as the VTS interface proposed in Chapter 8 , these differences in

performance could mean that exactly the same click event could sometimes be detected, and

other times could not. The tracking repeatability, gained by using quasi-random sampling,

will make possible that if a click event on a VTS is once detected (or not), the same click

event will always be detected (or not).


7.3 Variable Process Noise (VPN)

The dynamical model of the Condensation algorithm, as described in Section 3.3, is

composed of a deterministic part, and a stochastic part. The stochastic part of the dynamics is

represented by the term wB in Equation (3.11); where w is a vector of independent random

normal )1,0(N variates, and B is a matrix that modulates the strength of the stochastic

component for each one of the dimensions of the configuration space. The matrix B is also

known as process noise. Typically, the matrix B is constant along the tracking. In this section

we shall see how tracking can be improved by varying the process noise according to the

current tracking conditions.

In Condensation terminology, the deterministic part of Equation (3.11) is referred to as

prediction, and the stochastic part is referred to as noise. The reason for having a noise

component in the dynamics is to allow a random sampling of the configuration space around

the prediction point. The process noise modulates the extent of the random sampling. When

the tracking performance decreases, because the target exhibits brisk motion, a greater

process noise can prevent the tracker from losing the target; as the extent of the random

sampling will be greater, and more likely to cover far apart states. When the tracking

performance increases, because the target exhibits slower motion, a smaller process noise

can be sufficient to keep a lock on the target, while increasing the resolution of the lock.

Thus, a variable process noise can be exploited in order to have a more accurate tracking

when slow motions occur, and a more robust tracking when brisk motions occur. Blake and

Isard (1998) use this idea in contour tracking when using Kalman tracking; they define a

search region whose width is related to the Kalman's position covariance, this is a measure of

the tracking performance at each time-step. However, they do not use this idea with

Condensation tracking.

The tracking performance, that we refer to in the paragraph above, could be calculated, in

principle, with any of the performance measures of Section 4.8; however, a computationally

inexpensive performance measure can be defined in terms of the distribution of weights in a

particle set. Isard and MacCormick (2000) describe two extreme cases of the distribution of

weights in a particle set: if only one particle, in a particle set, has a high weight and the rest

of particles have a very small weight, there is significant danger that tracking could be lost.

On the other hand, if all the particles have the same weight, the tracking is more likely to


continue. Any particle set lies somewhere between these two extreme cases. This measure of

tracking performance can be used to control the process noise.

We use the relative weight of a particle, inside the particle set, to control the level of process

noise that is used when dynamics are applied to that particle; this only applies to particles

that are propagated from one time-step to the next one. We define three bands: in the first

band the process noise is reduced; in the second band the process noise is unaltered; and in

the third band the process noise is increased. Let S be the relative weight of a particle inside

the particle set; for example if a particle has S = 50%, this particle carries half of the particle

set's total weight. The process noise is controlled as follows (the constants have being found

empirically):

New Process Noise = Process Noise 0.5 if S < 8%New Process Noise = Process Noise 1.75 if S > 54%New Process Noise = Process Noise otherwise

× ×

These rules were added to the sweep tracker implementation of Section 4.7, and tested with

the test video sequence of Section 4.9. A video of the tracking output is available in

Appendix B and on the supporting webpage as "video_sequence_7.6.avi". The distance

metric performance results are shown in Figure 7.12. The most remarkable result is that the

sweep tracker with VPN is capable of tracking the whole test video sequence, including

frames 175 to 359, which correspond to very brisk rigid-hand motion. Note that with the

exception of the VPN tracker, Section 7.3, and the SCGS tracker, Section 7.4, no other

tracker implementation in this thesis could track this part of the test video sequence.

Figure 7.12: Distance metric performance measure when using VPN. Note that the use of VPN allows

tracking the test video sequence's section with brisk rigid-hand movement, frames 175 to 359.


In order to compare the tracking performance results between the sweep hand tracker with

VPN, and the sweep hand tracker with fixed process noise, we repeat the test initialising the

tracking at frames 30 and 360. Tracking is also initialised at frame 175 in order to have a

separate chart of the tracking performance during the brisk rigid-hand motion section. We

can see that from frames 30 to 174, Figure 7.13(a) and Figure 4.13(d), the version with fixed

process noise has better average distance metric, and smaller variance than the version with

VPN. However, from frames 360 to 890, Figure 7.13(c) and Figure 4.17(d), the version with

VPN has better average distance metric, and smaller variance.

We conclude that the use of VPN makes the hand tracking more robust, against brisk rigid-

hand motion; and also more accurate, when the rigid-hand motion is slow7, as is the case

from frames 360 to 890. However, in the first section of the test video sequence, from frames

30 to 174, when the global hand motion is medium, the tracking performance is slightly

reduced. It could be argued that during this section, of the test video sequence, the default

process noise is the adequate, and the possibility of switching to a larger or smaller process

noise slightly degrades the tracking accuracy.

7.4 Skin Colour Guided Sampling (SCGS)

Isard and Blake (1989b) introduced a tracking technique named ICondensation. This

technique combines the use of a Condensation based contour tracker with a skin colour blob

tracker, in order to improve the robustness and allow for automatic reinitialisation of their

tracker. The blob tracker runs on a low-resolution image, it is fast and robust, but conveys

little information other than the object centroid. However, the information about the centroid

of the object is sufficient to describe what areas should be searched for information about the

object. This information can be introduced in the contour tracker in the form of an

importance function. The importance function ( )g X describes which areas of the contour

tracker's state-space contain most information about the posterior. The idea is to concentrate

samples in those areas of state-space by generating samples from ( )g X rather than the prior

( )p X . The resulting samples are then weighted using a mixture weight which takes into

account both ( )g X and ( )p X . The desired effect is to avoid as far as possible generating

7 From frame 360 to 890 the hand motion is predominately articulated, while the rigid-hand motion is generally slow or null.


(a)

(b)

(c) Figure 7.13: Distance metric performance measure when using VPN, separate charts. Tracking is

initialised at the beginning of each chart, in order to compare performances with Figure 4.13, and

Figure 4.17.

any samples which have low weights, since they provide a negligible contribution to the

posterior.

In practice, Isard and Blake's (1989b) tracker implementation generates particles in three

different ways:

• Some particles are sampled from the particle distribution's prior these particles are named

condensation samples.

• For other particles, the translation part of the state is sampled from the importance

function, and the deformation part of the state is sampled from the distribution's prior,

these particles are named importance samples.

• Finally, for the rest of the particles, the translation part of the state is sampled from the

importance function, and the deformation part is sampled from a prior distribution

independent from the tracker's history, these particles are named initialisation particles.


Several methods similar to ICondensation have been used in visual-tracking for both single

and multiple targets (Wu and Huang, 2001; Pérez et al. 2004; Branson and Belongie, 2005).

In this section we propose a method based on ICondensation, which we call Skin Colour

Guided Sampling (SCGS). This method can be used to improve the robustness, and allow for

automatic reinitialisation, of the hand trackers presented in Chapter 4. The technique is based

on ICondensation, as described in (Isard and Blake, 1989b), and shares many of the same

elements. In particular, it shares the concept of condensation particles, importance particles,

and initialisation particles; however, it differs from ICondensation in a number of aspects:

• First, the low-level information comes from a skin coloured blob detection procedure on

the whole image, as opposed to blob tracking.

• Second, only the largest skin coloured blob is considered. This blob is analysed using

moments in order to convey extra information, and a heuristic method is used to calculate

insertion points for the importance samples, and initialisation samples.

• Third, the combination of the low-level information with the contour tracking does not

use an importance function. The insertion points previously calculated are used to

generate particles around them – in form of importance particles and initialisation

particles. These particles use the extra information about the blob in order to initialise a

larger part of their state.

• Finally, ICondensation uses initialisation particles and importance particles

simultaneously with condensation particles during normal tracking. In skin guided colour

sampling initialisation particles are used in combination to condensation particles only

when the tracking is considered lost – in order to reinitialise the tracking. And

importance particles are used in combination to condensation particles only during

normal tracking – in order to confer robustness against sudden or brisk movements of the

target.

7.4.1 Skin coloured blob detection and analysis

In order to apply SCGS to a video sequence, the first step is to find the skin coloured blobs

for each of the frames of the video sequence. This is achieved using the LC skin colour

classifier, described in Chapter 5, on a decimated frame of size 160x120 pixels. The resulting

skin colour image (binary image) is then processed in order to eliminate small skin coloured

blobs (one iteration of erosion) and connect together nearby skin coloured blobs (two


iterations of dilation). Finally, the connected components of the processed image are found,

and if the largest connected component is larger than a minimum threshold, then it is selected

for analysis. Moment analysis is used in order to find the largest component's centroid.

Figure 7.14(a) shows the skin coloured blobs as white areas; the largest skin coloured blob is

indicated by a green border, and its centroid is indicated by a red circle.

ICondensation as defined in (Isard and Blake, 1989b; Isard, 1998) follows the same steps

outlined above, with slightly different processing of the input image. However, at this point,

the proposed SCGS proceeds differently. ICondensation uses the centroid of the blobs as the

mean for a two-dimensional Gaussian, which constitutes their importance function. An offset

from the mean of the Gaussian and the covariance of the Gaussian are learnt off-line from

previous tracking sequences. The importance function is then sampled in order to insert

importance samples into the Condensation distribution. This approach works well for their

test environment, and image processing steps; however, the approach presents a problem for

this thesis' hand tracking, because of the different image processing, different tracking

environment, and different tracking possibilities. In Figure 7.14(a) we can see a subject in

short-sleeve holding his hand open in front of the camera. The centroid of the largest skin

coloured blob is quite far away from the hand contour position. If this centroid was used as

an insertion point for importance samples, the process noise would have to be increased to an

extent at which particles would be too spread out to be effective.

In order to avoid this problem we use an heurist method for the calculation of the insertion

points. Firstly, the area, major axis, and eccentricity of the largest skin coloured blob is

calculated using moments (Kilian, 2001). Secondly, two insertion points are calculated along

the major axis of the biggest skin coloured blob, as indicated in Figure 7.14(b). Each

insertion point lies at the following distance from the blob's centroid:

Distance to Top Insertion Point = 5 log(blob's eccentricity + 1)

Distance to Bottom Insertion Point = 7.5 log(blob's eccentricity + 1)

The insertion points will constitute the translation component for the importance samples and

initialisation samples. The insertion points are calculated under the assumption that the

biggest blob corresponds to the user's arm or hand. When the user wears short-sleeves, one

insertion point is on top of the potential hand pivot position, indicated in Figure 7.14(b).

When the user wears long-sleeves, the other insertion point is on top of the potential hand

pivot position, indicated in Figure 7.14(c) and (d). These two situations constitute two


extremes, and the potential hand pivot position will typically fall at some point, on the major

axis of the blob, between these two extremes.

(a) (b)

(c) (d) Figure 7.14: Skin coloured blobs. (a) White areas indicate skin colour; the red dot is the centroid of the largest

skin coloured blob. (b) The major axis of the skin coloured blob is indicated with a red line; to both

sides of the blob's centroid there is an insertion point, indicated with pink circles. When the subject

uses short-sleeve (b) the top insertion point is closer to the real hand pivot. When the subject uses

long-sleeve, (c) and (d), the bottom insertion point is closer to the hand pivot.

7.4.2 Combining low-level and high-level information

The insertion points determine the translation component of the initialisation particles. The

angle of the blob's major axis determines the angle of the initialisation particles. The scale of

the initialisation particle is determined using the following formula:

scale Area2ScaleFactor Blob's area= ×

where Area2ScaleFactor is the ratio between the largest blob area and the hand scale, at the

first frame of tracking, or at the initialisation frame. The configuration of the fingers is taken

from an initial finger configuration, corresponding to the hand open with splayed fingers.

Figure 7.15(b) illustrates how initialisation samples take the translation, angle and scale from


the blob information. As initialisation samples operate when the lock on the hand is lost, they

make possible to recover the lock on the hand, for example after the hand disappears and

then reappears on the tracking area.

The insertion points also determine the translation component of the importance samples, but

the scale, angle, and configuration of the fingers is taken from the particle with highest

weight in the previous times-step. Figure 7.15(c) illustrates how importance samples get the

finger configuration of the particle with highest weight from the previous time-step (shown

as a thicker blue contour).

(a) (b) (c) Figure 7.15: Importance samples and initialisation samples. (a) Shows together five importance samples in

magenta and five initialisation samples in light blue. (b) Initialisation samples take the angle and

scale from the blob information. (c) Importance samples take the fingers' configuration from the

particle with highest weight from the previous time-step.

7.4.3 Use of importance and initialisation particles

ICondensation as defined in (Isard and Blake, 1989b; Isard, 1998) uses initialisation and

importance particles simultaneously with condensation particles during normal tracking. It

was found that the use of initialisation particles during tracking can actually reduce the

tracking performance under certain situations. These situations correspond to those in which

the use of the tracker's dynamical model is more important, for example when a fast steady

motion of the user's hand occurs. Initialisation particles are generated in a way that makes

them likely to have a good fit on the target, and therefore be propagated to the following

time-step. However, initialisation particles do not have a tracking history and therefore they

cannot benefit from the dynamical model. Initialisation particles are generated as still

particles, with zero velocity. As a result, the particles generated from one initialisation

particle will be near (as far as the process noise allows them to spread) to the original

initialisation particle. This halts the tracking and degrades the tracking performance. This

halting of the tracking does not apply to importance samples, as those get the tracking history


from the particle with highest weight in the previous time-step. Importance samples do

contribute to strengthen the robustness of tracking when sudden, fast, motions of the target

occur.

In order to avoid the problems above, the initialisation particles are used in combination with

condensation particles only when the tracking is lost; and the importance particles are used in

combination with condensation particles only during normal tracking. The rule to detect

when tracking is lost and when tracking is normal uses a threshold, with time delay and

histeresis, on the weight of the particle with highest weight in the previous time-step:

• When the weight is below Llimit for T consecutive frames, the tracking state is switched

to "lost".

• When the weight is above Hlimit for T frames, the tracking state is switched to "normal".

The values for Llimit, Hlimit and T are found empirically, and they depend on the

measurement model of the tracker, see Section 4.3. This switching of tracking states allows

the tracker to use initialisation particles and importance particles when they are more needed:

initialisation particles when the tracking is lost and needs to be reinitialised; and importance

particles when the tracking is normal, but sudden, fast, motions of the target may occur.

7.4.4 Reinitialisation test

SCGS has been added to the sweep tracker of Section 4.7, and tested on a video sequence of

a user moving their hand rapidly in and out of the tracking area. The state of the tracking is

switched from normal to lost, and back to normal, following the rule described in the

previous section. When the user withdraws their hand from the tracking area, the tracking

state is switched to lost, and initialisation particles begin to be placed on top of the largest

skin coloured blob. When the user introduces their hand into the tracking area, some

initialisation particles are placed on top of the user's hand, until eventually, the tracking is

switched to normal, and importance particles begin to be placed on top of the largest skin

coloured blob. The number of both initialisation particles and importance particles used

during the tracking is 50; and the values for the tracking state switching rule are: Llimit =

0.3; Hlimit = 1000; T = 3.


frame 82 frame 303

frame 85 frame 305

frame 87 frame 306

frame 93 frame 307 Figure 7.16: Reinitialisation test selected frames.

The output of the tracking sequence for this test is available in Appendix B and on the

supporting webpage as "video_sequence_7.7.avi" for the reinitialisation test showing the


tracking output and the skin colour blobs; and "video_sequence_7.8.avi" for the

reinitialisation test showing the tracking output, the skin colour blobs, the initialisation

particles and the importance particles. Figure 7.16 shows some selected frames from the

tracking sequence. On the left, at frame 82, the user's hand is outside the tracking area; at

frame 85, the user's hand begins to appear on the tracking area; at frame 87, a first lock is

gained on the user's hand; finally, at frame 93, the tracker has a good lock on to the user's

hand. On the right, at frame 303, the hand is outside the tracking area; after just five frames

the tracker handles to get a good lock onto the user's hand, in frame 307.

7.4.5 Robustness test

The robustness test consists in measuring the tracking performance of the sweep tracker,

Section 4.7, including SCGS, on the test video sequence of Section 4.9. The section from

frames 175 to frame 359, which corresponds to very brisk rigid hand movement, could not

be tracked successfully using the sweep tracker. Now, this section of the test video sequence

can be tracked successfully thanks to the use of importance particles. The number of

importance particles used during the tracking is 50; and the values for the tracking state

switching rule are: Llimit = 0.003; Hlimit = 1000; T = 3. Note that the Llimit has been lowed

in order to prevent initialisation particles from appearing during the brisk motion section of

the test video sequence.

A video of the tracking output is available in Appendix B and on the supporting webpage as

"video_sequence_7.9.avi". Figure 7.17 shows the distance metric performance measure of

the tracking output. Comparing Figure 7.17 with Figure 7.12, it is possible to see that the

distance metric for the section of brisk rigid motion has a slightly worse performance in the

SCGS chart than in the VPN chart. This result is to be expected, as the brisk rigid motion

section involves not only fast translations but also fast changes in the contour's angle and

scale. The importance particles only take the translation component from the skin coloured

blobs, the angle and scale of the particle is taken from the particle with highest weight in the

previous time-step. Hence, they can only handle the fast translations, but not the fast

rotations, and fast changes in scale; for this reason, VPN exhibits slightly better performance

in this section. The other sections of the tracking sequence show similar performance in both

the SCGS chart and the VPN chart.


In order to compare the tracking performance results between the sweep hand tracker with

SCGS, and without SCGS, we repeat the test in three sections: frames 30 to 174; frames 175

to 359; and frames 360 to 890. The tracker is initialised at the beginning of each section.

Results can be seen in Figure 7.18. In the first section, frames 30 to 174, the version with

SCGS has slightly higher average distance metric, although the peaks are shorter, and the

variance is smaller than the version without SCGS, Figure 7.18(a) and Figure 4.13(d). If we

compare it with the VPN version, Figure 7.13(a), we see that both the average distance

metric and the variance are smaller in the SCGS version. The second section, frames 175 to

359, which corresponds to brisk rigid hand motion, can only be compared with the

performance results of the VPN, Figure 7.13(b). The latter has slightly smaller average

distance metric and slightly smaller variance than the SCGS version, for the reasons

mentioned above. In the third section, frames 360 to 890, the SCGS version, Figure 7.18(c),

has smaller average distance metric and smaller variance than the version without SCGS,

Figure 4.17(d), and just slightly smaller average and variance than the VPN version, Figure

7.13(c).

Figure 7.17: Distance metric performance measure when using SCGS. Note that the use of SCGS allows

tracking the test video sequence's section with brisk rigid-hand movement, frames 175 to 359.


(a)

(b)

(c) Figure 7.18: Distance metric performance measure when using SCGS, separate charts. Tracking is

initialised at the beginning of each chart, in order to compare performances with Figure 4.13, and

Figure 4.17.

7.4.6 Conclusions

Skin colour guided sampling as presented in this section enables the automatic initialisation

of the tracker, for when the target hand disappears and then reappears into the tracking area.

It also improves the tracking robustness when fast translations of the hand occur; although

we have seen that the use of VPN, Section 7.3, produced slightly better performance than

SCGS during the brisk rigid motion section of the test video sequence. The implementation

of SCGS uses the LC skin colour classifier in both the blob detection and in the contour

likelihood measurement function.

On the other hand, there is a limitation of SCGS. The limitation appears when the skin

coloured blob corresponding to the user's hand joins other skin colour blobs in the image; the

resulting blob cannot be used to predict the hand parameters accurately; this is illustrated in


Figure 7.19. This is a bigger problem for the initialisation particles than for the importance

particles, as the dependence of the former on the blob information is stronger. This is

illustrated in Figure 7.19(b). The initialisation particles can get an incorrect hand pivot

position, incorrect rotation angle, and incorrect scale. The importance particles can only get

an incorrect hand pivot position.

(a) (b) Figure 7.19: Skin coloured blobs mixing. (a) When skin coloured blobs mix, the resulting blob cannot predict

the hand parameters. (b) The position, angle, and scale of the initialisation samples, light blue,

depend entirely from blob information. Only the position of the importance samples, magenta,

depend on the blob information.

7.5 Combining tracking improvements

This section combines in a single sweep tracker, three of the techniques presented in this

chapter. The result is an improved tracker, that produces the best performance results so far,

and is good enough to be used in the applications of Chapter 8 . The techniques involved in

this improved tracker are: the template switching methods of Section 7.1, the variable

process noise of Section 7.3, and the skin colour guided sampling of Section 7.4. The Quasi-

random sampling, Section 7.2, is not used in this experiment as it does not does not bring a

performance improvement.

The improved sweep tracker is run on the test video sequence. Figure 7.20 shows the

distance metric performance measure for the test video sequence. The average distance and

variance are the smallest so far, and the peaks are shorter than the sweep tracker with VPN or

SCGS. A video of the tracking output for this experiment is available in Appendix B and on

the supporting webpage as "video_sequence_7.10.avi". In order to compare the tracking

performance between the bare sweep hand tracker and the sweep hand tracker including the


three techniques, the test is repeated in three sections: frames 30 to 174; frames 175 to 359;

and frames 360 to 890. Tracking is initialised at the beginning of each section. Figure 7.21

shows the distance metric results. In all three sections the performance is the best so far.

Figure 7.20: Distance metric for the sweep tracker with combined tracking improvements.

(a)

(b)

(c) Figure 7.21: Distance metric for the sweep tracker with combined tracking improvements, separate

charts.

175

8 Virtual touch screen

This chapter presents a vision-based interactive surface which has been named as Virtual

Touch Screen (VTS) interface. The VTS attempts to move beyond the traditional mouse and

computer screen interface by generating in the environment a surface that is both active as a

display and as a touch-sensitive pad. The contents of a VTS can be displayed by either the

use of a projector, which projects the contents on a selected surface, or by the use of a see-

through Head Mounted Display (HMD), which displays the contents on the HMD, but

appears to the user to be floating in their field of view. The VTS can be made touch-sensitive

by visually tracking the user's hand and interpreting their hand position and configuration in

order to determine when and where a user's finger touches the VTS.

There are a number of technical challenges is the realisation of the VTS interface, but the

most difficult and challenging element is the visual tracking of the user's hand. This needs to

be able to track not only the user's hand but also its configuration. Articulated hand tracking

of this sort, without the help of hand markers, or special gloves, is currently a very active

area of research (Nolker and Ritter, 1999; MacCormick and Isard, 2000; Shimada et al.,

2001; Zhou and Huang, 2003; Stefanov, 2005; Stenger's et al., 2006). In the absence of a

suitable articulated hand tracker that could satisfy the demanding requirements of the VTS, a

8 Virtual touch screen 176

especially tailored articulated hand tracker was developed for this purpose. This articulated

hand tracker was developed, improved, and finally presented in previous chapters (Chapter 3

to 7) of this thesis, making this chapter the goal of this thesis.

This chapter starts by describing the concept of a VTS interface, possible configurations, and

proposed operation, using an idealized visual hand tracking technology. Then it moves on to

describing three implementations of the VTS interface, the last two ones make use of the

articulated hand contour tracking developed in this thesis. Finally, a number of potential

applications for the VTS interface are described, and conclusions are drawn.

8.1 The VTS interface

The concept of a VTS interface is analogous to that of a touch sensitive screen. A user can

see information presented on the screen and can directly interact with this information by

touching the screen. In a touch sensitive screen the information is displayed on the screen by

the relevant screen's technology (CRT, LCD, etc) and this screen is made touch sensitive by

adding a transparent touch-sensitive membrane, or alternative technology. In a VTS the

information is displayed by either using a projector, which projects the information on a

selected surface, or by using a see-through Head Mounted Display (HMD), which displays

the contents in the HMD, but these appear to users as to be floating in their field of view.

The VTS is made touch-sensitive by visually tracking the user's hand and interpreting their

hand position and configuration. We will talk about a "Projector based VTS" when the

information in VTS is displayed by using a projector, and we will talk about "HMD based

VTS" when the information in the VTS is displayed by using a see-through HMD.

Leaving aside the implementation issues related to the development of such an interface, a

technological requirement for the VTS interface to work is that the field of view between the

camera and the user's hand has to be clear. Similarly, if a projector is used in order to display

the VTS contents, the field of view between the projector and the screen has to be clear.

These requirements leave us with a number of possible interface element configurations:


Projector based VTS:

• Use of a front projector and an opaque screen in order to display the VTS information. A

camera placed on the same side of the screen as the projector, captures the user's hand

from behind. The set camera/projector/screen could be tilted to suit the user's

preferences. This configuration (illustrated in Figure 8.1(a)) requires tracking of the back

of the hand.

• Use of a rear projector and a diffuse screen in order to display the VTS information. A

wearable camera (for example, carried by the user on the shoulder) captures the user's

hand from behind. This configuration (illustrated in Figure 8.1(b)) requires tracking of

the back of the hand.

• Use of a rear projector and a diffuse screen in order to display the VTS information. A

camera placed on top of the VTS captures the user's hand from behind. The set

camera/screen/projector could be tilted to suit the user's preferences. This configuration

(illustrated in Figure 8.1(c)) requires tracking of the back of the hand.

• Use of a rear projector and a transparent screen, such as the commercially available DNP

HoloScreen (HoloScreen display material supports video projection and is nearly

transparent to IR and visible light (DNP, 2004)) in order to display the VTS information.

A camera behind the VTS captures the user's hand through the VTS (palm view). This

configuration (illustrated in Figure 8.1(d)) requires tracking of the front of the hand.

HMD based VTS:

• Use of a see-through HMD in order to display the VTS information as floating in the

user's environment. A camera placed behind the VTS captures the user's hand from its

front. This configuration (illustrated in Figure 8.1(d)) requires tracking of the front of the

hand.

• Use of a see-through HMD in order to display the VTS information as floating in the

user's environment. A camera placed on the HMD (allowing the implementation of

video see-through) captures the user's hand from behind. This configuration (illustrated

in Figure 8.1(e)) requires tracking of the back of the hand.


(a)

Opaque screen + front projector

(b) (c)

Diffuse screen + rear projector

(d)

Transparent screen (Such as DNP HoloScreen)

+ rear projector

(e) (f)

See-through HMD

Figure 8.1: Six possible interface element configurations for the VTS. (a), (b), (c) and (d) are projector based

VTSs. (e) and (f) are HMD based VTSs.


8.1.1 Hand tracking

Hand tracking is crucial in the development of the VTS interface. The hand tracking used in

a VTS has to be able to track the hand position and configuration throughout a video

sequence. It also has to be able to identify actions relevant to the VTS use. These actions are

primarily clicking with a finger or dragging a finger on the VTS surface. The accuracy and

repeatability with which these two actions (clicking and dragging) are able to be detected

will greatly determine the overall reliability of the VTS. Another important element of the

hand tracking is its real-time operation, a fast hand tracking will result in a responsive VTS

interaction, while a slow hand tracking will result in a non-responsive, uncomfortable, user

experience.

Full 3D articulated hand tracking is currently an active, and challenging, area of research in

the computer vision community (see literature review in Chapter 2 ). The challenge is even

bigger when using a single camera. However, the 2D nature of the VTS interaction does not

require full 3D articulated hand tracking. Considering this fact, a 2D articulated hand contour

tracking was proposed and developed in this thesis specially for the VTS. This hand contour

tracking requires the user to maintain their hand mostly parallel to the VTS. However, the

hand contour tracking can be very flexible, allowing real-time tracking of the user's hand in a

range of orientations both from the front (palm view) and from the back. The hand contour

tracking developed in this thesis was specially designed to handle the set of articulated

movements that are expected to happen when a user operates a VTS, in particular, fast

flexion/extension of the fingers, which can correspond to clicks or finger presses on the VTS

(see Section 4.6).

8.1.2 Operation

The operation of a VTS interface is similar to that of a touch sensitive screen. As in a touch

sensitive screen, the VTS user can directly click or drag information elements as they appear

on the display. However, due to the use of visual hand tracking technology, and the lack of

haptic feedback when the user clicks or drags an object on the VTS, the operation of a VTS

interface can be expected to be different to that of a touch sensitive screen.

The hand tracking technology may need an initialisation step before starting the tracking of

the user's hand. The purpose of this initialisation step is to gather information about the user's


hand (while this is in a predetermined configuration) in order to make the hand tracking

more specific to that particular hand. Examples of tracking parameters that could be set

during the initialisation step to match a particular user's hand include: initial hand position

and configuration, particular skin tone of the hand, particular hand shape, current

illumination levels, and estimation of a kinematic model for the hand.

In order to incorporate this initialisation step into the usage of the VTS, the following

operation stages are proposed: (each of these stages can be indicated to the user by means of

visual or audio cues)

• Firstly, a hand shape can be displayed on the VTS in order to indicate to the user that the

VTS is ready to be initialised by a particular user. This stage is illustrated in Figure

8.2(a).

• Secondly, the user must place their hand inside the hand shape, at this moment the

system will detect the user's hand and tune a number of tracking parameters to this user.

This stage is illustrated in Figure 8.2(b).

• Thirdly, the hand shape disappears and it is replaced by the VTS display. At this point,

the user can start operating the VTS interface. This stage is illustrated in Figure 8.2(c).

(a) (b) (c) Figure 8.2: Proposed VTS operation.

Due to the hand tracking technology and the lack of haptic feedback, the VTS operation may

require the user to click or drag objects in a certain way. For example, the user may have to

operate the VTS from a certain distance, and may have to flex their fingers with a certain

speed and intensity in order for clicks to be recognized. A number of visual or audio cues can

be given in order to help with this. For example, the tracked hand can be displayed in a

different colour when it is held at a certain distance from the VTS (indicating the VTS can be

operated from that distance). Then different audio and visual feedback can be given

depending on whether a click or a drag on the VTS is detected.


8.1.3 Usability

The VTS interface has the potential to enable a large number of applications. However, a

question arises about the usability of the VTS interface – the ease with which people can

employ the interface. This includes perception of the interface, learning curve, postural

comfort, etc. There are various points to take into account when evaluating the usability of

the VTS interface. The first point to think about is how comfortable is for the user to operate

a VTS interface.

Kölsch (2004) studied the postural comfort for HCI systems similar to the projected based

VTS and to the HMD based VTS. The study resulted in the definition and mapping of certain

"comfort zones" for various types of single-handed interaction while standing. The resulting

comfort zone for hand placement is within a half moon-shaped area about 35 to 45

centimetres from the shoulder joint, at an angular range from 70 degrees adduction to 50

degrees abduction (away from the body centre). They did not investigate the wrist angle

within that comfort zone. However, various studies in ergonomics, mainly focused in

preventing Carpal Tunnel Syndrome while typing, using a mouse, or other work related tasks

suggest that the wrist angle should not exceed 20° extended, nor be bent to either side (Bach

et al., 1997; Wellman et al., 2004). On the other hand, Sears (1991) studied the use of touch

screen keyboards at various tilt angles and with various key sizes. Three tilt angles over the

horizontal where studied: 30, 45, and 75 degrees, concluding that 75 degrees resulted in

more fatigue and lower preference ratings, and 30 degrees resulted in less fatigue and higher

preference.

From the number of suggested applications is it possible to see that the VTS interface, either

projector based or HMD based, can be deployed in a large range of orientations, with the

only major restriction is that the field of view between the camera and the user's hand (and

between the projector and the VTS in the case of a projector based VTS) must be clear. One

possible way to guarantee a comfortable interaction with a projector based VTS is to deploy

the screen at the correct height and tilt angle. This can allow a standing user to operate the

VTS with their hand in the comfort zone and with their wrist angle not exceeding 20° of

extension. In the HMD based VTS, if this is HMD stabilized, the user can look slightly

downwards so that the whole field of view is in the comfort zone, although this is the users

choice (the system can be operated outside the comfort zone if necessary). Alternatives to


this mode of operation in the HMD based VTS are possible by changing the camera's

position and orientation. For example in HandVu (Kölsch, 2004) the camera is mounted on

the top of the HMD and pointed slightly downwards, in this way the user can look forward

and at the same time operate the HandVu system while keeping their hand in the comfort

zone. On the other hand, in a HMD based VTS the interface elements could potentially be

displayed with perspective, and the click and drag detection could be adapted to work on the

new tilted VTS. This new configuration could offer enough flexibility so that the user can

always operate the VTS in their hand comfort zone.

Another point to take into account when evaluating the usability of the VTS interface is the

type of use this interface would have. The use of a VTS interface is analogous to that of a

touch sensitive screen, and touch sensitive screens are better suited to information systems

with limited data entry (Gleeson, et al., 2004). A study made by Sears (1991) shows the

average words per minute that a user can input using a touch-screen keyboard (25 WPM), a

mouse activated keyboard (17 WPM), and a physical keyboard (53 WPM). He concludes

that where a large amount of data entry is required a keyboard is necessary. This suggests

that the best use of a VTS would be for interactive retrieval of information, like selecting

icons, items from menus, pointing at and dragging of objects, and input of short strings.

Another point to consider when evaluating the usability of the VTS interface is the learning

curve to use the interface. On the one hand, as the click and drag detection mechanisms of a

VTS are vision based, the detection of a particular click or a drag will depend on the

visibility (from the camera's point of view) of the acting finger at that moment in time. Self-

occlusions between fingers and the hand can hide some legitimate clicks on the VTS. This

will require the user to adapt to a particular way of clicking and dragging. This adaptation

will be related to the hand tracking technology. A given hand tracking technology may

impose harder movement restrictions (palm and fingers orientations and movements) than

another, resulting in a longer learning curve for the use of the interface. On the other hand,

the lack of haptic feedback may result by itself in longer learning curves. The lack of haptic

feedback could be alleviated by maximizing acoustic and visual cues.

Finally, in the case of the HMD based VTS, there is another point to consider when

evaluating its usability – this is the simulator sickness (Heider, 1998; Mollenhauer, 2004).

Simulation sickness is a condition where a person exhibits symptoms similar to motion


sickness caused by prolonged use of a HMD. So far, HMD technology has always involved

some degree of simulator sickness due to various reasons. These include visual aspects of the

HMD such as the time lag in the presentation of information, magnification level, field of

view, optics, etc; and other hardware aspects such as size and weight of the HMD, HMD

fitting, etc. However, the HMD technology is improving very rapidly, this will result in

higher resolutions, lower weights, better optics, and reduced simulator sickness. Even now,

there are monocular HMD models with resolutions of 680x400 that can be attached to a pair

of glasses, and they weight as little as 35 grams (SV-6 PC viewer (MicropOptical, 2005)).

There are even some HMD manufactures that claim to have eliminated "cybersickness"

(LightVu (Mirage Innovations, 2006)). It is the opinion of this author that in a near future,

improvements in the HMD technology will make HMDs transparent and comfortable to the

users and they will largely eliminate simulator sickness effects. This will result in a

widespread use of HMDs, and consequently, the potential popularity of HMD based VTS

interfaces.

8.2 Implementations

Three VTS generations have been implemented in this thesis. The first generation VTS

demonstrates the VTS concept with the implementation of a VTS based keypad. In this

generation, the user's hand is tracked from the front (palm view) by using simple image

processing techniques, which require the use of a black background. The second generation

VTS uses the articulated hand contour tracking developed in chapters 3 to 7 . In this

generation, the user's hand is tracked from its front, on an arbitrary background, and fewer

hand motion restrictions are imposed than in the first generation VTS. Finally, the third

generation VTS uses the same tracker as the second generation one but with some minor

modifications. This generation is a HMD based VTS. The camera that captures the user's

hand is mounted on a HMD, thus the hand is captured from its back. The HMD allows the

user see the VTS contents floating on their field of view. Next, the three VTS generations

will be described in detail.

8.2.1 Projector based VTS (First Generation)

The first implementation of a projector based VTS was used to demonstrate the possibility of

detecting key-presses or clicks by using a digital camera and image processing. In this early

version of the VTS the background is black in order to simplify the image processing. The


output of the VTS appears on the computer's screen and the user has to type on a frame

(where a VTS keypad is supposed to exist). This frame has a grid of threads that indicates

where the keypad keys are. The set up of the system is illustrated in Figure 8.3.

Figure 8.3: Set up of the first generation VTS.

Figure 8.4: Image processing for the first generation VTS.

Using simple image processing techniques the finger-tips and finger-valleys of the user's

hand are detected (as indicated in Figure 8.4). The projected lengths of the fingers can be

calculated from these hand features. The lengths of the fingers are continuously monitored in

order to detect changes in length that could be identified as key-presses. When the user's

hand is close enough to the VTS and one of these length changes happens, the final position

of the fingertip, before the finger recovers back to the rest position, is checked. If this

position is inside the area of a key a key-press is recognized. Figure 8.5 shows the key-press

detections from a video sequence of a user typing a telephone number. A video sequence

showing the operation of this first generation VTS is available in Appendix B and on the

supporting webpage as "video_sequence_8.1.avi".


Figure 8.5: Typing a telephone number on the first generation VTS. In the chart the horizontal axis is the

frame number inside the sequence, and the vertical axis is the estimated finger's length.

This first VTS generation demonstrates the basic ideas in a VTS and, at the same time, it has

to deal with the same functional blocks as future VTS implementations, these are: interface

initialisation, and touch detection. However, in order to implement a more flexible and

reliable working VTS a different hand tracking technology is needed. Articulated hand

contour tracking is the choice of the next VTS generations.

8.2.2 Interface initialisation

In Section 8.1.2 the initialisation sequence of a VTS was introduced. This initialisation

involves the user placing their hand on a hand shaped contour in order for the tracker to

initialise a number of parameters for the hand tracking. This initialisation sequence is needed


because of the hand tracking technology. This section refers to the initialisation of the

articulated hand contour tracker developed in this thesis for the VTS (chapters 3 to 7 ). The

initialisation sequence for this hand contour tracking technology involves three states which

are indicated to the user by the colour of the hand contour:

• The first stage (equivalent to that of Figure 8.2(a)) is indicated with a red hand contour.

In this state the hand contour is static in the centre of the field of view, and it will remain

in this state until a user places his hand with splayed fingers on top of the hand template.

This state is illustrated in Figure 8.6(a).

• The second stage (equivalent to that of Figure 8.2(b)) is indicated with a green hand

contour. During this state the user's hand contour is tracked. However, only global hand

tracking and abduction/adduction of the fingers is tracked. The flexion/extension of the

fingers is not tracked yet because the projected length of the fingers (when fully

extended) may need to be calculated. The hand tracking will remain in this state until the

location of the hand is considered good enough to perform the hand initialisation. In

order to tell when the hand location is good enough the score of the tracked hand contour

is monitored. When this score is above a certain threshold for 10 consecutive frames the

tracking state switches to the third state. The frame and hand contour configuration that

produced the best score in those 10 consecutive frames is used for the initialisation of

some parameters, such as tuning of the skin colour model. While the frame and hand

contour configuration at the moment of switching to the third state is used for other

parameters, such as the initial tracking position. This state is illustrated in Figure 8.6(b).

• The third stage (equivalent to that of Figure 8.2(c)) is indicated with a blue hand contour.

When the tracking switches to this state, a number of parameters are initialised. During

this state the tracker performs a fully articulated contour tracking of the user's hand. The

tracking will continue in this state for as long as the location of the hand contour is good

enough. If the output of the hand contour tracker has a low fitness for more than three

consecutive frames, the location of the tracked hand is considered lost, at which point the

state of the tracker goes back to the first state (red hand contour) waiting for a new

initialisation. This state is illustrated in Figure 8.6(c).


(a) (b) (c) Figure 8.6: Initialisation states of the VTS hand contour tracker. (a) First tracking state, waiting for

initialisation. (b) Second tracking state, partial articulated tracking. (c) Third tracking state, full

articulated tracking.

When the tracker switches from state 2 (green hand contour) to state 3(blue hand contour)

the initialisation of a number of parameters occurs. These parameters are listed next:

Initial tracking position

As the initialisation can only occur if the fitness of the hand contour template is high, the

point at which the initialisation occurs is a good point to start fully articulated tracking. The

configuration of the hand contour at that point is the used as a initial tracking position.

Skin colour tone of the user's hand

The hand contour tracker uses the LC skin colour classifier described in Chapter 5 . This skin

colour classifier can be tuned (as described in Section 5.3) to the particular user's skin tone

during the VTS initialisation. The hand contour tracker can generate the masks necessary for

the LC classifier tuning at this point. The configuration of the hand contour that has best

fitness in the last 10 frames previous to this point is used for the tuning of the LC classifier.

The tuning of the LC classifier in this manner assumes that the lighting conditions during

hand tracking will be more or less the same as during initialisation.

Estimation of a kinematic model

A kinematic model of the user's hand could be used in order to determine the configuration

of the hand from a 2D image. Although a kinematic model of just the fingers would be

enough in order to calculate separation of a fingertip from the palm plane. The separation of

a finger from the palm plane together with the estimated distance between the palm and the

VTS could be used to detect clicks on the VTS surface. In order to use such a kinematic


model the length of each finger segment needs to be known. If the length of the segments is

not known beforehand, it could be estimated from the hand tracking at the initialisation

stage. A possible simple method produced for this purpose involves estimating the length of

the fingers segments from the finger flexion creases in a frontal image (palm view) of the

user's hand taken at the initialisation point. The method involves scanning the pixels of a line

drawn along the finger and find the flexion creases as minima in the R channel (from an

RGB triad). Once the flexion creases are found in the image, the position of the joints in the

finger can be calculated using certain correction offsets. As the Index, Middle, Ring, and

Little fingers all have three segments and three joints, the procedure is the same for these

four fingers. Figure 8.7 illustrates the procedure to estimate the length of the finger segments

during the initialisation stage. The method has not been fully tested, and it is expected that

illumination changes can affect the detected position for the flexion creases; however, it

produces some initial results that can be used with a finger kinematic model in order to

detect clicks on the VTS surface.

Figure 8.7: Finding finger creases. (top) sampling line along index finger. (bottom) samples of the line

showing three local minimums which correspond to three finger flexion creases.


Shape of the hand

The shape of a hand contour may vary from user to user. These variations may produce a

poor fit of the hand template to some users; and a poor fit results in a poor tracking

performance. A possible improvement to the hand tracking would involve tailoring a hand

contour template for each particular user's hand. The best moment to create this template

would be the initialisation stage. A modeless method to find the user's hand contour, such as

Snakes (Kass et al., 1987), could be used, and from the found hand contour a new hand

template could be created. This method could make the hand tracker robust to different users.

8.2.3 Touch detection

The hand contour tracking technology used in the VTS makes possible to know what the

state of the hand is at every frame of the input video sequence. This hand state has to be

interpreted in order to determine when a user's fingertip touches the VTS surface (this is also

referred as finger click on the VTS). Touch detection refers to the method used in order to

determine this event. When considering a touch detection method, it is important to

remember that the hand contour tracking used in the VTS requires the user to keep their hand

approximately parallel to the camera's image plane (the VTS plane). Having this into mind, a

number of methods for touch detection are possible:

Kinematic model

One possibility for interpreting the hand contour is to use a kinematic model of the fingers.

In the previous section a method of finding the lengths of the finger segments was suggested.

Once the length of the finger segments is calculated using this method, the kinematic model

of the user's fingers can be used in order to calculate the separation of a fingertip from the

palm plane. This separation, in combination with the estimated distance between the hand

and the VTS, can be used to determine when a finger is touching the VTS. The procedure to

find the separation of the fingertip from the palm plane involves the calculation of the

reverse kinematics of a chain of three links (the finger segments). The input of the procedure

is the 2D projected length of the finger; the output is the angles of the joints for that finger.

This process is fully explained in Appendix A.1. Figure 8.8 shows (on the left) a hand

flexing the middle finger. The calculation of the reverse kinematics allows us to find the

configuration of the finger joints (on the right) and, therefore, the separation of the fingertip

from the palm plane. A video sequence of a hand tracker that both calculates the finger


segment lengths from the finger flexion creases, and then calculates the reverse kinematics is

available in Appendix B and on the supporting webpage as "video_sequence_8.2.avi".

Figure 8.8: Hand undergoing flexion of middle finger. A kinematic model for the fingers allows to calculate

the stick-out of the finger from the hand palm.

The accuracy of this method depends upon two factors: the accuracy with which lengths of

the finger segments are calculated; and the assumption that the flexion/extension of the

fingers follow a typical profile (as described in Appendix A.1). Ultimately an uncertainty

band is necessary. If the combination of the finger separation from the palm plane and the

distance between the hand and the VTS is inside this uncertainty band, a finger click on the

VTS is detected.

Thresholds

All the processes required in order to detect a finger click on the VTS using a kinematic

model for the fingers can be greatly simplified by using thresholds. A combination of

multiple thresholds involving the 2D projected length of the fingers for each of the distances

between the hand and the VTS can effectively do the same job as the kinematic model for

the fingers. These thresholds can be calculated from the kinematic model during the

initialisation stage, and then be stored in form of a lookup table. During tracking the lookup

table with the thresholds is continuously tested in order to detect finger clicks on the VTS.

This is several times faster than calculating the reverse kinematics. A further simplification

using thresholds involves the use of a single threshold for the distance between the hand and

the VTS (triggered when the hand is near enough to the VTS), and a single threshold for the

2D projected length of finger (triggered when the finger is flexed beyond a point).

Contour trackers based on particle filters tend to exhibit a certain degree of jitter in their

output contours. The same is true for the articulated hand contour tracker developed in this


thesis for the VTS. This fact together with variations in illumination of the fingers, result in a

variable measurement of the 2D projected length of the finger (which corresponds directly to

the finger length parameter in the contour's state), and consequently potentially incorrect

finger click detections on the VTS. One way of making the touch detection more robust to

these finger length variations is to consider the rate of change of the finger length. Thus, a

simple and relatively robust method for touch detection involves three thresholds: one for the

distance between the hand palm and the VTS; another for the finger length; and finally

another for the rate of change of the finger length. This touch detection method relies on the

user touching the VTS surface in a particular way, which is determined by the three

thresholds.

Moving threshold

A touch detection method that is both more reliable and lessens the constraints imposed on

the way of clicking on the VTS (constraints on the amount of finger flexion and speed of the

flexion) is based on a moving threshold. The proposed method uses an Exponentially

Weighted Moving Average (EWMA) (NIST, 2006). of the finger length, and from this

EWMA a Lower Control Limit (LCL) is calculated. If the length of the finger becomes

smaller than the current LCL, then a click on the VTS is triggered. The EWMA and the LCL

are calculated as follows:

1(1 )xµ α α µ−= + − (8.1)

2LCL kµ σ= − (8.2)

where x is the current finger length, µ and 1µ− are the EWMAs of the finger length for the

current and previous time-steps respectively, α is the degree of filtering, 2σ is the variance

of the finger length, and k is a constant that modulates the distance between the EWMA and

the LCL. The parameter α can vary from 0 to 1. If α is near to 1, the filtering of x weak. If

α is near 0, the filtering of x is strong. The larger the constant k is the larger the amount of

finger flexion required to trigger a finger click is, and so is the duration within which a finger

flexion has to performed (slower finger clicks are possible). The values for α , k, and 2σ

have been found empirically resulting in α =0.65, k=1.2, and 2σ =0.1. Figure 8.9 shows the

finger length, x, and the LCL for the index finger during a video sequence of interaction with

a VTS. The vertical axis is the finger length. The horizontal axis is the frame number.

Arrows indicate the points at which a finger click on the VTS is detected.


Figure 8.9: Touch detection using a moving threshold. The vertical axis is the finger length (parameter in the

hand contour state). The horizontal axis is the frame number from a video sequence of interaction

with the VTS. Arrows indicate the points at which a finger click on the VTS is detected.

Debouncing

The term debouncing is used in here as an analogy with keyboard technology. In keyboard

technology, debouncing refers to the filtering of spurious electric signals just before and after

a key is pressed and released. This filtering prevents the detection of various key presses

when only one is correct. In the VTS the concept refers to the filtering of the finger length

against sudden and brief changes, which could be produced by the contour jitter rather than

from a finger click. The technique involves waiting for two frames before triggering a finger

click, and two frames before releasing the finger click. During these two frames the finger

length has to be either below LCL (for key press) or above LCL (for key release). This same

procedure is used in keyboards for debouncing purposes.

Detecting the position of a finger click

Once a finger click is triggered, the position in the VTS where that event happened can be

calculated from the state of the hand contour. However, there are two points to take into

account when calculating this position. Firstly, the debouncing mechanism triggers a finger

click after the finger length is below LCL for two consecutive frames, but the finger click

position must be calculated using the length of the finger in the first of these two frames.

Secondly, the calculated finger click position has to be corrected with a small vertical offset.


Sears (1991) studied the ergonomics of touch-screen keyboards and the effect of the key size

in their use. He reported that subjects consistently touched below targets. This phenomenon

has also been observed in a VTS based keypad. This is the reason why the calculated finger

click position is corrected with a small vertical offset. The size of this offset depends on the

size of the target and may be different for each particular VTS interface.

Dragging

We have seen various techniques to detect a finger click on the VTS surface. Another

important interaction that a VTS should support is dragging. Dragging is implemented by

first detecting a finger click, storing the length of the finger at that point, Lclick, and

establishing a threshold for that finger, LengthThreshold = Lclick + Margin (the value of

Margin can be different for each finger). If the finger length is below LengthThreshold the

finger is considered to be dragging on the VTS, and the position of the dragging fingertip

will be reported for every time-step. If the finger length goes above LengthThreshold the

drag operation is considered finished. Thus, every finger drag operation starts with a finger

click operation followed by a translation of the hand contour, while keeping the length of the

finger below LengthThreshold.

8.2.4 Projector based VTS (Second Generation)

The articulated hand contour tracking developed in chapters 3 to 7 , together with the

initialisation techniques and touch detection techniques described in the previous two

sections enables the implementation of a second generation VTS. In this implementation the

hand tracking is from the front of the hand (palm view). This is the same point of view as the

hand contour trackers presented in chapters 4 , 6 , and 7 . The VTS contents are designed to

be projected onto a screen such as DNP HoloScreen. This type of screen works as display

surface when light from a rear projector is incident at a particular angle, but it is transparent

to all other light. This allows us to both project the VTS contents onto it and do hand

tracking through it (as illustrated in Figure 8.1(d)). However, a HoloScreen was not available

while developing this VTS generation. As an alternative, the VTS was tested using a non-

reflective glass, on which the outline of some interfaces such as keypads, slider bars, etc,

were drawn. The VTS interfaces are aligned with the interfaces drawn on the glass, this

allows the user to operate the virtual interfaces while using the drawn ones as a visual aid.

The feedback produced by the VTS interfaces when clicks and drags occurred is shown on


the computer screen. Note that with this set up the user's hand is tracked through the drawn

interfaces (camera behind the screen); however, this does not affect the hand contour

tracking because the interfaces are drawn with thin lines. Figure 8.10 shows the set up of the

second generation VTS.

Figure 8.10: Set up of the second generation VTS. The VTS is on a non-reflective glass and the VTS

feedback is displayed on the computer screen. However, this VTS generation is meant to be used in

combination with a projector and a screen such as DNP HoloScreen. This screen works as a

transparent surface from the camera's point of view but as a display surface from the projector's

point of view.

This VTS implementation follows the initialisation stage described in Section 8.2.2. During

this initialisation stage only the initial tracking position and LC skin colour classifier are

initialised. This VTS implementation uses the moving threshold touch detection method; the

finger click debouncing; the finger click position calculation; and the finger dragging

methods as described in Section 8.2.3.

In this VTS implementation when the user's hand is close enough to the VTS (for clicking on

it), a light blue circle appears in the centre of the palm. This indicates to the user that their

hand is at the correct distance to operate the VTS. In fact, touch detection operates only

when this circle appears on the hand contour. The VTS was tested with two types of


interfaces: a keypad, and a slider bar. Figure 8.11 shows a sequence consisting of four

consecutive frames of a user pressing the key '0' on a keypad interface. In frame 150 the

index finger of the user is completely extended. In frame 151, the finger flexes rapidly

stopping on top of the key '0'. This amount of finger flexion and the speed with which it

occurred (1 frame) is enough to trigger a finger click. However, because of the debouncing

mechanism the key press will not be confirmed until frame 153. Effectively, in frame 152 the

key press is not yet confirmed. In frame 153, the key press is confirmed (using as location

for the click that of the fingertip in frame 151).

Frame 150 Frame 151

Frame 152 Frame 153 Figure 8.11: Keypad usage. Key press sequence.

In order to use a slider bar the user needs to follow three steps: firstly, click with a finger

onto the slider bar's cursor; secondly, drag the cursor along the slider bar up to the desired

position; and thirdly, lift the finger from the slider bar's cursor. Figure 8.12 shows four

frames which illustrate the usage of a slider bar. In frame 455, the user clicks on the slider

bar's cursor. In frame 459, the cursor is captured and the dragging can start. In frame 476 the

cursor has been dragged down the slider bar. Note that in order to improve the robustness of

the slider bar interface, once the cursor is captured only the vertical coordinate of the user's


fingertip is used for the dragging. This means that even if the fingertip that captured the

cursor does not move exactly along the slider bar, the cursor will still be dragged to the

fingertip vertical position. Slider bars in common GUI interfaces have this same behaviour.

A video sequence showing the usage of a keypad and two slider bars in the second

generation VTS is available in Appendix B and on the supporting webpage as


Frame 452 Frame 455

Frame 459 Frame 476 Figure 8.12: Slider bar usage. Dragging sequence.

8.2.5 Tracking from the back of the hand

Leading towards the third generation VTS, a new hand tracking point of view had to be

tested. In the third generation VTS the camera that captures the user's hand is mounted on a

HMD, as shown in Figure 8.15. This involves tracking the user's hand from its back. In

principle, the articulated hand contour tracker developed in chapters 3 to 7 , is capable of

tracking the user's hand both from the front view (palm view), and from the back view. The

only requirement is that the camera that captures the user's hand has to be approximately

parallel to the hand palm. However, when the user's hand is tracked using a camera mounted


on a HMD, the hand is typically held just in front of the user (as opposed to holding the hand

in front of the user's shoulder, as in the case of the second generation VTS) so that the hand

stays near the centre of the camera's field of view. This point of view may result in the hand

contour appearing slightly different. Also, in this configuration (camera mounted on a

HMD), if the user turns around while operating the VTS, it is possible that the illumination

conditions of the hand could change substantially (depending on the light sources in the

room). For these reasons, some minor modifications have been made to the hand contour

tracker.

The first modification is in the hand contour template. The template is a horizontally flipped

version of the hand contour template described in Section 4.1. However, in this hand

tracking configuration (camera mounted on a HMD) the user's right hand is normally held

near the centre of the camera's field of view, and the user's arm appears in the lower right

quadrant of the image. In order to better adapt to the way the user's hand is held in this

configuration, the right hand side of the hand contour template is slightly shortened (see

Figure 8.13). On the other hand, the fact that the user's arm normally appears in the lower

right quadrant of the image is taken into account during hand tracking initialisation, when the

LC skin colour classifier is tuned. When the LC classifier is tuned two masks need to be

generated from the hand tracking information. These masks were referred in Section 5.3 as

SkinMask, and BackgroundMask. These mask are meant to segment the skin colour area of

the user's hand (SkinMask), and avoid any obvious skin colour areas in the image

(BackgroundMask). The obvious skin colour areas include the user's hand and arm.

Therefore, when generating the BackgroundMask the potential position of the user's arm is

taken out of the mask. Figure 8.14 shows the two initialisation images used in the LC

classifier tuning. These masks are equivalent to those shown in Figure 5.6 for the new

tracking point of view.

Finally, in this hand tracking configuration it is possible for the illumination conditions of

the hand to change substantially (as the user turns around while operating the VTS). This

illumination changes may result in a poor skin colour segmentation of the user's hand, and

consequent loss of tracking. In order to cope with illumination changes the LC classifier is

repeatedly tuned to the current hand skin colour, once for every frame of tracking. The

procedure is referred to as dynamic tuning and it is described in detail in Section 5.9. A

video sequence showing hand tracking from the back of the hand (and which uses dynamic


tuning) is available in Appendix B and on the supporting webpage as


Figure 8.13: New hand contour template.

(a) (b) Figure 8.14: New initialisation image masks. (a) Initialisation image segmented by SkinMask. (b)

Initialisation image segmented by BackgroundMask.

Hand tracking from the back of the hand can also enable other VTS configurations (apart

from the HMD based configuration). The camera that captures the user's hand can be

wearable or be placed somewhere on top of the VTS, so that the hand can be tracked from its

back. Examples of these configurations appear in Figure 8.1(b) and (c).

8.2.6 HMD based VTS (Third Generation)

The third generation VTS is a HMD based VTS. In this configuration the camera that

captures the user's hand is mounted on the HMD, and the user's hand is tracked from its

back. The VTS contents are displayed in the HMD and appear to the user as if they were

floating in their field of view. Figure 8.15 shows the set up of the third generation VTS. The

HMD model used in this set-up is an I-glasses (I-O Display Systems, 2006). The HMD

contains two 800×600 LCD screens, one in front of each eye, although in this model both


screens show the same video output. The camera used in this set-up is a QuickCam Pro 3000

(Logitech, 2006). The camera was disassembled in order to replace the default lens with a

fisheye 2.1mm focal length lens, which provides an approximately 150° field of view. Then,

the camera was placed in a plastic housing and the set was attach to the front of the HMD

using Velcro strips. As the lens produces a fisheye image, the camera is first calibrated, and

the video from the camera is undistorted using CalibFilter from OpenCV (OpenCV, 2006).

Figure 8.15: Set up of the third generation VTS. The VTS contents are displayed in the HMD, and appear to

the user as if they were floating in their field of view. A camera is placed on the HMD in order to

capture the user's hand from its back. This allows tracking and interpretation of the of the user's

hand interaction with the VTS.

The third generation VTS uses the modified hand tracker (tracking from the back of the

hand) described in Section 8.2.5. The initialisation of the VTS interface follows the same

sequence as the second generation VTS (described in Section 8.2.2). A moving threshold and

debouncing (as described in Section 8.2.3) are used for the touch detection. The fingertip

dragging detection is also the same as in the second generation VTS.

In order to demonstrate the operation of the third generation VTS, two virtual interface

elements are implemented: a keypad, and spinning wheel. The keypad (shown in Figure

8.16(a)) has a display area and a dragging bar. When keys are pressed in the keypad those


produce visual and acoustic feedback and the typed numbers appear in the display area.

When the user clicks on the dragging bar, the keypad can be repositioned on the VTS by

dragging it. The keypad is used in the following demonstrations in order to launch the wheel

and control some aspects of the VTS interface by typing in certain codes. The spinning

wheel (shown in Figure 8.16(b)) has a spinning area (area between the outer circle and the

inner circle), and a dragging handle (area in the inner circle). Users can spin the wheel by

clicking and dragging their fingertip around the spinning area. The dragging handle area

allow the users to reposition the wheel on the VTS. The spinning wheel can be used to

control values with a continuous magnitude.

(a)

(b) Figure 8.16: Example virtual interfaces. (a) Keypad. (b) Spinning wheel.

Despite the fact that the HMD presents 2D information only (with the same video output for

the two HMD screens), an illusion of depth is generated with a combination of colour cues

and occlusions. When the user's hand is far from the VTS (and in front of it), the user's hand

occludes the VTS interfaces, and the tracked hand contour is displayed in a light blue colour

(see Figure 8.17(a)). The occlusion of the VTS interfaces by the user's hand gives the

impression to the user that their hand is in front of the VTS interfaces. At this distance the

user's hand is too far away to operate the VTS. When the user's hand gets nearer to the VTS

(and still in front of it), the user's hand occludes the VTS interfaces, and the colour of the

tracked hand contour changes to dark blue (see Figure 8.17(b)). At this distance the user's

hand is near enough to operate the VTS. When the user's hand goes behind the VTS, the

colour of the tracked hand remains dark blue, but the VTS interfaces occlude the user's hand

(see Figure 8.17(c)). The occlusion of the tracked hand by the VTS interface gives the


impression to the user that their hand is certainly behind the VTS. The VTS cannot be

operated when the user's hand is behind it.

(a)

(b)

(c)

Figure 8.17: Illusion of depth perception. (a) The user's hand is too far away from the VTS so as to operate it.

This is indicated with a light blue hand contour. (b) The user's hand is near enough to the VTS so as

to operate it. This is indicated with a dark blue hand contour. (c) The user's hand is behind the VTS

and cannot operate it. This is indicated by occluding the hand with the keypad.


These three states (far way from the VTS, near enough to the VTS, and behind the VTS) are

controlled with the scale of the tracked hand contour. Two thresholds with hysteresis are set

manually on the hand contour scale, so that the switching between these three states

produces convincing depth perfection.

The occlusion of the VTS interfaces is implemented using the skin colour of the user's hand.

Each of the VTS interfaces is continually testing the image area they occupy for skin colour.

If skin colour is detected in that area, this skin colour is used to create a binary mask. The

mask is morphologically processed (in order to have more compact blobs) and used to

selectively display the pixels of the interface. The pixels inside the interface area where there

is no skin colour (mask values are zero) are displayed normally, but the pixels of the

interface area where there is skin colour (mask values are one) are not displayed, and

therefore the skin colour will appear on the image instead. This gives the impression to the

user that their hand is occluding the interface, and therefore it must be in front of it.

8.2.7 Third generation VTS experiments

In this section the VTS operation is tested in four experiments. These experiments aim to

demonstrate the capabilities of the third generation VTS interface, and they give an idea of

its potential applications. In all four experiments the VTS contents are HMD stabilised i.e.

the contents appear floating in field of view regardless of where the user is looking at.

Briefly:

• The first experiment involves testing the VTS operation against a plain background. This

makes the skin colour segmentation easier and the hand tracking more precise.

• The second experiment involves testing the VTS operation against a complex

background. A complex background makes the skin colour segmentation harder and the

hand tracking has to cope with non-ideal hand segmentation.

• The third experiment is similar to the first one but in this experiment the VTS interfaces

can be resized with a hand gesture.

• The fourth experiment uses the VTS to implement a drawing application.


VTS operation with a plain background

In this experiment, an user operates the VTS in front of a white wall. In this situation, the

skin colour of the user's hand can be easily segmented from the background. This makes the

articulated hand contour tracking more precise. The experiment is recorded as a video

sequence which is available in Appendix B and on the supporting webpage as

"video_sequence_8.5.avi". Some example frames from this video sequence are shown in

Figure 8.18.

Frame 232 Frame 272 Frame 347



Frame 1269 Frame 1284 Frame 1304 Figure 8.18: Example frames of the actions occurring during the first experiment (third generation VTS).


The experiment starts with the user initialising the hand contour tracker by placing their hand

on the floating red hand contour. Once the hand tracking starts (blue hand contour), a keypad

appears in the centre of the field of view. Then the user proceeds to sequentially type on the

keypad from the top row to the bottom row. During the first two rows the users types with

their hand parallel to the VTS (example frame on Figure 8.18 Frame 223). For the third row

the user types with their hand tilted forward (example on Frame 272). And on the last row

the user types with their hand tilted to the left (example on Frame 347). These three ways of

typing illustrate the freedom with which the user can type on a VTS. After this, the user

drags the keypad to one side (Frame 425), and types the code "123" and enter "E" (Frame

524). This code launches a spinning wheel interface (Frame 591). The spinning wheel is

dragged to a new position (Frame 757) and it is spun using the index finger (Frame 591),

then the middle finger (Frame 815), and finally the ring finger (Frame 886). Finally, the user

types the code "1" in the keypad. This code makes the spinning wheel control the brightness

of the input video with the angular position of the wheel (Frame 1269, Frame 1284, and

Frame 1304). Note that through the whole video sequence the shadow of the user's hand can

be seen on the wall, this shows that the user is operating a non-physical interface.

VTS operation with a complex background

In this experiment, a user operates the VTS against a complex background. A complex

background makes the skin colour segmentation harder as some colours in the background

can be misclassified as skin when they are not. This makes more difficult the tracking of the

user's hand. The experiment is recorded as a video sequence which is available in Appendix

B and on the supporting webpage as "video_sequence_8.6.avi". Some example frames from

this video sequence are shown in Figure 8.19.

This experiment is very similar to the first one with the only difference of using a complex

background. The experiment proceeds as follows: Firstly, the VTS is initialised, and a

keypad appears on the centre of the screen. Then the user types on the keypad (example

frame on Figure 8.19 Frame 412), launches the spinning wheel (typing code "123E"), and

drags the keypad to one side (Frame 595). Then, the user spins the wheel (Frame 834) and

activates the brightness control feature (typing code "1E") (Frame 1014). This allows the

user to control the brightness of the input video with the angular position of the wheel

(Frame 1149, and Frame 1158).


Frame 412 Frame 595

Frame 834 Frame 1014

Frame 1149 Frame 1158 Figure 8.19: Example frames of the actions occurring during the second experiment (third generation

VTS).

VTS operation with resizable interfaces

This experiment tests two qualities of the VTS interface: firstly, the possibility of using hand

gestures to control aspects of the VTS interface; and secondly, the range of interface sizes

that the user can comfortably operate. In this experiment the size of the VTS interface


elements can be controlled with the position of the thumb. The procedure is as follows:

firstly, the user has to click on the dragging handle of the interface element and hold the

finger down as if the interface was going to be dragged (optionally the interface can be

dragged); then, the thumb of the user's hand has to be completely flexed in order to activate

the resizing mechanism; from that moment on, when the thumb is flexed (past a threshold)

the size of the interface increases, conversely, when the thumb is completely extended

(passed a threshold) the size of the interface decreases. The experiment is recorded as a video

sequence which is available in Appendix B and on the supporting webpage as

"video_sequence_8.7.avi". Some example frames from this video sequence are shown in

Figure 8.19.

The experiment starts by initialising the interface. Then after typing on the keypad, the user

clicks and holds their finger on the keypad's dragging bar (example frame on Figure 8.19

Frame 291). At this point, the user flexes their right hand's thumb (Frame 306) and the

keypad's resizing mechanism starts to operate. As the user keeps their thumb flexed the size

of the keypad increases (Frame 319). After typing on the resized keypad, the user proceeds

to decrease the size of the keypad. For that, the user clicks on the keypad's dragging bar and

while holding their finger on the dragging bar, flexes the thumb in order to activate the

keypad's resizing mechanism (Frame 510). Then, the user extends their thumb completely

(Frame 532) and the size of the keypad starts to decrease (Frame 554). The user types the

code "123E" on the resized keypad (Frame 686). This launches the spinning wheel, the size

of which is first decreased (Frame 949, and Frame 1023), and then increased (Frame 1315).

Finally, the resized spinning wheel is operated. As the wheel's size is big, this one is operated

by the user using three fingers (Frame 1589, and Frame 1596).





Frame 1315 Frame 1589 Frame 1596 Figure 8.20: Example frames of the actions occurring during the third experiment (third generation

VTS).

VTS based drawing application

As a last experiment using the third generation VTS, a drawing application was developed.

In this application the user can draw strokes on the VTS by clicking and dragging their finger

on the virtual surface. The drawing application has a toolbar consisting of a number of

buttons, see Figure 8.21. From left to right, the first group of 6 buttons, allows the user to

select the colour with which the strokes are drawn. The current drawing colour is indicated

by a red border in the corresponding button. The following button in the toolbar, allows the


user's hand to occlude the drawings or vice versa. When the state of this button is "_" the

drawings occlude the hand, when the state of the button is "^" the hand occludes the

drawings. The next button in the toolbar allows the user to delete a previously drawn stroke.

When the user clicks on the button, this changes from "D" to "X". When the state of this

button is "X" the user can click on a previously drawn stroke, and this one will disappear

from the screen. The last button on the toolbar is "C". When the user clicks on this button the

drawing area is cleared.

Figure 8.21: Drawing application toolbar (third generation VTS).

A video sequence showing the operation of this drawing application is available in Appendix

B and on the supporting webpage as "video_sequence_8.8.avi". The first thing to obverse in

this video sequence is that the user withdraws his hand and introduces it into the VTS's field

of view without going through the initialisation sequence. The hand contour tracking for this

drawing application uses the SCGS (Skin Colour Guided Sampling) technique as described

in Section 7.4. This technique allows the VTS to automatically initialise the hand tracking as

soon as the user's hand appears on the field of view (typically within 5 frames). As a result,

the user can easily withdraw their hand from the field of view (and keep it in a rest position),

and then introduce it again into the field of view when interaction with the VTS is required.

Figure 8.22 shows some frames from "video_sequence_8.8.avi", which illustrate how the

hand tracking is automatically initialised within 5 frames from the moment at which the hand

starts to appear on the field of view.




Frame 95 Frame 96

Figure 8.22: Automatic hand tracking reinitialisation feature.

The video sequence "video_sequence_8.8.avi" continues with the user operating the drawing

interface in order to draw a simple scene with a house, a tree, a tractor, clouds and a sun.

Figure 8.23 shows some example frames from this video sequence, which illustrate the type

of actions occurring during the operation of the drawing application. By frame 2405 the user

has managed to draw most of the intended scene, having had to change the drawing colour

several times. Note that the strokes are smoothed out as they are drawn. This makes the

stroke to appear a bit delayed as the user drags their hand on the VTS, but on the other hand,

this avoids reflecting in the stroke the potential jitter of the hand contour. Also note that the

tracked hand contour is always dark blue, this means that the user can draw strokes in the

VTS regardless of the distance between the hand and the drawing surface (this new

behaviour does not apply to the toolbar). Frame 2515 illustrates how the user's hand can

occlude the drawings (by switching the "_" button to "^") creating the illusion that the hand

is on top of the drawing. Note that in previous frames the drawings were on top of the hand

(occluding the hand). Frames 2769 and 2792 illustrate how the user can delete a stroke (by


switching the "D" button to "X" and clicking on the desired stroke). The user deletes the

stroke corresponding to the tree and redraws it with a different shape (Frame 3044). Finally,

the user clears the drawing area by clicking on the "C" button (Frame 3355).




Frame 3044 Frame 3117 Frame 3355 Figure 8.23: Example frames of the actions occurring during the drawing application experiment (third

generation VTS).

The drawing application allows the user to draw using either the index finger, the middle

finger, or the ring finger. It also allows the user to draw using multiple fingers at the same

time (multiple points of input). This results in drawing multiple strokes at the same time. A

video sequence illustrating drawing with multiple fingers at the same time is available in


Appendix B and on the supporting webpage as "video_sequence_8.9.avi". Some example

frames from this video sequence are shown in Figure 8.24. Frame 182 shows how the user

clicked and dragged their index and middle fingers on the VTS, that produced two parallel

strokes. Frame 390 shows how the user draws simultaneously three horizontal strokes

representing a sea. Frame 1063 shows how the user draws simultaneously the two eyes of a

sun. Finally, frame 1192 shows the finished drawing.

Frame 182 Frame 390

Frame 1063 Frame 1192 Figure 8.24: Example frames showing drawing with multiple fingers at the same time (multiple points of

input).

8.3 Applications

The unique characteristics of the VTS interface lead to a number of possible applications that

go beyond emulating traditional touch sensitive screens. Section 8.2 gave us an basic idea of

the potential of the VTS interface as implemented in three VTS generations. This section

describes a number of potential applications of the VTS interface. Some of the suggested

applications may involve other technologies too, but the VTS remains central in the


application. Depending on whether the VTS is projector based or HMD based the potential

applications vary slightly.

Possible applications for a projector based VTS are:

• An alternative to touch sensitive panels. This alternative is particularly attractive when

large touch sensitive panels are required, because these are difficult to construct and thus

quite expensive. This would enable applications such as large desktops, interactive

drafting boards, and large interactive information points.

• VTSs make possible not only traditional Windows, Icons, Menu, Pointer (WIMP)

interaction but also direct manipulation of objects in the VTS, and gesture recognition.

For example objects displayed in the VTS could be directly relocated, resized, and

reoriented using a particular gesture, or just clicking on them and dragging them on a

particular way (similar to direct manipulation of object on screen using a mouse). It is

possible to detect two or more fingers clicking at the same time on the VTS (multiple

points of input). Finally, it is also possible to use two identical hand trackers (one for

each hand) and use both hands in the interaction. This would enable applications such as:

direct manipulation of maps in GIS software, direct drawing media, or the simulation of

musical instruments such as a virtual piano, virtual DJ desks, etc.

• The VTS is specially suited to applications in which physical contact is not desired (as

clicks and drags can happen at a certain distance from the screen). These type of

applications can be divided into two scenarios:

• In the first scenario, it is essential to avoid contamination of the user's hand. For

example, the VTS could provide a sterile interface for use in hospitals, operating

theatres, or clean rooms.

• In the second scenario, the interfaces are subjected to extreme wear, for example:

heavy use in public places, hazardous environments with high humidity, dust, or

even underwater. Mechanical interfaces could stop working under these

conditions. A VTS could keep operating as long as the user's hand can be tracked.

If the screen where the VTS is projected is transparent (such as the DNP HoloScreen (DNP,

2004)) a number of additional applications are possible:


• Video conference with gaze awareness and direct manipulation of virtual objects. The

concept is similar to that of ClearBoard (Ishii and Kobayashi, 1992). This is a video

conferencing application where the camera is situated behind the screen, in front of

which the user stands. When the user stands in front of the screen's centre, the camera

points directly to the user's eyes. This makes not only direct eye contact possible but also

awareness of the other collaborator's gaze. In ClearBoard the user needs a stylus in order

to draw or manipulate virtual objects. In a VTS the user could use both of their hands.

• Spatial displays. As the VTS in this category is transparent, it is possible to see the scene

behind it. This can allow pointing out objects in the field of view, as in DigiScope

(Frescha and Keller, 2003), or sending areas in the field of view to a recognition engine,

as in HandVu (Kölsch, 2004). In combination with direct manipulation it would be

possible to place virtual objects on the scene, as seen through the VTS. This could have

applications in surveillance, modelling, prototyping, maintenance, and training.

• Use in museums, information points and shop displays. Information relevant to the user

could be projected on the shop display, and users could interact directly with this

information using their hands (clicking and dragging items). They could browse articles,

request information, they could even try virtual clothes in a virtual mirror.

• 'Minority report' interfaces. As previously suggested in a similar system, TouchLight

(Wilson, 2004), these type of interfaces allow filmmakers to cleanly put the interaction

system and the actor's face in the same shot.

If the VTS uses a see-through HMD in order to visualize the VTS contents, a number of

extra applications are possible (Note that most of the projector based VTS applications can

also be effectively implemented using a HMD based VTS):

• A cheap and flexible alternative to handheld keypads, controls, or pointing devices in

HMD based AR environments. It would be possible to implement it using cheap USB

cameras. This alternative has the extra advantage that the user does not need to have their

hand occupied by interface devices, and consequently they could have their hands free to

touch or grab any physical object.


• Direct manipulation of virtual objects in the field of view. The VTS does not need to be

2D when using a see-through HMD (the 2D operation is an artificial constrain). Virtual

objects could be relocated, resized, and reoriented not only on a 2D virtual surface, but

also above and under it. Occlusion of the virtual objects by either the user's hand or other

virtual objects could give a perception of depth.

• As the user could see the VTS superimposed in the real world, it is possible to point out

an object in the field of view, and select areas in the field of view for processing or

recognition, as in the case of HandVu (Kölsch, 2004). In combination with direct

manipulation it would be possible to place virtual objects on the scene, as seen through

the VTS. This could have applications in surveillance, hybrid modelling, prototyping,

maintenance, and training.

• Interface for mobile computing. A VTS could become a flexible alternative interface for

PDAs, or other mobile computing devices.

• The VTS would have a large field of application in the video game industry. For

example, two user's using a video see-through HMD could see each other and pass to

each other a virtual ball, sort of virtual tennis. Also, taking into account that input from

various fingers and also from both hands is possible, it would be possible to reproduce

for leisure a number of musical instruments such as pianos, DJ decks, etc.

• The VTS is specially suited to applications in which physical contact is not desired (as

clicks and drags happen on the air). These type of applications can be divided into two

scenarios:

• In the first scenario, it is essential to avoid contamination of the user's hand. For

example the VTS could provide a sterile interface for use in hospitals, operating

theatres, or clean rooms.

• In the second scenario, the interfaces are subjected to extreme wear, for example:

heavy use in public places, hazardous environments with high humidity, dust, or

even underwater. Mechanical interfaces could stop working under these

conditions. A VTS could keep operating as long as the user's hand can be tracked.

In the applications involving a see-through HMD, the VTS is HMD stabilised. That is,

wherever the user looks at the VTS interface elements will be in the field of view. The VTS

could be made world stabilized by using fiducial markers such as the ones provided by

ARToolkit (ARToolkit, 2006) or ARTag (Fiala, 2004). For example, a HMD could house a


camera in order to implement video see-through, track the user's hand, and recognize fiducial

markers. When the system recognizes a certain fiducial marker this could be used in order to

visualise a VTS interface in a location and orientation relative to the marker. The possibility

of making the VTS world stabilised opens up a set of new applications:

• Distributed VTS interfaces which could be activated when the user looks at them. The

idea is that an user wearing a video see-through HMD could move inside a building or

other area, and each time the camera captures a fiducial marker a VTS interface can

appear in that place. That VTS interface could enable the user to perform some task

relevant to that position, for example, operation of a nearby machine, get access to a

nearby door, etc.

• The use of multiple fiducial markers arranged on a wall could make possible to

implement a large VTS which size is bigger than the video see-through field of view.

This arrangement could enable large continuous desktop surfaces. For example a large

windows manager desktop could be directly displayed, and be active for interaction, over

a whole wall – or even over the four walls, ceiling and floor of the room. The only

requirement of having to arrange a number of fiducial markers over the wall would make

this large desktop approach much cheaper than any other technology.

• Interactive textbooks. These textbooks contain fiducial markers in their pages. The

markers could be recognized by the system and the desired information is rendered on

top of the page (as if it was printed on the page). The page could become a VTS surface.

This would allow the user to click and drag on the page in order to trigger new contents.

The concept is similar to MagicBook (Billinghurst et al., 2001) or EnhancedDesk (Koike

et al. , 2001).

8.4 Conclusions

This chapter has described the concept of a Virtual Touch Screen (VTS) interface. A VTS is

an interface analogous to a touch screen. In a VTS the sensing technology is based on visual

tracking of the user's hand, and the display technology is based on either a projector and a

screen, or a HMD. A number of possible VTS configurations, the proposed operation of a

VTS, and usability factors have also been described and discussed in the first part of this

chapter. The second part of this chapter presented the current implementations of the VTS

interface. Three generations of VTS implementations have been developed in this thesis. The


first generation was intended to demonstrate the VTS concept, but its hand tracking was too

restricted and only worked against a black background. This constraints reduce the usability

of this first generation VTS. The research presented in chapters 3 to 7 of this thesis, resulted

in an articulated hand contour tracker that is specially designed for the VTS use. This hand

tracker is used in the second generation VTS, and with minor modifications, in the third

generation VTS. The second generation VTS is a projector based VTS and the hand tracking

is from the front of the hand (palm view). The third generation VTS is a HMD based VTS,

and the hand tracking is from the back of the hand. The operation of the second and third

generation VTS was demonstrated through a number of experiments which involved

operating buttons, keypads, slider bars, and spinning wheels. Finally, a drawing application

was developed for the third generation VTS. This drawing application allows the user to

draw on the VTS by using clicks and drags of a finger.

The potential applications for the VTS range from an alternative to touch screens, handheld

keypads, controls, and pointers, to spatial displays, collaboration environments, and

entertainment industry (see Section 8.3 for a detailed list of potential applications). The VTS

interface also has potential applications in environments where physical contact is not

desired. This includes sterile environments where contamination of the user's hand needs to

be prevented (hospitals, or clean rooms), and environments where extreme wear would make

touch screens, keyboards, and mouse unfeasible (heavy use, high humidity or dust, or

underwater).

The hand tracking used in the second and third generation VTS follows the initialisation

sequence described in Section 8.2.2. During this initialisation sequence, the skin colour

model used in the tracking is tuned to every new user that operates the VTS, but the hand

contour template remains unchanged. This may result in an incorrect fitting of the hand

contour template to different users. Thus, a subject of future work is to create a mechanism

that can adapt the hand contour shape to that of the current user's hand. Isard and

MacCormick had some success in making a hand contour tracker robust to different users.

They used two trackers that operated simultaneously; one tracked the rigid movement of a

hand, and the other tracked changes in the shape of the contour. The shape of the contour

was only allowed to change within a space of deformations calculated from examples and

reduced with PCA (Isard, 1998). This technique is a possible candidate in order to make the

VTS fully multi-user. However, the hand contour they track is not articulated and is simpler


than the one used for the VTS. Their hand contour consists of an index finger, thumb and

hand, whereas the hand contour used in the VTS is more complex and is articulated (14

DOF). This raises doubts about the success of this method on the hand contour used for the

VTS. An alternative method could involve using another tracker, or mechanism such as

snakes (Kass et al. 1987), in order to find the shape of the user's hand during initialisation,

and then create an articulated hand contour template from that shape.

Questions about the usability (easy of use) of the VTS interface arose and were discussed in

Section 8.1.3. The flexibility of the VTS interface makes it possible to deploy it in a large

range of configurations and orientations (the only real restriction is that the field of view

between the camera and the user's hand must be clear). This allows the VTS to be deployed

in such a way that allows the VTS operation within the user's comfort area (as defined by

Kölsch (2004)), and within the wrist angle range suggested by Bach et al. (1997) and

Wellman et al. (2004). The only unavoidable shortcoming of the VTS interface is the lack of

haptic feedback. This problem can be alleviated by maximizing audio and visual feedback.

Finally, as regards to the usage scope of a VTS, it must be noted that as in the case of other

interactive surfaces, such as touch screens, VTS interfaces are better suited to information

systems with limited data entry.

218

9 Closing Discussion

This thesis has developed a specific visual articulated hand tracking system which enables

the creation of a novel vision-based interactive surface, referred to as the Virtual Touch

Screen (VTS). The thesis further develops and improves existing contour tracking (Blake

and Isard, 1998) and partition sampling (MacCormick and Isard, 2000) algorithms for the

creation of a robust hand tracker for the VTS. The existing tracking algorithms, though have

the potential of tracking complex objects in real-time against cluttered backgrounds, do not

satisfy the demanding hand tracking requirements of the VTS. As a result a number of novel

techniques for articulated hand contour tracking have been developed and presented. The

final visual articulated hand tracking system is used for the creation of the VTS interface.

9.1 Summary

This thesis has developed a visual articulated hand tracking system capable of satisfying the

hand tracking requirements of the VTS interface. The development of this tracking system

has been possible thanks to the combination of a number of other developments, these are:

• A novel technique, referred to as particle interpolation, which makes it possible to

improve the efficiency of particle propagation between time-steps in tracking tree-

structured articulated objects using particle filters and partition sampling.

9 Closing discussion 219

• Development of a novel skin colour classifier, referred to as the Linear Container (LC)

classifier and testing of the classifier under various conditions for use in hand tracking

for HCI. The classifier is robust to illumination (brightness) changes, requires less

storage, and is significantly faster than existing classifiers

• A novel measurement function based on skin colour that is both faster and more reliable

than existing edge based measurement functions provided no other skin colour objects

appear in the background.

• A novel skin colour based importance sampling, referred to as Skin Colour Guided

Sampling (SCGS), that allows the estimate of position, scale, and angle of the hand

contour from low-level information, for either users wearing long sleeve or short sleeve.

• A novel contour fitting method for articulated contour trackers that improves tracking

agility and reduces jitter on the tracking output.

• A novel method for particle filter based contour trackers, referred to as Variable Process

Noise (VPN), which varies the size of the contour's search region in order to cope with

brisk target movements.

The final visual articulated hand tracking system has been used to create the VTS interface.

However, the tracking system is just a part of the VTS, and the full VTS implementation has

required the development of other techniques dealing with: Interface intialisation, touch

detection, and occlusion of interface elements by the user's hand in order to simulate a depth

feeling.

The capabilities of the VTS interface have been demonstrated through a number of

experiments. These experiments have involved the operation of various interface elements,

such as keypads, sliderbars, control wheels, and buttons, against plain and cluttered

backgrounds. Finally, the capability of the VTS interface to complete a task has been

demonstrated through a hand drawing application.

The original goal of the project has been achieved. This is the implementation of a visual

articulated hand tracking system that can enable the creation of the VTS. However, questions

about usability of the VTS interface arise. The first question involves the posture of a user

operating a VTS. The VTS may be deployed in a position that forces the user to pose

uncomfortably. However, this same problem can occur to a keyboard, a touch screen, or any


other hardware interface that is not properly deployed. The solution to this problem is to

deploy the VTS in such a way that allows the VTS operation within the user's comfort area

(as defined by Kölsch (2004)), and within the wrist angle range suggested by Bach et al.

(1997) and Wellman et al. (2004). This is easy to achieve in a projector based VTS, as the

location of screen and camera force the user to operate it holding their hand in a given way.

However, this is not so easy to achieve in a HMD based VTS where the contents are always

in the HMD's field of view. In this case it is up to the user to operate the VTS in a

comfortable position. Another question about the VTS usability is the quality of the visual

articulated hand tracking system. This quality could always be better, and indeed the current

visual hand tracking system can be easily fooled if desired. Operation of the VTS within the

visual articulated hand tracking system capabilities is a skill that the current VTS user has to

acquire with practice. Further improvements to the tracking quality are part of the future

work of this thesis.

The only unavoidable shortcoming of the VTS interface, even with an ideal hand tracking, is

the lack of haptic feedback. This problem could be alleviated to some extend by maximizing

audio and visual feedback.

Finally, it is worth to remember the usage scope of a VTS. As in the case of other interactive

surfaces, such as touch screens, VTS interfaces are better suited to information systems with

limited data entry.

9.2 Future work

The first part of the future work is related to improving the hand tracking system and the

VTS interface in order to reach a commercial quality system. The second part is concerned

with the extension of the VTS interface to support new applications.

The visual articulated hand tracking system developed in this thesis is fast, accurate, and

robust enough to enable the creation of the VTS. However, errors in the hand tracking may

occur, and these errors could result in incorrect interface actions. This demands a very high

quality tracking that can guarantee correct tracking in a broad range of conditions. The first

aspect of the current articulated hand tracking system liable to improvement is the skin

colour detection. The LC skin colour classifier is most effective when the skin colour


dispersion of the user's hand is small. When the skin colour dispersion is high (for example

one side of the hand is dark while the other side is bright) the performance of the LC

classifier decreases. This could be overcome by dynamic skin colour modelling together with

multiple LC classifiers. Each LC classifier could deal with a fraction of the hand template.

Then, while tracking, each one of the LC classifiers could get updated to the particular skin

colour in that area of the hand template. In this way, the skin colour dispersion supported by

the hand tracker could be higher, which in turn would improve the tracking performance.

Another aspect for improvement in the current articulated hand tracker system is its multi-

user capability. At the moment, a single deformable template is used for hand tracking. The

template can transform within a space of Euclidian similarities in order to adapt to different

hands. However, if the hand shape is different there will always be matching errors with only

one template. One possibility to solve this problem could be to create a template specific for

each user. The template could be created during the initialisation step. The user would place

their hand over the hand contour and then using a mechanism such as snakes (Kass et al.

1987), the shape of the hand initialising the tracker could be found. From that shape a new

template specific to that user could be generated, and used later for the tracking. As the

template used in tracking is specific to the current user, the match should be more accurate.

This in turn improves the tracking performance. Once the articulated hand tracking system is

prepared to work with multiple users aspects such as the usability of the VTS interface could

be calculated for a representative group of users.

The VTS interaction experience is closely linked to the quality of the visual hand tracking

system employed. However, there are some aspects of the VTS that could be improved

independently of the hand tracking system in order to make the VTS interaction easier, and

to reduce learning time. One of these aspects is the click and drag detection mechanisms.

Another aspect is to provide mechanisms that reduce the impact of the lack of haptic

feedback, for example, maximizing audio and visual feedback, or imposing simulated

surface constraints. Lindeman et al. (2001) reported that the imposition of simulated surface

constraints (such as clamping) can compensate to some extend the decrease in performance

produced by the lack of haptic feedback in virtual work surfaces. The notion of clamping

could easily be implemented in a VTS by drawing the hand contour with a fixed minimum

scale, each time the user stretches their hand beyond the virtual surface (no occlusion of the


user's hand by the interface). During this state (scale of the hand clamped to a minimum) the

VTS could still be operated.

A second part of the proposed future work is concerned with the extension of the VTS

interface to support new applications. One interesting extension would be to make the HMD

based VTS to be world stabilised. In the applications involving a see-through HMD, the VTS

is HMD stabilised. That is, wherever the user looks at the VTS interface elements will be in

the field of view. The VTS could be made world stabilized by using fiducial markers such as

the ones provided by ARToolkit (ARToolkit, 2006) or ARTag (Fiala, 2004). For example a

HMD could house a camera in order to implement video see-through, track the user's hand,

and recognize fiducial markers. When the system recognizes a certain fiducial marker this

could be used in order to visualise a VTS interface in a location and orientation relative to

the marker. The possibility of making the VTS world stabilised opens up a set of new

applications (these are detailed at the end of Section 8.3):

• Distributed VTS interfaces which could be activated when the user looks at them.

• The use of multiple fiducial markers arranged on a wall could make possible to

implement a large VTS which size is bigger than the video see-through field of view.

• Interactive textbooks.

One final piece of future work is the creation of a library that could give easy access to the

functions involved in a VTS, from hand tracking, VTS initialisation, click and drag detection

mechanisms, to the inclusion of new interface elements. This library would enable

developers of AR projects to easily incorporate and customise a VTS interface into their

systems.

223

Appendix A

A.1 Reverse kinematics of a chain of three links The three segments that form a finger can be represented as a kinematic chain of three links.

The position of the last link's end (fingertip) can be calculated from the angles of the three

joints, and the lengths of the three links. The process to calculate this is called direct

kinematics (or simply kinematics). The reverse process (reverse kinematics) returns the

angles of the three joints given the lengths of the three links and the position of the last link's

end. The problem can be represented as in Figure A.1.

Figure A.1: Kinematic chain representing a finger.

Appendix A 224

In Figure A.1, the lengths of the finger segments are indicated as: PPL for the length of the

Proximal phalanx segment; MPL for the length of the Middle phalanx segment; and DPL for

the length of the Distal phalanx segment. The angles of the joints are indicated as: αPP for the

Metacapophalangeal joint angle; αMP for the Proximal interphalangeal joint angle; and αDP

for the Distal interphalangeal joint angle. When the finger is fully extended the angles of the

joints are all zero, the value of y equals the length of the finger, and the value of x is zero. As

the finger flexes the value of y decreases, and the value of x increases. For touch detection

purposes, the 2D projected length of the finger is indicated as the distance B in Figure A.1

(this is a value measured from the hand configuration), and the wanted value is x (the

separation between the fingertip and the palm plane).

The direct kinematics for the chain of three links of Figure A.1 is:

)cos()cos()cos( DPMPppDPMPppMPppPP LLLy αααααα +++++= (A.1.1)

In order to find the reverse kinematics from the direct kinematics two constraints are used:

• MPDP αα32

= , this is an anatomic movement constraint due to the fact that the distal

interphalangeal and proximal interphalangeal joints share the same tendon.

• PPMP αα 2= , this is an artificial constraint. It is used to simplify the reverse kinematics

considering only a typified finger flexion. This finger flexion relates to a typified

trajectory of a finger when typing on a keyboard.

Substituting the constrains in Equation A.1.1 yields:

)3

13cos()3cos()cos( ppDPppMPppPP LLLy ααα ++= (A.1.2)

)3

13sin()3sin()sin( ppDPppMPppPP LLLx ααα ++= (A.1.3)

The lengths of the fingers are assumed to be known. In order to find the separation between

the fingertip and the palm plane (x) from the distance B in Figure A.1, equations A.1.2 and

A.1.3 need to be solved. However, a close expression that tells PPα for a given y is not easy

to find, as there is no expression that tells PPα from 13cos( )3DP ppL α . Three alternative

solutions are suggested:

• Use a polynomial approximation of the cos function and substitute it in Equation A.1.2.

Then PPα can be isolated and used in Equation A.1.3 in order to find x.

Appendix A 225

• Use an iterative approach. The value of PPα can be changed in little steps, evaluating

Equation A.1.2 at each step. When the resulting y is near enough to the measured B the

iteration stops and current PPα is used in Equation A.1.3.

• Calculated a number of combinations of B, x before the tracking starts (for example

during the tracking initialisation), and store it in a lookup table. These combinations can

be calculated for a range of valid PPα values just applying Equation A.1.2 and Equation

A.1.3.

226

Appendix B

This appendix comprises a number of video sequences which illustrate the results of various

experiments throughout the thesis. The video sequences are available in the attached CD and

in the supporting webpage at: http://www.cs.nott.ac.uk/~mtb/thesis

The video sequences contained in the CD and in the supporting webpage are listed next:

Chapter 4

video_sequence_4.1.avi Test video sequence.

ground_truth.txt Ground truth for the test video sequence.

video_sequence_4.2.avi Particle-set implementation; tracking output on test video

sequence frames 30-174.

video_sequence_4.3.avi Particle-set implementation; tracking output on test video

sequence frames 360-890.

video_sequence_4.4.avi Sweep implementation; tracking output on test video sequence

frames 30-174.

video_sequence_4.5.avi Sweep implementation; tracking output on test video sequence

frames 360-890.

Appendix B 227

Chapter 5

video_sequence_5.1.avi Video illustrating the tuning operation of the LC skin colour

classifier

video_sequence_5.2.avi Mediterranea subject video sequence, including the output of

the RGB histogram classifier, the rg histogram classifier and

the LC classifier.

video_sequence_5.3.avi White Caucasian subject video sequence, including the output

of the RGB histogram classifier, the rg histogram classifier and

the LC classifier.

video_sequence_5.4.avi Black African subject video sequence, including the output of

the RGB histogram classifier, the rg histogram classifier and

the LC classifier.

video_sequence_5.5.avi Chinese subject video sequence, including the output of the

RGB histogram classifier, the rg histogram classifier and the

LC classifier.

video_sequence_5.6.avi Example of a LC classifier initialisation in a HCI system.

Office illumination during the day.

video_sequence_5.7.avi Example of a LC classifier initialisation in a HCI system.

Office illumination during the night.

vide_sequence_5.8.avi Dynamic tuning vs. Static tuning of the LC classifier.

Chapter 7

video_sequence_7.1.avi Tracking output using template fitting method 1.

video_sequence_7.2.avi Tracking output using template fitting method 2.

video_sequence_7.3.avi Tracking output using the combination of template fitting

method 1 and 2.

video_sequence_7.4.avi Tracking output of the test video sequence using the

combination of method 1 and method 2. Frames 30-174.

video_sequence_7.5.avi Tracking output of the test video sequence using the

combination of method 1 and method 2. Frames 360-890.

video_sequence_7.6.avi Tracking output of the test video sequence using variable

process noise. Frames 0-890.

Appendix B 228

video_sequence_7.7.avi Skin colour guided sampling reinitialisation test showing the

tracking output and the skin colour blobs .

video_sequence_7.8.avi Skin colour guided sampling reinitialisation test showing the

tracking output, the skin colour blobs, the initialisation

particles and the importance particles

video_sequence_7.9.avi Skin colour guided sampling robustness test on the test video

sequence. Frames 0-890.

video_sequence_7.10.avi Tracking output using template fitting methods 1&2, variable

process noise, and skin colour guided sampling.

Chapter 8

video_sequence_8.1.avi First generation VTS. The video sequence shows VTS user

typing a telephone number on virtual keypad. Black

background.

video_sequence_8.2.avi A finger kinematic model makes possible to find the finger's

joint angles. Black background.

video_sequence_8.3.avi Second generation VTS usage demo. The video sequence

shows the VTS user typing on a keypad and using slider bars.

video_sequence_8.4.avi Hand tracking from the back of the hand (Camera mounted on

a HMD).

video_sequence_8.5.avi Third generation VTS, first experiment. VTS operation against

a plain background.

video_sequence_8.6.avi Third generation VTS, second experiment. VTS operation

against a complex background.

video_sequence_8.7.avi Third generation VTS, third experiment. VTS operation with

interfaces of various sizes. Thumb gesture can control the size

of the interfaces.

video_sequence_8.8.avi Third generation VTS. Drawing application demo.

video_sequence_8.9.avi Third generation VTS. Multiple points of input demo for the

drawing application.

Bibliography 229

Bibliography Abe, K., Saito, H. and Ozawa, S. (2000). 3-D Drawing System via Hand Motion Recognition

from Two Cameras. In Proceeding of the 6th Korea-Japan Joint Workshop on Computer

Vision, pp. 138-143.

Ahlberg, J. (1999). A system for face localization and facial feature extraction. Tech. Rep.

LiTH-ISY-R-2172, Linkoping University.

Assan, M. and Grobel, K. (1997). Video Based Sign language Recognition using Hidden

Markov Models. Gesture and Sign Language in Human-Computer Interaction, Intl. In Proc.

of Gesture Workshop, vol. 1371 of Lecture Notes in Computer Science, pp. 97-110.

ARToolkit. (2006). http://www.hitl.washington.edu/artoolkit

Bach, J., Honan, M., and Rempel, D. (1997). Carpal tunnel pressure while typing with the

wrist at different postures. In Proceedings of the Marconi Research Conference (San

Francisco: University of California, San Francisco and Center for Ergonomics), Paper 17.

Billinghurst, M., Kato, H., Poupyrev, I. (2001). The MagicBook: A Transitional AR

Interface. Computers and Graphics, pp. 745-753.

Blake, A. and Isard, M. (1998). Active contours. Springer.

Bowden, R. (1999). Learning Non-linear Models of Shape and Motion. PhD thesis, Brunel

University.

Bowden, R., Heap, A., and Hart, C. (1996). Virtual Datagloves: Interacting with Virtual

Environments Through Computer Vision. In Proc. 3rd UK VR-Sig Conference, DeMontfort

University, Leicester, UK, July 1996.

Bradski, G. (1998). Computer Vision Face Tracking for Use in a Perceptual User Interface.

Intel Technology Journal, 2(2):12-21.

Bibliography 230

Brand, J. and Mason, J. (2000). A comparative assessment of three approaches to pixel level

human skin-detection. In Proc. of the ICPR, vol. 1, 1056-1059.

Branson, K. and Belongie, S. (2005). Tracking Multiple Mouse Contours (without Too Many

Samples). Proceedings of the IEEE Computer Society Conference on Computer Vision and

Pattern Recognition (CVPR'05), 1, 1039 - 1046.

Brown, D., Craw, I., and Lewthwaite, J. (2001). A SOM based approach to skin detection

with application in real time systems. In Proc. of the British Machine Vision Conference.

Chai, D., And Bouzerdoum, A. (2000). A Bayesian approach to skin color classification in

ycbcr color space. In Proc. of IEEE Region Ten Conference (TENCON’2000), vol. 2, 421-

424.

Chen, Q., Wu, H., and Yachida, M. (1995). Face detection by fuzzy pattern matching. In

Proc. of the ICCV, 591-597.

Cootes, T., Taylor, C., Cooper, D., and Graham, J. (1995). Active shape models - their

training and application. Computer Vision and Image Understanding, 61, 1, 38-59.

Crowley, J., Brard, F. and Coutaz, J. (1995). Finger tracking as an input device for

augmented reality. In Proc. Workshop Automatic Face and Gesture Recognition, pp. 195-

200.

Cui, Y. and Weng, J. (1996). Hand sign recognition from intensity image sequences with

complex background. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp.

195-200.

Davis, J. and Shah, M. (1994). Visual gesture recognition. Vision, Image, and Signal

Processing, vol. 141, pp. 101-106.

Dietz, P. and Leigh, D. (2001). DiamondTouch: A Multi-User Touch Technology.

Proceedings of the 14th annual ACM Symposium on User Interface Software and

Technology (UIST), ISBN: 1-58113-438-X, pp. 219-226.

Bibliography 231

DirectShow. (2005). Official DirectShow SDK documentation from MSDN. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directshow/ htm/directshow.asp

Drugan, M.M., and Thierens, D. (2004). Evolutionary Markov Chain Monte Carlo. In P.

Liardet (Ed.), Proceedings of the Sixth International Conference on Artificial Evolution - EA

2003 (pp. 63-76). Springer.

Du, H., Oggier, T., Lustenberger, F. and Charbon, E. (2005). A Virtual Keyboard Based on

True-3D Optical Ranging. British Machine Vision Conference, Vol. 1, pp. 220-229.

Fiala, M. (2004). Artag revision 1, a fiducial marker system using digital techniques. In

National Research Council Publication 47419/ERB-1117, November 2004.

Fleck, M., Forsyth, D. A., and Bregler, C. (1996). Finding naked people. In Proc. of the

ECCV, vol. 2, 592-602.

Frescha, A., and Keller, M. (2003). DigiScope: An Invisible Worlds Window. In Adjunct

Proceedings, UbiComp2003, 261-264.

Fukumoto, M. and Tonomura, Y. (1997). "Body Coupled FingeRing": Wireless Wearable

Keyboard. In ACM CHI '97, pp. 147-154.

Galanti, S., and Jung, A. (1997). Low-Discrepancy Sequences: Monte Carlo Simulation of

Option Prices. Journal of Derivatives, 63-83.

Gelb, A., editor (1974). Applied Optimal Estimation. MIT Press, Cambridge, MA.

Gleeson, M., Stanger, N., and Ferguson, E. (2004) Design strategies for GUI items with

touch screen based information systems: assessing the ability of a touch screen overlay as a

selection device. Discussion Paper 2004/02. Department of Information Science, University

of Otago, Dunedin, New Zealand. Available from http://www.business.otago.ac.nz/infosci/pubs/papers/papers/dp2004-02.pdf

Bibliography 232

Gomez, G. (2002). On selecting colour components for skin detection. In Proc. of the ICPR,

vol. 2, 961-964.

Gomez, G., and Morales, E. (2002). Automatic feature construction and a simple rule

induction algorithm for skin detection. In Proc. of the ICML Workshop on Machine Learning

in Computer Vision, 31-38.

Heap, T. and Samaria, F. (1995). Real-time hand tracking and gesture recognition using

smart snakes. In Proceedings of Interface to Real and Virtual Worlds, pp. 261-271.

Heap, T. and Hogg, D. (1996). Towards 3D hand tracking using a deformable model. In

Proc. IEEE Int. Conf. Automatic Face and Gesture Recognition, pp. 140-145.

Heap, T. and Hogg, D. (1998). Wormholes in shape space: Tracking through discontinuous

changes in shape. In Proc. 6th Int. Conf. on Computer Vision.

Heider, M. (1998). The Adaptive Effects Of Virtual Interfaces: Vestibulo-Ocular Reflex and

Simulator Sickness. PhD thesis, University of Washington.

DNP. (2004). HoloScreen. http://www.en.dnp.dk/data/675421/5948/HOLO_A4_092004.pdf

I-O Display Systems. (2006). I-glasses PC/SGVA. http://www.i-glassesstore.com/iglasses-pc-hr.html

Isard, M. (1998). Visual Motion Analysis by Probabilistic Propagation of Conditional

Density. PhD thesis, University of Oxford.

Isard, M. and Blake, A. (1998a). Condensation - conditional density propagation for visual

tracking. Int. J. Computer Vision, 28, 1, 5-28.

Bibliography 233

Isard, M. and Blake, A. (1998b). ICONDENSATION: unifying low-level and high-level

tracking in a stochastic framework. In Proc. European Conf. on Computer Vision, Freiburg,

Germany. vol. 1, 893-908.

Isard, M. and MacCormick, J. (2000). Hand tracking for vision-based drawing. Technical

report, Visual Dynamics Group, Dept. Eng. Science, University of Oxford. Available from http://www.robots.ox.ac.uk/~vdg

Ishii, H. and Kobayashi, M. (1992). ClearBoard: A Seamless Media for Shared Drawing and

Conversation with Eye-Contact. In Conference on Human Factor in Computing Systems

(CHI), 525-532.

Jones, M. J. and Rehg, J. M. (1999). Statistical color models with application to skin

detection. In Proc. of the CVPR ’99, vol. 1, 274-280.

Jorda, L., Perrone, M., Costeira, J., and Santos-Victor, J. (1999). Active face and feature

tracking. In Proc. of the 10th International Conference on Image Analysis and Processing,

572-577.

Kass, M., Witkin, A., and Terzopoulos, D. (1987). Snakes: Active contour models. In Proc.

1st Int. Conf. on Computer Vision, 259-268.

Kilian, J. (2001). Simple Image Analysis By Moments. OpenCV library documentation.

Kim, Y., Soh, B., and Lee, S. (2005). A New Wearable Input Device: SCURRY. In IEEE

Transactions on industrial electronics, vol. 52, no. 6, pp. 1490-1499.

Kitagawa, G., (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state

space models. Journal of Computational and Graphical Statistics, 5, 1, 1-25.

Koike, H., Sato, Y., and Kobayashi, Y. (2001). Integrating paper and digital information on

EnhancedDesk. ACM TOCHI, 8 (4), 307-322.

Bibliography 234

Kölsch, M. (2004). Vision Based Hand Gesture Interfaces for Wearable Computing and

Virtual Environments. PhD thesis, University of California, Santa Barbara.

Kolsch, M. and Turk, M. (2005). Hand tracking with Flocks of Features. Computer Vision

and Pattern Recognition, CVPR, vol. 2, 20-25.

Kuch, J., and Huang, T. (1995). Vision-based hand modeling and tracking for virtual

teleconferencing and telecollaboration. In Proc. IEEE Int. Conf. Computer Vision, pp 666-

671.

Kurata, T., Okuma, T., Kourogi, M., and Sakaue, K. (2001). The Hand Mouse: GMM Hand-

color Classification and Mean Shift Tracking. In Second Intl. Workshop on Recognition,

Analysis and Tracking of Faces and Gestures in Realtime Systems, pp. 119-124.

Kurata, T., Kato, T., Kourogi, M., Keechul, J., and Endo, K. (2002). A functionally-

distributed hand tracking method for wearable visual interfaces and its applications. In IAPR

Workshop on Machine Vision Applications, pp. 84-89.

Kwok, N. M., Zhou, W., Dissanayake. G., and Fang, G. (2005). Evolutionary Particle Filter:

Re-sampling from the Genetic Algorithm Perspective. EEE/RSK International Conference

on IROS.

Lee, J. and Kunii, T. (1995). Model-based analysis of hand posture. IEEE Comput. Graph.

Appl., vol 15. no. 5, pp. 77-86.

Lee, J. Y., and Yoo, S. I. (2002). An elliptical boundary model for skin color detection. In

Proc. of the 2002 International Conference on Imaging Science, Systems, and Technology.

Lee, M. and Woo, W. (2003). ARKB: 3D vision-based Augmented Reality Keyboard.

International Conferece on Artificial Reality and Telexisitence (ICAT03), paper ISSN 1345-

1278, pp. 54-57.

Lin, J., Wu, Y., and Huang, T. (2000). Modelling the constraints of human hand motion.

Workshop on Human Motion, pp. 121-126.

Bibliography 235

Lindeman, R., Sibert, J. and Templeman, J. (2001). The Effect of 3D Widget Representation

and Simulated Surface Constraints on Interaction in Virtual Environments. In Proc. of IEEE

Virtual Reality 2001, pp. 141-148.

Lindemann, S., La Valle, S., (2003). Incremental Low-Discrepancy Lattice Methods for

Motion Planing. ICRA, 2920-2927.

Logitech. (2006). http://www.logitech.com/

MacCormick, J. and Blake. A. (1999). A probabilistic exclusion principle for tracking

multiple objects. In Proc. 7th International Conf. Computer Vision, 572-578.

MacCormick, J. and Isard, M. (2000). Partitioned sampling, articulated objects, and

interface-quality hand tracking. In European Conf. Computer Vision.

MacCormick, J. (2000). Probabilistic modelling and stochastic algorithms for visual

localisation and tracking. PhD thesis, University of Oxford.

Matsushita, N. and Rekimoto, J. (1997). HoloWall: Designing a Finger, Hand, Body, and

Object Sensitive Wall. In Proc. of the ACM UIST'97 Symposium on User Interface Software

and Technology, pp. 209-210.

Malik, S. and Laszlo, J. (2004). Visual touchpad: a two-handed gestural input device. In

Proceedings of ICMI '04, pp. 289-296.

Menser, B., and Wien, M. (2000). Segmentation and tracking of facial regions in color image

sequences. In Proc. Visual Communications and Image Processing, SPIE, 731-740.

MicroOptical. (2005). SV-6 PC viewer. http://www.microopticalcorp.com/DOCS/sV6mobile_MK-0061A.pdf

Mirage Innovations. (2006). LightVu. http://www.mirageinnovations.com

Bibliography 236

Mollenhauer, M. (2004), Simulator Adaptation Syndrome Literature Review, Realtime

Technologies Technical Report, 2004.

Morokoff, W., and Caflish, R. (1994). Quasi-Random Sequences and Their Discrepancies.

SIAM J. Sci. Comput. 15:6, 1251-1279.

Moro, B. (1995). The full monte, Risk. 8(2), 53-57.

Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods.

SIAM, Philadelphia, PA.

NIST. (2006). EWMA Control Charts. Engineering Statistics Handbook. http://www.itl.nist.gov/div898/handbook/pmc/section3/pmc324.htm

Nolker, C. and Ritter, H. (1997). Detection of fingertips in human hand movement

sequences. In Proc. of the International Gesture Workshop on Gesture and Sign Language in

Human-Computer Interaction. pp. 209-218.

Nolker, C. and Ritter, H. (1999). GREFIT: Visual Recognition of Hand Postures. In Proc. of

the International Gesture Workshop, 61-72.

Oliver, N., Pentland, A., And Berard, F. (1997). Lafter: Lips and face real time tracker. In

Proc. Computer Vision and Pattern Recognition, 123-129.

OpenCV. (2006). http://sourceforge.net/projects/opencvlibrary

OTL. (1999). http://www.robots.ox.ac.uk/~vdg/Darling.html

Pantrigo, J. J., Sánchez, Á., Gianikellis, K., Montemayor, A. (2005). Combining Particle

Filter and Population-based Metaheuristics for Visual Articulated Motion Tracking.

Electronic Letters on Computer Vision and Image Analysis, (5), No. 3, 68-83.

Bibliography 237

Pavlovic, V., Sharma, R., and Huang, T. (1997). Visual Interpretation of Hand Gestures for

Human-Computer Interaction: A Review. IEEE Transactions on Pattern Analysis and

Machine Intelligence, vol. 19, No. 7, 677-695.

Peer, P., Kovac, J., and Solina, F. (2003). Human skin colour clustering for face detection. In

International Conference on Computer as a Tool, EUROCON, The IEEE region 8, vol 2,

144-148.

Pérez, P., Vermaak, J., and Blake, A. (2004). Data fusion for visual tracking with particles.

Proceedings of the IEEE, 92(3):495-513.

Pingali, G., Pinhanez, C., Levas, A., Kjeldsen, R., Podlaseck, M., Chen, H. and Sukaviriya,

N. (2003). Steerable Interfaces for Pervasive Computing Spaces. In Proceedings of the First

IEEE International Conference on Pervasive Computing and Communications, pp. 315-322.

Philomin, V., Durasiswami, R., and Davis, L. (2000). Quasi-Random Sampling for

Condensation. ECCV, (2), 134-149.

Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (1996). Numerical

Recipes: The Art of Scientific Computing. 2nd Edition, Cambridge University Press.

Press, W. H., Flannery, B. P., and Teukolsky, S. A. (1992). Numerical Recipes in C: The Art

of Scientific Computing. Cambridge University Press. Quek, F. (1995). Eyes in the interface. Image Vision Computing, vol. 13(6), pp. 511-525.

Rehg, J., (1995). Visual Analysis of High DOF Articulated Objects with Application to Hand

Tracking. PhD thesis, Electrical and Computer Eng., Carnegie Mellon University.

Rehg, J., and Morris, D. (1997). Singularities in articulated object tracking with 2-d and 3-d

models. Tech. rep., Digital Equipment Corporation, Cambridge Research Lab.

Rehg, J., and Kanade, T. (1995). Model-based tracking of self-occluding articulated objects.

In Proc. IEEE Int. Conf. Computer Vision, pp. 460-475.

Bibliography 238

Rekimoto, J., Matsushita, N. (1997). Perceptual surfaces: Towards a human and object

sensitive interactive display. In Workshop on Perceptual User Interfaces (PUI-97), pp. 30-

32.

Roeber, H., Bacus, J. and Tomasi, C. (2003). Typing in thin air: the Canesta projection

keyboard -- A new method of interaction with electronic devices. Proceedings of the

Conference on Human Factors in Computing Systems (CHI 2003), pp. 712-713.

Rosales, R. and Sclaroff, S. (2000). Inferring body pose without tracking body parts. In Proc.

IEEE Conf. Computer Vision and Pattern Recognition, vol II, pp. 721-727.

Schumeyer, R., and Barner, K. (1998). A color-based classifier for region identification in

video. In Proc.Visual Communications and Image Processing, SPIE, vol 3309, 189-200.

Sears, A. (1991). Improving Touchscreen Keyboards: Design Issues and a Comparison with

Other Devices. Interacting with Computers, vol. 3, 253-269.

Segen, J. and Kumar, S. (1998). GestureVR: Vision-Based 3D Hand Interface for Spatial

Interaction. In ACM Multimedia Conference, Bell Laboratories, pp. 455-464.

Senseboard. (2000). http://www.senseboard.com

Shimada, N., Shirai, Y., Kuno, Y., and Miura, J. (1998). Hand Gesture Estimation and Model

Refinement using Monocular Camera -- Ambiguity Limitation by Inequality Constraints. In

Proc. of The 3rd Int. Conf. on Automatic Face and Gesture Recognition, pp. 268-273.

Shimada, N., Kimura, K., and Shirai, Y. (2001). Real-time 3-D hand posture estimation

based on 2-D appearance retrieval using monocular camera. In Proc. Int. WS RAFTFG-RTS,

pp. 23-30.

Sigal, L., Sclaroff, S., and Athitsos, V. (2000). Estimation and prediction of evolving color

distributions for skin segmentation under varying illumination. In Proc. IEEE Conf. on

Computer Vision and Pattern Recognition, vol. 2, 152–159.

Bibliography 239

Skarbek, W., and Koschan, A. (1994). Colour image segmentation – a survey –. Tech. Rep.

Institute for Technical Informatics, Technical University of Berlin, October.

Soriano, M., Martinkauppi, B., Huovinen, S., and Laaksonen, M. (2000). Using the skin

locus to cope with changing illumination conditions in color-based face tracking. In Proc. of

the IEEE Nordic Signal Processing Symposium, pp. 383-386.

Stafford, Q. and Robinson, P. (1996). BrightBoard: A Video-Augmented Environment. In

Proc. of the CHI96, pp. 134-141.

Starner, T. and Pentland, A. (1995). Real-Time American Sign Language Recognition From

Video Using Hidden Markov Models. In International Symposium on Computer Vision, vol.

5B Systems and Applications, pp. 265-270.

Stefanov, N., Galata, A. and Hubbold, R. (2005). Real-time hand tracking with Variable-

length Markov Models of behaviour. In IEEE Int. Workshop on Vision for Human-Computer

Interaction (V4HCI), in conjunction with CVPR 2005.

Stenger, B., Mendonca, P., and Cipolla, R. (2001). Model-Based 3D Tracking of an

Articulated Hand. In CVPR, Volume II, pp. 310-315.

Stenger, B., Arasanathan, T., Torr, P., and Cipolla, R. (2006). Model-Based Hand Tracking

Using a Hierarchical Bayesian Filter. In PAMI, vol. 28, No. 9, pp. 1372-1384.

Stern, H., and Efros, B. (2002). Adaptive color space switching for face tracking in multi-

colored lighting environments. In Proc. of the International Conference on Automatic Face

and Gesture Recognition, 249-255.

Stoll, P. and Ohya, J. (1995). Application of HMM modelling to recognizing human gestures

in image sequences for a man-machine interface. In Proc. IEEE Int. Workshop on Robot and

Human Communication. pp. 129-134.

Bibliography 240

Sturman, D. J. (1992). Whole-Hand Input. PhD thesis, Media Arts and Science Laboratory,

Massachesetts Institute of Technology, Cambridge, MA USA.

Terrillon, J. C., Shirazi, M. N., Fukamachi, H., and Akamatsu, S. (2000). Comparative

performance of different skin chrominance models and chrominance spaces for the automatic

detection of human faces in color images. In Proc. of the International Conference on Face

and Gesture Recognition, 54-61.

Tezuka, A. (1995). Uniform Random Numbers: Theory and Practice. Kluwer Academic

Publishers.

Tissainayagam, P. and Suter, D. (2002). Performance measures for assessing contour

trackers. Int. Journal of Image and Graphics, 2, 343-359.

Triesch, J. and Malsburg, C. (1996). Robust classification of hand postures against complex

background. In Proc. Int. Conf. Automatic Face and Gesture Recognition, pp. 170-175.

Uosaki, K., Kimura, Y., and Hatanaka. T. (2004). Evolution strategies based particle filters

for state and parameter estimation of nonlinear models. Congress of Evolutionary

Computation, 884-890. Vezhnevets V., Sazonov V., Andreeva A. (2003). A Survey on Pixel-Based Skin Color

Detection Techniques. In Proc. Graphicon, 85-92.

Vogler, C. and Metaxas, D. (1998). ASL Recognition Based on a Coupling Between HMMs

and 3D Motion Analysis. In Proc. International Conference on Computer Vision. Mumbai,

India. pp. 363-369.

Welch, G., and Bishop, G. (2002). An introduction to the kalman filter. Technical Report 95-

041, University of North Carolina at Chapel Hill, Department of Computer Science.

Wellman, H., Davis, L., Punnett, L., and Dewey, R. (2004) Work-related carpal tunnel

syndrome (WR-CTS) in Massachusetts, 1992-1997: source of WR-CTS, outcomes, and

employer intervention practices. American Journal of Industrial Medicine, 45, 139-152.

Bibliography 241

Wilson, A. (2004). TouchLight: An Imaging Touch Screen and Display for Gesture-Based

Interaction. International Conference on Multimodal Interfaces.

Wilson, A. (2005). PlayAnywhere: A Compact Tabletop Computer Vision System. Proc.

UIST '05, ACM Press, pp. 83-92.

Wellner, P. (1993). Interacting with paper on the DigitalDesk. Communications of the ACM,

36(7), pp. 87-96.

Wren, C., Azarbayejani, A., Darrell, T., and Pentland, A. (1997). Pfinder: Real-time tracking

of the human body. In IEEE Transactions on Pattern Analysis and Machine Intelligence,

19(7), pp. 780-785.

Wu, Y. and Huang, T. (1999). Capturing articulated human hand motion: A divide-and-

conquer approach. In Proc. IEEE Int. Conf. Computer Vision, pp. 606-611.

Wu, Y. and Huang, T. (2000). View-independent recognition of hand postures. In Proc.

IEEE Int. Conf. Computer Vision and Pattern Recognition, vol. II, pp. 88-94.

Wu, Y. and Huang, T. (2001). A Co-inference Approach to Robust Visual Tracking. In Proc.

IEEE ICCV, Vol. II, 26-33.

Wu, Y., Lin, J., and Huang, T. (2001). Capturing natural hand articulation. In ICCV, volume

2, 426-432.

Yang, J., Xu, Y., and Chen, C. (1994). Gesture interface: Modelling and learning. In Proc.

IEEE Int. Conf. Robotics and Automation, vol. 2, pp. 1747-1752.

Yang, M. H., and Ahuja, N. (1999). Gaussian mixture model for human skin color and its

applications in image and video databases. In Proc. of the SPIE: Conf. On Storage and

Retrieval for Image and Video Databases, vol. 3656, 458-466.

Bibliography 242

Yang, M., and Ahuja, N. (1998) Detecting human faces in color images. In Proc. of ICIP,

vol. 1, 127-130.

Yuille, A. and Hallinan, P. (1992). Deformable templates. In Blake, A. and Yuille, A.,

editors, Active Vision, 20-38. MIT.

Zhang, Z., Wu, Y., Shan, Y. and Shafer. S. (2001). Visual panel: Virtual mouse keyboard

and 3d controller with an ordinary piece of paper. In Workshop on Perceptive User

Interfaces. ACM Digital Library, ISBN 1-58113-448-7.

Zarit, B. D., Super, B. J. and Quek, F. K. H. (1999). Comparison of five color models in skin

pixel classification. In ICCV’99 Int’l Workshop on recognition, analysis and tracking of

faces and gestures in Real-Time systems, 58- 63.

Zhou, H. and Huang, T. (2003). Tracking articulated hand motion with eigen-dynamics

analysis. In Proc. 9th Int. Conf. on Computer Vision, vol. 2, pp. 1102-1109.

hand/gesture recognition thesis

Documents