hand/gesture recognition thesis
TRANSCRIPT
Visual Articulated Hand Tracking
for Interactive Surfaces
Martin Tosas
Thesis submitted to the University of Nottingham
for the degree of Doctor of Philosophy
December 2006
ii
Visual Articulated Hand Tracking for Interactive Surfaces
Abstract As computer systems become more and more embedded into our environment, the ability to
interact with them without the need for special equipment is very attractive. Vision-based
Human Computer Interaction (HCI) has the potential of making this possible in a form that is
both easy and natural for people to use. However, there are great technical challenges in the
creation of robust algorithms for vision-based HCI systems. One strategy to overcome these
technical challenges is to create vision algorithms that are more specific to a particular
application. This thesis develops visual articulated hand tracking algorithms for use in
interactive surfaces.
The possibility of using visual articulated hand tracking as sensing technology for interactive
surfaces is very attractive with respect to other sensing technologies because it is flexible,
cheap, and avoids the need of cumbersome hardware. A small number of attempts at
developing vision-based interactive surfaces have been made, but there have been no
previous attempts at using articulated hand tracking as sensing technology for interactive
surfaces. This is because visual articulated hand tracking is difficult to achieve. Fortunately,
the 2D nature of the interactive surfaces does not require a full 3D articulated hand tracking.
This makes it possible to develop a specific but robust visual articulated hand tracking that is
constrained to a particular viewpoint – where the hand is approximately parallel to the
interactive surface.
This thesis develops a specific visual articulated hand tracking system which enables the
creation of a novel vision-based interactive surface, referred to as the Virtual Touch Screen
(VTS). The contents of a VTS are displayed using a projector or a head mounted display.
The VTS is made touch-sensitive by using visual hand tracking. The VTS can potentially be
used as, for example, an alternative to touch screens, an interface to mobile computing
devices, interactive surface for information points, shop displays, video games, and as a
sterile interface for use in hospitals and clean rooms.
iii
List of publications Published papers:
Tosas, M., Li. B. (2004). Virtual Touch Screen for Mixed Reality. Lecture Notes in
Computer Science, Proc. ECCV 2004 Workshop on HCI, pp. 48 - 59.
Tosas, M., Li, B., Mills, S. (2005). Switching Template Fitting Methods During Articulated
Object Tracking. In Proc. IEE Int. Conf. Visual Information Engineering, VIE 2005, pp. 243
- 249.
Tosas, M., Li, B. (2007). Tracking Tree-Structured Articulated Objects Using Particle
Interpolation. Accepted for publication in the proceedings of CGIM2007.
Tosas, M., Li, B. (2007). Virtual Touch Screen: A Vision-Based Interactive Surface.
Accepted for publication in the proceedings of CGIM2007.
M. Tosas, B. Li, S. Mills. (2007). Fast Adaptable Skin Colour Detection in RGB Space.
Submitted to VISAPP 2007.
iv
Acknowledgements
I would like to thank my first supervisor Bai Li for giving me the opportunity of starting this
PhD, for having confidence in my abilities and letting me pursue as the topic for this thesis
my visions about vision-based interactive surfaces, and for her invaluable help and
supervision during the course of this PhD. I would like to thank my second supervisor Steven
Mills for having being always available and willing to give me help and advise both in
technical subjects and in my English writing. Other people in the department who I would
like to thank are William Armitage (who helped me to compile the OTL library), and Holger
Schnadelbach (who helped me building a wooden frame for the second generation VTS).
I would like to thank my family in Barcelona who has always given me their support despite
me being far away. I would also like to thank several friends who were important to me and
whose friendship often help me to remain sane from before and during my PhD. Thanks to
Sara Cathie, Adam Cox, Pablo Nogueira, Dora Tothfalussy, and Franca Supran.
My time in the University of Nottingham has been an exciting one. I have not just completed
a PhD but I have also learned and experienced lots of things that have enriched me. My
experience as a hall tutor in Sherwood hall has enabled me to meet lots of people, and
experience university life from a different point of view. Often my dear tutor colleagues
made me feel as if I had a second family in the hall; thanks to you all. While in Nottingham I
have also become a salsa-dancing fan, this has helped a lot to keep my social life always
active during the otherwise rather solitary PhD experience. Thanks to salsa!
v
Table of Contents
1 INTRODUCTION...........................................................................................1
1.1 VISUAL ARTICULATED HAND TRACKING.............................................................3 1.2 VISUAL HAND TRACKING BASED INTERACTIVE SURFACES..................................5 1.3 CONTRIBUTIONS.................................................................................................8 1.4 ROADMAP OF THE THESIS ...................................................................................9
2 HAND TRACKING AND HCI: A LITERATURE REVIEW..................12
2.1 HUMAN HAND: ANATOMY, MOTION AND MODELLING ......................................13 2.2 VISUAL HAND TRACKING..................................................................................15 2.3 HAND GESTURES ..............................................................................................19 2.4 VISUAL HAND TRACKING IN HCI......................................................................20 2.5 INTERACTIVE SURFACES...................................................................................22 2.6 SUMMARY ........................................................................................................31
3 HAND CONTOUR TRACKING USING PARTICLE FILTERS AND DEFORMABLE TEMPLATES...................................................................33
3.1 DEFORMABLE TEMPLATES................................................................................34 3.2 MEASUREMENT MODEL ....................................................................................36 3.3 THE CONDENSATION ALGORITHM APPLIED TO VISUAL CONTOUR TRACKING....40
Resampling......................................................................................................43 Prediction ........................................................................................................43 Measurement ...................................................................................................44
3.4 ARTICULATED TRACKING .................................................................................45 3.4.1 Partition sampling...........................................................................................46 3.4.2 Incomplete particles in a chain of links ..........................................................51
3.5 TREE-STRUCTURED ARTICULATED OBJECTS .....................................................53 3.6 INCOMPLETE PARTICLES IN TREE-STRUCTURED ARTICULATED OBJECTS...........55 3.7 PARTICLE INTERPOLATION ...............................................................................56
3.7.1 Generating interpolated particles ...................................................................60 3.7.2 Differences from Condensation.......................................................................62
4 IMPLEMENTATIONS AND RESULTS....................................................65
4.1 ARTICULATED HAND CONTOUR MODEL ............................................................66 4.2 DYNAMICAL MODEL.........................................................................................67 4.3 MEASUREMENT MODEL ....................................................................................68
4.3.1 Measurement lines...........................................................................................68 4.3.2 Skin colour based measurement ......................................................................70
4.4 RESAMPLING SCHEME ......................................................................................73 4.5 PARTICLE-SET IMPLEMENTATION .....................................................................74 4.6 REFINING FINGER LENGTH ESTIMATES..............................................................77 4.7 SWEEP IMPLEMENTATION .................................................................................79 4.8 PERFORMANCE MEASURES FOR ARTICULATED CONTOUR TRACKERS ................82
4.8.1 Cost function ...................................................................................................83 4.8.2 Contour distances............................................................................................84 4.8.3 Signal to Noise Ratio (SNR) ............................................................................84
vi
4.8.4 Distance between model points .......................................................................85 4.8.5 Distance between model parameters...............................................................85
4.9 TEST VIDEO SEQUENCE.....................................................................................86 4.10 RESULTS AND COMPARISONS............................................................................87 4.11 RELATIONSHIP BETWEEN PERFORMANCE MEASURES........................................94 4.12 TECHNOLOGIES EMPLOYED IN THE IMPLEMENTATIONS ....................................96 4.13 CONCLUSIONS ..................................................................................................97
5 A SKIN COLOUR CLASSIFIER FOR HCI..............................................99
5.1 PREVIOUS WORK ON SKIN COLOUR DETECTION ............................................100 5.2 DEVELOPMENT OF THE LC CLASSIFIER ..........................................................102
5.2.1 RGB histogram classifier ..............................................................................102 5.2.2 Normalised RGB histogram classifier...........................................................103 5.2.3 Projected RGB histogram .............................................................................104
5.3 THE LINEAR CONTAINER (LC) CLASSIFIER....................................................106 5.4 PERFORMANCE RESULTS ................................................................................109 5.5 TUNING AT VARIOUS RESOLUTIONS...............................................................116 5.6 HCI USABILITY FACTORS ..............................................................................118 5.7 TARGET IMPORTANCE SELECTION...................................................................119 5.8 EXAMPLE OF A LC CLASSIFIER INITIALISATION IN HCI ..................................120 5.9 EXAMPLE OF DYNAMIC SKIN COLOUR MODELLING DURING TRACKING...........123 5.10 CONCLUSIONS ................................................................................................128
6 USING SKIN COLOUR AND EDGE FEATURES.................................130
6.1 EDGE FEATURES VS. SKIN COLOUR FEATURES ................................................131 6.2 USING ONLY EDGE FEATURES IN THE MEASUREMENT FUNCTION ....................135 6.3 COMBINING EDGE DETECTION AND SKIN COLOUR DETECTION IN THE
MEASUREMENT FUNCTION ..............................................................................136 6.4 CONCLUSIONS ................................................................................................140
7 TRACKING IMPROVEMENTS ..............................................................141
7.1 SWITCHING TEMPLATE FITTING METHODS DURING ARTICULATED TRACKING.142 7.1.1 Fitting templates to the links of an articulated object...................................142 7.1.2 Simplified articulated hand tracker...............................................................145 7.1.3 Results with the simplified articulated hand tracker.....................................146 7.1.4 Tracking performance with the sweep implementation.................................152 7.1.5 Conclusions ...................................................................................................153
7.2 QUASI RANDOM SAMPLING.............................................................................154 7.2.1 Quasi-random sequences ..............................................................................155 7.2.2 Application of quasi-random sequences in Condensation ............................156 7.2.3 Results ...........................................................................................................158 7.2.4 Conclusions ...................................................................................................158
7.3 VARIABLE PROCESS NOISE (VPN) .................................................................160 7.4 SKIN COLOUR GUIDED SAMPLING (SCGS) ....................................................162
7.4.1 Skin coloured blob detection and analysis ....................................................164 7.4.2 Combining low-level and high-level information..........................................166 7.4.3 Use of importance and initialisation particles ..............................................167 7.4.4 Reinitialisation test........................................................................................168 7.4.5 Robustness test ..............................................................................................170
vii
7.4.6 Conclusions ...................................................................................................172 7.5 COMBINING TRACKING IMPROVEMENTS .........................................................173
8 VIRTUAL TOUCH SCREEN....................................................................175
8.1 THE VTS INTERFACE .....................................................................................176 8.1.1 Hand tracking................................................................................................179 8.1.2 Operation ......................................................................................................179 8.1.3 Usability ........................................................................................................181
8.2 IMPLEMENTATIONS.........................................................................................183 8.2.1 Projector based VTS (First Generation) .......................................................183 8.2.2 Interface initialisation ...................................................................................185
Initial tracking position .................................................................................187 Skin colour tone of the user's hand................................................................187 Estimation of a kinematic model...................................................................187 Shape of the hand ..........................................................................................189
8.2.3 Touch detection .............................................................................................189 Kinematic model ...........................................................................................189 Thresholds .....................................................................................................190 Moving threshold ..........................................................................................191 Debouncing ...................................................................................................192 Detecting the position of a finger click .........................................................192 Dragging........................................................................................................193
8.2.4 Projector based VTS (Second Generation) ...................................................193 8.2.5 Tracking from the back of the hand ..............................................................196 8.2.6 HMD based VTS (Third Generation) ............................................................198 8.2.7 Third generation VTS experiments................................................................202
VTS operation with a plain background .......................................................203 VTS operation with a complex background..................................................204 VTS operation with resizable interfaces .......................................................205 VTS based drawing application ....................................................................207
8.3 APPLICATIONS................................................................................................211 8.4 CONCLUSIONS ................................................................................................215
9 CLOSING DISCUSSION ...........................................................................218
9.1 SUMMARY ......................................................................................................218 9.2 FUTURE WORK................................................................................................220
APPENDIX A.....................................................................................................................223
APPENDIX B .....................................................................................................................226
BIBLIOGRAPHY...............................................................................................................229
viii
List of Figures Figure 1.1: Articulated hand contour tracking. ....................................................................5 Figure 1.2: VTS implementations...........................................................................................7 Figure 1.3: Drawing application. ............................................................................................8 Figure 2.1: Human hand anatomy and degrees of freedom of each joint.........................14 Figure 2.2: DigitEyes..............................................................................................................16 Figure 2.3: Stenger's hand tracker tracking an out-of-image-plane rotation. .................17 Figure 2.4: MacCormick and Isard's articulated hand contour tracker..........................18 Figure 2.5: Hand Mouse. .......................................................................................................22 Figure 2.6: HandVu. ..............................................................................................................23 Figure 2.7: VisualPanel..........................................................................................................24 Figure 2.8: Visual Touchpad.................................................................................................24 Figure 2.9: Steerable interfaces. ...........................................................................................25 Figure 2.10: ARKB. ...............................................................................................................26 Figure 2.11: HoloWall............................................................................................................27 Figure 2.12: TouchLight........................................................................................................28 Figure 2.13: PlayAnywhere. ..................................................................................................28 Figure 2.14: Canesta keyboard. ............................................................................................29 Figure 2.15: Virtual keyboard based on true-3D optical ranging. ....................................29 Figure 2.16: DiamondTouch. ................................................................................................30 Figure 2.17: FingeRing. .........................................................................................................31 Figure 2.18: SCURRY. ..........................................................................................................31 Figure 2.19: Senseboard. .......................................................................................................31 Figure 3.1: A B-spline contour fitted to the middle finger of a hand. ...............................36 Figure 3.2: Measurement lines distributed along a contour. .............................................37 Figure 3.3: Measurement line normal to a hypothesized contour. ....................................38 Figure 3.4: Weighted particle set approximation of a probability density.......................42 Figure 3.5: Graphical representation of three particles from a particle set.....................42 Figure 3.6: One time-step in the Condensation algorithm. ................................................45 Figure 3.7: An intuitive partition sampling example..........................................................47 Figure 3.8: Articulated object with three links forming a chain. ......................................48 Figure 3.9: Algorithm for one time-step of partition sampling on the chain of links of
Figure 3.8. ............................................................................................................49 Figure 3.10: Particle set diagram showing two fictitious time-steps of partition sampling.
..............................................................................................................................51 Figure 3.11: Tree-structured articulated object..................................................................53 Figure 3.12: Algorithm for one time-step of partition sampling on the tree-structured
articulated object of Figure 3.11(a).................................................................55 Figure 3.13: Particle set diagram showing two fictitious time-steps of partition sampling
for the example articulated hand.. .................................................................57 Figure 3.14: Particle set diagram showing the particle interpolation process. ................59 Figure 3.15: Algorithm for one time-step of partition sampling, and particle
interpolation. ....................................................................................................59 Figure 3.16: Graphical representation of the interpolation process using (rule 1)..........61 Figure 4.1: Hand contour model...........................................................................................67 Figure 4.2: Measurement lines used in the articulated hand contour...............................69 Figure 4.3: Skin colour image with the measurement lines on top....................................70 Figure 4.4: Score look-up table. ............................................................................................71
ix
Figure 4.5: Algorithm to calculate the contour's score.......................................................72 Figure 4.6: One time-step of tracking for the particle-set implementation. .....................75 Figure 4.7: Algorithm for one time-step of tracking with the particle-set
implementation...................................................................................................76 Figure 4.8: Procedure to refine finger length estimations..................................................78 Figure 4.9: Angle sweep pattern. ..........................................................................................80 Figure 4.10: Sweep hand tracker implementation diagram for one time-step.................81 Figure 4.11: Algorithm for one time-step of tracking with the sweep implementation...82 Figure 4.12: Test video sequence structure. ........................................................................87 Figure 4.13: Performance comparison from frame 30 until 174. ......................................89 Figure 4.14: Example frames of the particle-set tracker output from frame 30 to frame
174. ....................................................................................................................90 Figure 4.15: Example frames of the sweep tracker output from frame 30 to frame 174.90 Figure 4.16: Performance comparison from frame 175 until 359. ....................................91 Figure 4.17: Performance comparison from frame 360 until 890. ....................................92 Figure 4.18: Example frames of the particle-set tracker output from frame 360 to frame
890. ....................................................................................................................93 Figure 4.19: Example frames of the sweep tracker output from frame 360 to frame 890.
...........................................................................................................................93 Figure 4.20: Relationship between cost function and distance metric. .............................96 Figure 5.1: Skin colour RGB histogram. ...........................................................................103 Figure 5.2: Normalised RGB histogram. ...........................................................................104 Figure 5.3: Projection from rg to RGB. .............................................................................105 Figure 5.4: LC classifier decision planes............................................................................106 Figure 5.5: Possible decision planes to avoid dark pixels. ................................................107 Figure 5.6: Initialisation image masks. ..............................................................................107 Figure 5.7: Tuning heuristics. .............................................................................................108 Figure 5.8: Ground truth masks. ........................................................................................110 Figure 5.9: Mediterranean subject test. .............................................................................112 Figure 5.10: White Caucasian subject test.........................................................................113 Figure 5.11: Black African subject test.. ............................................................................114 Figure 5.12: Chinese subject test. .......................................................................................115 Figure 5.13: NTP when tuning at various resolutions. .....................................................117 Figure 5.14: NTP chart for four percentages of skin in SkinMask. ................................118 Figure 5.15: NTP chart for two different SkinMask containing 25% of skin colour. ...119 Figure 5.16: Target importance for various tuning situations.........................................121 Figure 5.17: Initialisation sequence. ...................................................................................122 Figure 5.18: Modified cost function....................................................................................125 Figure 5.19: Dynamic tuning vs. static tuning...................................................................126 Figure 5.20: Dynamic tuning vs. static tuning...................................................................127 Figure 6.1: Skin colour vs. edges. .......................................................................................132 Figure 6.2: Exemplar frame. ...............................................................................................133 Figure 6.3: Normalised histograms of feature positions along a measurement line. .....134 Figure 6.4: Distance metric of the skin edge vs. the image edge based sweep tracker. .136 Figure 6.5: Situations in which the use of edges is essential for the correct location of the
hand. ..................................................................................................................137 Figure 6.6: Combination matrix. ........................................................................................138 Figure 6.7: Performance of the sweep tracker using edges and skin colour in the
measurement function. ....................................................................................139 Figure 7.1: Fitting methods for an articulated object.......................................................144
x
Figure 7.2: Potential problem of method 2. .......................................................................144 Figure 7.3: Simplified articulated hand contour model with 9 DOF. .............................146 Figure 7.4: Flow chart of the simplified hand tracker......................................................146 Figure 7.5: Selected frames from the critical zone............................................................148 Figure 7.6: Various misalignments for each method of fitting the articulated template.
............................................................................................................................150 Figure 7.7: X and Y variance of the hand pivot. ...............................................................151 Figure 7.8: Variance of the rotation angle and scale factor. ............................................151 Figure 7.9: Sweep implementation tracking performance when using the combined
template fitting method. ..................................................................................153 Figure 7.10: Distributions of pseudo-random and quasi-random points. ......................156 Figure 7.11: Gaussian transformation of uniform pseudo-random and quasi-random
points. ................................................................................................................157 Figure 7.12: Distance metric performance measure when using VPN. ..........................161 Figure 7.13: Distance metric performance measure when using VPN, separate charts.
............................................................................................................................163 Figure 7.14: Skin coloured blobs. .......................................................................................166 Figure 7.15: Importance samples and initialisation samples. ..........................................167 Figure 7.16: Reinitialisation test selected frames. .............................................................169 Figure 7.17: Distance metric performance measure when using SCGS. ........................171 Figure 7.18: Distance metric performance measure when using SCGS, separate charts.
.........................................................................................................................172 Figure 7.19: Skin coloured blobs mixing. ..........................................................................173 Figure 7.20: Distance metric for the sweep tracker with combined tracking
improvements. ................................................................................................174 Figure 7.21: Distance metric for the sweep tracker with combined tracking
improvements, separate charts.....................................................................174 Figure 8.1: Six possible interface element configurations for the VTS...........................178 Figure 8.2: Proposed VTS operation..................................................................................180 Figure 8.3: Set up of the first generation VTS. .................................................................184 Figure 8.4: Image processing for the first generation VTS..............................................184 Figure 8.5: Typing a telephone number on the first generation VTS. ............................185 Figure 8.6: Initialisation states of the VTS hand contour tracker...................................187 Figure 8.7: Finding finger creases. .....................................................................................188 Figure 8.8: Hand undergoing flexion of middle finger. ....................................................190 Figure 8.9: Touch detection using a moving threshold.....................................................192 Figure 8.10: Set up of the second generation VTS. ...........................................................194 Figure 8.11: Keypad usage. Key press sequence. ..............................................................195 Figure 8.12: Slider bar usage. Dragging sequence. ...........................................................196 Figure 8.13: New hand contour template...........................................................................198 Figure 8.14: New initialisation image masks. ....................................................................198 Figure 8.15: Set up of the third generation VTS. ..............................................................199 Figure 8.16: Example virtual interfaces.............................................................................200 Figure 8.17: Illusion of depth perception...........................................................................201 Figure 8.18: Example frames of the actions occurring during the first experiment (third
generation VTS). ............................................................................................203 Figure 8.19: Example frames of the actions occurring during the second experiment
(third generation VTS). .................................................................................205 Figure 8.20: Example frames of the actions occurring during the third experiment
(third generation VTS). .................................................................................207
xi
Figure 8.21: Drawing application toolbar (third generation VTS). ................................208 Figure 8.22: Automatic hand tracking reinitialisation feature........................................209 Figure 8.23: Example frames of the actions occurring during the drawing application
experiment (third generation VTS)..............................................................210 Figure 8.24: Example frames showing drawing with multiple fingers at the same time
(multiple points of input)...............................................................................211
List of Tables Table 4.1: Parameter values for the hand tracker dynamical model. ..............................68 Table 4.2: Finger angle constrains. ......................................................................................77 Table 5.1: LC classifier priori values. ................................................................................109 Table 5.2: Execution time results .......................................................................................116 Table 7.1: Contour distance performance metric comparison using three sampling
methods................................................................................................................159
1
1 Introduction
As computer systems become more and more embedded into our environment, the ability to
interact with them without the need for special equipment is very attractive. Vision-based
Human Computer Interaction (HCI) has the potential of making this possible in a form that is
both easy and natural for people.
Vision-based HCI core technologies are generally based on visual tracking and visual
template recognition algorithms. These algorithms can be designed to track and recognize
faces (for identity recognition or verification), recognize facial expressions (for mood or
attention detection), detect position and pose of human arms, legs, and body (for gesture
driven interfaces), and track the position and configuration of hands and fingers (for 3D hand
pointing, 3D mouse control, sign language recognition, etc). However, despite the wide
range of potential applications, the creation of these algorithms still presents great technical
challenges, and often, some form of constraint needs to be put in place for these algorithms
to operate correctly. One strategy to overcome the technical problems is to make these
algorithms more application specific. Following this strategy, this thesis develops visual
articulated hand tracking algorithms aimed at interactive surfaces.
1 Introduction 2
Interactive surfaces are surfaces that display information and allow users to interact with this
information by touching the surface either directly with their hands, using a stylus, or using
some form of wearable hardware. The concept includes traditional touch screens, but it goes
beyond them. An interactive surface can be presented on a table or a desk, on a wall, on a
book or on a piece of paper, on a shop display, or even on a virtual surface floating in the air.
The technologies used for both displaying information on the surface, and making the
surface sensitive vary considerably. The possibility of using visual hand tracking as the
sensing technology for interactive surfaces is very attractive with respect to other sensing
technologies because it is flexible (it could be deployed in various configurations), cheap (it
could operate with simple USB cameras), and it avoids the need of cumbersome hardware
(no special gloves or hardware need to be attached to the user's hands). A small number of
attempts at developing vision-based interactive surfaces have been made. These are typically
based on detecting reflected infrared light near the surface, tracking single fingers, or
tracking adorned hands. There have, however, been no previous attempts at using visual
articulated hand tracking as the sensing technology for interactive surfaces. This is partly
because full 3D visual articulated hand tracking is difficult to achieve.
Visual articulated hand tracking is currently an active and challenging area of research in the
computer vision community. Visual articulated hand tracking has a great potential in HCI,
especially in Virtual Reality (VR) and Augmented Reality (AR) environments. However, the
high degree of freedom (DOF) of the hand models, the self-occlusion of the fingers, and the
kinematic singularities in the finger's articulated motion, make visual articulated hand
tracking very difficult. The challenge is even greater when a single camera, unadorned
hands, unconstrained background, and unconstrained illumination levels are required.
Fortunately, the 2D nature of the interactive surfaces does not require a full 3D visual
articulated hand tracking. This makes it possible to develop a specific but robust visual
articulated hand tracking that is constrained to a particular view point – where the hand is
approximately parallel to the interactive surface.
This thesis develops a visual articulated hand tracking system which enables the creation of a
novel vision-based interactive surface, referred to as the Virtual Touch Screen (VTS). The
tracking algorithm is based on the contour tracking framework proposed by Blake and Isard
(1998), with a considerable extension to improve the efficiency of particle propagation
between time-steps in tracking tree-structured articulated objects, such as human hands. This
1 Introduction 3
allows the creation of a novel 14 DOF articulated hand contour tracker, which is capable of
tracking in real-time the articulated contour of a hand with the palm approximately parallel
to the camera's image plane. The tracker uses a robust skin colour classifier to track the hand
contour, enabling the tracking of unadorned hands against cluttered backgrounds. The
tracker is specifically designed to track the finger motions of a hand operating a VTS. The
contents of the VTS can be displayed by either the use of a projector, or the use of a see-
through Head Mounted Display (HMD). The VTS is made touch-sensitive by using the
visual hand tracking system developed in this thesis – in order to determine when and where
a user's finger touches the VTS. The VTS interface can be used as a multi-point touch-
sensitive surface, with the added ability to also detect hand gestures hovering above the VTS
surface. This enables a large number of promising applications including alternatives to
touch sensitive panels, alternative interface for mobile computing devices, contactless
interactive surfaces for museums, information points, shop displays, and video games, and
sterile interfaces for use in hospitals and clean rooms.
The major contribution of this thesis is the development of a visual articulated hand tracking
system that could enable the creation of a vision-based interactive surface. The creation of
the VTS interface itself and related experiments constituted a comparatively small part of the
thesis. In other words, this thesis is focused on computer vision rather than HCI and
therefore the proposed VTS interface has not been tested with a representative group of users
in order to evaluate its usability in various situations and configurations. This is left as a part
of the future work of this thesis.
1.1 Visual articulated hand tracking
This thesis develops a visual articulated hand tracking system which enables the creation of a
novel vision-based interactive surface, referred to as Virtual Touch Screen (VTS). The
intended use of the tracking system sets a number of demanding requirements on it. The
tracking system has to be able to perform robust articulated tracking of an unadorned hand,
using a single camera, against arbitrary backgrounds, and under a wide range of lighting
conditions. The tracking of the fingers has to be accurate enough as to enable detection of
click events on the interactive surface. Finally, the visual articulated hand tracking has to
work in real-time. A visual articulated hand tracking of this description presents great
technical difficulties and has not yet been achieved in the computer vision community.
1 Introduction 4
However, the 2D nature of the interactive surfaces does not require a full 3D visual
articulated hand tracking. A visual articulated hand tracking constrained to a particular
viewpoint can greatly simplify the technical difficulties, and can be suitable for its use in
interactive surfaces. Following this strategy, this thesis develops a 2D hand tracking system
that can both satisfy the above-mentioned requirements and track a hand in an orientation
approximately parallel to the camera's image plane.
The 2D hand tracking in this thesis is based on contours. A contour is a curve that defines the
2D boundary of an object as it appears in an image. In this thesis, the contour is that of a
hand approximately parallel to the camera's image plane. Hand contour tracking involves
matching a deformable hand template to the 2D contour of this hand as it moves within the
camera's field of view. In order to track the articulations of the fingers, the deformable hand
template has to match the changing contour of the fingers too. Existing contour tracking
algorithms based on particle filters (Blake and Isard, 1998) and partition sampling
(MacCormick and Isard, 2000) have the potential of tracking articulated objects in real-time
against cluttered backgrounds. However, they do not satisfy the demanding hand tracking
requirements of the VTS interface. This thesis further develops and improves these tracking
algorithms aiming at the creation of a suitable articulated hand contour tracking system for
the VTS. The improvements to these tracking algorithms are the following:
• A novel technique, referred to as particle interpolation, which makes it possible to
improve the efficiency of particle propagation between time-steps in tracking tree-
structured articulated objects using particle filters and partition sampling.
• A novel measurement function based on skin colour that is both faster and more reliable
than existing edge based measurement functions provided no other skin colour objects
appear in the background.
• A novel skin colour based importance sampling implementation, referred to as Skin
Colour Guided Sampling (SCGS), that allows the estimate of position, scale, and angle of
the hand contour from low-level information, for either users wearing long sleeve or
short sleeve.
• A novel contour fitting method for articulated contour trackers, which improves tracking
agility and reduces jitter on the tracking output.
1 Introduction 5
• A novel method for particle filter based contour trackers, referred to as Variable Process
Noise (VPN), which varies the size of the contour's search region in order to cope with
brisk target movements.
These techniques are used in the development of a novel 14 DOF articulated hand contour
tracking system, that can track the contour of hand approximately parallel to the camera's
image plane, from either its front (palm) or back views. Figure 1.1 shows two snapshots of
the tracking output (blue hand contour) of this articulated hand contour tracking system.
Figure 1.1(a) shows a front view (palm). Figure 1.1(b) shows a back view.
(a) (b) Figure 1.1: Articulated hand contour tracking. (a) Front view. (b) Back view.
1.2 Visual hand tracking based interactive surfaces
A novel form of HCI, referred to as interactive surfaces, has emerged in recent years. An
interactive surface refers to a surface that can display information and can allow users to
interact with this information by touching the surface either directly with their hands, using a
stylus, or using some form of wearable hardware. The concept includes the traditional touch
screens but it goes beyond them. An interactive surface can be presented on a table or a desk,
on a wall, on a book or on a piece of paper, on a shop display, or even on a virtual surface
floating in the air. The technologies used for both displaying information on the surface and
making the surface interactive vary considerably. The use of computer vision in interactive
surfaces is attractive with respect to other sensing technologies because it is flexible and
avoids the need of expensive and cumbersome hardware.
1 Introduction 6
This thesis develops a visual articulated hand tracking system which enables the creation of a
novel vision-based interactive surface, referred to as the Virtual Touch Screen (VTS). In a
VTS the information is displayed by either using a projector, which projects the information
on a selected surface or by using a see-through Head Mounted Display (HMD), which
displays the contents in the HMD, but these appear to users as to be floating in their field of
view. The VTS is made touch-sensitive by visually tracking the user's hand and interpreting
their hand position and configuration. This interpretation results in the detection of click and
drag actions on the VTS. Figure 1.2 shows a projector based VTS, Figure 1.2(a), and two
examples of its operation: operating a keypad, Figure 1.2(b), and operating a sliderbar,
Figure 1.2(c). On the right column, Figure 1.2 shows a HMD based VTS, Figure 1.2(d), and
two examples of its operation: operating a keypad against a cluttered background, Figure
1.2(e), and resizing interface elements with a thumb gesture, Figure 1.2(f).
The VTS interface can be used as a multi-point touch-sensitive surface, with the added
ability to also detect hand gestures hovering above the VTS surface. This enables a large
number of applications for the VTS interface. A VTS could constitute an alternative to touch
sensitive panels (especially attractive for large panels). A VTS could become a flexible
interface for PDAs, or other mobile computing devices. A VTS could be used in museums,
information points, and shop displays in order to show interactive information relevant to
users. A VTS could constitute a cheap and flexible alternative to handheld keypads, controls
or pointing devices for existing HMD based AR environments. The entertainment industry
could also benefit from VTSs for video games. Finally, as the visual hand tracking system
used in the VTS does not require a physical surface to operate, the VTS could have
applications in scenarios where physical contact is not desired, for example, sterile interfaces
for use in hospitals, or clean rooms.
This thesis proposes the VTS interface, its possible configurations, possible operation, and
potential applications regardless of the particular visual hand tracking technology in use.
Then, these ideas are implemented with the help of this thesis visual articulated hand
tracking system. This results in three VTS generations, whose capabilities are tested with a
number of experiments. The final experiment is a VTS based drawing application. Figure 1.3
shows two snapshots of the drawing application in action.
1 Introduction 7
(a)
(d)
(b)
(e)
(c)
(f) Figure 1.2: VTS implementations. Left column shows a projector based VTS, (a), and two examples of
interface operation: operating a keypad, (b), and operating a sliderbar (c). Right column shows a
HMD based VTS, (d), and two examples of interface operation: operating a keypad against a
cluttered background, (e), and resizing interface elements with a thumb gesture (f).
1 Introduction 8
(a) (b) Figure 1.3: Drawing application. (a) Drawing demo. (b) Multiple points of input demo.
1.3 Contributions
The main contributions of this thesis are:
• Critical evaluation of existing visual tracking algorithms, with special focus on the use of
particle filters in the implementation of articulated contour trackers, and analysis of the
efficiency of particle propagation between time-steps in tracking tree-structured
articulated objects.
• Development of a novel technique, referred to as particle interpolation, which makes it
possible to improve the efficiency of particle propagation between time-steps in tracking
tree-structured articulated objects using particle filters and partition sampling.
• Development of a novel skin colour classifier, referred to as the Linear Container (LC)
classifier, and testing of the classifier under various conditions for use in hand tracking
for HCI. The classifier is robust to illumination (brightness) changes, requires less
storage, and is significantly faster than existing classifiers.
• Implementation of skin colour based importance sampling for the hand contour trackers
presented in this thesis. This, referred to as Skin Colour Guided Sampling (SCGS),
allows the estimate of position, scale and angle of the hand contour from low-level
information.
• Analysis of contour fitting methods for articulated contour trackers and development of
an improved contour fitting method that improves tracking agility and reduces jitter on
the tracking output.
1 Introduction 9
• Implementation of variable process noise in a particle filter, in order to improve its
tracking agility. The concept of variable process noise has been used before in mono
modal trackers, such as Kalman filters, but it has never been used before in a particle
filter scenario.
• Development of a novel 14 DOF articulated hand contour tracker. The tracker uses only
skin colour to track the hand contour, which enables real-time tracking of unadorned
hands against cluttered backgrounds.
• Implementation of a novel vision-based HCI interface called the Virtual Touch Screen
(VTS), and demonstration of its capabilities through a number of experiments.
1.4 Roadmap of the thesis
This chapter has given the background information needed for the remanding of the thesis.
The rest of the thesis is organized as follows:
Chapter 2 gives an overall view of the current visual hand tracking technologies and their
applications in HCI. It also reviews several examples of interactive surfaces, both using
vision and non-vision based technologies.
Chapter 3 reviews and expands the contour tracking framework developed by Blake and
Issard (1998). Within this framework, the use of partition sampling (MacCormick and Isard,
2000) for tracking tree-structured articulated objects is analysed, and efficiency problems in
the propagation of particles between time-steps are identified. Finally, a novel technique,
referred to as particle interpolation, that overcomes these efficiency problems is proposed.
Chapter 4 develops two novel 14 DOF articulated hand contour trackers using the particle
interpolation technique presented in Chapter 3 . One of the trackers is entirely based on
stochastic processes, Section 4.5, while the other one uses a combination of stochastic and
deterministic processes, Section 4.7. Both trackers are designed to be used in HCI and
therefore are tested with tracking sequences that simulate HCI. The results of the tests are
analysed and the performance of the trackers is assessed in Section 4.10. The proposed hand
trackers use skin colour as the only cue for tracking. Thus, the skin colour detection needs to
be very robust.
1 Introduction 10
Chapter 5 presents a novel skin colour classifier, referred to as the Linear Container (LC)
classifier, which is robust to illumination (brightness) changes, requires little storage, is
significantly faster than existing classifiers, and can be tuned to a particular skin tone. The
evaluation speed of the LC classifier guarantees that the hand trackers proposed in Chapter 4
can operate in real-time. The LC classifier is tested and compared with existing classifiers.
Finally, a method for rapidly adapting the LC classifier parameters to illumination changes
during tracking is presented in Section 5.9, this method enables the implementation of
dynamic skin colour modelling.
Chapter 6 analyses and compares the use of skin colour information with the use of edge
information in contour trackers. Typically, contour trackers use edge information. However,
the proposed articulated hand contour trackers use skin colour information only. The chapter
supports that the use of skin colour information alone in contour tracking is more attractive
than the use of edge information alone, because it can be faster to evaluate and it is more
accurate provided no other skin colour objects appear in the background.
Chapter 7 describes a number of techniques that improve the tracking performance of
articulated hand contour trackers:
• The order in which the segments of an articulated template are fitted to the corresponding
segments of an articulated object can affect the tracking agility and the level of jitter in
the tracking output. Section 7.1 analyses this phenomenon and proposes a fitting method
that improves tracking agility and reduces jitter levels in the tracking output.
• The use of stochastic processes in the hand tracking has the inconvenience that tracking
is not repeatable. Section 7.2 studies the use of quasi-random sampling in order to
improve the repeatability of the tracking.
• When a target object moves briskly the tracker may lose the location of its contour. This
situation can be prevented by controlling the size of the region in which this contour is
searched. Section 7.3 describes and tests a method, referred to as Variable Process Noise
(VPN), which varies the size of the contour's search region in order to cope with brisk
target movements.
1 Introduction 11
• When the tracked hand exits the camera's field of view, tracking is lost. A method to
automatically regain tracking on the user's hand once this one re-enters the camera's field
of view is presented in Section 7.4. This method, referred to as Skin Colour Guided
Sampling (SCGS), allows the estimate of position, scale and angle of the hand contour
from low-level information.
• Section 7.5 shows that the separate techniques described in Chapter 7 can be combined
in a single articulated hand tracker resulting in improved tracking performance.
Chapter 8 presents a novel vision-based interactive surface, referred to as Virtual Touch
Screen (VTS). The chapter describes the VTS interface, its possible configurations, its
proposed operation, and a discussion of potential VTS applications. Then, using the visual
articulated hand tracking system developed in previous chapters, a number of experiments
with various VTS implementations is presented. Chapter 8 finishes with the presentation of
a VTS based drawing application which illustrates the use of VTS interfaces to complete a
task.
Chapter 9 presents the conclusions of the thesis and future work directions.
The manuscript also contains two appendices. Appendix A shows the calculation of the
reverse kinematics of a chain of links. Appendix B contains a number of video sequences
that illustrate the results of various experiments throughout the thesis. The appendix is both
available in the enclosed CD and in the supporting webpage at:
http://www.cs.nott.ac.uk/~mtb/thesis
12
2 Hand tracking and HCI: a literature review
The most important use of visual hand tracking is its application for HCI. Visual hand
tracking, and in particular visual articulated hand tracking, can enable new ways of
interaction with computers, which can be more natural to people, less intrusive, and more
flexible than other ways of interaction based on hardware devices. However, the
implementation of a full 3D visual articulated hand tracking is a difficult and challenging
problem. The potential benefits of visual articulated hand tracking in HCI, and the difficulty
of the problem, has attracted during the last decade a great deal of research onto this topic.
This chapter places this thesis into a bigger context, regarding to visual articulated hand
tracking and its applications for HCI. The chapter starts reviewing the human hand anatomy,
its motion, and common ways of modelling it. Then it places the visual articulated hand
tracking developed in this thesis into context among the existing visual hand tracking
techniques. Finally, it places the proposed Virtual Touch Screen (VTS) interface into context
inside an emerging field in HCI. This field is referred to as interactive surfaces.
2 Hand tracking and HCI: a literature review 13
2.1 Human hand: anatomy, motion and modelling
Before the current hand tracking technologies can be addressed in the next section, it is
important to review some basic facts about human hand anatomy, motion and modelling.
The human hand skeleton is composed of 27 bones. These bones can be divided into three
groups:
• Eight carpals
• Five metacarpals
• Phalanges (finger bones)
The carpals are found in the wrist, the metacarpals in the palm, and the phalanges in the
finger bones. Joints between these bones have different number of degrees of freedom
(DOF). Figure 2.1 shows the human hand anatomy and degrees of freedom of each joint. For
the little, ring, middle, and index fingers, the Distal Interphalangeal (DIP) joint, and the
Proximal Interphalangeal (PIP) joints have a single DOF. The Metacarpophalangeal (MCP)
joint has 2 DOF. The joints between carpals and metacarpals have 1 DOF more. The Thumb
is a special case having 1 DOF for the Interphalangeal (IP) joint, 1 DOF for the
Metacarpophalangeal (MP) joint, and 3 DOF for the Trapeziometacarpal (TM) joint.
The motion of the phalanges is described using a specific terminology:
• Flexion, refers to the movement of the fingers toward the palm.
• Extension, refers to the movement of the fingers away from the palm.
• Abduction, refers to the movement of the fingers away from the plane that divides the
hand between the middle and ring fingers.
• Adduction, refers the movement of fingers towards this plane.
Flexion and Extension is exhibited by all the phalanges. Abduction and Adduction is
exhibited only by the Metacarpophalangeal (MPC) joint.
A common approach to represent the anatomical structure of the human hand is by means of
a 3D kinematic hand model. These kinematic models are based on a simplified skeleton of
the human hand and they can represent the state of the hand over time (updating them as
hand motion occurs). They typically comprise the lengths of the finger segments, the joint
angles for each of the articulations, and constraints in the motion of the finger segments.
2 Hand tracking and HCI: a literature review 14
Figure 2.1: Human hand anatomy and degrees of freedom of each joint. (Figure reproduced from Sturman
(1992).)
An anatomically correct kinematic hand model has 26 DOF (and 6 DOF more if the 3D
position and orientation is considered). However, the Metacarpocarpal joints (situated inside
the hand palm) are generally disregarded. The hand can then be modelled with 2 DOF in the
MCP and TM joints, and 1DOF for all the other joints, therefore simplifying the kinematic
model to 21 DOF.
The dimensions of the hand state can be considerably reduced by registering the range of
valid moments (for example using a data glove to gather empiric data of a hand moving
through all the possible configurations) and then analysing the data using Principal
Component Analysis (PCA) techniques. Using this technique, Wu, et al. (2001) managed to
reduce a hand kinematic model from 20 DOF to 7 DOF.
The use of hand motion constraints can also reduce the state space considerably; this
generally does not reduce the DOF but reduces the range of variation in the state space.
There are two sets of constraints that can be placed on the joint angle movements:
• Static constraints. The range of valid joint angles for a given joint.
2 Hand tracking and HCI: a literature review 15
• Dynamic constraints. The dependencies between joints due to sharing the same tendons.
There is a third category of more subtle constraints proposed by Lin, et al. (2000). These
constraints have nothing to do with limitations in the hand anatomy, but rather are a result of
common and natural movements. These constraints are mostly used in simulation of natural
movement of a hand.
2.2 Visual hand tracking
During the last decade, visual hand tracking has attracted a great deal of research in the
computer vision community. In 1995, Rehg's seminal work in articulated hand tracking,
established what is now a classical approach to model based tracking of articulated objects
through video sequences. The classical approach to articulated hand tracking uses a
kinematic hand model, which also represents the volume of each of its segment and palm
(for example using truncated cylinders, conic sections, or truncated quadrics). The initial
configuration of the model is known, and a fast frame rate is assumed (i.e. small differences
in the hand's configuration between two consecutive frames). The projection of the hand
model onto an adequate plane is then compared to the hand appearing in each frame of a
video sequence. The discrepancies between the hand model projection and the hand features
in the frame are then minimized. The minimization process makes small changes in the state
of the hand model, until a match between the hand model and the hand features is reached.
This procedure allows the state of the hand to be tracked throughout the video sequence. The
matching process has been formulated as a constrained nonlinear optimisation problem.
However, the high DOF of the kinematic hand models, self-occlusions of the fingers, and
kinematic singularities in the articulated motion, often make the optimisation process to get
trapped in local minima. These problems make articulated hand tracking very difficult.
Rehg (1995) built a system called DigitEyes where a hand could be tracked against a black
background, in real-time (up to 10Hz), either using two cameras and a 27 DOF hand model,
or a single camera and a simplified 6 DOF hand model (which was applied in a 3D mouse
user-interface trial). The kinematic models that Rehg used provided geometric constraints on
the position of the hand features. In addition, self-occlusions of the fingers were handled
using layered templates, whose order was inferred from the current state of the kinematic
model. However, finger occlusions were only tracked off-line. One of the main weaknesses
2 Hand tracking and HCI: a literature review 16
of Rehg's system is the adaptation of the hand model to a new user. This process takes about
4 hours of interactive work where measurements of the length and the breadth of all the links
are made. Figure 2.2 shows the experimental test bed for the DigitEyes system and its 3D
hand model. In a similar system, Kuch and Huang (1995) simplified the model adaptation to
a new user with only three snapshots of the user's hand in three predefined configurations,
plus an interactive selection. A few years later, Rehg and Morris (1997) developed a method
to capture 3D motion using a 2D model. They used this method to register the 3D motion of
a person dancing. Then, afterwards, the system would recover the 3D motion from the 2D
registration using a 3D kinematic model of the person’s body.
Figure 2.2: DigitEyes. Left: experimental test bed for the DigitEyes system. Right top: hand image with
kinematic model overlaid. Right bottom: 3D view of the hand model. (Figure reproduced from
(Rehg, 1995).)
Following the same model based hand tracking approach, Stenger et al. (2001) constructed
an anatomically accurate hand model using truncated quadrics. Their method uses elegant
tools from projective geometry in order to generate 2D profiles of the model and handle self-
occlusions. The pose of the hand is estimated with an Unscented Kalman filter (UKF) using
one or more cameras. Stenger's et al. (2006) work evolved into the discretisation of their
hand model state space, and subsequent organization of the discretised state space into a
hierarchy of hand templates. The hierarchy of templates contains all the possible (or allowed)
hand configurations. The approach is similar to hierarchical object detection. Areas of the
state space, which are unlikely to contain the current hand configuration, are rejected early
on at the top of the template hierarchy. A search down the template hierarchy refines further
the fitting of the hand model to the hand in the image. The search in the template hierarchy is
2 Hand tracking and HCI: a literature review 17
aided by a dynamic model, which sets only a weak prior assumption about the motion
continuity. The method produces good results, and it is capable of handling out-of-image-
plane rotations, fast motion, and automatic recovery of tracking. However, the system has
large memory requirements, and it does not work yet in real-time (it takes a few seconds to
process each frame of a tracking sequence). Figure 2.3 shows Stenger's hand tracker tracking
an out-of-image-plane hand rotation.
Figure 2.3: Stenger's hand tracker tracking an out-of-image-plane rotation. (Figure reproduced from
Stenger's et al. (2006).)
A different approach to the model based hand tracking involves tracking the 2D contour of
the hand (as it appear on an image). Heap and Samaria (1995) used this approach and
introduced "smart snakes" in order to track and recognize hand gestures. Some time later,
Blake and Isard (1998) established a framework for tracking deformable 2D contours. One of
the most salient features of their work was the introduction of the Condensation algorithm
for real-time tracking contours against cluttered backgrounds. They demonstrated the
robustness of their tracking framework with a number of experiments (Isard, 1998; Isard and
Blake, 1998a; Isard and Blake, 1998b), such as: tracking the fast motion of a leaf on a
cluttered background, tracking people's profiles, tracking cars, tracking facial expressions,
and even tracking of articulated objects. However, tracking of articulated objects using this
framework alone was inefficient (as the number of particles required to track an articulated
object grows exponentially with the DOF of the object). MacCormick and Blake (1999)
introduced a new technique, called partition sampling, which makes possible to avoid the
high cost of particle filters when tracking more than one object. Later, this technique was
used by MacCormick and Isard (2000) to implement a vision based articulated hand tracker.
2 Hand tracking and HCI: a literature review 18
Their hand tracker could track position, rotation and scale of the user's hand while in a
pointing configuration. In addition, the thumb had 2 DOF and the index finger had 1 DOF.
Figure 2.4 shows some snapshots of MacCormick and Isard's articulated hand contour
tracker. Partition sampling makes it possible to deal with large configuration spaces,
provided that certain conditions such as the ones found in articulated object tracking are met.
The articulated hand contour tracking developed in this thesis is based on partition sampling,
and Blake and Isard's framework. This thesis contributes to these techniques by making
possible efficient tracking of tree-structured articulated objects while using the hierarchical
structure of the object in the matching process.
Figure 2.4: MacCormick and Isard's articulated hand contour tracker. (Figure reproduced from
MacCormick and Isard (2000).)
A different approach to realise articulated hand tracking is that of Nolker and Ritter (1997).
They find the fingertips of a hand in a grey-scale image by means of a hierarchical neural
network. As the hand movement is highly constrained, the fingertip positions are enough to
roughly infer the 3D state of the hand. This allows them to update a 3D hand model from
only the fingertips positions (Nolker and Ritter, 1999). Recent research by Stefanov (2005)
uses a rather different approach to hand tracking. He combines Hough transform features
(circles) from the input image with behaviour knowledge (from structured interaction) in
order to guide and achieve robust hand tracking.
2 Hand tracking and HCI: a literature review 19
Skin colour is an important source of information in hand tracking. Tracking methods such
as Camshift (Bradski, 1998) rely entirely on skin colour. As colour is a low level feature,
skin colour based trackers are generally fast, and allow real-time operation. However, they
do not generally allow articulated tracking (the articulated hand tracking presented in this
thesis is an exception as it allows fast articulated hand tracking and is entirely based on skin
colour). Other image features are typically combined with skin colour in order to improve
hand tracking. For example, MacCormick and Blake (1999) use skin colour and edges
information; and Kolsch and Turk (2005) use a technique called flocks of features which
uses skin colour and KLM features to track a hand through rapid deformations.
2.3 Hand gestures
Hand gestures are a natural and powerful way of communicating actions or states to a
computer system. If those hand gestures are combined with pose and position hand tracking,
many parameters of an application can be controlled at the same time. Pavlovic et al. (1997)
defines a gesture as a trajectory in parameter space. A gesture can be divided into three
stages: preparation, peak or stroke, and retraction. Quek (1995) suggested a taxonomy of
hand/arm gestures for HCI that divides gestures into Manipulative and Communicative
gestures. Communicative hand gestures inherently communicate an idea, which can be used
as a command to a system. Examples of communicative hand gestures are: the O.K. symbol,
thumbs up, waving the index finger (indicating a negation), and stop (showing the palm and
fingers extended). Manipulative hand gestures are the hand movements resulting from acting
on objects in the environment (object movement, rotation, etc). Examples of manipulative
hand gestures are: grasping a tool, drawing with a pen, or typing in a keyboard.
Computer systems that can recognize hand gestures from visual data typically follow one of
two approaches. The first approach is the use of 3D hand models and tracking (Davis and
Shah, 1994; Heap and Hogg, 1996; Kuch and Huang, 1995; Lee and Kunii, 1995; Rehg and
Kanade, 1995; Shimada et al., 1998; Wu and Huang, 1999). In this approach the
configuration of the hand model is estimated over time. The second approach is the use of
appearance-based models (Cui and Weng, 1996; Rosales and Sclaroff, 2000; Triesch and
Malsburg, 1996; Wu and Huang, 2000). These models aim to characterize the mapping from
the image feature space to the possible hand configuration space directly from a set of
training data. This approach often involves learning techniques. In either approach (3D hand
2 Hand tracking and HCI: a literature review 20
models or appearance-based models) Hidden Markov Models (HMM) and its variations are
the most important techniques employed in modelling, learning, and recognition of hand
gestures (Yang et al., 1994; Stoll and Ohya, 1995; Starner and Pentland, 1995; Assan and
Grobel, 1997; Vogler and Metaxas, 1998).
2.4 Visual hand tracking in HCI
The most important area of application of visual hand tracking is HCI. Visual hand tracking,
often combined with hand gesture recognition, provides a natural way of interaction with
computer systems. Visual hand tracking is specially suited to VR and AR environments
because it allows users to interact with virtual objects in an intuitive and flexible way,
without requiring cumbersome hardware.
Visual hand tracking has often been used in order to implement interaction with Digital
Desks. The concept of a Digital Desk is to augment a physical desk by adding electronic
features to it, and allow users to interact with these electronic features on the desk, moving
information from the desk to the computer and vice versa. This is achieved by means of two
cameras situated on the top of the work-surface (for tracking the user's hands) and a projector
(to project the electronic features on top of the desk). Wellner (1993) introduced many of the
original Digital Desk ideas. He presented a number of example applications for the Digital
Desk. The first application was a calculator projected on the desk, which could be operated
directly with the finger. This application allowed the user to get numbers directly from some
document on the desk and feed them into the calculator's display. Another application was
called PaperPaint. This application allowed the user to draw on the desk and copy/move and
paste section of the drawing. Finally, he presented the idea of double-desk. This idea
involves projecting the actions happening in two separate Digital Desks at the same time, on
the same desk.
Using a different methodology but also following the Digital Desk concept, Crowley et al.
(1995) implemented a finger drawing application. The system tracked the user’s fingertip
using a template of the fingertip and correlating it with the desk’s image. The correlation was
done only over the area surrounding the last detected finger tip position. If tracking is lost,
the user needed to put his finger on a square situated in one of the corners of the desk. More
recently in the EnhancedDesk (Koike et al., 2001), paper and digital information are
2 Hand tracking and HCI: a literature review 21
integrated on the desk. The system uses template matching and infrared cameras in order to
track the user's hands and fingers. Also related with the Digital Desk, Stafford and Robinson
(1996) presented BrightBoard. BrightBoard is a system that uses a video camera and audio
feedback to enhance the facilities of an ordinary whiteboard, allowing a user to control a
computer through simple marks made on the board.
Fish tank VR applications, also benefit from the use of visual hand tracking. Bowden et al.
(1996) presented an application where the user could drive 3D engines on a computer screen
using their bare hands. Segen and Kumar (1998) described a multi-dimensional hand-gesture
interface system and its use in interactive spatial applications. The system acquires input data
from two cameras that look at user's hand, recognizes three gestures and tracks the hand in
3D space. Five spatial parameters (position and orientation in 3D) are computed for index
finger and the thumb, which gives the user a simultaneous control of up to ten parameters of
an application. They demonstrated the capabilities of the system with some example
applications: a video game control, a piloting of a virtual fly-through over terrain by hand
pointing, interaction with 3D objects in a scene editor by grasping and moving objects in
space, and partial control of a human hand. Abe et al. (2000) described a system that tracks
the user’s hand using two cameras, one from the top-view, and another from the side-view.
The system allows drawing figures in 3D space and handling of 3D virtual objects on the
computer screen.
As an application for wearable computing, Kurata et al. (2001) described a system, named
Hand Mouse, where the user’s hand is tracked using a wearable camera. Users wear a HMD
that allow them to see tags and information relevant to the user’s environment. A colour
based mean shift algorithm is used in order to track the user’s hand. This allows users to use
their hand as a pointing device. A soft floating keyboard operated with a single finger is
shown as an example application. In a later work, Kurata et al. (2002) improved the system
using a Condensation algorithm with the lowest possible number of samples in order to
coarsely but rapidly track the user’s hand. Three promising applications were described: a
virtual universal remote control, a secure password input, and a real world OCR. The system
allowed selection of areas, and selection of points in space. The selection used the index and
thumb fingers in order to select the area between them (as seen from the wearable camera).
To select a point in space the user had to bring the index finger and thumb together until they
touch each other, at which point a selection was made. Figure 2.5 shows some snapshots
2 Hand tracking and HCI: a literature review 22
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 2.5: Hand Mouse. (a) - (d) Selecting a rectangle by dragging. (e) Universal remote control application.
(f) Secure password input application. (g), (h) OCR application detects text regions automatically.
(Figure reproduced from (Kurata et al., 2002).)
from Hand Mouse applications. Also using wearable camera and a video see-through HMD,
Kölsch (2004) constructed a system called HandVu. The system can track the user's hand
through highly articulated hand motions by using "flocks of features". He presented a GUI
which allowed the user to complete multiple tasks, such as interacting with virtual objects
(using ARToolKit fiducial markers), operating buttons, selecting areas from the field of
view, and recognizing gestures. Figure 2.6 shows the HandVu set-up, some snapshots of
hand tracking through highly articulated motions, and some snapshots of the HandVu GUI.
2.5 Interactive surfaces
In recent years, the concept of interactive surfaces has attracted a big deal of research. An
interactive surface can display information and users can interact with this information by
touching the surface either directly with their hands, using a stylus, or using some form of
wearable hardware. The concept includes the traditional touch screens but it goes beyond
them. An interactive surface can be presented on a table or a desk, on a wall, on a book or on
a piece of paper, on a shop display, or even on a virtual surface floating in thin air. The
technologies used for both displaying information on the surface and making the surface
interactive vary considerably. Examples of display technologies include front or rear
projectors, HMD, or simply a computer screen. Some sensing technologies are based on
image processing, either using visible-light cameras, infrared cameras and infrared
illumination, or 3D range cameras. Other sensing technologies use special surfaces which
include touch sensitive elements (eg. antennas, or pressure sensors). Finally, other sensing
2 Hand tracking and HCI: a literature review 23
(a) (b)
(c)
(d) (e) Figure 2.6: HandVu. (a) System set-up. (b) Tracking of highly articulated hand motions using flocks of
features. (c) Selecting a region with both hands. (d) Interacting with virtual objects and ARToolKit.
(e) Virtual keypad application. (Figure reproduced from (Kölsch, 2004).)
technologies use wearable hardware such as data gloves, pressure sensors, gyros,
accelerometers, etc. Interactive surfaces are sometimes more specifically referred to as
interactive tabletops, virtual keyboards, or touch panels. In any case, they always lay on a
surface, and thus they are generally referred to as interactive surfaces. A short survey on a
number of recent interactive surface developments is presented next.
Image processing is the technology used in a number of interactive surface implementations.
Zhang et al. (2001) presented Visual Panel. Visual Panel is a vision-based interface which
employs an arbitrary quadrangle-shaped panel (eg. and ordinary piece of paper) and a tip
pointer (eg. fingertip) as a wireless and mobile input device. The user's fingertip is tracked
and operations such as click and drag on the visual panel can be detected. The system
simulates clicking and pressing by holding the tip pointer in the same position for a short
period of time. This makes possible to simulate a keyboard or a mouse on the visual panel.
The position of the panel itself is visually tracked and can provide 3D information, serving as
a virtual joystick or 3D mouse. The quadrangle panel is tracked using Hough transform, and
the user's fingertip is tracked using a Kalman filter and fitting a conic to the fingertip.
2 Hand tracking and HCI: a literature review 24
Background subtraction is also used for reinitialisation. Figure 2.7 shows VisualPanel set-up
and two applications.
(a) (b) (c)
Figure 2.7: VisualPanel. (a) Tracked panel and tracked fingertip. (b) Virtual keyboard. (c) Finger painting.
(Figure reproduced from (Zhang et al., 2001).).
Malik and Laszlo (2004) presented Visual Touchpad. Visual Touchpad is a low-cost vision-
based input device that allows fluid two-handed interactions with a planar surface. Two
downward-pointing cameras are attached above a planar surface, and a stereo hand tracking
system provides the 3D positions of a user's fingertips on and above the plane. Thus, the
planar surface can be used as a multi-point touch-sensitive device, but with the added ability
to also detect hand gestures hovering above the surface. Figure 2.8 shows three example
configurations for the Visual Touchpad.
Figure 2.8: Visual Touchpad. (a) Desktop set-up. (b) Laptop setup. (c) Hand-held setup. (Figure reproduced
from (Malik and Laszlo, 2004).)
In a slightly broader context but also using image processing, Pingali et al. (2003) proposed
the concept of steerable interfaces for pervasive computing spaces. This type of interfaces
can be displayed (using projectors) on a wall, or on any surface near to the user, and they can
be displayed when the user needs them. User's hands an head are tracked using cameras, this
allows them to interact with the displayed interfaces by touching them. They built a 3D
environment designer that allows defining the geometry of the environment, indicating the
2 Hand tracking and HCI: a literature review 25
available surfaces (where interfaces could be displayed), and position cameras and projectors
inside the environment. With this data a geometrical model of the environment is built
allowing to work out the best location to display an interface depending on the user’s
position and orientation of the head. They described the technologies needed to realize this
concept, those include, projecting images at will on different surfaces, visually track users
head orientation and position, track users hands, and finally to direct sound to different
places in the environment.
(a) (b)
Figure 2.9: Steerable interfaces. (a) shows the everywhere display projector. (b) shows the office set-up.
(Figure reproduced from (Pingali et al., 2003)).
A system which presents some similarities to the HMD based VTS presented in Chapter 8 of
this thesis is ARKB (Lee and Woo's, 2003). ARKB uses a video see-through HMD which
allows users to see a virtual keyboard laid horizontally (rendered on top of a fiducial
marker). Users can type on the keyboard with both hands. For that, users needs to wear
markers in the fingers, which are recorded from two cameras that form part of a video see-
through HMD. Using stereovision, the system can detect when one of the markers is inside
the volume of a virtual key and consider the key as pressed. Figure 2.10 shows the user's
hand with the required markers, and the ARKB operation.
Developments on interactive surfaces often use infrared cameras and infrared illumination
combined with image processing. Rekimoto and Matsushita (1997) presented HoloWall.
HoloWall uses an infrared camera located behind the wall and infrared illumination.
Information is projected on the wall (which is opaque) using a rear-projector (with IR-cut
2 Hand tracking and HCI: a literature review 26
(a) (b) Figure 2.10: ARKB. (a) User's hand with markers. (b) User operating the ARKB. (Figure reproduced from (Lee
and Woo's, 2003).)
filter). The user's shape or any objects in front of the wall are invisible to the camera.
However, when the user's hand or other objects get near enough to the wall, they reflect
infrared light allowing them to be detected. The system can detect when the user touches the
wall with either fingers, hands, their body, or even physical objects. Figure 2.11 shows the
HoloWall configuration, and two example applications.
Wilson (2004) developed TouchLight. TouchLight is a vision-based touch screen technology
which uses stereo image processing techniques to combine the output of two infrared
cameras placed behind a semi-transparent screen in front of the user. The image processing
allows the system to determine when an object is near the surface of the screen. A rear
projector (with IR-cut filter) projects information on the screen. Because of the exclusive
properties of the semi-transparent screen, TouchLight enables various unique applications,
such as video conferencing with gaze awareness and various forms of spatial displays. Figure
2.12 shows the TouchLight configuration and a TouchLight prototype. One year later,
Wilson (2005) developed PlayAnywhere. PlayAnywhere is a front-projected computer
vision-based interactive table. The system uses a projector to display information on a table.
An infrared camera together with infrared illumination is used to capture the user's hands.
Visual analysis of the hands shadow results in the detection of clicks and drags on the
surface. Click and drag detection is demonstrated with the manipulation of various virtual
objects. Optical flow techniques are also used in order to enable panning, rotating, and
zooming of high-resolution maps. Figure 2.13 shows a PlayAnywhere prototype and an
example application.
2 Hand tracking and HCI: a literature review 27
(a)
(b) (c) Figure 2.11: HoloWall. (a) Configuration of the HoloWall. (b) A simple two-handed interface: a user
simultaneously manipulates two control points (and directions) on a Bezeir curve. (c) A map
browser on a curtain: When a user touches a map projected on a curtain with one or two hands and
moves hands in the same direction, the map moves according to the hand movement. A user can
also expand or shrink the map by controlling the distance between two hands. (Figure reproduced
from (Rekimoto and Matsushita, 1997).)
2 Hand tracking and HCI: a literature review 28
(a) (b) Figure 2.12: TouchLight. (a) TouchLight physical configuration: DNP HoloScreen with two IR cameras and
IR illuminant behind screen. (b) TouchLight prototype. (Figure reproduced from (Wilson, 2004).)
(a) (b) Figure 2.13: PlayAnywhere. (a) PlayAnywhere prototype. (b) Flow field-based manipulation of objects applied
to panning, rotating, and zooming a high resolution map. (Figure reproduced from (Wilson, 2005).)
There are a number of "virtual keyboards" based on infrared cameras that made it into
commercial products, one of them is the Canesta keyboard (Roeber et al. , 2003). The
Canesta keyboard projects the image of a QWERTY keyboard onto any flat surface and
allows the user to input text by typing on the projected keys. The Canesta keyboard emits a
plane of infrared light slightly above the typing surface. A sensor module detects the
intersection of fingers with the infrared light. The data generated by the sensor module is
interpreted into mouse and keyboard events. Figure 2.14 shows an illustration of the Canesta
keyboard. Another virtual keyboard (this one still in research phase) is the one proposed by
2 Hand tracking and HCI: a literature review 29
Du et al. (2005). This virtual keyboard is projected onto a flat surface in a similar way to the
Canesta keyboard, but allowing to project other information too. The key press detection is
achieved analysing the depth maps from a 3D optical range camera. Figure 2.15 shows a
prototype of this 3D optical ranging virtual keyboard.
Figure 2.14: Canesta keyboard. (Figure reproduced from (Roeber et al. , 2003).)
Figure 2.15: Virtual keyboard based on true-3D optical ranging. (Figure reproduced from (Du et al., 2005).)
Other interactive surface technologies are based on sensing elements deployed on the
interactive surface, as in the case of touch screens. One of these technologies is
DiamondTouch (Dietz and Leigh, 2001). DiamondTouch is a interactive tabletop which uses
a set of antennas embedded on it. A transmitter is connected to the table and a receiver is
connected each user (typically through the chair where they sit on). When a user touches the
table top, the signal from the transmitter can travel, from the tabletop, through the user, and
to the receiver attached to the user's chair. The position of contact can be found processing
the receiver's signal. The system allows to detect simultaneous, multiple point inputs from
various users, being able to identify each user. This enables new applications for group
2 Hand tracking and HCI: a literature review 30
collaboration on a common surface. A projector above the tabletop is used to display
feedback information onto the table. Figure 2.16 shows DiamondTouch set-up and a
collaborative work environment.
(a) (b)
Figure 2.16: DiamondTouch. (a) System set-up. (b) Collaborative work environment implemented with
DiamondTouch. (Figure reproduced from (Dietz and Leigh, 2001).)
Finally, there have been some attempts in creating interactive surfaces which are based on
wearable hardware. Those are generally concerned with the sensing of clicks and/or drags,
and the presentation of information and feedback part is generally handled by a common
computer, PDA, or mobile phone, screen. Osawa and Sugimoto (2002) use a data-glove in
order to recognize the user's hand movements. They implemented a VR keyboard
(immersive VR) with which they introduce Japanese characters. Fukumoto and Tonomura
(1997) created FingeRing. FingeRing uses accelerometers on each finger of the user's hand
in order to detect surface impacts. These allows to input symbols by tapping various fingers
on a table, knee, or other surface. They implemented a chord keyboard. Figure 2.17 shows an
illustration of FingeRing. Kim et al. (2005) presented SCURRY. SCURRY is a wearable
input device developed by Samsung Advanced Institute of Technology. Based on inertial
sensors, this device allows a human operator to select a specified character (from a virtual
keyboard), an event, or operation through both hand motion and finger clicking. SCURRY
can also be used as a mouse, using the index, middle, and ring fingers as mouse buttons.
Figure 2.18 shows an illustration of SCURRY. Senseboard (2000) is another commercial
wearable hardware keyboard developed by Senseboard Technologies AB. It consists of two
rubber pads that slip onto the user's hands. Muscle movements in the palm are sensed and
translated into keystrokes with pattern recognition methods. Figure 2.19 shows an illustration
of Senseboard.
2 Hand tracking and HCI: a literature review 31
Figure 2.17: FingeRing. (Figure reproduced from (Fukumoto and Tonomura, 1997).)
Figure 2.18: SCURRY. (Figure reproduced from (Kim et al., 2005).)
Figure 2.19: Senseboard. (Figure reproduced from (Senseboard, 2000).)
2.6 Summary
In this chapter the basics about human hand anatomy and its modelling in computer systems
have been reviewed. The most important techniques for visual hand tracking have also been
reviewed and a number their applications in HCI have been described. The starting point for
the visual hand tracking developed in this thesis is the work of Blake and Isard (1998) and
MacCormick and Isard (2000). They set up a framework for visual tracking of 2D contours.
From this starting point, this thesis develops further their work by making possible an
2 Hand tracking and HCI: a literature review 32
efficient tracking of tree-structured articulated objects while using the hierarchical structure
of the object in the matching process. Skin colour is a common visual cue used in hand
trackers. It is generally fast to process and this allows real-time tracking. Typically, hand
trackers that are entirely based on skin colour information do not allow articulated tracking
(Bradski, 1998; Kolsch and Turk, 2005). In contrast, the visual hand tracking developed in
this thesis is entirely based on skin colour, which makes it very fast, and it is fully
articulated. The visual hand tracking developed in this thesis is designed for HCI, and in
particular for use in interactive surfaces.
This chapter has described the concept of interactive surface and several examples of
interactive surfaces have been reviewed. The vision-based interactive surfaces are attractive
with respect to other sensing technologies because they are flexible and avoid the need of
expensive and cumbersome hardware. The interactive surface proposed in Chapter 8 ,
referred to as Virtual Touch Screen (VTS), uses the visual hand tracking developed in this
thesis. This sensing technology is novel in comparison with the sensing technologies for
interactive surfaces seen in this chapter. Most of the vision based interactive surfaces
reviewed in this chapter are based on detecting (rather than tracking) when the user's hand or
fingers are near or inside the interactive surface. They often rely on the interaction of light
and the user's hand or fingers when these are in the proximity of the surface (shadows,
reflected IR light, etc). In contrast, the visual hand tracking used in the VTS does not rely on
such light interactions in the proximity of the surface, for it does not require a physical
surface to operate – it only requires a single video stream containing a hand mostly parallel
to the camera's image plane (which itself is parallel to the VTS). Probably ARKB (Lee and
Woo's, 2003) is the most similar work to the VTS (when a HMD is used), although their
sensing technology employs stereovision and requires adorned hands. When the VTS's
display technology is based on a screen and a projector, TouchLight (Wilson, 2004) or
PlayAnywhere (Wilson, 2005) are probably the most similar works to the VTS, again using
different sensing technologies.
33
3 Hand contour tracking using particle filters and deformable templates
Particle filters, also known as Sequential Monte Carlo methods (SMC), are sophisticated
model estimation techniques based on simulation. Particle filters in conjunction with
deformable templates have been widely used in the past in order to track hand contours
(Cootes et al., 1995; Heap and Hogg, 1998; Bowden, 1999; Blake and Isard, 1998;
MacCormick and Isard, 2000). Blake and Isard established a framework to perform contour
tracking by using deformable templates together with a particle filter, known as
Condensation filter. One of the strongest features of this framework is its ability to track
object contours against cluttered backgrounds, while still being able to keep the
computational requirements within a given resource.
This chapter presents this framework, emphasising its application to hand contour tracking.
Firstly, the basic elements of the framework are presented, and their use within the
Condensation filter explained. Secondly, we will see how this framework can be expanded in
3 Hand contour tracking using particle filters and deformable templates 34
order to track articulated objects. Thirdly, we investigate the problems that arise when
tracking tree-structured articulated objects. Finally, particle interpolation is proposed as a
method to overcome these problems.
3.1 Deformable templates
The tracking techniques used in this thesis are based on modelling the contour of the target
object by using B-spline curves (Blake and Isard, 1998). These B-spline curves are handled
in a specific way by using a configuration or state vector of relatively few degrees of
freedom. In this context a B-spline curve is referred to as contour, referring to its shape, or as
a deformable template, often referring to both its shape and its capability to change
according to a state vector. Blake and Isard presented a framework for contour tracking,
which relies upon the use of deformable templates. This section presents a brief introduction
to this part of Blake and Isard's framework.
A B-spline curve is a parametric curve that allows representation of a smooth, natural-
looking shape by specifying only a small number of "control points". If the coordinates of
the control points are ),),...(,( 11 nn yxyx then the B-spline is a curve
Tsysx ))(),(( parameterised by a real variable s on an interval of the real line:
=
yx
sBsysx
r
r
)()()(
(3.1)
where )(sB is a n22 × matrix, called the metric matrix, whose entries are B-spline basis
functions (polynomials in s) and yx rr, are 1×n column vectors containing the x- and y-
coordinates of the control points respectively. By convention, we will call such a B-spline
curve a contour. Figure 3.1 shows the contour of a hand's middle finger, the contour uses 8
control points indicated by blue crosses.
A contour can use a large number, n, of control points to define its shape. The vector space
of such a contour will have an undesirable large number (2n) of dimensions. In order to
manage the contour more easily, a vector subspace, termed shape space and denoted by X, is
defined. An element Xx ∈ is related to the control point coordinates yx rr, by a linear
transformation with a fixed offset:
3 Hand contour tracking using particle filters and deformable templates 35
0QWxyx
+=
r
r
(3.2)
If the shape space had d dimensions, then W is the dn ×2 shape matrix, x is a 1×d column
vector generally referred as the configuration or state of the contour, and 0Q is a 2 1n× vector
called the template for this object. Given a template for a rigid planar object, it is easy to
define a shape space X corresponding to translations, 2D Euclidean similarities, or affine
transformations of the template 0Q . The final configuration of the contour, defined by the
control point coordinates yx rr, , can be controlled by changing the state vector x.
For example, a shape matrix that represents a 2D affine transformation of the template 0Q
can be define as:
=
00100001
00
00xy
yx
QQQQ
W (3.3)
Where 0 (0,0,0...,0)T= , 1 (1,1,1...,1)T= , and the template 0Q decomposes as ( )Tyx QQ 00 ,
xQ0 being the x-coordinates of the template control points, and yQ0 the y-coordinates of the
template control points. Using the shape matrix in Equation (3.3), here we see some
examples of transformations:
1. Tx )0,0,0,0,0,0(= represents the original template shape 0Q .
2. Tx )0,0,0,0,0,1(= represents the template translated 1 unit to the right.
3. Tx )0,0,1,1,0,0(= represents the template doubled in size.
4. Tx )sin,sin,1cos,1cos,0,0( θθθθ −−−= represents the template rotated through angle θ
5. Tx )0,0,0,1,0,0(= represents the template doubled in width.
This type of contours together with its capability to change are also known as deformable
templates. Deformable templates are linearly parameterised by their state vector. This linear
parameterisation is useful later in contour tracking because it simplifies the fitting algorithms
and avoids problems with local minima. However, there exist other types of deformable
templates that are not linearly parameterised. Yuille and Hallinan use a geometrical
3 Hand contour tracking using particle filters and deformable templates 36
parameterisation of their contours (Yuille and Hallinan, 1992), and MacComick and Isard
use a non-linear parameterisation of their articulated hand contour, which includes joint
angles (MacCormick and Isard, 2000; Isard and MacCormick, 2000). The hand contour
model used in this thesis uses the latter non-linear parameterisation.
Figure 3.1: A B-spline contour fitted to the middle finger of a hand. The contour uses 8 control points,
shown here as blue crosses.
3.2 Measurement model
The tracking algorithms presented in the following sections need to measure how well a hand
contour fits the hand in an image. Within the tracking framework presented by Blake and
Isard, the fitness of a hypothesised contour configuration to the image features is expressed
as a conditional probability, referred to as the contour likelihood. If tx is the state of a
contour (modelled object) at time t, and tZ is the set of image features at time t, then the
likelihood of that contour representing the true object configuration is the conditional
probability )|( ttp xZ . This contour likelihood can be calculated in several ways, and a
detailed discussion about various ways of calculating it is given by MacCormick (2000).
Blake and Isard's approach to calculating a contour likelihood uses a set of line segments
normal to the contour at several informative points, the measurement points. These line
segments, termed measurement lines, are processed in order to find image features along
them. Each of the features found along the measurement lines has a measurable contribution
towards the contour likelihood. Figure 3.2 shows a hypothesised mouse-shaped contour and
several normal measurement lines. Along the measurement lines there are some black dots,
which represent the found features. This method of calculating a contour likelihood
3 Hand contour tracking using particle filters and deformable templates 37
represents a large saving of processing time in comparison to the original implementations of
active contours, or "snakes" (Kass et al. 1987), in which processing is performed on the
entire image, and the resulting edge-map is used as an energy surface across which the
contour moves.
In Blake and Isard's approach, the contributions towards the contour likelihood of each of the
features found along a measurement line, are calculated using a Gaussian centred on the
measurement point. Figure 3.3 depicts the measurement process for a measurement line. The
measurement line is normal to the hypothesised contour at the measurement point. The three
features found along the measurement line, black dots, correspond (from left to right) to: a
feature outside the target object due to cluttered background; the real contour position; and a
feature internal to the target object, due to it having a non-homogeneous texture. The feature
corresponding to the real contour has the highest Gaussian value, out of the three features,
therefore this feature will constitute the largest contribution to the measurement point
likelihood.
Figure 3.2: Measurement lines distributed along a contour. The thick white line represents a hypothesised
configuration of a mouse-shaped contour. The thin lines are measurement lines. The black dots
represent the detected features along the measurement line. (Figure reproduced from (MacCormick,
2000).)
3 Hand contour tracking using particle filters and deformable templates 38
Figure 3.3: Measurement line normal to a hypothesized contour.
The functional form of a measurement point likelihood as formulated by Blake and Isard has
the form )|( xzp , where z is the set of features found along the measurement line1, and x is
the position of the hypothesised contour on the measurement line:
∑ −+∝m
mvxzp 2
2
2exp
211)|(
σσαπ (3.4)
where σ is the standard deviation of the Gaussian; λα q= where q is the probability that the
contour target object is not visible, and λ is the spatial density of the background clutter
(following a Poisson process along the line); and xzv mm −= is the distance between the m
feature found on the measurement line and the position of the hypothesised contour on the
measurement line, this is x. Generally x is at the midpoint of the measurement line.
In practice, considerable economy can be applied when evaluating (3.4); it is not necessary
to include all the features mzz ,...,1 for which mv results in:
12
exp2
12
2
<<−σσαπ
mv (3.5)
1 The specific image processing techniques employed to process the measurement lines and detect image features are described in (Blake and Isard, 1998).
3 Hand contour tracking using particle filters and deformable templates 39
These features bring a negligible contribution to the measurement point likelihood. Then
(3.4) can be simplified as:
);(2
1exp)|( 12 µσ
vfxzp −∝ (3.6)
where ),min();( 22 µµ vvf = , )21log(2 ασπσµ = is a spatial scale constant, and 1v is
the mv lying closest to the hypothesised position x.
In order to calculate the likelihood of the entire contour two assumptions have to be made:
first, observations mz are assumed to be mutually independent; second; the contour's
likelihood depends only on the object's configuration at the current time-step. Then, the
likelihood of the entire contour at time t is the product of each of the likelihoods of each of
the M measurement points:
∏=
=M
mmmtt xzpp
1
)|()|( xZ (3.7)
This can be computed using the following simplified form:
121
1( | ) exp ( ( ) ( ); )2
M
t t m mm
p f z s r s µσ=
∝ − −
∑Z x (3.8)
where ms is the mth measurement point; )(1 sz is the closest feature to the hypothesised
contour along the measurement line s; and )(sr is the hypothesised contour position along
the measurement line s. According to Equation (3.8), the parameters that control the
computation of the contour likelihood are σµ, and M. µ controls the clutter-resistance of
the tracker: if an object is expected to lie in a clutter-free environment then µ can be set
quite large, and as clutter density increases its value should decrease accordingly. σ should
be set according to the accuracy of the shape model. If the expected object appearance is
very well modelled by the shape space then a small value of σ can be used since features can
be expected to be found very close to the predicted curve. If however the shape model is
inaccurate, a larger value of σ will permit tracking of shapes that are not exactly within the
modelled space, while increasing the risk of distraction by clutter. The time spent in
calculating the contour likelihood depends largely on the number of measurement points
3 Hand contour tracking using particle filters and deformable templates 40
along the contour, so a large value of M will slow down the calculation time. Judicious
positioning of the measurement points ms at informative points along the contour, rather than
spacing the ms evenly, can allow a smaller M for an equivalent performance.
3.3 The Condensation algorithm applied to visual contour tracking
At the core of the tracking techniques used in this thesis there is a particle filter known as
Condensation (Isard and Blake, 1998a). Particle filters have been used in a diverse range of
applied sciences, and the basic algorithms were discovered independently by researches in
several of these disciplines. In the field of computer vision, and in particular contour
tracking, Isard and Blake made an important contribution by introducing the Condensation
algorithm. This section describes briefly this algorithm.
In visual contour tracking the task is to find the contour configuration of the target object
throughout T frames of a video sequence. The contour configuration of the target object at
frame t is denoted as tx , with t=1,...T. In order to find information about the target object a
number of measurements are made at each frame, calculating the likelihoods for the
hypothesised contours, measurements at frame t are denoted as tΖ . The measurements
acquired up to frame t are denoted as tΖ :
},...{ 1 tt ZZ=Z
The information of interest for the location of the target object is expressed as a conditional
probability )|( tttp Zx , this is the probability of a hypothesised contour given the history of
measurements. However, in general it is difficult to calculate )|( tttp Zx directly. For this
reason the Bayes' theorem is applied to each time-step, obtaining a posterior
)|( tttp Zx based on all available information:
)(
)|()|()|( 11
tt
ttttttttt p
pppZ
xxZx −−=Z
Z (3.9)
where )|( 11 −− tttp Zx is called the prior, and )|( tttp xZ is the observation density. As usual in
filtering theory, a model for the expected motion between time-steps is adopted. This takes
3 Hand contour tracking using particle filters and deformable templates 41
the form of a conditional probability distribution )|( 1−tttp xx termed the dynamics. Using the
dynamics, (3.9) can be re-written as:
)(
)|()|()|()|(
111111
tt
ttttx tttttt
ttt p
dpppp t
Z
xxxxxZx
−−−−−∫−=
ZZ (3.10)
This is the equation that a filter must calculate or approximate. This equation suggests a
recursive implementation; the conditional probability of the target configuration at time t can
be approximated as a sum of the previous conditional probabilities of the target configuration
multiplied by the dynamics, all weighted by the observation density; )( ttp Z is generally a
constant independent of tx for a given image, therefore it can be neglected in the case where
only relative likelihoods need to be considered. This is what filters such as Kalman filter do
(Gelb, 1974; Welch and Bishop, 2002). In a Kalman filter the observation density is assumed
Gaussian, and the target configuration evolves as a Gaussian pulse throughout the tracking
task. However, it is an empirical fact that the observation densities occurring in visual
tracking problems are not at all Gaussian; this was the original motivation for Isard and
Blake to introduce the Condensation algorithm.
The fundamental idea behind Condensation is simply to simulate Equation 3.10. The
simulation uses the idea of a weighted particle set: this is a list of n pairs niii ,...1),,( =πx ,
where X∈ix (the configuration space) and ]1,0[∈iπ is a weight with ∑ ==
n
i i11π . The
weighted particle set approximates the conditional density )|( tttp Zx at time t. Figure 3.4
shows a weighted particle set approximating a probability density. The particles are the grey
ellipses and their weight is proportional to their area. Picking one of the particles is
approximately the same as drawing randomly from the continuous probability function
shown. One of the strengths of this weighted particle set representation is that it allows the
representation of multimodal distributions.
3 Hand contour tracking using particle filters and deformable templates 42
Figure 3.4: Weighted particle set approximation of a probability density. (Figure reproduced from (Isard
and Blake, 1998a).)
In the context of contour tracking, each particle constitutes a hypothesised contour
configuration ix , and its weight iπ is the likelihood of the contour representing the real
target object. Figure 3.5 illustrates this using a hand as the target object. The figure shows
three hand contours, in different colours, which are the graphical representation of three
particles from a particle set. The thickness of each particle is proportional to the weight of
the particle.
Figure 3.5: Graphical representation of three particles from a particle set. The thickness of the hand
contours is proportional to the weight of the particle.
The application of certain operations on the particle set for each time-step allows simulating
the evolution in time of the conditional density )|( tttp Zx . The repetition of these operations
3 Hand contour tracking using particle filters and deformable templates 43
at each time-step is what constitutes the Condensation algorithm. These operations on the
particle set are called: resampling, prediction, and measurement. Next, each of these
operations is briefly described:
Resampling
The first operation on the particle set is to sample (with replacement) n times. The particles
are chosen with probability equal to their weight iπ . This can be done efficiently with the
use of cumulative probabilities. Some particles, especially those with high weights, may be
chosen several times, leading to identical copies of elements in the new set. Others with
relatively low weights may not be chosen at all. After the resampling operation, the resulting
particles do not have a weight. The particles are endowed with a new weight at the
measurement step.
There are various methods to perform this sampling operation. Kitagawa (1996) used
deterministic and stratified sampling methods; and in the visual tracking field, Blake and
Isard experimented with both random and deterministic sampling methods. Independently of
the method used for the resampling, an important fact is that resampling should not affect the
distribution represented by the particle set. This fact allowed MacCormick (2000)
completing the proof that Condensation is correct.
Prediction
Each of the particles of the new resampled particle set (hypotheses of contour
configurations) evolves from time-step to time-step following certain dynamics. Applying
these dynamics to a particle is referred to as prediction. In this thesis the dynamics used for
the particles follow the second-order auto-regressive processes (ARPs) described in (Blake
and Isard, 1998). A second-order APR model expresses the state tx at time t as a linear
combination of the previous two states and some Gaussian noise:
wttt BxAxAx ++= −− 1122 (3.11)
where 21, AA are fixed dd × matrices which represent the deterministic components of the
dynamics; B is also a fixed dd × matrix that represents the stochastic component of the
dynamics; w is a 1×d vector of independent random normal )1,0(N variates. The values of
21, AA , and B in this thesis have been found empirically.
3 Hand contour tracking using particle filters and deformable templates 44
Looking at the particle set as a whole, the prediction operation can be interpreted as
convolving the particle set with the dynamics )|( 1−tttp xx in order to produce a new particle
set (MacCormick, 2000).
Measurement
At this point, the particle set is composed of n new particles whose configuration has been
predicted, according their dynamics, from the original particles. However, the particles do
not have a weight yet. At this step, the image features are analysed in order to calculate the
likelihood of each particle (hypothesised contour) representing the target object. The
likelihood of a particle will constitute the new weight for that particle. The contours'
likelihood can be calculated as described in Section 3.2. Once the weights for all the particles
are generated, these weights are normalised so that their sum equals 1.
Looking at the particle set as a whole, the measurement operation can be interpreted as
multiplying the particle set by the observation density )|( tttp xZ in order to produce a new
particle set (MacCormick, 2000).
Figure 3.12 depicts one time-step in the Condensation algorithm. The time-step begins with a
weighted particle set of size n from the previous time-step t-1, Figure 3.6(a). This particle set
is resampled in order to produce another set of n particles, and dynamics are applied to each
of the new particles, Figure 3.6(b). Finally, the contour likelihood is calculated for each of
the new particles. This is equivalent to multiplying the particle set by the observation density
)|( tttp xZ , Figure 3.6(c). The result is a new weighted particle set that will be used as a
starting point in the following time-step.
After any time-step of the Condensation algorithm, it is possible to "report" on the current
state, for example by evaluating some moment of the state density. In this thesis, the reported
state at each time-step will be that of the particle with highest weight.
A remarkable quality of the Condensation algorithm is its simplicity, in comparison to other
algorithms such as Kalman, despite its generality. The Condensation algorithm is capable of
tracking a multimodal distribution, and capable doing this within a given computational
3 Hand contour tracking using particle filters and deformable templates 45
Figure 3.6: One time-step in the Condensation algorithm. (Figure adapted from (Isard, 1998)).
resource, determined by the size of the particle set. An interesting advantage of the
Condensation algorithm over other algorithms such as Kalman, is that Condensation avoids
overshooting – as it is just testing for hypotheses, and in real life there are no overshoots.
3.4 Articulated tracking
It is possible to use a Condensation filter to visually track articulated objects. In this case, a
particle from the filter has to contain the configuration of a suitable articulated contour. This
configuration vector will include the deformation parameters for each of the links plus the
parameters that define the relationship between links. Then, a Condensation filter could be
used to track the contour of that articulated object.
However, the size of the configuration vector for many articulated objects of interest, such as
the human body, human hand, etc, is too large (20-40 degrees of freedom) to be dealt
directly with a Condensation filter. As the dimension of a particle's configuration space in a
Condensation filter increases, the number of particles needed to explore this configuration
space increases exponentially for a given level of performance, rendering the Condensation
filter ineffective. One possible solution to stop this increase in dimensionality could be using
3 Hand contour tracking using particle filters and deformable templates 46
PCA to reduce the number of dimensions needed, as described by Isard (1998). However,
even after this reduction in dimensions the configuration space may still be too big to be
explored using Condensation. Fortunately, there exist a technique that can make tractable
the tracking of articulated objects, this technique is called partition sampling.
3.4.1 Partition sampling
A technique called partition sampling was introduced by MacCormick and Blake (1999) for
avoiding the high cost of particle filters when tracking more than one object. Later, this
technique was used by MacCormick and Isard (2000) to implement a vision based articulated
hand tracker. This technique makes it possible to deal with larger configuration spaces,
provided that certain conditions are met, such as the ones found in articulated object
tracking.
Partition sampling is based on a hierarchical decomposition of the problem's configuration
space X . The problem's configuration space is divided into a number of smaller
configuration spaces called partitions kXX ,...,1 , where the posterior in a partition jX
determines the search scope of the following partition 1+jX . Each partition jX has an
associated particle set jS , the size of which is related to the number of dimensions of that
partition, fewer dimensions means much fewer particles for a given level of coverage. As a
result, the sum of the sizes of the particle sets of each partition is much smaller than the size
of an equivalent particle set that could cover the original configuration space. This results in
needing much fewer particles to achieve the same level of performance.
From a general point of view, the objective of partition sampling is to use one's intuition
about the problem to choose a decomposition of the configuration space, the dynamics, and a
measurement function, which is beneficial.
An intuitive idea of how partition sampling operates can be gained by considering the
example in Figure 3.7. The figure shows a two-dimensional configuration space (x,y), in
which there is a peak of a 2D likelihood function. Here, the original two-dimensional
configuration space has been decomposed into two partitions, one for the x coordinate, and
another for the y coordinate. Therefore, in order to locate this peak the search is split into two
stages. In the first stage, the x coordinate is explored. As a result, an area of high likelihood
in the x coordinate, the grey shaded area in Figure 3.7, is located. In the second stage, the y
3 Hand contour tracking using particle filters and deformable templates 47
coordinate is explored, but only for the particles that constituted the area of high likelihood
in the x coordinate. As a result, an area of high likelihood in the y coordinate, the hatched
area in Figure 3.7, is located. In this way the peak is located in two stages, and the combined
number of particles required to explore the two partitions is smaller than the number of
particles that would be required to directly explore the original configuration space (x, y).
Figure 3.7: An intuitive partition sampling example. The two-dimensional configuration space is divided into
two partitions, one for the x coordinate, and another for the y coordinate.
Partition sampling can be used in a problem if the three following conditions hold
(MacCormick and Isard, 2000):
• The configuration space, X, can be partitioned as a Cartesian product kXXX ××= ...1 .
• The dynamics, h, can be decomposed as khhh ∗∗= ...1 with each jh acting on
kj XX ×× ... . The symbol ∗ denotes convolution.
• There are weighting functions 121 ,..., −kggg with each jg peaked in the same region as
the posterior restricted to jX .
These conditions hold for articulated tracking. An example is used to describe how partition
sampling operates with articulated objects. For a rigorous description refer to (MacCormick
and Isard, 2000).
Let us consider, as an example, the articulated object of Figure 3.8. This articulated object is
composed of three links, of fixed length, connected by two-dimensional hinges, i.e. a chain
of three links. The whole object can translate (parameters x, y), rotate around the base link
centroid (parameter r), and scale (parameter s). The hinges that joint the links have angle
parameters 1α and 2α . If we wanted to visually track this object using a Condensation filter,
as described in Section 3.3, we would need to define a number of things: first, an articulated
3 Hand contour tracking using particle filters and deformable templates 48
contour that matches the articulations of the object and shares the same six parameters;
second, the dynamics h of the articulated object; and third, a measurement function g capable
of calculating a contour likelihood. In order to use partition sampling for this example we
can still use the same articulated contour that the one defined for Condensation, but the
configuration space, dynamics, and measurement function need to be decomposed for each
partition. A convenient decomposition of this articulated object's configuration space is:
• first partition 1X , parameters (x, y, r, s), corresponding to the base link.
• second partition 2X , parameter 1α , corresponding to the L1 link.
• third partition 3X , parameter 2α , corresponding to the L2 link.
Figure 3.8: Articulated object with three links forming a chain.
Each partition has a particle set; 1X has a particle set 1S , 2X has a particle set 2S , and 3X has
a particle set 3S . The particles in each set are associated hierarchically, so that each particle
in 3S will have a particle associated in 2S (its parent), and each particle in 2S will have
associated a particle in 1S . However, only a few selected particles in 1S will have associated
particles in 2S , and only a few selected particles in 2S will have associated particles in 3S .
These particles sets are complementary, thus, by putting together the associated particles
from each of the partitions 1S , 2S , and 3S , it is possible to have a valid configuration for the
whole articulated object.
The dynamics for each partition have to decompose as 321 hhhh ∗∗= ; with 1h acting on 1X ,
2X , and 3X ; 2h acting on 2X , and 3X ; and 3h acting on 3X . The measurement functions
are specific for each partition; ,, 21 gg and 3g will measure the likelihood of the contour
segments for the base, L1, and L2 links respectively.
3 Hand contour tracking using particle filters and deformable templates 49
Figure 3.9 shows one time-step of partition sampling for the example articulated object.
Before starting the first time-step, there will be one particle pre-selected in each particle set:
one particle in 1S , which will be associated to the pre-selected particle in 2S , and this one
will in turn be associated to another pre-selected particle in 3S . These three particles will
contain a known initial configuration of the articulated object's base, L1, and L2 links.
1. First partition 1X : 1.1. From the selected particles in 1S (of the previous time-step) generate new particles for 1S . 1.2. Apply dynamics 1h to each of the particles in 1S . 1.3. Weight particles in 1S using the measurement function 1g . 1.4. Select particles from 1S . 2. Second partition 2X : 2.1. From the selected particles in 1S generate new particles for 2S . 2.2. Apply dynamics 2h to each of the particles in 2S . 2.3. Weight particles in 2S using the measurement function 2g . 2.4. Select particles from 2S . 3. Third partition 3X : 3.1. From the selected particles in 2S generate new particles for 3S . 3.2. Apply dynamics 3h to each of the particles in 3S . 3.3. Weight particles in 3S using the measurement function 3g . 3.4. Select particles from 3S . 4. Reorganize particles from 1S , 2S , and 3S for the next time-step.
Figure 3.9: Algorithm for one time-step of partition sampling on the chain of links of Figure 3.8.
The first operation for each partition, [1.1], [2.1], and [3.1], uses the particles selected in the
previous partition; for [1.1] the previous partition becomes the same partition, but from the
previous time-step. A selected particle in the previous partition, referred to as parent, will
produce a number of particles in the new partition proportional to the weight of the parent
particle. The configuration of the child particles will be taken from the parent's associated
particles of the previous time-step.
The second operation for each partition, [1.2], [2.2], and [3.2], applies the relevant dynamics
to the particles of each partition. This dynamics consist of a deterministic drift plus a
stochastic diffusion.
3 Hand contour tracking using particle filters and deformable templates 50
The third operation for each partition, [1.3], [2.3], and [3.3], calculates a weight for each
particle in the respective set. Note that to measure the contour likelihood of a particle in 2S ,
the measurement function 2g needs to consider the particle's configuration together with its
associated parent's configuration in 1S . Likewise, to measure the contour likelihood of a
particle in 3S , the measurement function 3g needs to consider the particle's configuration
together with its associated parents' configurations in 2S , and 1S . In other words, this means
that in order to measure a contour hypothesis for the L2 link, the configuration of the base,
and L1 links, needs to be known.
The fourth operation for each partition, [1.4], [2.4], [3.4], is to select the particles from a
particle set that constitute peaks of weight in the set.
At the end of the time-step [4.] the particles selected in each partition need to be reorganized
so that a particle in 1S has exactly one particle associated in 2S , and this particle in 2S has
exactly one particle associated in 3S . For that it may be necessary to replicate some particles
in 1S , or 2S .
Note that the sizes of the particle sets 1S , 2S , and 3S , can be different. The size of a particle
set is typically a trade-off between: the number of particles required to reach certain degree
of coverage, and thus accuracy, in a particular configuration space; and the desired amount
of allocated resources.
In this example the number of parameters of the articulated object is only six. For a
configuration space of six dimensions a plain Condensation filter would probably be enough
in order to track a chain of three links (though not without using a considerable number of
particles). However, if the chain was longer, a plain Condensation filter would not be the
best choice.
When tracking a chain of N links, partition sampling greatly reduces the search in the
configuration space. In each partition the most likely configurations for a link are found,
using a Condensation time-step, and these likely configurations are used as starting points for
the search in the following link. At the end of the time-step, there will be a set of likely
3 Hand contour tracking using particle filters and deformable templates 51
configurations for each partition, all of them associated with their parents forming a series of
tree structures.
3.4.2 Incomplete particles in a chain of links
Continuing with the example of the previous section. At the end of a time-step of partition
sampling, there are three sets of selected particles, one for each partition. Recall that the
selected particles in a particle set are the particles that constitute peaks of weight in the set.
These selected particles are associated forming a series of tree structures. Before the next
time-step takes place, these particles need to be grouped in order to form a set of complete
particles. The term complete particle refers to three selected particles (one from each
partition) that are associated one to another, and that together form a valid configuration of
the articulated object. On the other hand, the term incomplete particle refers to a selected
particle in a partition whose associated particles in the following partition are not selected.
These two terms can be better understood by looking at Figure 3.10.
Figure 3.10: Particle set diagram showing two fictitious time-steps of partition sampling. Complete
particles are encircled in red.
3 Hand contour tracking using particle filters and deformable templates 52
Figure 3.10 shows two fictitious time-steps of partition sampling, for t=0, and t=1. Each of
the horizontal lines represents a particle set. There are three particle sets per time-step, one
for each partition: base, L1, and L2. The lengths of the horizontal lines are the same for all the
particle sets; however, this is just a representation and the particle sets could actually have
different sizes, for example: the Base particle set could have size 400 particles, and the L1,
and L2 particle sets could have 100 particles each. The black dots represent the selected
particles in each particle set. At t=0 there are three selected particles in the base particle set.
This diagram is for illustrative purposes; in practice, there could be a large number of
selected particles in each particle set. Each selected particle gives place to a number of
particles, in the following particle set, proportional to its weight. This is illustrated in the
diagram by arrows coming out of the selected particles and pointing to a section, indicated
with horizontal braces, of the following particle set. The longer the brace, the more particles
this brace involves in the particle set.
At the end of the t=0 time-step, it is possible to form two complete particles, encircled in red
on the diagram. These complete particles will be propagated to the following time-step t=1.
On the other hand, there are some selected particles that do not have any child selected
particles associated to them. These constitute incomplete particles.
Note that which particles get selected in a particle set depends on a few factors:
• The method of selection in use.
• The number and extension of the likelihood peaks in the configuration space represented
by that particle set.
• The level of coverage of a configuration space by a particle set. A sparse coverage may
result in likelihood peaks to go unselected.
As a result, it is possible to find a selected particle with a large weight in its particle set, but
that produces no other selected particles in the following particle sets down the hierarchy.
This is an unfortunate waste of resources, as some particles with high weight may not be
propagated to the following time-step, just because they did not belong to a complete
particle.
3 Hand contour tracking using particle filters and deformable templates 53
3.5 Tree-structured articulated objects
The previous section discussed how articulated objects could be tracked using partition
sampling; however, it only addressed chains of links. Other articulated objects of greater
interest in this thesis, such as the human body, or a human hand, have tree structures.
Fortunately, it is possible to adapt partition sampling in order to track a tree-structured
articulated object. This section proposes a natural extension of partition sampling for
tracking of tree-structured articulated objects. This extension, however, as proposed in here,
has some drawbacks, which will be resolved later in Section 3.7.
Let us consider the tree-structured articulated object of Figure 3.11(a). This is a simplified
articulated hand contour model, whose only purpose is illustrative. The hand contour model
assumes that the hand palm is always parallel to the camera's image plane, although allowing
for independent finger and thumb movements. The hand contour model is made of a hand
palm and five fingers, named L, R, M, I, T after Little, Ring, Middle, Index, and Thumb.
Each finger can rotate around its finger pivot (the black dot at the base of the finger) in order
to represent the abduction/adduction movements of the fingers. The finger pivots are situated
(a) (b) Figure 3.11: Tree-structured articulated object. (a) Simplified hand contour model with 14 DOF. (b) Particle
set tree for the articulated hand model; Palm is the parent particle set, and L, R, M, I and T are the
child particle sets.
approximately where the Metacarpophalangeal (MPC) joint of each finger would be
expected to be. The length of the fingers can change in order to represent the 2D projection
of a finger's flexion/extension movement. The thumb is treated as if it was a finger; it is
3 Hand contour tracking using particle filters and deformable templates 54
modelled by only an angle and a length. The angle and length of a finger are indicated with
α , and L respectively and then adding the name of the finger as a subscript. The whole hand
is allowed to translate (x, y), rotate around the hand palm pivot r, and scale s. In total the
articulated hand contour has 14 parameters.
The first step in order to use partition sampling with this object is to choose a convenient
decomposition of the configuration space, for example: one partition for the hand palm,
parameters (x, y ,r, s); and one partition for each of the five fingers, each one with parameters
(α , L). The following steps will be to define: 6 particle sets, 6 motion models, and 6
measurement functions, related to each of the 6 partitions.
In the chain of links example, a parent partition was related to just one child partition. Here,
the hand palm partition is related to all the finger partitions at the same time, forming a two-
level tree of partitions. The particle sets associated to each partition form an equivalent two-
level tree. Figure 3.11(b) shows the particle set tree. A method for handling this tree structure
is to proceed as in the chain of links example for the case of only two levels: one level for the
hand palm particle set, and another level made by all the finger particle sets. In the first level,
hand palm, a selected particle will produce a number of particles, proportional to the selected
particle's weight, in each one of the finger particle sets. Then, after dynamics and
measurements are applied to each one of the finger sets, new particles are selected in each
one of the finger sets. Finally, selected particles from the hand palm and fingers particle sets
are grouped to form complete particles, and those are propagated to the following time-step.
The process for one time-step is described in Figure 3.12.
Unfortunately, the process of forming complete particles is more difficult with a tree-
structured articulated object than it is with a chain of links articulated object. The next
section will explain these difficulties.
3 Hand contour tracking using particle filters and deformable templates 55
1. For the parent particle set (Palm) do: 1.1. Use the complete particles of the previous time-step to generate new particles for Palm. 1.2. Apply dynamics to each of the particles in Palm. 1.3. Weight particles in Palm. 1.4. Select particles from Palm that constitute peaks of weight in the set. 2. For each of the child particle sets (L, R, M, I, and T) do: 2.1. For each of the selected particles in Palm, generate a number of new particles in the finger particle set, proportional to the weight of the selected particle in Palm. 2.2. Apply dynamics to each of the particles in the finger particle set. 2.3. Weight particles in the finger particle set. 2.4. Select particles, from the finger particle set, that constitute peaks of weight in the set. 3. Form complete particles for the next time-step.
Figure 3.12: Algorithm for one time-step of partition sampling on the tree-structured articulated object of
Figure 3.11(a).
3.6 Incomplete particles in tree-structured articulated objects
Continuing with the previous section's example. Figure 3.13 shows a particle set diagram
representing two fictitious time-steps of partition sampling for the simplified articulated hand
of Figure 3.11(a). The horizontal lines represent particle sets, as in Figure 3.10. However, in
this diagram there are only two levels of hierarchy in each time-step; at the first level there is
the hand palm particle set, referred to as "Palm", and at the second level there are five
particle sets, one for each finger, referred to as L, R, M, I, and T. The selected particles are
indicated with black dots on the particle sets. The weight of the selected particles in the palm
particle set determines how big the portion of associated particles in the fingers particle sets
is. These portions of the fingers particle sets that are associated to the same parent particle
will be referred to as subsets, and in Figure 3.13 are separated by vertical dashed lines, four
subsets in the time-step t=0, and three subsets in the time-step t=1. The reader is reminded
that this type of particle set diagram only represents the relationships between particle sets.
In the practice, each particle set can have different sizes, and the number of selected particles
in each set can be much larger.
Each of the time-steps consist of four operations: first, the selected particles in the palm
particle set are used to generate new subsets in the fingers particle sets; second, dynamics are
applied to finger particle sets; third, measurements are applied to fingers particle sets; and
3 Hand contour tracking using particle filters and deformable templates 56
fourth, new particles are selected from the fingers particle sets. At the end of each time-step
the selected particles have to be grouped to form complete particles. In the current context a
complete particle is defined as: the combination of a selected particle in the palm particle set,
and from its associated subset, a selected particle in each of the finger particle sets.
We can see that at the end of time-step t=0 there is only one complete particle which
propagates to t=1. However, at the end of the time-step t=1 it is not possible to form any
complete particles. Figure 3.13 illustrates this potential situation. In the practice, the more
links the tree-structured articulated object has, the more frequently incomplete particles
appear. This situation worsens if the finger particle sets have child particle sets of their own.
In the best case these incomplete particles are a waste of resources, in the worst case the
tracking cannot continue.
One way of avoiding incomplete particles is to force at least one selected particle in each
subset for each finger particle set, for example, the particle with highest weight in the subset.
However, this could lead to the selection of particles with very low weight, i.e. particles that
do not represent properly the relevant link; and this would probably constitute a waste of
particles in the child's particle set. Another possible solution is to combine various selected
particles in the same particle set, then generate new particles so that each subset has at least a
selected particle, this leads to the following section.
3.7 Particle interpolation
The idea of particle interpolation in partition sampling involves creating new particles in a
particle set using the combined data of other particles, from the same or other particle sets,
and following certain creation constrains in order to form useful complete particles. The idea
of combining particles to form new ones, as opposed to the particle filter's primary method of
generating new particle sets, is not new in a broad sense. Various works on hybridization of
particle filters (PF) and genetic algorithms (GA) use crossover operators between particles
(Uosaki et al., 2004; Drugan and Thierens, 2004), to the extreme that in Kwok et al. (2005)
the resampling operation of a PF is completely replaced by crossovers and other techniques
popular in GAs. In the context of articulated visual tracking Pantrigo et al. (2005) used two
metaheuristics, known as path relinking, and scatter search, in order to create new particles,
from a few selected ones, that efficiently cover a search space. However, the particle
3 Hand contour tracking using particle filters and deformable templates 57
Figure 3.13: Particle set diagram showing two fictitious time-steps of partition sampling for the example
articulated hand. At the end of the time-step t=0 there is one complete particle. At the end of the
time-step t=1 there are none complete particles.
interpolation is the first attempt at interpolating particles in a partition sampling scheme with
the capability of producing particles that maintain coherence between partitions, and have
the highest possible weights. The ultimate aim of this particle interpolation is to avoid
incomplete particles. In order to describe particle interpolation we will continue with the
3 Hand contour tracking using particle filters and deformable templates 58
tree-structured articulated object example of Figure 3.11. Remember the fourth operation on
each finger particle set (see Figure 3.12). In this operation the particles that constitute peaks
of weight in the particle set are selected; however, there is no guarantee that each of the
subsets will contain at least one selected particle, which results in potential incomplete
particles. For this reason, the first aim of particle interpolation is to generate a new particle
for each subset, and the second aim is that the generated particle has the highest possible
weight.
Let us consider the particle with highest weight in a finger particle set. This particle has the
highest weight because the finger contour it represents matches the image features, a finger,
better than the others. This particle gives us information about where the real finger in the
image is. Particle interpolation consist in generating a particle, for each subset, that shares
some of the image features of the particle with highest weight, and therefore it is expected to
achieve a high weight too, but it is relative to the subset's parent particle, from the palm
particle set. This process is illustrated in the particle set diagram of Figure 3.14. In this
particle set diagram there is a particle set for the palm, and a particle set for each of the five
fingers, for a time-step t=0. The palm particle set has four selected particles, and therefore
there are four subsets in the finger particle sets. The large black dots in each of the finger
particle sets represent the particle with highest weight in that set. The smaller red dots in the
finger particle sets represent the interpolated particles, one for each subset. The interpolated
particles in each particle set are calculated by combining data from the particle with highest
weight in the particle set and the parent of each one of the subsets, this is represented with
red arrows in the diagram, coming from the particle with highest weight and going to the
interpolated particles in each of the subsets.
We can see from the diagram in Figure 3.14 that at the end of the time-step there will be as
many complete particles as selected particles in the palm particle set. However, although the
interpolated particles are generated in a way that makes then likely to have a high weight, the
exact weight is not known. In order to be able to form complete particles with a known
weight, the interpolated particles need to be weighted.
3 Hand contour tracking using particle filters and deformable templates 59
Figure 3.14: Particle set diagram showing the particle interpolation process. The big black dots in the
finger particle sets are the particles with highest weight in the set. The smaller red dots in the finger
particle sets are the interpolated particles, one for each subset.
1. For the Palm particle set do: 1.1.Use the complete particles of the previous time-step to generate new particles for Palm. 1.2. Apply dynamics to each of the particles in Palm. 1.3. Weight particles in Palm. 1.4. Select particles from Palm that constitute peaks of weight in the set. 2. For each of the finger particle sets, i.e. L, R, M, I, and T, do: 2.1. For each of the selected particles in Palm, generate a number of new particles in the finger particle set, proportional to the weight of the selected particle in Palm (the finger particles that come from the same parent particle in Palm form a subset). 2.2. Apply dynamics to each of the particles in the finger particle set. 2.3. Weight particles in the finger particle set. 2.4. Select the particle with highest weight in the finger particle set. 2.5. Generate a new interpolated particle for each subset, based on the particle selected in step 2.4. 2.6. Weight the interpolated particles. 3. Form complete particles for the next time-step.
Figure 3.15: Algorithm for one time-step of partition sampling, and particle interpolation. The algorithm is
based on the articulated hand model of Figure 3.11.
Figure 3.15 shows the algorithm for one time-step of partition sampling, including particle
interpolation, for the example articulated hand model. The algorithm with particle
3 Hand contour tracking using particle filters and deformable templates 60
interpolation differs from that of Figure 3.12 in steps [2.4], [2.5], [2.6], and [3]. In the step
[2.4] the particle with highest weight in the finger particle set is selected, this is a simpler
step than [2.4] without particle interpolation. Step [2.5] and [2.6] are new steps. Finally, step
[3] is simpler than the equivalent step without particle interpolation, since in this case there is
exactly one complete particle for each subset.
3.7.1 Generating interpolated particles
This section describes the criteria used to generate interpolated particles. These criteria may
appear as an "ad hoc" solution because it uses specific knowledge about the problem, i.e.
visually tracking a hand that moves in a plane parallel to the image plane, using the
simplified contour model of Figure 3.11. However, similar criteria could be applied in other
partition sampling applications by interpreting the relationships between particle parameters.
We start by considering the particle with highest weight in a finger particle set. In the
previous section, we reasoned that this particle has the highest weight because the finger
contour it represents matches the image features, a finger, better than the others. Particle
interpolation assumes that the generated particle will also have high weight because it is
constructed in a way that shares some of those image features. The proposed method to
generate an interpolated particle makes use of this assumption with an additional constrain,
that the generated particle has to be consistent with its parent particle.
Let us consider a simplified version of the articulated hand model of Figure 3.11. This
version only includes the palm and the little finger. Let us consider two particles A, and B,
which use this simplified model, see Figure 3.16. Particle A is formed by two partitions: Ap
corresponding to the palm, and dependent on this one, Af corresponding to the finger.
Similarly, particle B is formed by Bp, and Bf. Now, let us assume that Af is the particle with
highest weight in the finger particle set, and Bf is a particle with low weight in the same
particle set. The interpolation procedure finds new parameters (length, and angle) for Bf in
order that it can share some image features with Af. The goal of this operation is to maximize
Bf's weight, while taking into account that the two particles come from different parents: Ap,
and Bp. Some example rules that attempt to maximize Bf's weight in this manner could be:
3 Hand contour tracking using particle filters and deformable templates 61
(rule 1) Bf maximizes its weight if its fingertip coordinates are the same as Af's
fingertip coordinates.
(rule 2) Bf maximizes its weight if both Bf and Af share a common point along
their respective major axes.
(rule 3) Bf maximizes its weight if it also maximizes its area overlap with Af.
(rule 4) Bf maximizes its weight if it has the same angle as Af.
Other rules are possible; however, after some experimentation (rule 1) produced the best
results and was adopted in the implementations of Chapter 4. The implementation of this rule
is detailed next, with aid of Figure 3.16.
Figure 3.16: Graphical representation of the interpolation process using (rule 1). (a) Particle A's finger, Af,
has the highest weight. (b) Bf parameters ),( BB Lα are updated in order that Bf's fingertip
coordinates match those of Af fingertip.
In Figure 3.16(a) we can see a graphical representation of particle A's state, and particle B's
state. ),( AA cenYcenX and ),( BB cenYcenX are the palm pivots of Ap and Bp respectively.
),( AA pivYpivX and ),( BB pivYpivX are the finger pivots of Af and Bf respectively. These
finger pivots can be calculated from the palm pivots and the palm's state i.e. translation,
rotation, and scale. ),( AA tipYtipX and ),( BB tipYtipX are the fingertips of Af and Bf
respectively. These fingertips can be calculated from the finger pivots and the finger's state
i.e angle, and length. We assume that Af has the highest weight, and Bf has a low weight. In
3 Hand contour tracking using particle filters and deformable templates 62
order for Bf to maximize its weight its fingertip coordinates must be the same as Af's
fingertip coordinates, Figure 3.16(b). This can be achieved by updating Bf's parameters
),( BB Lα in the following manner:
BA pivXtipXdx −=
BA pivYtipYdy −=
)(tan 1
dydxAngleF −= (3.12)
22 dydxLengthF += (3.13)
anglengerAngleOriginalFiAngleFB −−=′α (3.14)
)*( scalengerLengthOriginalFi
LengthFLB =′ (3.15)
Equations (3.12) and (3.13) calculate the angle and Euclidean distance between Bf's pivot
and Af's fingertip. Equations (3.14) and (3.15) apply a normalisation to AngleF and LengthF
in order that Bα′ is relative to the Bp's angle, angle; and BL′ is a number between 0 and 1. In
these equations OriginalFingerAngle and OriginalFingerLength are the angle and length of
the finger in the template position, i.e. for 0=α and 1=L ; and angle and scale are the
rotation and scale parameters of Bp's state.
Following this rule we can generate new finger particles for any palm particles, and the
weight of these new particles is likely to be high.
3.7.2 Differences from Condensation.
We have seen how particle interpolation can be used with partition sampling, and tree-
structured articulated objects. In this context, particle interpolation is essential in order to
prevent tracking from terminating, due to inability to form complete particles. At the same
time, particle interpolation provides an efficient solution in the sense of propagating as many
particles as possible from one time-step to the next one, and ensuring that these particles
have high weights. Particle interpolation makes it possible that, for each selected particle in
the palm particle set, a complete particle is formed and propagated to the next time-step.
However, the particle selection policies in partition sampling differ slightly from the
resampling operation in Condensation.
3 Hand contour tracking using particle filters and deformable templates 63
In the Condensation algorithm, ideally, the resampling operation does not affect the
distribution represented by the sample set. On the one hand, the resampling operation avoids
the unbounded growth of the sample set as it evolves from one time-step to the next one. On
the other hand, the resampling operation introduces noise in the sample set, from one time-
step to the next one, in order to prevent the sample set from degenerating into a few particles
that represent only the peaks of the distribution. It turns out that a particle set represents a
distribution more efficiently when most of the particles have equal weights (MacCormick,
2000).
As opposed to the idea of resampling in Condensation, partition sampling produces a particle
set that represents only the peaks of the distribution. The benefit of partition sampling is that
we can explore configuration spaces of a high number of dimensions using much fewer
particles than an equivalent Condensation filter would require. This is possible thanks to the
hierarchy between partitions. The search results in parent partitions allows focusing the
search for the child partitions, which in turn have a smaller space to search from. Particle
interpolation adds an extra focusing into the child's partition search space, by generating
particles with high weight, and ensuring that the same number of particles, that generated the
particle set, is propagated to the child's particle set.
The main reason of using a Condensation filter, instead of a Kalman filter, for visual
tracking, is that the Condensation's capability of keeping several hypotheses of the target
makes it resistant to background clutter and partial occlusion. During these situations some
hypothesis with medium weight will be carried from one time-step to the next one, because
they could turn later into high weight hypotheses. The fact that partition sampling, together
with particle interpolation, focuses the particles on the peaks of the underlying distribution,
may seem to reduce the tracking resistance against background clutter. However, despite the
chance that the clutter resistance of partition sampling and particle interpolation may be
reduced in comparison to an equivalent Condensation filter, it is not annulled; as a number of
different hypotheses are always propagated between time-steps. On the other hand, partition
sampling and particle interpolation offer a robust, versatile, and time-efficient, solution that
Condensation alone would, otherwise, not be able to offer.
3 Hand contour tracking using particle filters and deformable templates 64
Finally, as a comparison note, it is worth mentioning that the articulated tracking in this
thesis is based on MacCormick and Isard (2000) work. They presented an interface-quality
hand tracking, which involved a tree-structured articulated hand contour and partition
sampling. The hand's configuration space was divided into 4 partitions: hand palm, index
finger, and two partition for the thumb. However, the approach taken in their implementation
was to deal with each of the partitions in sequence: the particles selected in partition 1
generate the particles in partition 2, the particles selected in partition 2 generate the particles
in partition 3, and so on. Finally, the particles selected in partition 4 generate the particles in
partition 1, for the next time-step. This strategy avoids the task of forming complete
particles. However, in their solution, each time an extra partition is involved in the tracking,
the number of particles required to keep a fix level of performance, grows faster than the
number of particles required in our solution, for each extra partition. They point out that their
approach is valid mathematically but an approach that takes into account the tree structure
would be more appropriate and possibly more efficient.
65
4 Implementations and results
Chapter 3 presented Blake and Isard's framework for contour tracking. The chapter reviewed
the framework elements and the Condensation algorithm; and it was shown how partition
sampling can be used in order to track articulated objects, and tree-structured articulated
objects such as a human hand. Finally, particle interpolation was presented as a novel
solution to the incomplete particles problem in tree-structured articulated objects.
In this chapter, these techniques are put into practice with the implementation of two
articulated hand trackers. The first hand tracker, which we will refer to as particle-set
implementation, implements exactly the concepts of Chapter 3 . The second hand tracker,
which we will refer to as sweep implementation, implements the same general ideas, but
using a deterministic search for the fingers. Both hand trackers use the same articulated hand
contour model, and a skin colour based measurement model, which provides various
advantages in comparison to the measurement models used by Blake and Isard.
There are numerous possible improvements to the two articulated hand trackers presented in
this chapter; however, these improvements will not yet be used in this chapter, they will be
covered later in Chapter 7 .
4 Implementations and results 66
4.1 Articulated hand contour model
The hand tracking techniques used in this thesis use the assumption that the user's hand palm
is always approximately parallel to the camera's image plane, although allowing for
independent finger and thumb movements. In order to support tracking of a hand in such a
configuration we built a hand contour model, shown in Figure 4.1(a). The hand contour
model is an articulated BSpline template constructed from 50 control points. Each finger can
rotate around its finger pivot in order to represent the abduction/adduction movements of the
fingers. The finger pivots are situated approximately where the Metacarpophalangeal (MPC)
joint of each finger would be expected. The length of the fingers can change in order to
represent the 2D projection of the finger's flexion/extension movement as it would be
perceived from the camera's point of view. Finally, the thumb consists of two segments
which can rotate around their pivots but on the contrary to the fingers, the thumb segments
have a constant length (as the thumb is assumed to flex only on the same plane as the palm),
see Figure 4.1(b). Altogether the articulated hand contour has 14 DOF which are represented
by the following state vector:
( )0 0 1 1 2 2 3 3 4 5, , , , , , , , , , , , ,x y l l l lα λ θ θ θ θ θ θ
Where
x is the x-coordinate for the centroid of the hand.
y is the y-coordinate for the centroid of the hand.
α is the rotation angle of the whole hand.
λ is the scale of the whole hand.
3210 ,,, θθθθ are the angles with respect to the hand palm for the little, ring, middle,
and index finger respectively.
3210 ,,, llll are the lengths of the of the little, ring, middle, and index finger
respectively, as perceived from the camera's point of view.
4θ is the angle of the first segment of the thumb with respect to the hand
palm.
5θ is the angle of the second segment of the thumb with respect to the
first segment of the thumb.
4 Implementations and results 67
The first four parameters ( )λα ,,, yx are a non-linear representation of a Euclidean similarity
transform applied to the whole BSpline template. The finger and thumb angles are 0º when
they are in the template position. The angles of the fingers and thumb are only allowed to
change within a range of valid angles. Note that in Figure 4.1(b) the two thumb pivots are off
the major axis of their corresponding thumb segments. This offset allows for a better
mapping of the thumb's articulated contour movement to the real thumb joints movement, as
seen in a 2D projection. The length of the fingers is relative to the template's finger lengths;
it is 1 when they have the same length as in the template, smaller than 1 when shorter, and
greater than 1 when longer. The two thumb segments have a fixed length of 1.
Figure 4.1: Hand contour model. (a) Hand contour showing its 50 control points. (b) Articulated hand contour
showing its joint parameters.
4.2 Dynamical model
The dynamical model used in the proposed hand tracker consists of one-dimensional
oscillators, one for each parameter of the articulated hand model, as described in (Blake and
Isard, 1998). Each oscillator is defined by three parameters: a damping constant β , a natural
frequency f, and a root-mean-square average displacement ρ . Table 4.1 shows the
parameter values used in the dynamical model. These values were found empirically to suit
how the user's hand moves. Note that the natural frequencies of all the oscillators are zero.
This means that the parameters of the articulated hand model are not expected to oscillate,
although the dynamical model can represent oscillations if needed.
4 Implementations and results 68
4.3 Measurement model
The two hand contour tracker implementations presented in this chapter use a measurement
model based on that described in Section 3.2. This section proposes a measurement model
that uses measurement lines too (as in Section 3.2), but the processing of these lines differs
from that of Section 3.2 in that only skin colour features are used.
1( )sβ − ( )f Hz ρ
x 6 0 50 pixels
y 6 0 45 pixels
α 6 0 0.3 rad
λ 6 0 0.1
0θ 8 0 0.2 rad
0l 10 0 0.2 pixels
1θ 8 0 0.2 rad
1l 10 0 0.2 pixels
2θ 8 0 0.2 rad
2l 10 0 0.2 pixels
3θ 8 0 0.2 rad
3l 10 0 0.2 pixels
4θ 8 0 0.2 rad
5θ 8 0 0.2 rad
Table 4.1: Parameter values for the hand tracker dynamical model.
4.3.1 Measurement lines
Figure 4.2 shows the location of the measurement lines used in the articulated hand contour
of Section 4.1. There are 70 measurement lines, 19 on the palm, 10 on each finger, 5 on the
first thumb segment, and 6 on the second thumb segment. The measurement lines on the
hand palm and thumb are normal to the contour. However, the measurement lines in the
fingers are normal to the finger's axis, for the 8 closest lines to the hand, and parallel to the
4 Implementations and results 69
finger's axis, for the 2 lines on the fingertip. The average2 length of a measurement line is 20
pixels.
Blake and Isard suggest using an anti-aliasing scheme, such as bilinear interpolation, when
sampling their measurement lines. On the other hand, the approach adopted in this thesis is
to retrieve the image pixels along a measurement line, drawn according a Bresenham line.
This approach may seem less accurate; however, if an anti-alising scheme was to be used on
the measurement lines, other inaccuracies present in the hand contour model3 would obscure
this gain in accuracy. As we will soon see, the pixels along the measurement line will be
classified as being skin colour or non-skin colour. In this context, interpolation of the pixel
values could interfere with the skin colour classification. Finally, non-anti-aliased
measurement lines can be sampled faster. This is an important fact to consider for real-time
performance, since the measurement lines are heavily used in the tracking algorithms.
Figure 4.2: Measurement lines used in the articulated hand contour.
2 When rotating a line of length L, drawn using a Bresenham algorithm, the resulting line can sometimes involve a number of pixels slightly different than L. These are the pixels that later will be scanned to find features. 3 The hand contour model can only undergo Euclidean similarity transformations. It is assumed that the user will keep the hand parallel to the image plane of the camera; however, this will often not be the case, resulting in the hand contour not fitting the tracked hand perfectly.
4 Implementations and results 70
4.3.2 Skin colour based measurement
In the Section 3.2, we saw how the measurement lines are processed in order to find image
features. Typically these image features refer to edges and valleys. Although MacCormick
and Isard (2000) have used edges in combination with skin colour for their hand contour
tracking: they first calculate a likelihood based on the edges found along the measurement
line, and then the likelihood is increased or decreased after testing that the correct end of the
measurement line is in a skin colour area (the hand interior) and the other end is on a non-
skin colour area (the exterior of the hand). The hand trackers presented in this chapter use
skin colour in a rather different way.
In Chapter 5 a skin colour classifier, called the LC classifier, is presented. This skin colour
classifier possesses a number of interesting features that make it especially suitable for use in
HCI. Some of these features are: its computational time and storage efficiency; its tuning to a
specific person's skin colour; its resistance to illumination (brightness) changes; and that the
detected skin colour areas tend to be solid and clearly defined without requiring post-
processing. This classifier makes it possible to detect the edge of an object along a
measurement line, based entirely on the skin colour information. The edge of an object found
in this manner is referred to as skin edge. This is the approach used in all the hand tracking
experiments of this thesis, unless otherwise specified. Figure 4.3 shows the output of the LC
skin colour classifier applied to the whole image, with the measurement lines drawn on top
in order to see where the skin edges of the hand would be detected.
Figure 4.3: Skin colour image with the measurement lines on top.
4 Implementations and results 71
The position of the skin edges on each measurement line is used to calculate the contour
likelihood. However, the procedure used in this thesis to calculate the contour likelihood
differs slightly from that explained in Section 3.2. In Section 3.2 a simplified expression to
calculate the contour likelihood is given in Equation 3.8. This expression involves the
calculation of a log-likelihood for each measurement point. These log-likelihoods are then
added and an exponential is taken over the total in order to obtain a probability. The
approach used in this thesis to calculate the contour likelihood does not use probabilities, it
uses a score; although, effectively, this score works as a likelihood.
The procedure to calculate the contour likelihood is as follows: each measurement line is
scanned, starting from the outer part of the line, until two consecutive pixels are classified as
skin colour, this point is called SkinEdge; then the distance from this position to the midpoint
of the line is used to access a look-up table of scores. The scores obtained in this way, for
each measurement line, are multiplied resulting in the final contour score. Figure 4.4 shows
the scores look-up table. The look-up table is made from Gaussian values taken at integer
distances. The Gaussian's standard deviation was found empirically and the mean is situated
at the origin. The Gaussian is transformed so that its maximum value is 2, and minimum
value 0.5. This means that each measurement point can potentially double or halve the score.
If no SkinEdge is found along the measurement line, the contribution of the line is to halve
the score. Notice that with this measurement function the score for the whole hand, 70
measurement lines, could potentially go from 702 1.180 21E≈ + to 70(0.5) 8.47 22E≈ − .
Figure 4.5 shows the procedure to calculate the contour score.
Figure 4.4: Score look-up table.
4 Implementations and results 72
1. Score = 1 2. For each of the measurement lines in a contour repeat: 2.1. For each of the pixels along the measurement line, starting from the end of the line exterior to the contour, use the LC skin colour classifier to determine whether the pixels are skin colour or not. 2.2. Once two consecutive pixels are classified as skin colour stop scanning the line, and store in i the position just before that point. 2.3. Calculate the distance between i and the midpoint of the measurement line. This is the SkinEdge value. 2.4. Then use SkinEdge to access the look-up table in Figure 3.6: Score = Score * Look-upTable [SkinEdge] 3. End.
Figure 4.5: Algorithm to calculate the contour's score.
This approach has a number of advantages with respect to that of Section 3.2:
• Its implementation only requires multiplication and access to a look-up table (avoiding
the use of exponentials for each measurement point). This approach results in a faster
calculation of the contour likelihood than that of Section 3.2. This fact is important to
achieve real-time operation (as the measurement function is heavily used in the tracking
algorithm).
• It allows penalising the score for measurement lines that do not follow the right non-
skin/skin pattern. If a measurement line does not contain any skin colour pixels, its
contribution is to halve the score. If a measurement line only contains skin colour pixels,
its contribution is to halve the score. If a measurement line contains first skin colour and
then non-skin colour, its contribution is to halve the score. Only when a measurement
line follows the pattern first non-skin colour and then skin colour, its contribution can be
bigger than 0.5. This results in a very sensitive measurement function, in the sense that
the contour hypothesis that are very close to the real contour will have a very large score,
while the rest of contour hypothesis will have on average a score around 1. This make the
measurement function every resistant to outliers.
One disadvantage to this measurement function is that as it does not use probabilities, the
Condensation algorithm cannot be interpreted anymore in terms of propagation of
conditional probabilities. However, in practical terms, this does not affect the tracking
performance.
4 Implementations and results 73
4.4 Resampling scheme
Section 3.3 described the resampling operation, within the Condensation algorithm, as an
operation that does not affect the distribution represented by the sample set. However, when
using partition sampling this changes slightly, for efficiency reasons; the resampling
operation on a particle set will generate a new particle set containing only the peaks of
weight of the first particle set, as discussed in Section 3.7.2.
There are various ways of implementing this resampling operation for partition sampling. In
(Isard and MacCormick, 2000) a threshold is used. All the particles with weight above the
threshold are selected, in order to generate the new particle set. The threshold is calculated,
for each time-step, from the particle with largest weight minus a constant offset. In this
method of resampling, there are two extreme situations which can potentially arise: in the
first situation only one particle is selected; and in the second situation all the particles are
selected; the number of particles selected depends on the weight distribution of the particle
set. This method of resampling may be suitable for their partition sampling implementation,
in which partitions are processed in a circular sequence, see final note in Section 3.7.2.
However, in the implementation of partition sampling for tree-structured articulated objects
described in Section 3.5, the threshold-based resampling scheme would result in that the
number of particles required, and consequently the time required, to process the partitions
could vary largely. For this reason, the chosen method of resampling is one in which the
number of selected particles is fixed. In each resampling operation, the N highest weight
particles in the particle set are selected. This guaranties that each time-step has a fixed
duration, resulting in a stable frame rate along the tracking. A selection of the highest
weighted 10% of the particles in the particle set was found to produce good results. This
percentage of particles was also observed by Isard and Blake (1998b), as the percentage of
particles that may have high enough weight to be used as a base, that is to be used to
generate particles in the following particle set.
One feature added to the resampling scheme is that the particle with highest weight in the
original particle set is always copied to the new particle set. The particle with highest weight
in a particle set is used for display purposes. Hence, if there is no change at all in the
configuration of the tracked object, from one time-step to the next, the output of the tracking
4 Implementations and results 74
will fit the object at least as well as in the previous time-step. This technique helps the
tracker to display a more stable output.
4.5 Particle-set implementation
Here we present an articulated hand tracker capable of tracking the contour of a hand through
a video sequence. The tracker can handle the rigid movement of the hand, and the
independent movement of each finger, according to the hand contour model described in
Section 4.1. This implementation is referred to as the particle-set version, because each of
the partitions is covered by a particle set, exactly as described in Chapter 3. This hand tracker
uses the following features:
• Partition sampling and particle interpolation as described in Chapter 3.
• The articulated hand contour model described in Section 4.1.
• Dynamics as described in Section 4.2.
• The measurement model described in Section 4.3.
• The resampling scheme described in Section 4.4.
The articulated hand model has 14 parameters, ( )0 0 1 1 2 2 3 3 4 5, , , , , , , , , , , , ,x y l l l lα λ θ θ θ θ θ θ ,
which we decompose into 7 partitions, as follows:
• One partition for the hand palm, comprising the parameters ( )λα ,,, yx , and to whose
associated particle set we will refer to as Palm.
• Four partitions, one for each finger, comprising the parameters 0 0 3 3( , ),..., ( , )l lθ θ , and to
whose associated particle sets we will refer to as L, R, M, and I .
• Two partitions for the thumb, one for the first thumb segment, comprising the parameter
4θ , and another for the second thumb segment, 5θ . To whose associated particle set we
will refer to as T1 and T2 respectively.
A particle set diagram representing one time-step of the particle-set hand tracker version is
shown in Figure 4.6. This diagram is very similar to the one shown in Figure 3.14, with the
addition of two segments for the thumb. Each of the thumb segments has a unique parameter
for its angle (the length of the segment is constant). The second thumb segment depends
hierarchically from the first thumb segment; therefore, the subset sizes in the T2 particle set
are determined by the weights of the selected particles in the T1 particle set.
4 Implementations and results 75
Figure 4.6: One time-step of tracking for the particle-set implementation.
An algorithm showing the specific operations for one time-step of the particle-set hand
tracker is shown in Figure 4.7. The algorithm is very similar to the one described in Figure
3.15, but differs from it in that there is one extra level in the hand tree structure, the second
thumb segment.
Some comments about the particle-set hand tracker are:
• The number of particles used in each particle set is set as follows: 250 particles for the
Palm particle set; and 100 particles for each of the finger particle sets, including the two
thumb segment particle sets.
• The contour templates are fit in the following order: first, the hand palm; second, the
fingers and first thumb segment; finally, the second thumb segment. Alternatives to this
order will be discussed in Section 7.1.
4 Implementations and results 76
1. For the Palm particle set do: 1.1.Use the complete particles of the previous time-step to generate new particles for Palm. 1.2. Apply dynamics to each of the particles in Palm. 1.3. Weight particles in Palm. 1.4. Select particles from Palm that constitute peaks of weight in the set. (Resampling) 2. For each of the finger particle sets, i.e. L, R, M, I, and the first thumb segment particle set, T1, do: 2.1. For each of the selected particles in Palm, generate a number of new particles in the finger particle set, proportional to the weight of the selected particle in Palm (the particles, in the finger particle set, that come from the same parent particle in Palm form a subset). 2.2. Apply dynamics to each of the particles in the finger particle set. 2.3. Weight particles in the finger particle set. 2.4. Select the particle with highest weight in the finger particle set. 2.5. Generate a new interpolated particle for each subset, based on the particle selected in step 2.4. Select the new interpolated particles. 2.6. Weight the interpolated particles. 3. For the second thumb segment particle set, T2, do: 3.1. For each of the selected particles in T1, generate a number of new particles in the T2 particle set, proportional to the weight of the selected particle in T1 (the particles, in the T2 particle set, that come from the same parent particle in T1 form a subset). 3.2. Apply dynamics to each of the particles in the T2 particle set. 3.3. Weight particles in the T2 particle set. 3.4. Select the particle with highest weight in the T2 particle set. 3.5. Generate a new interpolated particle for each subset, based on the particle selected in step 3.4. Select the new interpolated particles. 3.6. Weight the interpolated particles. 4. Form complete particles for the next time-step.
Figure 4.7: Algorithm for one time-step of tracking with the particle-set implementation.
• The finger lengths and finger angles are constrained. The minimum allowed length is
0.15, and the maximum allowed length is 1.2. Remember, that a length of 1 is the length
of the finger in the original finger contour template, see Section 4.1. The finger angles
are constrained specifically for each finger according to Table 4.2. The values for these
angle limits have been found empirically, in order that they suit the expected finger
movements.
• In order to bootstrap the tracking an initial complete particle is needed. This initial
complete particle contains a known configuration for the articulated hand contour that
matches the configuration of the target hand in the first frame of the tracking sequence.
4 Implementations and results 77
Minimum angle
(rad)
Maximum angle
(rad)
Little -0.6 0.4
Ring -0.25 0.15
Middle -0.25 0.15
Index -0.45 0.4
Thumb segment 1 -0.9 0.9
Thumb segment 2 -1.3 0.35
Table 4.2: Finger angle constrains.
4.6 Refining finger length estimates
The visual hand tracking in this thesis was designed as a solution to the hand tracking
requirements of the VTS application. In this application the flexion/extension of the fingers
could be very fast, for example when typing on a virtual keypad, or clicking on a virtual
button. The flexion/extension movements of the fingers are modelled, in the articulated
contour model, as changes in the length of the finger contour. The parameter that governs the
dynamics of the finger contour length could be tuned in order to suit these fast changes;
however, the required value would also make the measured finger contour length less stable.
On the other hand, the measured finger contour length needs to be as accurate as possible in
order that the inferred flexion/extension of the fingers is accurate enough for the VTS. For
these reasons, a mechanism, additional to the described contour tracking, is needed in order
to guarantee this extra accuracy.
This section introduces a technique that allows previously found finger length estimations to
be refined. The technique consists in using two parallel measurement lines placed along the
finger contour. The length of these lines is slightly longer than the maximum possible finger
length. The pixels along these measurement lines are tested for skin colour, starting from the
base of the finger, in order to find a skin colour edge, from skin colour to non-skin colour.
The position of the skin colour edge is used to correct the length of the finger. The procedure
is illustrated in Figure 4.8.
4 Implementations and results 78
(a)
(b)
(c)
Figure 4.8: Procedure to refine finger length estimations. (a) the middle finger length is initially estimated by
the articulated hand contour tracking. (b) two measurement lines are used in order to find a more
accurate length of the middle finger. (c) the length of the middle finger is updated.
For this technique to produce good results, it needs to be able to handle noise along the
measurement lines, and potentially, skin colour from adjacent fingers. In order to achieve
this, the measurement lines are processed in the following way:
1. The pixels of the measurement line are tested for skin colour using the Linear Container
skin-colour classifier described in Chapter 5. The results are stored in an array, if the
pixel is skin coloured it is store as 1, if it is not skin coloured it is stored as 0.
2. A one-dimensional morphological erosion is applied to the array in order to remove
noise; one pixel erosion considering 1 as foreground, and 0 as background.
3. Morphological dilation is applied to the array in order to fill any skin holes inside the
finger; 25 pixel dilation.
4. Morphological erosion is applied to the array in order to return the skin-colour edge to
its original location inside the array; 24 pixels erosion.
5. Finally, the array is scanned in order to locate a skin-colour edge.
This processing of the measurement lines makes the location of a skin-colour edge, on the
measurement line, more accurate. In addition, two parallel measurement lines are used in
order to combine their skin-colour edge positions, and, in this way, obtain a more reliable
length for the finger. If the distance between the skin-colour edges of the two measurement
lines is smaller than 0.3 (remember, the length of the fingers goes from 0 to 1), the resulting
fingertip position is calculated as an average of the two skin-colour edge positions;
4 Implementations and results 79
otherwise, the fingertip position is calculated from the skin-colour edge that is further away
form the base of the finger.
The procedure here presented, for refining a previously estimated finger length, is applied
directly after the step [2.4] (only applied to the fingers, not to the first thumb segment) of the
particle-set version algorithm in Figure 4.7. This step corresponds to selecting the particle
with highest weight in the particle set. After this particle is selected, the procedure for
refining the finger length estimation is applied. The resulting particle, with the refined finger
length, will be used in the following step, in order to generate interpolated particles for each
of the subsets of the particle set.
4.7 Sweep implementation
The previous section described an articulated hand tracker implementation based on the
techniques presented in Chapter 3 . In this section a new articulated hand tracker
implementation is described. This implementation is also based on the techniques presented
in Chapter 3 ; however, the particle sets for the fingers are substituted by a deterministic
search of the fingers' positions. The approach has two major advantages with respect to the
particle-set implementation. These advantages are: increased accuracy in locating the fingers,
and as a consequence increased tracking accuracy; and faster computation time.
This hand tracker works mostly in the same way as the particle-set implementation, it uses
the same hand contour model, the same dynamical model, the same measurement model, and
the same resampling scheme. The only difference is the way in which finger positions are
found. In the particle-set implementation, the particles in the finger particle sets correspond
to many potential finger positions, distributed according the finger dynamics. The reported
finger position is determined by the particle with highest weight in this finger particle set.
The accuracy with which the finger position is found depends on the number of particles in
the finger particle set; more particles means better accuracy.
The hand tracker presented in this section substitutes the fingers' particle sets for a
deterministic search, as a means of finding the finger positions. This search involves
measuring the fitness of each finger contour for a specific range of angles. The angle whose
finger contour has the highest fitness is selected for that finger. The angle is swept in a
4 Implementations and results 80
progressive pattern; it starts with the finger angle, and length, of the previous time-step, then
the angle changes gradually with increasing steps; first positive increases, and then returning
back to the initial angle, with negative increases. Note that, with this approach, no dynamics
are applied to the finger angle. Figure 4.9 illustrates this angle sweep pattern. Once the finger
angle is found in this way, the length of the finger is found using the finger length refinement
procedure of Section 4.6.
Figure 4.9: Angle sweep pattern. (a) First, the finger angle changes with positive increasing steps. (b) Then,
the finger angle changes with negative increasing steps.
It turns out that this sweep procedure requires fewer measurements, of the finger contour
fitness, than the particle-set implementation does. Using the presented progressive angle
sweep pattern, 15 angle positions are enough to estimate the finger angle with better
accuracy, on average, than using the particle set method with 100 particles. This is the reason
why the sweep implementation is faster than the particle-set implementation – it essentially
involves fewer measurement steps.
Conceptually, this algorithm uses partition sampling, as the configuration space is partitioned
exactly as in the particle-set version, and the searches in each of the partitions are performed
hierarchically. The finger position is found in two stages, first the finger angle and then the
finger length. This is as if we had partitioned the configuration space one level more, one
partition for the finger angles, and another partition for the finger lengths.
4 Implementations and results 81
Figure 4.10 represents one time-step for the sweep articulated hand tracker implementation.
The top horizontal line is the Palm particle set. Dots on this particle set represent selected
particles, and the bigger dot is the particle with highest weight. From this particle, five
branches come out, one for each finger, and one for the first thumb segment: L for little, R
for Ring, M for middle, I for index, and T1 for the first thumb segment. Each of the finger
branches undergoes, first, a sweep procedure (triangles with S on them) and then, a finger
length refinement (boxes with R on them). Once the angle and length of fingers are found,
these are used in order to generate interpolated particles, one for each of the selected
particles in the Palm particle set. This stage is represented in Figure 4.10 as a smaller
horizontal line with four particles (dots) on it. The thumb segments follow a slightly different
path. They only require angle estimation, which then can be used to generate interpolated
particles. The second thumb makes an angle estimation based on the first thumb angle. At
the end of the time-step it will be possible to form as many complete particles as particles
were selected in the Palm particle set. Figure 4.11 shows the algorithm for one time-step of
tracking with the sweep implementation.
Figure 4.10: Sweep hand tracker implementation diagram for one time-step. The triangles, with an S on
them, represent the sweep procedure. The boxes, with an R on them, represent the finger length
refinement procedure. The horizontal lines at the bottom represent particle interpolation; the red
dots are interpolated particles from the black dot, the particle with highest weight.
4 Implementations and results 82
1. For the Palm particle set do: 1.1.Use the complete particles from the previous time-step to generate new particles for Palm. 1.2. Apply dynamics to each of the particles in Palm. 1.3. Weight particles in Palm. 1.4. Select particles from Palm that constitute peaks of weight in the set. (Resampling) 2. For each of the fingers i.e. L, R, M, I do: 2.1. Apply the sweep pattern, using as starting position the finger angle of the previous time-step. 2.2. Apply the finger length refinement. 2.3. Use the angle and length found in the previous two steps in order to generate a new interpolated particle for each of the particles selected in the Palm particle set. 2.4. Weight the interpolated particles. 3. For the first thumb segment do: 3.1. Apply the sweep pattern, using as starting position the first thumb angle of the previous time-step. 3.2. Use the angle found in the previous step in order to generate an interpolated particle for each of the selected particles in the Palm particle set. 3.3. Weight the interpolated particles. 4. For the second thumb segment do: 4.1. Apply the sweep pattern, using as starting position the second thumb angle of the previous time-step. 4.2. Use the angle found in the previous step in order to generate an interpolated particle for each of the interpolated particle in the first thumb segment. 4.3. Weight the interpolated particles. 4. Form complete particles for the next time-step.
Figure 4.11: Algorithm for one time-step of tracking with the sweep implementation.
4.8 Performance measures for articulated contour trackers
In the previous sections, we have presented two articulated hand contour tracker
implementations. Both implementations make it possible to track the user's hand contour
through a video sequence. However, we may ask: how accurate is the tracking in each of
these implementations? In order to answer this question, we need to define suitable
performance measures for articulated contour trackers. The performance measurement of an
articulated contour tracker is not an easy task. One problem is that the accuracy of the global
tracking and the accuracy of the articulated tracking interact with each other, which makes
difficult to give a general benchmark. In particular, it is not very meaningful to give a single
value to explain the quality of a particular tracker on a particular video-sequence. For this
4 Implementations and results 83
reason, the performance measures treated in this section will refer to a single time-step of
tracking. Hence, for a tracking sequence, the results will be in the form of a sequence of
values, one for each time-step of tracking.
This section presents five performance measures: cost function, contour distance, SNR,
distance between model points, and distance between model parameters. The first one is
based on a value internal to the tracker, the cost function; the rest are based on a ground
truth. The ground truth for the articulated hand contour tracker requires placing, for each
frame of a test video sequence, an articulated hand contour exactly on top of the real hand.
This process is done manually, and this means that it is costly, and not completely accurate.
4.8.1 Cost function
In Section 4.3, we presented a measurement model. Within this measurement model we
define a function whose return value describes the degree of fit between an articulated hand
contour and a real hand in an image. This function is referred to as a weighting function
when dealing with particle filters, and as a fitness function, when referring to template
matching. Here, in a more general sense, we refer to it as a cost function. Most trackers use a
cost function in their tracking mechanisms as a way of evaluating how good any particular
tracking state is, at a given time-step. Generally, the objective of a tracker is to maximize (or
minimize) the value of the cost function by changing the values of some model variables
from one time-step to the next one. This is an important point to have in mind when
comparing various trackers implementations which use the same cost function; one tracker
implementation will be better than another one in terms of which one best maximizes (or
minimizes) the value of the cost function.
However, the original design of a cost function is not as a performance measure. The cost
function may be a simplification and have a limited accuracy itself, because of computation
speed requirements, or because of measurement model limitations. In fact, as we will see in
Section 4.11, when the proposed skin colour based cost function is compared, as a
performance measure, with an external performance measure based on a ground truth, we
can sometimes observe a disagreement between the two performance measures.
4 Implementations and results 84
We finish this section pointing out the fact that as trackers, in most cases, already provide the
cost function, its use as a performance measure is very attractive; however, a proper
benchmarking of the tracker should use other performance measures too.
4.8.2 Contour distances
A performance measure for contour trackers can be constructed by calculating, for every
frame of a tracking sequence, a distance metric between the tracker's output contour and a
ground truth contour. A simple distance metric between two contours could be defined in
terms of their control points. Let ),( ii yx=x and ),( ii yx ′′=′x be two sets of control points,
for the tracker output contour and the ground truth contour respectively. A distance metric
can be given by the norm of their difference:
1
1 22 2
0( ) ( )
N
i i i ii
x x y y−
=
′ ′ ′− = − + − ∑x x (4.1)
However, the contours used in this thesis are represented by cubic B-splines parametric
curves, and (4.1) does not take into account the distances between corresponding points on
each curve. A better distance metric can be formulated by including the B-spline metric
matrix as given in (Blake and Isard, 1998). Given two cubic B-splines ( )sP and ( )s′P
defined by their control points ( , )i ix y and ( , )i ix y′ ′ , a more accurate distance metric, d,
measures the difference between corresponding points on each spline, sampled densely and
uniformly over the parametric curves. The distance metric is given by,
( )( ) ( )( )1 1
2 21 12 22
0 00 0 0
( , ) ( ) ( ) ( ) ( )N N NN N
i i i i i ii i
d s s ds x x B s ds y y B s ds− −
= =
′ ′ ′ ′= − = − + −
∑ ∑∫ ∫ ∫x x P P (4.2)
where ( )iB s is the cubic B-spline basis matrix.
A performance measure based on this distance metric can be convenient, and gives useful
information about how similar, or different, the output of a contour tracker is to a ground
truth contour.
4.8.3 Signal to Noise Ratio (SNR)
A popular measure for evaluating the performance of a tracker is the SNR. The SNR is an
image processing based measurement (Tissainayagam and Suter, 2002); thus, it is
independent of the parameterisation of the contour representation. The SNR is defined in
4 Implementations and results 85
terms of the overlap between the output contour of the tracking and the ground truth contour.
The output SNR (in dBs) denoted as outSNR , for a single frame, is calculated as follows:
2
,2 ( , )ref
x ysignal I x y = ∑
2
,( , ) ( , )ref track
x ynoise I x y I x y = − ∑
10( ) 10 logoutsignalSNR dBnoise
= (4.3)
where refI and trackI are binary images. refI is 1 for points inside the ground truth contour,
and 0 outside. trackI is similar, but uses the tracked contour. The scale factor 2 in the signal
value was chosen so that a SNR of 0 (ie. signal = noise) would occur if the tracker silhouette
consisted of a shape of the same area as the ground truth shape, but inaccurately placed so
that there is no overlap between the two. This is the worst-case scenario where the tracker
has completely failed to track the object.
4.8.4 Distance between model points
The previous performance measures refer to the whole contour as a single entity; however, in
an articulated tracking we may want to know how good is the tracking of individual
articulations. One easy way of calculating this is by measuring the 2D Euclidean distance
between two corresponding points, situated on some articulation of the tracking output
contour and a ground truth contour. Some candidate points are: the origin of the articulated
contour, any of the finger pivots, and the fingertips. The measurement of the tracking
accuracy of the individual fingertips, against a ground truth, is of special importance in the
applications of Chapter 8 ; where the movement and position of the fingertips determines the
location of the 'touch' on an interactive surface.
4.8.5 Distance between model parameters
As the proposed articulated hand contour tracking is model based, an alternative benchmark
is the discrepancy, for each time-step of tracking, between the model parameters and the
ground truth model parameters. This performance measure allows examining individual
model parameters but relies completely on the ground truth. This makes the measure a bit
inflexible, as there could be situations in which a combination of model parameters could
4 Implementations and results 86
differ from the ground truth, but the resulting contour still fits correctly in some areas of
interest, for example the fingertips.
4.9 Test video sequence
In order to assess the performance of the hand tracking implementations presented in this
thesis, a test video sequence was created. As the purpose of the articulated hand tracking is to
be used in the VTS applications of Chapter 8 , this video sequence was designed for testing
each of the VTS hand tracking requirements. The video sequence consist in a VTS user
holding their hand, open, in front of a camera, hand parallel to the camera's image plane,
exactly as if the user was to interact with a VTS. Then, the user starts moving their hand
following certain patterns of rigid and articulated movement. Figure 4.12 shows a diagram of
the test video sequence structure. The video sequence has a total of 890 frames (although
tracking starts at frame 30) and it is composed of the following five sections:
• From frame 30 to frame 174, the hand moves rigidly with splayed fingers. This section
tests the tracker's capability for rigid tracking.
• From frame 175 to frame 359, the hand moves rigidly with splayed fingers again; this
time with very brisk movement. The movement is so intense, sometimes more than 50
pixels of swing from one frame to the next one, that this section of the test video
sequence can only be successfully tracked by using some of the techniques introduced in
Chapter 7 . Note that the brisk motion not only involves fast translations but also fast
rotation, and fast changes in scale. This section tests the tracker's capability for brisk
rigid tracking.
• From frame 360 to frame 402, the hand remains static but fingers flex and extend, as if
the user was typing on specific area of a VTS. This section tests the tracker's capability
for articulated finger tracking, isolating it from global hand tracking.
• From frame 403 to frame 572, the hand moves rigidly at the same time that fingers flex
and extend, as if the user was typing on a wider area of a VTS. This section tests the
combined tracker's capability for simultaneous global and articulated tracking.
• From frame 573 to frame 890, the hand moves rigidly while a finger is kept flexed, as if
the user was dragging an object on the VTS. The dragging of the finger describes a
square on the VTS; this is first done for the index finger, then for the middle finger, and
finally for the ring finger. This section tests the tracker's capability for tracking the rigid
movement of the hand while a finger remains flexed.
4 Implementations and results 87
In addition to the main movement pattern of each section, at the end of the first, second, and
last sections the thumb is flexed. This allows testing the tracking for the two joints of the
thumb.
Figure 4.12: Test video sequence structure.
A ground truth for this test video sequence was created in the following way: for each frame
of the video sequence an articulated template, like the one in Figure 4.1(b), was manually
adjusted in order to match the hand configuration in that frame. Once the template was
adjusted for each frame, the parameters of the articulated template were stored. This allows
the recreation of the tracking for the whole test video sequence. The test video sequence is
available in Appendix B and on the supporting webpage as "video_sequence_4.1.avi". The
ground truth for the test video sequence is available in Appendix B and on the supporting
webpage as "ground_truth.txt".
4.10 Results and comparisons
This section presents tracking results of the two articulated hand contour tracking
implementations: the particle-set implementation, and the sweep implementation. The
particle-set implementation uses 250 particles for the Palm particle set, and 100 particles for
each of the finger and two thumb partitions. The sweep implementation uses 250 particles
for the Palm particle set; the finger and thumb positions are found using the deterministic
search described in Section 4.7. Both hand tracker implementations use the resampling
scheme described in Section 4.4, which involves propagating 10% of the particles from one
time-step to the next.
Both tracker implementations are tested with the video sequence described in the previous
section. However, the section between frames 175 and 359, corresponding to brisk rigid-
hand motion, cannot be successfully tracked by either of the two trackers. For this reason the
4 Implementations and results 88
test is performed in three sections, initialising the tracker at the beginning of each one. The
first section goes from frames 30 to 174, and corresponds to the rigid hand motion. The
second section, goes from frames 175 to 359, and corresponds to the brisk rigid hand motion.
During this section both trackers fail to keep a lock on the target. The third section goes from
frames 360 to 890, and corresponds to static typing, dynamic typing, and dragging. The two
tracker implementations calculate three performance measures for each frame along the
tracking, these are: cost function, contour distance, and SNR. The results are presented in
form of charts; the vertical axis is specific to each performance measure, and the horizontal
axis is the frame number. The cost function is shown in a logarithmic scale because of the
large variation the cost function values can reach, these could potentially go from about
8.47E-22 to 1.180E+21. The large range of the cost function values is due to the way the cost
function is calculated (see Section 4.3.2 for further information). Each chart displays the
mean average and variance of its values, with the exception of the cost function chart, which
shows the log variance4.
Videos for the tracking output of both trackers are available in Appendix B and on the
supporting webpage as:
"video_sequence_4.2.avi" for particle-set implementation frames 30-174;
"video_sequence_4.3.avi" for particle-set implementation frames 360-890;
"video_sequence_4.4.avi" for sweep implementation frames 30-174;
"video_sequence_4.5.avi" for sweep implementation frames 360-890.
Figure 4.13 shows a comparison between the particle-set results, left column, and the sweep
results, right column, for frames 30 to 174. Figure 4.13(a) and (b) show the cost function
charts for the particle-set and sweep implementations respectively. We can observe that the
average of the cost function values for the sweep implementation is larger, and the log
variance is smaller, than for the particle-set implementation. A larger value for the cost
function indicates that the fit between the hand contour resulting from the tracking, and the
real hand on the image, is better. A smaller value for the log variance indicates that the
tracking is more stable. Figure 4.13(c) and (d) show the contour distance charts for the
particle-set and sweep implementations respectively. The contour distance is calculated
using the distance metric of Section 4.8.2. When the value of the metric is smaller indicates
4 We calculate the log variance by taking the base-10 logarithm of the data, and then calculating its variance.
4 Implementations and results 89
that the hand contour in the tracking is closer to the hand contour in the ground truth. We can
see that, for this section of the tracking, the metric values for the sweep implementation are
considerably smaller than the values for particle-set implementation. The variance of the
metric is also smaller in the sweep implementation than in the particle-set implementation.
This indicates that the sweep implementation tracking is more stable. Finally, in Figure
4.13(e) and (f) we can see the SNR charts for the particle-set and sweep implementations
respectively. A larger SNR value indicates that the overlap between the hand contour in the
tracking and the hand contour in the ground truth is greater. The SNR in both charts is quite
similar; however, if we look at the average and variance values, we can see that the sweep
implementation has a slightly larger average SNR, about 1 dB more, and less variance.
Particle-set Results
(a)
Sweep Results
(b)
(c) (d)
(e) (f) Figure 4.13: Performance comparison from frame 30 until 174.
Figure 4.14 and Figure 4.15 show four example frames for the particle-set tracker output and
the sweep tracker output respectively. These frames illustrate the type of rigid motions and
common fitting errors for each tracker between frames 30 to 174.
4 Implementations and results 90
frame 50 frame 72
frame 130 frame 158 Figure 4.14: Example frames of the particle-set tracker output from frame 30 to frame 174.
frame 50 frame 72
frame 130 frame 158 Figure 4.15: Example frames of the sweep tracker output from frame 30 to frame 174.
4 Implementations and results 91
Figure 4.16 shows a comparison between the particle-set results, left column, and the sweep
results, right column, for frames 175 to 360. During this section the motions of the target
hand are so intense that the tracking is lost for both trackers. Note that both trackers have a
stopping mechanism, which stops the tracking from continuing if the cost function goes
below 1E-4. If this mechanism did not exist the tracking would continue for longer; however,
the lock could be on the wrong object.
Particle-set Results
(a)
Sweep Results
(b)
(c) (d)
(e) (f) Figure 4.16: Performance comparison from frame 175 until 359. Both trackers fail at keeping a lock onto
the target hand.
Figure 4.17 shows a comparison between the particle-set results, left column, and the sweep
results, right column, for frames 360 to 890. Figure 4.17(a) and (b) show the cost function
charts. We can see that the cost function values for the sweep implementation are again
larger in average than the cost function values for the particle-set version. The log variance
of the cost function for both implementations is approximately the same. Figure 4.17(c) and
(d) show the distance metric charts for the particle-set and sweep implementations
4 Implementations and results 92
respectively. We can see that both the average and the variance of the metric for the sweep
implementation are considerably smaller than the average and variance for the particle-set
implementation. Finally, in Figure 4.17(e) and (f) we can see the SNR charts for the particle-
set and sweep implementations respectively. Both charts look quite similar; however, if we
look at the average SNR value, the sweep implementation has a slightly larger one than the
particle-set implementation. Also, the SNR variance in the sweep implementation is slightly
smaller, which indicates that the tracking for this implementation is slightly more stable.
Figure 4.18 and Figure 4.19 show four example frames for the particle-set tracker output and
the sweep tracker output respectively. These frames illustrate the type of rigid motions and
common fitting errors for each tracker between frames 360 to 890.
Particle-set Results
(a)
Sweep Results
(b)
(c) (d)
(e) (f) Figure 4.17: Performance comparison from frame 360 until 890.
4 Implementations and results 93
frame 523 frame 584
frame 627 frame 817 Figure 4.18: Example frames of the particle-set tracker output from frame 360 to frame 890.
frame 523 frame 584
frame 627 frame 817 Figure 4.19: Example frames of the sweep tracker output from frame 360 to frame 890.
4 Implementations and results 94
There is certain variability among the three performance measure results for each
implementation; this is due to the different ways in which the performance measures are
calculated. Despite this variability in the results, we can clearly conclude that the sweep hand
tracker implementation produces a more accurate output, and it is slightly more stable, than
the particle-set implementation. Finally, another advantage of the sweep implementation is
that it is about 1.42 times faster than the particle-set implementation. This difference is
important in real-time systems as is can increase the responsiveness of the system.
4.11 Relationship between performance measures
In the previous section we used three indicators, cost function, contour distance, and SNR, in
order to assess the performance of the two hand tracking implementations. If we look at the
three charts for the sweep implementation, we can see that, on average, the three indicators
agree: when one indicator shows good performance the other two tend to show good
performance too, and when one indicator shows bad performance the other two tend to show
bad performance too. However, if we look to the results for individual frames we can often
see small disagreements, for example: for a particular frame the cost function could rise,
indicating a better fit of the hand model to the image, and the distance metric could rise too,
indicating a larger distance from the ground truth. The same is true between cost function
and SNR; and even, to a lesser degree and because of different reasons, between distance
metric and the SNR.
The main cause of these disagreements is that the distance metric and the SNR are based on
a ground truth, but the cost function is based on a measurement model, as discussed in
Section 4.3. On the one hand, the ground truth is calculated manually and can potentially
involve human errors. On the other hand, the measurement model used in the cost function,
explores only certain areas of the image, the measurement lines, in order to evaluate the
fitness between a contour model and the image features. This is an efficient measurement
model in the sense of not having to explore the whole image in order to assess the contour
likelihood. However, the measurement lines can pick up noise and other non-target-related
image features that could result in the incorrect fitness of a contour. Finally, the contours of
the hand in the image (used to establish the ground truth) may differ slightly from the
contours of the hand in the skin colour image (used to calculate the fitness of the contour).
4 Implementations and results 95
Depending on what type of features are searched along the measurement lines, the cost
function can be more or less prone to report incorrect contour likelihoods. In Section 6.1 we
will see that, under certain assumptions, the skin colour based measurement model is less
prone to report incorrect likelihoods than an equivalent edge detection based measurement
model.
In order to study the relationship between the cost function and a ground truth based
measurement function, the following experiment is performed:
• We run the sweep hand tracker implementation with the test video sequence, section for
rigid tracking; while tracking is locked onto the target, we collect 30000 hand contour
hypotheses, these are particles from the Palm particle set. Some of these hypotheses will
be well aligned on top of the hand, as it appears on the image, and therefore will have a
high weight. Others hypotheses will be poorly aligned or placed away from the hand and
therefore will have a low weight. We evaluate the cost function and distance metric for
all of them.
The results of this experiment are collected in a chart containing 30000 points, one for each
hand contour hypothesis. Figure 4.20(a) shows this chart using logarithmic axis for both the
cost function and distance metric. In this chart is possible to see a relationship between the
cost function and the metric distance. As the metric distance increases, the cost function
decreases. The strength of this relationship can be calculated performing a Spearman's rank
correlation on the dataset. The resulting correlation factor is -0.6941. This indicates a fairly
strong negative correlation between the cost function and the metric distance. In Figure
4.20(b) we can see the same data using linear axes. This chart shows that the cost function is
always very low and grows extremely rapidly when the metric distance is under 200. Figure
4.20(c) shows a close-up view, in which it is possible to see that the cost function peaks at
metric distance of around 20. These observations give us a better insight into the distance
metric performance charts of Section 4.10. One thing to note is that even in an idealized
tracker (that is a tracker that could find the hypothesis with the largest possible weight for
every frame of a tracking sequence) a chart of the distance metric for some tracking sequence
will show an average distance metric of about 20. Another thing to note is that a hypothesis
with a large weight is certainly going to be near the real contour. As the weight of a
hypothesis decreases, it can be said that for a small interval the distance to the real contour
increases. However, if a hypothesis has a low weight it does not tell us anything about how
4 Implementations and results 96
near to the real contour is. In general it would be far from the real contour, but it could also
be near.
(a)
(b) (c) Figure 4.20: Relationship between cost function and distance metric. (a) Both axis in logarithmic scale, (b)
Both axis in linear scale, (c) Both axis in linear scale, close-up view.
4.12 Technologies employed in the implementations
The development of the hand contour trackers presented in this chapter has involved a
number of third-party libraries:
• Oxford Tracking Library (OTL, 1999). Developed by the Oxford Visual Dynamics
Group, this library provides a framework for contour tracking. This library was useful for
understanding the implementation of contour trackers, and was used in early
experiments. From there, a new library was specifically written to suit the requirements
of the hand trackers developed in this thesis. The only part from OTL that trackers in this
thesis still use is that related with BSpline drawing.
4 Implementations and results 97
• OpenCV (OpenCV, 2006). OpenCV is an open source computer vision library developed
by Intel. Several functions from this library are used in the implementation of this thesis
hand trackers.
• DirectShow SDK (DirectShow, 2005). This library is used for video streaming under
Microsoft Windows platforms.
4.13 Conclusions
This chapter has presented two articulated hand contour tracker implementations based on
the techniques described in Chapter 3 . These are named particle-set and sweep
implementations. Both implementations were benchmarked using the same test video
sequence, and three performance measures were evaluated for each implementation. The
sweep implementation produced the best results in both tracking accuracy and computational
speed. Note that an increase in the number of particles used in any of this chapter's hand
tracking implementations would also increase the tracking accuracy. However, all the hand
tracking implementations in this thesis use only 250 particles – unless otherwise specified.
This number of particles is chosen because produces a reasonably good tracking performance
while allowing the tracker to run in real-time.
The two hand trackers presented in this chapter are novel for a number of reasons:
• The two trackers use partition sampling in combination with particle interpolation. The
combination of these two techniques makes it possible to track a 14 DOF articulated
hand template using only 250 particles. An equivalent hand contour tracker using only
Condensation would be unfeasible, as it would require nearly 250 million particles in
order to have a similar tracking accuracy.
• The measurement model is exclusively based on skin colour.
• The resampling scheme uses only the 10% of the particles with highest weight.
• The particular point of view from which the hand is tracked: palm parallel to camera's
image plane. In practice, small variations from the plane parallel to the camera's image
plane are still tracked, especially variations on the tilt of the hand with respect to the
vertical.
• The articulated hand model used is novel on its own, specially for the representation of
the finger flexion/extension as a change in the perceived finger length.
4 Implementations and results 98
The measurement model used in this chapter uses a skin colour classifier that is specially
well suited for use in HCI. This classifier is presented in the following chapter.
99
5 A skin colour classifier for HCI
This chapter proposes a skin colour classifier, which is designed to detect skin colour in
tracking applications, such as the ones found in Human Computer Interaction (HCI) systems;
in particular, this classifier is used in the measurement function, Section 4.3, of the
articulated hand trackers presented in this thesis. The classifier, named the Linear Container
(LC) classifier, uses four decision planes in order to define a volume of the RGB space
where skin colours are likely to appear. The main features of the LC classifier are:
• Rapid evaluation.
• Minimal storage requirements.
• Resistance to illumination (brightness) changes equivalent to that of classifiers that work
in normalised RGB.
The classifier needs an initialisation step, where a single training image, with marked skin
and background areas, is analysed in order to find the model parameters of the classifier.
Other features of the LC classifier related to the initialisation step are:
• It can be tuned to maximize the detection of skin colour for a specific person, and for
specific illumination conditions.
5 A skin colour classifier for HCI 100
• Fast calculation of the model parameters, which can allow the model to be updated
dynamically.
• Even when the initialisation of the classifier is under unideal conditions, the skin
detection rates are still high.
• Good generalisation of training data.
The LC classifier is tested on various illuminations and on various skin colour tonalities, the
results are then compared with other classifiers. The time the LC classifier spends during the
initialisation step can be greatly reduced by reducing the resolution of the training image.
The performance of the LC classifier is studied when the initialisation step is repeated at
various decimated resolutions. Some HCI usability factors, related with unideal
initialisations, are also explored. Finally, the LC classifier use is illustrated with two example
applications: firstly, an initialisation example of the LC classifier in a HCI application;
secondly, a dynamic tuning procedure for the LC classifier is presented in the context of
contour tracking.
5.1 Previous Work on Skin Colour Detection
Skin colour provides an important source of information for computer vision systems that
monitor people. The skin colour cue is widely used in face detection and recognition
systems, various types of surveillance, vision-based biometric systems, and vision-based
HCI systems. All these application areas use skin colour to track, locate and interpret people,
with relatively efficient, fast, low-level methods.
The goal of skin colour detection is to build a decision rule that can discriminate between the
skin and non-skin colour pixels of an image. Because of the importance of skin colour
detection there have been numerous approaches to solve this task. The various approaches
can be grouped into the following four categories: non-parametric skin distribution
modelling, parametric skin distribution modelling, explicitly defined skin region modelling,
and dynamic skin colour modelling (Vezhnevets et al., 2003).
Non-parametric skin distribution modelling uses training data to estimate a skin colour
distribution. This estimation process is sometimes referred to as the construction of a Skin
5 A skin colour classifier for HCI 101
Probability Map (SPM) assigning a probability value to each point of a discretized colour
space (Jones and Rehg, 1999; Brand and Mason, 2000; Gomez, 2002). A SPM can be
implemented by a colour histogram, and such approaches normally use the chrominance
plane of some colour space in order to offer resistance to illumination changes (Jones and
Rehg, 1999; Chen, et al., 1995; Zarit, et al., 1999; Schumeyer and Barner, 1998). SPMs can
use a Bayes classification rule in order to improve their performance, in this case two colour
histograms are required; one for the probability of skin colour, and another for the
probability of non-skin colour (Jones and Rehg, 1999; Zarit, et al., 1999; Chai and
Bouzerdoum, 2000). The main disadvantages of SPMs are the high storage requirements and
the fact that their performance directly depends on the representativeness of the training
images.
Parametric skin distribution modelling can represent skin colour in a more compact form.
Common examples of parametric modelling model a skin colour distribution using a single
Gaussian (Menser and Wien, 2000; Terrillon, et al., 2000; Ahlberg, 1999), or a mixture of
Gaussians (Jones and Rehg, 1999, Terrillon, et al., 2000; Yang and Ahuja, 1999).
Expectation Maximization (EM) algorithms are used on training data to find the model
parameters that produce the best fit. The goodness of fit, and therefore the performance of
the model, depends on the shape of the chosen model and the chosen colour space. This
performance dependency with the colour space is stronger in the case of parametric
modelling than it is in the case of non-parametric modelling (Brand and Mason, 2000; Lee
and Yoo, 2002).
Another way to build a skin colour classifier is to define explicitly, through a number of
rules, the boundaries of a skin cluster in some colour space; this is called explicitly defined
region modelling. The obvious advantage of this method is its computational simplicity,
which has attracted many researchers (Ahlberg, 1999; Peer, et al., 2003; Fleck, et al. 1996;
Jorda, et al., 1999), as it leads to the construction of a very rapid classifier. However in order
to achieve high recognition rates both a suitable colour space and adequate decision rules
need to be found empirically. Gomez and Morales (2002) proposed a method that can build a
set of rules automatically by using machine learning algorithms on training data. They
reported results comparable to the Bayes SPM classifier in RGB space for their data set.
5 A skin colour classifier for HCI 102
Finally, we have dynamic skin colour modelling. This category of skin modelling methods is
designed for skin detection during tracking. Skin detection in this category is different from
static image analysis in a number of aspects. First, in principle, the skin models in this
category can be less general – i.e tuned for a specific person, camera, or lighting. Second, an
initialisation stage is possible, when the skin region of interest is segmented from the
background by a different classifier or manually; this makes possible to obtain a skin
classification model that is optimal for the given conditions. Finally, this category of skin
models can be able to update themselves in order to match changes in lighting conditions.
Some of the methods in this category use Gaussian distribution adaptation (Yang and Ahuja,
1998), or dynamic histograms (Soriano, et al., 2000; Stern and Efros, 2002; Sigal, et al.,
2000). In (Soriano, et al., 2000) a skin locus, in rg space, is constructed beforehand from
training data. Then, during tracking, their dynamic skin colour histogram is updated with
pixels from the bounding box of the target, provided these pixels belong to the skin locus.
This makes the dynamic histogram less likely to adapt to colour distributions other than that
of skin.
The proposed LC classifier belongs to the last two categories. The classifier is implemented
using rules similar to those of the explicitly defined skin region models; however, these rules
are parameterised in order that they can be tuned to specific conditions, during an
initialisation stage. The parameters of the LC classifier can also be recalculated rapidly in
order to adapt to changing illumination conditions.
5.2 Development of the LC classifier
The LC was designed to overcome some of the shortcomings of three other skin colour
classifiers. This section briefly describes these three skin colour classifiers that lead towards
the construction of the final LC classifier.
5.2.1 RGB histogram classifier
One of the simplest skin colour classifiers is a SPM implemented with a RGB histogram
(Jones and Regh, 1999; Chen, et al., 1995; Zarit, et al., 1999; Schumeyer and Barner, 1998).
The RGB colour space is quantified into a number of bins, for example 256x256x256 bins.
Each bin, defined by a triad of values, stores the number of times this particular colour
occurred in the training skin images. After the training stage, a pixel can be tested as being
5 A skin colour classifier for HCI 103
skin colour or not by using the RGB components of the pixel to form the address of a bin in
the histogram. The main features of the RGB histogram classifier are:
• Very fast.
• Large storage requirements.
• Poor resistance to illumination changes.
• Need of a large data set for training.
• Larger bin size can reduce storage requirements, and account for training data sparsity;
however, then, the false positives can increase.
Jones and Regh (1999) reported the 32x32x32 quantification as being the best compromise,
regarding storage, generalisation, and detection rates. Figure 5.1 shows a skin colour
histogram made from a single skin colour sample. The quantification is 32x32x32, and only
the bins bigger than 5 are shown.
Figure 5.1: Skin colour RGB histogram.
5.2.2 Normalised RGB histogram classifier
Normalised RGB can be easily obtained from the RGB values by the following
normalisation procedure:
BGR
BbBGR
GgBGR
Rr++
=++
=++
= (5.1)
As the sum of the three normalised components is 1, the third component does not hold
significant information and can thus be omitted. The remaining components, often called
“pure colours”, have a diminished dependence on brightness. A property of this
representation is that for matte surfaces normalized RGB is invariant (under certain
assumptions) to changes of surface orientation relative to the light source (Skarbek and
5 A skin colour classifier for HCI 104
Koschan, 1994). This, together with the transformation simplicity has made this colour space
a popular choice among several researchers (Zarit, et al., 1999; Lee and Yoo, 2002; Peer, et
al., 2003; Stern and Efros, 2002; Yang and Ahuja, 1998; Brown, et al., 2001; Soriano, 2000;
Oliver, et al., 1997). A skin colour histogram using the r and g components is therefore going
to be more resistant to illumination (brightness) changes than a bare skin colour RGB
histogram. The main features of the rg histogram classifier are:
• Fast, but slower than an RGB histogram. This is because every pixels has to be
normalised before accessing the histogram.
• Good resistance to illumination (brightness) changes.
• Needs less training data than an RGB histogram.
Figure 5.2 shows an rg histogram made from a single skin colour sample.
Figure 5.2: Normalised RGB histogram.
5.2.3 Projected RGB histogram
The projection from RGB (3D space) to normalised RGB (2D space) corresponds with a
cone in the original 3D RGB space, in that each point in the rg-plane corresponds to a 3D
line of colour values in the original RGB space. These lines meet at (0, 0, 0), and points
along the lines correspond to scaling of white illumination. Therefore, a skin colour cluster in
the rg-plane corresponds to a cone-like cluster in RGB space. This is illustrated in Figure 5.3.
5 A skin colour classifier for HCI 105
(a) (b)
Figure 5.3: Projection from rg to RGB. (a) Skin colour cluster, in the rg-plane, from a single sample. (b) The
rg-plane skin colour cluster projected to RGB space; each point in the rg-plane becomes a line in
RGB space.
A new skin colour classifier is proposed: the projected RGB histogram classifier. This
classifier tries to combine the evaluation speed of the RGB histogram and the illumination
independence of the rg histogram. The procedure to construct this projected histogram is to
create an rg histogram from the training data, and then project it to RGB, creating an RGB
histogram. The bins of the resulting RGB histogram are processed using a 3D median filter
in order that the skin colour cone becomes more solid, and fills any possible gaps (due to
sparse data in the original rg histogram). The main features of the projected RGB histogram
classifier are:
• As fast as the RGB histogram classifier.
• Good resistance to illumination changes (same as the rg histogram classifier).
• Needs less training data than an RGB histogram.
• Large storage requirements.
• The processing of the resulting RGB histogram with a 3D median filter, can account for
some training data sparsity, resulting in better generalisation.
5 A skin colour classifier for HCI 106
5.3 The Linear Container (LC) classifier
The proposed LC classifier attempts to reproduce and improve the results of the projected
RGB histogram while reducing the storage requirements. The LC classifier uses a polyhedral
cone, constructed from four decision planes, in order to model the cone-like region in RGB
space that results from the projection of a skin colour cluster in the rg-plane to the RGB
space. The LC classifier performs pixel-based segmentation. If an RGB value is inside the
polyhedral cone volume, it is classified as skin; if the RGB value is outside the polyhedral
cone volume then it is classified as non-skin. The definition of the four decision planes is:
BGhmin G BRmin R B BGhmax G BRmax R⋅ + ⋅ < < ⋅ + ⋅ (5.2)
where BGhmin and BRmin parameterise the lower "horizontal" plane, and BGhmax and
BRmax parameterise the higher horizontal plane. The horizontal planes are illustrated in
Figure 5.4(a). These two planes confine a volume between them by constraining the values
that B can take in relation to R and G. This volume is further constrained by two "vertical"
planes:
BGvmin B GRmin R G BGvmax B GRmax R⋅ + ⋅ < < ⋅ + ⋅ (5.3)
where BGvmin and GRmin parameterise the left vertical plane, and BGvmax and GRmax
parameterise the right vertical plane. The vertical planes confine a volume between them by
constraining the values that G can take in relation to R and B. The vertical planes are
illustrated in Figure 5.4(b).
(a) (b) Figure 5.4: LC classifier decision planes. (a) Horizontal decision planes. (b) Vertical decision planes.
As the RGB values that are close to the origin carry too little colour information, we need an
additional rule in order to truncate the apex of the polyhedral cone. Possible rules are:
5 A skin colour classifier for HCI 107
Rmin R< illustrated in Figure 5.5(a) (5.4)
RGSum R G< + illustrated in Figure 5.5(b)
RGBSum R G B< + + illustrated in Figure 5.5(c)
Each one rule is better than the previous one, but it also has one addition more than the
previous one. If a colour value satisfies Equations 5.2 and 5.3, and one of the dark rule
equations 5.4, then it is inside the truncated polyhedral cone, and therefore it is classified as
skin colour.
(a) (b) (c) Figure 5.5: Possible decision planes to avoid dark pixels.
The LC classifier can be tuned for a specific person, camera, or lighting conditions, in an
initialisation step. For this, an initialisation image is needed. This initialisation image is
composed of two approximately complementary masks; one mask delimits the target skin
colour area, we call this mask SkinMask; and the other mask comprises areas where we do
not expect to find skin colour, we call this mask BackgroundMask. Figure 5.6 shows an
initialisation image segmented by the two masks. The BackgroundMask can be tailored in
order to avoid areas of skin colour in addition to those included in SkinMask, for example,
Figure 5.6(b) avoids the subject's wrist. The two masks can be generated manually, or
automatically by a tracking system.
(a) (b) Figure 5.6: Initialisation image masks. (a) Initialisation image segmented by SkinMask. (b) Initialisation
image segmented by BackgroundMask.
5 A skin colour classifier for HCI 108
The tuning procedure uses a heuristic method by which the parameters of the decision planes
are changed in sequence. Each time a parameter is changed, the fitness of the LC classifier,
to the detection of skin colour in the SkinMask and to the rejection of skin colour in the
BackgroundMask, is measured using the following equation:
# #
skin pixels in SkinMask skin pixels in BackgroundMaskfitness TI
size of SkinMask size of BackgroundMask= × − (5.5)
where TI (Target Importance) is used to control the importance of the target skin colour area
in the fitness. In the experiments of the following sections TI = 2 so as to give double
importance to detecting skin on the SkinMask than to avoid detecting skin on the
BackgroundMask. This parameter allows the classifier to be tuned to favour true positives or
negatives.
The heuristic search, by which the parameters of the decision planes are changed, is
illustrated in Figure 5.7(a). This figure shows a section view of the RGB cube, corresponding
to the B-G-plane with maximum R. Lines 1, 2, 3 and 4 are the intersections of the four
decision planes with the section view. Starting from some a priori values, given in Table 5.1,
the search varies BRmin, then BRmax, GRmin, and finally GRmax; first, reducing their
values, then increasing their values, and measuring the fitness (Equation 5.5) at each step.
The values that produce the best fitness are finally selected. Note that the angle of each
decision plane remains unchanged in this heuristic search.
(a) (b) Figure 5.7: Tuning heuristics. (a) First tuning heuristic. (b) Tuning enhancement.
5 A skin colour classifier for HCI 109
BGhmin = -1.2 BGhmax = -1.2
BGvmin = 0.76 BGvmax = 0.76
BRmin = 0.973 BRmax = 1.55
GRmin = 0.104 GRmax = 0.476
Table 5.1: LC classifier priori values.
The performance of the tuned LC classifier can be increased by using a further tuning
heuristic. This tuning heuristic, regarded as tuning enhancement, is performed after the first
tuning heuristic is finished. Starting from the model parameters resulted from the first
heuristic search, this tuning enhancement proceeds to vary the eight model parameters of the
LC classifier. First, a rotating pivot is calculated for each decision place. A rotating pivot is
the midpoint of each line 1, 2, 3, and 4. In Figure 5.7(b) the rotating pivots are labels as a, b,
c, and d. Then each decision plane is rotated around its pivot, sequentially, decision plane 1
around pivot a, decision plane 2 around pivot b, and so on. The rotation of a decision plane
around its pivot involves the two parameters that define the plane, for example for decision
plane 1, rotation around pivot a involves BRmin, and BGhmin. At each step during a rotation
the fitness (Equation 5.5) is calculated. The values that produce the best fitness are finally
selected. A video illustrating the tuning operation is available in Appendix B and on the
supporting webpage as "video_sequence_5.1.avi".
5.4 Performance Results
The LC classifier is tested on video sequences of subjects with four different skin tonalities:
Mediterranean, white Caucasian, black African, and Chinese. The target skin colour area is
the subject's hand. The subjects hold their hand open in front of the camera, and move the
hand towards and away from the camera. An overhead lamp affects the illumination of the
subject's hand. When the subject's hand is closer to the camera, it is under a shadow and
looks darker. When the hand is further away from the camera, it is under the lamp and looks
brighter. The classifier is initialised once, using the first frame of each video sequence, and
using the first tuning heuristic described in Section 5.3.
The skin colour detection performance is calculated for each video sequence, using a ground
truth. The ground truth consists of two masks, which have been manually generated for every
fifth frame of the four video sequences. The ground truth considers the subject's hand as the
5 A skin colour classifier for HCI 110
target area for skin colour detection. This area is segmented using the SkinTruth mask,
Figure 5.8(b). The background is segmented using the BackgroundTruth mask, Figure 5.8(c).
Note that the BackgroundTruth mask is not the complement of the SkinTruth mask. The
BackgroundTruth mask avoids the target skin colour area, the subject's hand, and any other
skin colour areas in the image; therefore, for each measurement frame, there will be some
areas which will not take part in the counting; these areas correspond to the subject's face and
arms. Both masks are tested for skin colour. Skin colour pixels found in the SkinTruth mask
constitute true-positives. Non-skin colour pixels found in the BackgroundTruth mask
constitute true-negatives. In order to compare detection results between frames, the true-
positives and true-negatives are normalised to the size of SkinTruth and BackgroundTruth
masks respectively. Normalised true-positives are referred to as NTP, and normalised true-
negatives are referred to as NTN.
(a)
(b)
(c)
Figure 5.8: Ground truth masks. (a) Original frame. (b) SkinTruth mask. (c) BackgroundTruth mask.
We use the skin colour classifiers described in Section 5.2 as a comparison reference. The
RGB skin colour histogram has a size of 32×32×32 bins; the rg histogram has a size of
64×64 bins; and the projected RGB histogram has a size of 100×100 bins for the initial rg
histogram, and 32×32×32 bins for the projected RGB histogram. All the histograms used for
comparison are constructed in an initialisation step at the beginning of each sequence from
the pixels in SkinMask. A pixel is classified as skin colour if its corresponding bin in the
histogram is bigger than a threshold. The choice of the threshold affects the detection rate of
the histogram. In general, if the threshold increases, NTN tends to be higher, but NTP tends
to be lower; if the threshold decreases, NTP tends to be higher, but NTN tends to be lower.
For the tested video sequences the thresholds that produce the best results for each histogram
are: 5 for the RGB and projected RGB histograms, and 25 for the rg histogram.
Figure 5.9 shows the results for the Mediterranean subject. The figure presents plots of the
NTP and NTN against the frame number, and two example frames showing the skin colour
classification. The example frames correspond to points in the video sequence at which the
5 A skin colour classifier for HCI 111
detection rate is maximum and minimum. The row (b) corresponds to the RGB histogram;
the row (c) corresponds to the rg histogram; the row (d) corresponds to the projected RGB
histogram, and the bottom row (e) corresponds to the LC classifier. Figure 5.10, Figure 5.11,
and Figure 5.12, follow the same layout for the white Caucasian, black African, and Chinese
subjects respectively.
We can see in Figure 5.9 that the skin colour detection results for each classifier are
different. The RGB histogram is the most sensitive to light changes, and its performance is
the worst of all classifiers. The performance of the last three classifiers is more alike, this is
because all of them present an equivalent illumination (brightness) resistance. However, the
LC classifier exhibits slightly larger NTP and NTN along all the video sequence than both
the rg histogram classifier, and the projected RGB classifier. Tests for the other three ethnic
skin tonalities, Figure 5.10, Figure 5.11, and Figure 5.12, reported similar results: the LC
classifier exhibited the same or larger NTP and NTN than both the rg histogram classifier,
and the projected RGB histogram classifier. Note that the ambient illumination was not
controlled during the recording of the video sequences, and this results in slightly different
ambient illuminations for each test.
The original video sequences, and the output of the classifiers, for each of the test subjects,
are available in Appendix B and on the supporting webpage as:
• "video_sequence_5.2.avi" for the Mediterranean subject video sequence, including the
output of the rg histogram classifier and the LC classifier.
• "video_sequence_5.3.avi" for the White Caucasian subject video sequence, including the
output of the rg histogram classifier and the LC classifier.
• "video_sequence_5.4.avi" is the Black African subject video sequence, including the
output of the rg histogram classifier and the LC classifier.
• "video_sequence_5.5.avi" is the Chinese subject video sequence, including the output of
the rg histogram classifier and the LC classifier.
5 A skin colour classifier for HCI 112
(a)
Frame 90 Frame 110
(b)
(c)
(d)
(e)
Figure 5.9: Mediterranean subject test. (a) Original frames. (b) RGB histogram classifier. (c) rg histogram
classifier. (d) Projected RGB histogram classifier. (e) LC classifier.
5 A skin colour classifier for HCI 113
(a)
Frame 15 Frame 85
(b)
(c)
(d)
(e)
Figure 5.10: White Caucasian subject test. (a) Original frames. (b) RGB histogram classifier. (c) rg histogram
classifier. (d) Projected RGB histogram classifier. (e) LC classifier.
5 A skin colour classifier for HCI 114
(a)
Frame 10 Frame 70
(b)
(c)
(d)
(e)
Figure 5.11: Black African subject test. (a) Original frames. (b) RGB histogram classifier. (c) rg histogram
classifier. (d) Projected RGB histogram classifier. (e) LC classifier.
5 A skin colour classifier for HCI 115
(a)
Frame 25 Frame 85
(b)
(c)
(d)
(e)
Figure 5.12: Chinese subject test. (a) Original frames. (b) RGB histogram classifier. (c) rg histogram classifier.
(d) Projected RGB histogram classifier. (e) LC classifier.
5 A skin colour classifier for HCI 116
An experiment comparing the computational speed of the LC classifier, against other
classifiers, is also carried out. The experiment measures the time taken for a classifier to
check all the pixels in a 640×480 frame. The experiment is repeated for 100 frames of a
video sequence, containing skin colour, and the times used in each frame are averaged. The
test is carried out in an AMD Athlon 3500+, 1GB of RAM. The results are shown in Table
5.2.
The RGB LC classifier uses an extra rule to avoid dark pixels; the other classifiers do not use
this rule. The rg LC classifier is the 2D equivalent to the proposed RGB LC classifier. It
works in the rg-plane by using 4 decision lines instead of 4 decision planes. The skin
detection performance of this classifier is equivalent to the RGB LC classifier. The equations
in the rg LC classifier are simpler than those of the RGB LC classifier; however, the former
is slower because it has to normalise each pixel from RGB to rg. The use of lookup table
containing all the possible normalisations can speed up the normalisation procedure. But,
even when using a lookup table, the RGB LC classifier is ×1.172126 times faster than its rg
LC equivalent. The average times for the rg histograms are worse than those of the rg LC
classifier, with the additional storage cost. The RGB histogram is only included as a speed
reference because it is the fastest classifier; however, its skin detection rates fall far behind
the other classifiers.
Average time per frame
Speed-up of the RGB LC classifier with respect to the
other classifiers RGB LC classifier 0.0090 secs rg LC classifier 0.0147 secs x1.62 rg LC classifier with lookup table 0.0107 secs x1.17 rg histogram 0.0235 secs x2.59 rg histogram with lookup table 0.0204 secs x2.25 RGB histogram 0.0022 secs x0.24
Table 5.2: Execution time results
5.5 Tuning at Various Resolutions
So far, the LC classifier has been tuned using an initialisation image of the same size as the
video sequence in which it was tested, 640×480 pixels. It was observed that the tuning of the
LC classifier on a decimated version of the initialisation image, results in little degradation
of the classifier's detection performance on the non-decimated video sequence. This is
because the result of the tuning is more dependent upon the range of colours of the pixels in
5 A skin colour classifier for HCI 117
the initialisation masks than upon the number of pixels. This fact allows us to speed-up the
tuning procedure, because the amount of data to be dealt with is reduced, while still keeping
similar detection performance. The speed-up of the tuning procedure as a result of using a
decimated initialisation image instead of using a non-decimated initialisation image is: ×4 for
a 320×240 resolution, ×16 for 160×120, ×64 for 80×60, ×256 for 40×30, and ×1024 for a
20×15 resolution. Figure 5.13 shows the NTP of the LC classifier, on the video sequence of
the Mediterranean subject, for various resolutions of the initialisation image. The NTN are
not shown as they remain at almost 1 for the six resolutions. Notice that the NTP for an
initialisation image of 320×240 is virtually the same as the NTP for an initialisation image of
640×240.
When the LC classifier is used as a part of an HCI system, the tuning time becomes
extremely important, as the skin detection system has to work at real-time, consuming little
computing power. This is particularly true when the LC classifier is retuned periodically, in
order to cope with new illumination conditions. The speed-up resulting from the use of
decimated initialisation images allows us to meet the real-time requirements of an HCI
system.
Figure 5.13: NTP when tuning at various resolutions.
5 A skin colour classifier for HCI 118
5.6 HCI Usability Factors
The tuning stage in the experiments of the previous sections was idealised, in that no
background colours appear in the SkinMask, and no skin colour appeared in
BackgroundMask. If the LC classifier is used in a HCI system, which could generate the
initialisation masks automatically from a tracking subsystem, it is possible that background
appears in SkinMask, and skin colour appears in BackgroundMask. In this section we study
the robustness of the LC classifier against non-ideal tuning conditions.
The detection performance of the LC classifier is calculated, once more, for the video
sequence of the Mediterranean subject. This time, the tuning is repeated for a misaligned
SkinMask and BackgroundMask. In each repetition SkinMask only contains a percentage of
the target's skin colour area. The skin that is not in the SkinMask is in the BackgroundMask,
this affects the final configuration of the LC parameters found during the tuning stage.
Figure 5.14 shows the NTP for four percentages of skin colour in SkinMask. The NTN is not
shown as it is almost unaffected in all the four cases. We can see that the degradation in NTP
for a 50% skin in SkinMask is small; and even when the amount of skin in SkinMask is as
small as 25%, the NTP along the whole sequence may still be useful for some applications.
However, the model parameters found during the tuning stage, depend on the colours
appearing in each initialisation mask; hence, different results are possible even when
SkinMask contains the same amount of skin. This is illustrated in Figure 5.15, where the
tuning of the LC classifier using two SkinMasks with the same percentage of skin inside the
mask, produce different detection performances.
(a) (b)
(c) (d)
Figure 5.14: NTP chart for four percentages of skin in SkinMask. (a) profile of SkinMask containing 100%
skin, (b) 75% skin, (c) 50% skin, and (d) 25% skin.
5 A skin colour classifier for HCI 119
25% A
25% B
Figure 5.15: NTP chart for two different SkinMask containing 25% of skin colour.
5.7 Target importance selection
The tuning of the LC classifier maximizes the skin colour detection inside the SkinMask. The
maximisation procedure is governed by the fitness Equation 5.5. The target importance
parameter in Equation 5.5, TI, was set by default to 2, so as to give double importance to
detect skin colour inside SkinMask than to avoid detecting skin colour in BackgroundMask.
This default value for TI worked well in previous experiments; however, depending on the
amount of skin colour present during initialisation, better detection rates can be achieved by
selecting a specific target importance. The selection of the target importance is a compromise
between the desired levels of NTP and NTN. As NTP increases with the target importance,
NTN will decrease. However, they will do this at different rates, depending on the amount of
skin colour in the initialisation image. The NTP and NTN rates will also change along the
video sequence on which the LC classifier is used. For this reason, it is recommended to
have an initialisation image that is representative of the skin colour ratios in the video
sequence.
Figure 5.16 shows how detection rates vary depending on the target importance, for three
tuning situations, each one with an increasing amount of skin colour in the BackgroundMask.
On the left column there are three charts showing the NTP and NTN, for the initialisation
image, plotted against the target importance. On the right column there are the three
initialisation images used in the tuning. In Figure 5.16(a) the initialisation image is ideal,
there is only skin colour in the SkinMask, and no skin colour in the BackgroundMask. From
the chart we can see that, in this situation, the target importance has very little effect on the
5 A skin colour classifier for HCI 120
NTP, and NTN rates. In Figure 5.16(b) there is some skin colour in the BackgroundMask
during initialisation. Here, NTP, and NTN change with the target importance. We can see
that as the target importance increases, the NTP increase and the NTN decrease. Here, a
target importance of up to 3.5 does not substantially decrease NTN. In Figure 5.16(c) there is
a high amount of skin colour in the BackgroundMask during initialisation. Also, in this
situation, the T-shirts of two of the persons in the background could be taken as skin colour
depending on the foreground importance value. We can see, in this case, that the NTP and
the NTN change faster with the target importance than in the other cases. A target
importance from 2 to 3 would generally be a suitable compromise between increasing NTP
and reducing NTN.
5.8 Example of a LC classifier initialisation in HCI
A vision-based HCI system that uses the LC classifier can, in most cases, generate the two
initialisation masks, SkinMask and BackgroundMask, automatically from the tracking
subsystem. In this section, an example of initialisation of the LC classifier in a HCI is given.
The HCI system tracks the user's hand, using the sweep articulated hand tracker described in
Section 4.7; this hand tracker uses the LC classifier in order to locate the user's hand. The
important part in this example is the initialisation procedure. In order that the LC classifier
can be initialised the hand tracking has to go through three stages:
• First stage, there is no hand tracking. In this stage, a red hand template appears in the
centre of the tracking area.
• Second stage, when the user places their hand on top of the red hand template an initial
tracking of the hand is started. The hand template turns green to indicate this second
stage. During this stage the LC classifier uses its priori model parameters.
• Third stage, when the location of the hand, during the initial tracking, is good enough the
initialisation of the LC classifier takes place. The hand contour resulting from the
tracking is used to generate SkinMask, and BackgroundMask. The tuning of the LC
classifier is then performed at a reduced resolution of 160x120 in order to speed-up the
tuning procedure. The tuning takes places in a fraction of a second, and from this point
the full articulated hand contour tracking is started. The hand contour turns blue to
indicate full tracking.
5 A skin colour classifier for HCI 121
(a)
(b)
(c)
Figure 5.16: Target importance for various tuning situations. The horizontal axes on the charts are the target
importance. On the right, there are the initialisation images corresponding each tuning situation. (a)
Ideal case, no other skin colour in the BackgroundMask. (b) Medium case, some skin colour in the
BackgroundMask. (c) Worst case, there is a high amount of skin colour in the BackgroundMask.
Figure 5.17 illustrates the initialisation sequence. Rows (a) and (b) correspond to the original
video frames and detected skin colour for an initialisation sequence during day illumination
levels. It can be noted in columns 1 and 2, that the LC classifier with priori parameters can
detect the skin colour of the user, however at that illumination level it also detects a few
areas of the background as skin colour. When the LC classifier is tuned to the user's hand,
column 3, the detection of the user's hand skin colour is maximized for the day illumination
levels. Figure 5.17 rows (c) and (d) correspond to the original video frames and detected skin
5 A skin colour classifier for HCI 122
colour for an initialisation sequence during night illumination levels. It can be noted in
columns 1 and 2, that the LC classifier with default parameters can detect the skin colour of
the user, however at this illumination level the detected skin regions contain holes, and some
missing areas. When the LC classifier is tuned to the user's hand, column 3, the detection of
the user's hand skin colour is maximized for the night illumination levels. Videos of the
initialisation sequences shown in Figure 5.17 are available in Appendix B and on the
supporting webpage as "video_sequence_5.6.avi" for the day illumination test, and
"video_sequence_5.7.avi" for the night illumination test.
(1) (2) (3)
(a)
(b)
(c)
(d)
Figure 5.17: Initialisation sequence. Office daylight. row (a) columns 1, 2, 3, initialisation stages first, second,
and third original video frames; row (b) detected skin colour at each stage. Office nighttime. row
(c), initialisation stage first, second, and third original video frames; row (d) detected skin colour at
each stage.
5 A skin colour classifier for HCI 123
5.9 Example of dynamic skin colour modelling during tracking
One of the advantages of the LC skin colour classifier is that the detection results of the
classifier can rapidly adapt to changing illumination conditions by updating just four
parameters, these are: BRmin, BRmax, GRmin, and GRmax. This capability is especially
useful when tracking a skin colour target that undergoes various illumination changes,
because it allows to dynamically tune the classifier. This section presents a method for
dynamically tuning an LC classifier while tracking a skin colour object.
Consider a contour tracker similar to the ones described in Chapter 4 . This tracker uses the
LC skin colour classifier in order to perform tracking of a skin colour target. If the
illumination conditions of the target change substantially with respect to the moment at
which the LC classifier was tuned (typically a the beginning of the tracking sequence), the
LC classifier will not be able to produce a good segmentation of the target, and the tracking
may be lost. A possible solution to this problem involves to continuously tune the classifier
to the new skin tones of the target as its illumination conditions change. This solution is
commonly referred to as dynamic skin colour modelling.
Unfortunately, the procedure presented in Section 5.3 for tuning a LC classifier is too slow to
be performed on every frame of a video sequence. The cause for this tuning being slow is
that the LC classifier has to be evaluated for every pixel of the image each time a parameter
of the classifier is tested – and the tuning procedure as implemented for the trackers of
Chapter 4 involves testing about 360 combinations of parameter values. A much faster
method for testing combinations of LC parameters involves using the cost function of the
contour tracker. The cost function of a contour tracker, such as the ones described in Chapter
4 , only explores the pixel values along certain measurement lines normal to the contour, as
opposed to exploring all the pixels in the image. This makes the evaluation of the cost
function comparatively faster. On the other hand, this cost function uses the LC classifier in
order to determine a score. If the parameters of the LC classifier change, the sensitivity of the
cost function will change too, and so will its score value. Generally, the better the skin colour
segmentation of the target is, the higher the cost function's score will be, and vice versa. This
fact can be used in order to adapt the tracking to changes in the illumination conditions of the
target. The procedure involves producing the best skin colour segmentation of the target at
every time-step of tracking by maximizing the cost function of the tracking output. Note that
5 A skin colour classifier for HCI 124
during this second maximization the contour hypothesis on which the cost function is
evaluated, remains fix. The cost function is maximized by changing the LC classifier
parameters, as opposed to finding the contour hypothesis that better fits the image features
(first maximization).
The procedure to tune the LC classifier during tracking is constructed by using the same
tuning heuristics of Section 5.3, but substituting the part involved in the calculation of the
fitness, Equation 5 .5, for the skin colour based cost function of Section 4.3.2. This new
tuning procedure has to be executed just after the tracker has found the contour hypothesis
with highest weight. Then, the tuning will further increase the cost function of this contour
hypothesis. Only when the cost function increases substantially the new LC parameters are
used. The approach can be understood as changing the skin colour perception of the cost
function so that the best contour hypothesis at each time-step is even more highlighted. The
approach works considerably well while the best contour hypothesis is placed near the
configuration of the target, but the results degrade very rapidly (cascade of errors) if the
configuration of the contour separates from the configuration of the target.
The proposed dynamic tuning of the LC classifier is tested on a hand contour tracker similar
to the ones described in Chapter 4 . The target object is a subject's hand, but in this case the
hand is tracked from its back. During the tracking sequence, the hand is most of the time
near the centre of the image, but the background and the illumination conditions change as
the subject moves the hand around. The illumination changes on the hand are so dramatic
that a hand contour tracker whose LC classifier is tuned only once at the beginning of the
sequence loses tracking easily. In practice, the cost function used in this dynamic tuning is
slightly modified in order to better capture the skin colour segmentation of the hand for the
current set of LC parameters. Figure 5.18 shows a hand contour superimposed with the
measurement lines of the modified cost function. This modified cost function has two new
measurement lines longitudinal to each of the fingers and first thumb segment. These new
measurement lines are processed differently to the ones normal to the contour. When the skin
colour segmentation of the hand is good, these measurement lines will retrieve skin colour
pixels only. If these measurement lines retrieve non-skin colour pixels means that the skin
colour segmentation of the hand is not good. The new measurement lines are used in
combination with the measurement lines normal to the contour in order to generate a score
for the segmentation of the subject's hand.
5 A skin colour classifier for HCI 125
A video sequence showing the results of the test is available in Appendix B and on the
supporting webpage as "video_sequence_5.8.avi". This video sequence shows the tracking
output and skin segmentation of two trackers. The tracker on top uses dynamic tuning, the
one on the bottom uses static tuning (tune only once at the beginning of the tracking). The
state of the LC decision planes for each frame of tracking is shown in a diagram beside the
respective tracking output. This diagram shows the LC decision planes as they intersect the
B-G plane with maximum R (this type of diagram was used to explain the tuning heuristics,
see Figure 5.7).
Figure 5.18: Modified cost function. Note the measurement lines along the fingers and first segment of the
thumb.
In "video_sequence_5.8.avi" the tracker with dynamic tuning tracks the subject's hand
successfully through all the sequence. However, the tracker with static tuning, first fails to
initialise, and then losses the tracking consistently. The procedure to initialise the hand
tracker follows the three stages described in Section 5.8, these stages are indicated with a red
hand contour (waiting for initialisation), green hand contour (partial tracking), and blue hand
contour (initialisation, and full hand tracking). Figure 5.19 shows some example frames of
this video sequence at the moment in which the tracker with dynamic tuning initialises, and
the tracker with static tuning fails to initialise. In frame 66, the tracker with dynamic tuning
starts to adapt the LC classifier in order to detect the subject's hand (this is indicated with a
green hand contour). The tuning at this moment can be seen in the LC decision planes
diagram on the left. In frame 74 the tracker with dynamic tuning is fully initialised (indicated
with a blue hand contour). The tracker with static tuning makes an attempt to initialise
5 A skin colour classifier for HCI 126
Frame 65 Frame 66
Frame 71 Frame 73
Frame 74 Frame 76 Figure 5.19: Dynamic tuning vs. static tuning. The tracker with static tuning fails to initialise.
5 A skin colour classifier for HCI 127
around frame 74, but the score of the hand contour hypothesis does not reach the
initialisation score, resulting in a unsuccessful attempt.
Later in frame 376, the tracker with static tuning initialises by chance when the subject's
hand passes just under the initialisation position. However, the tracking is lost again in frame
1182. Figure 5.20 shows some example frames at the moment in which the tracker with
static tuning loses the location of the hand.
Frame 1144 Frame 1158
Frame 1175 Frame 1182 Figure 5.20: Dynamic tuning vs. static tuning. The skin colour detection of the tracker with static tuning,
progressively degrades from frame 1144 to frame 1175. Eventually, in frame 1182 the location of
the hand is lost. The tracker with dynamic tuning adapts to the current lighting conditions and the
location of the hand is maintained.
5 A skin colour classifier for HCI 128
In a more general note, when comparing the skin colour segmentation of both tracker outputs
in "video_sequence_5.8.avi", it is possible to see that the LC decision planes of the tracker
with dynamic tuning change every a few frames, this produces radical changes in the skin
colour segmentation. In addition, the segmentation is only good as far as the modified cost
function can tell. This means that areas of the hand where there are no measurement lines are
not guaranteed to appear as skin colour, let alone areas outside the hand. The result is that the
segmented skin colour areas keep changing and sometimes most of the image is classified as
skin colour. This would suggest a worse skin colour segmentation than the tracker with static
tuning; however, the tracker with dynamic tuning can adapt to the changes in illumination
and the location of the subject's hand is not lost at any point in the video sequence.
The output of the tracker with dynamic tuning (but without showing the skin colour
segmentation) is also available in Appendix B and on the supporting webpage as
"video_sequence_8.4.avi".
5.10 Conclusions
This chapter has presented the linear container skin colour classifier. This classifier
constitutes a contribution to the dynamic skin colour modelling methods. Its detection
performance compares well with an rg histogram classifier, resulting in equal or better
detection rates, when using a single training image. Two remarkable qualities of this
classifier are its evaluation speed, and its low storage requirements. The four rules that define
the decision planes, and an extra rule to avoid dark pixels, can be rapidly evaluated, resulting
in a ×2.24 speed-up with respect to a simple rg histogram classifier. As the rules of the
classifier operate in the RGB space, there is no need to spend time normalising pixels to the
rg-plane. However, the LC classifier has a resistance to illumination changes equivalent to
that of a classifier that operates in the rg-plane. The detection performance of the LC
classifier is not greatly impaired when the tuning is performed in a decimated initialisation
image, but the execution time of the tuning is notably reduced. The LC classifier also proved
to be robust to non-ideal initialisations, in which skin colour appears in BackgroundMask,
and background appears in SkinMask. Two example applications of the LC classifier have
also been presented in this chapter. The first application demonstrates the usage of the LC
classifier in a HCI system. The second application shows how hand tracking can improve
when tuning the LC classifier for every time-step of tracking (dynamic tuning), in
5 A skin colour classifier for HCI 129
comparison to tuning the LC classifier only once at the beginning of the tracking (static
tuning).
A subject of further work is the tuning stage. Different heuristics or maximisation procedures
could be used in order to find a set of parameters for the LC classifier that produce better
detection results. Finally, the LC model itself could be changed. Linear containers are fast to
evaluate, but other type of containers, could produce a better fit of the skin colour cluster
through scaling of white illumination. Sets of rules such as the ones proposed in Gomez and
Morales (2002) could give better detection results, although the tuning procedure for these
type of rules could be more complex.
130
6 Using skin colour and edge features
In Section 4.3 a measurement model based entirely on skin colour features was presented. In
this chapter we will compare this measurement model against an edge based measurement
model. The measurement function of the sweep tracker, Section 4.7, is modified in order to
use edge features, then the performance of the tracker is evaluated when using exclusively
edge features, and when combining both skin colour and edge features.
The classical approach to use edge features in a contour tracker is to process a number of
measurement lines normal to the contour. The position of edge features found on these
measurement lines is then used to calculate the contour likelihood, see Section 3.2. Another
method of using edge features was proposed by Isard (1996). He uses a Sobel filter to find a
directed edge strength, which is then convolved with the direction of the measurement point;
the result of the convolution is transformed in order to directly measure a log likelihood for
that measurement point. MacCormick (2000) proposed another type of measurement
function that used edge features. In this measurement function, the measurement lines
formed a grid on the image and were static, as opposed to be on top of the tracked contour.
6 Using skin colour and edge features 131
The approach taken in this chapter to use edge features is the classical one, measurement
lines normal to the contour.
6.1 Edge features vs. skin colour features
There are some important differences between using edges and using skin colour as features
for calculating the contour likelihood. This section discusses these differences. In order to
avoid confusion in the following discussion, we will refer to an edge found along a
measurement line as image edge, and we will refer to and skin colour edge found along a
measurement line as skin edge.
In Figure 6.1 we can see three views of a frame from a tracking sequence. Figure 6.1(a)
shows the original frame superimposed with the measurement lines of the current contour
position; Figure 6.1(b) shows the skin colour image superimposed with the measurement
lines; and Figure 6.1(c) shows the edge image superimposed with the measurement lines.
When measuring the contour likelihood both edges and skin colour are only detected along
the measurement lines. However, showing the edges and skin colour for the whole image
allows us to see the potential edge and skin colour features that could be found along any
measurement line. Observing Figure 6.1 we can see that the edges are more spread over the
image than the skin colour is. We call an edge in the image that is selected but does not
belong to the object of interest a wrong edge. Assuming that no other skin colour objects
appear in the image when calculating the contour likelihoods, it would seem, intuitively, that
there is less chance of detecting wrong skin edges than of detecting wrong image edges.
Some understanding about the differences between using edges and skin colour can be
gained by analysing the positions of image edges and skin edges found on a large number of
measurement lines. For this purpose, we analyse the measurement lines in a hand tracking
sequence and construct two histograms: a histogram of the positions of image edges found
along measurement lines; and another histogram of the positions of skin edges found along
measurement lines. The hand tracking sequence uses the first 50 frames of the Mediterranean
subject's video sequence of Chapter 5 , this is "video_sequence_5.2.avi". During these first
50 frames of this video sequence, the only skin coloured object in the scene is the subject's
hand (and a small part of the arm). The subject's hand is still on the centre of the image, over
6 Using skin colour and edge features 132
(a)
(b)
(c)
Figure 6.1: Skin colour vs. edges. (a) shows the original frame superimposed with the measurement lines of
the current contour position. (b) shows the skin colour image superimposed with the measurement
lines. (c) shows the edge image superimposed with the measurement lines.
an averaged cluttered background (see exemplar frame in Figure 6.2). The video sequence is
tracked using the sweep hand tracker described in Chapter 4 . The tracker uses 1000 particles
per time-step, and uses both skin colour and edge information. The length of the
6 Using skin colour and edge features 133
measurement lines is 20 pixels. The histograms are constructed from the collected positions
of image edges and skin edges found along the measurement lines of each of the 1000
contour hypothesis, for the 50 frames that the tracking lasts. The total number of
measurement lines taking part in each histogram is about 3.5 million. Because of the nature
of the tracker (a particle filter) only a fraction of the hypothesized contours will fall on the
subject's hand, and therefore, only a fraction of the measurement lines will measure the hand.
However, the hypothesized contours will be always relatively near the to the subject's hand,
and so will be the measurement lines.
Figure 6.2: Exemplar frame. The image edge and skin edge histograms are constructed from tracking frames
similar to this one. The only skin coloured objects are the hand and part of the subject's arm.
Figure 6.3 shows the histograms of the positions where image edge and skin edge features
were found along measurement lines. The histograms are normalised to the total number of
measurement lines. The horizontal axis represents the distance from the measurement point
to where a feature was found. A distance of 0 means that the feature was found right on the
measurement point, a distance of 1 means that the feature was found 1 position away from
the measurement point, and so on. However, the bin 10 is used for when no feature at all was
found along a measurement line.
Before interpreting the histograms it is important to remember two points: Firstly, that the
only skin coloured object present in the scene is subject's hand. Secondly, that when image
edges are used, if various edges are found along a measurement line the selected edge is the
closest to the measurement point, the other image edges are ignored. And when skin edges
are used, there is always a single skin edge in a measurement line, and this is at the first
occurrence of two consecutive skin colour pixels, starting from the end of the line exterior to
the contour. The image edges and skin edges counted in the histogram are in fact selected
6 Using skin colour and edge features 134
(a) (b) Figure 6.3: Normalised histograms of feature positions along a measurement line. (a) Features found were
image edges. (b) Features found were skin edges.
image edges and selected skin edges. Having all this in mind, we can interpret the two
histograms. The first thing that can be observed about the two histograms is the large value
of the 10th bin on the skin edge histogram, about 0.56, in comparison to the image edge
histogram, about 0.25. This bin corresponds to the number of measurement lines than did not
detect a skin edge or an image edge respectively. Taking into account that the number of
measurement lines is the same in both histograms, this first observation implies that there are
about twice as many detected image edges than detected skin edges. On the other hand, as
the only skin colour object during this tracking sequence is the subject's hand, it is logical to
assume that the detected skin edges are mostly due to the subject's hand contour. Therefore,
the difference between the number of detected skin edges and detected image edges found in
the measurement lines must be because about one half of the detected image edges belong to
the background or features inside the hand, and hence, they are wrong edges.
The second thing to notice in the two histograms is the shape of the bins 0 to 9. The general
progression of values in these bins is the same in both histograms. There is a peak at the bin
1, and then values decrease progressively from bins 2 to 9. However, in the image edge
histogram, the peak is much larger, and the following bin values decrease much faster than in
the skin edge histogram. Bins 8 and 9 in the image edge histogram are zero because in order
to detect an image edge the measurement line is convolved with a kernel of 5 elements, and
the result of the convolution is for the element at the middle of the kernel. In the image edge
histogram we can see that most of the detected edges are within 2 or 3 pixels of the
measurement point – in contrast with the skin edge histogram where edges are spread rather
evenly from bins 0 to 9. Taking into account that about half of the detected image edges
6 Using skin colour and edge features 135
belong to the background or features inside the hand, the distribution of values in the image
edge histogram gives us an idea of the density of the background clutter in the image.
When skin edges are used and a hypothesized contour has a high likelihood, it is almost
certain to be due to the correct alignment of the hypothesized contour and the target object.
Where as when image edges are used and a hypothesized contour has a high likelihood, it is
more probable that the hypothesized contour is not so well aligned with the target object,
because they involve more wrong edges. This difference between skin edges and image
edges is reduced when the number of measurement lines increases, as a contour hypothesis
with high likelihood requires the simultaneous contribution of all the measurement lines, and
it is unlikely that all of the measurement lines are wrong edges at the same time.
From the performance point of view, skin edges can be computed slightly faster than the
image edges for two reasons: firstly, the LC skin colour classifier used to calculate whether
pixels belong to skin colour or not, can be evaluated very fast as it consists of only 5 simple
inequalities; and secondly, because in the average case only half of the pixels in a
measurement line will need to be processed in order to find a skin edge. The measurement
operation is typically the most computationally expensive operation in a particle filter.
Therefore, any speed increases in this operation result in significant speed increases during
tracking.
We conclude that the use of skin colour for calculating a contour likelihood is more
attractive than using image edges, provided the skin colour segmentation is good.
6.2 Using only edge features in the measurement function
In the previous section we saw an argument in favour to the use of skin edges over the use of
image edges. In this section we test the performance of the sweep tracker (described in
Section 4.7) when using only image edges in the measurement function. The sweep tracker
has been modified in order to use image edges only. The measurement function uses the
same measurement lines as before, but these lines are now processed using the same edge
detection operator as in Blake and Isard (1998), in order to find image edges on them. This
edge detection operator uses a threshold value, which was set to 80. The method used for the
refinement of the fingers' length (Section 4.6) remains unchanged, as the two measurement
6 Using skin colour and edge features 136
lines involved in it are processed using morphological operations which are not possible to
simulate using edges. Figure 6.4 shows the distance metric results on the test video sequence
of Section 4.9 for both the skin colour based sweep tracker, and the modified image edge
based sweep tracker; both trackers use 250 particles. We can see that the distance metric of
the sweep tracker using image edges is much worse than the sweep tracker using only skin
edges. The decrease in performance is related to the frequent mislocations of the fingers,
which produce the frequent peaks in the charts.
Skin edge based sweep tracker
(a) (b) Image edge based sweep tracker
(c) (d) Figure 6.4: Distance metric of the skin edge vs. the image edge based sweep tracker.
6.3 Combining edge detection and skin colour detection in the measurement function
In the previous section we saw that the performance of the sweep tracker using only image
edges in the measurement function is inferior to the performance of the same tracker using
skin edges in the measurement function. However, if both image edges and skin edges are
used together in the measurement function then the performance of the tracker can be
increased. In fact, there are some situations which cannot be tracked reliably if the edge
information is not available. Two of these situations are illustrated in Figure 6.5. On the left
column of Figure 6.5, there is hand with the fingers together Figure 6.5(a). When the fingers
are together the detected skin colour areas in the fingers become merged, making difficult to
find skin edges between the fingers, mainly for the ring and heart fingers, Figure 6.5(b);
6 Using skin colour and edge features 137
(a) (d)
(b) (e)
(c) (f) Figure 6.5: Situations in which the use of edges is essential for the correct location of the hand. (a) Hand
with closed fingers. (b) The skin colour areas of the fingers become merged, making difficult to
find skin edges between the fingers. (c) The image edges between the fingers are more reliable. (d)
The skin colour of the subject's face occludes the skin colour of the thumb. (e) The thumb position
cannot be recognized from the skin colour information. (f) The image edges of the thumb are not
occluded by the subject's face.
however, it is possible to find image edges between the fingers, Figure 6.5(c). On the right
column of Figure 6.5, the thumb of the target hand is in front of the subject's face, Figure
6 Using skin colour and edge features 138
6.5(d). If we look at Figure 6.5(e), we can see that the skin colour of the face partially
occludes the skin colour of the thumb. It is only by using the edge information, Figure 6.5(f),
that the thumb location can be found. The next step is to combine the information gained
from an image edge and a skin edge in order to have a single score that can represent the
measurement point. One possible approach is to use the same Gaussian profile as in the skin
colour based measurement function of Section 4.3 for both the skin edge distance and the
image edge distance and to multiply the results together. The approach taken is exactly this
one, but the products of the two Gaussian profiles are pre-calculated in a combination matrix
for efficiency. The Gaussian profiles are transformed so that when both the image edge and
the skin edge are zero, the score for that measurement point is 2; and when both the image
edge and the skin edge are 10, the score for that measurement point is 0.5. Note that when
either no image edge or skin edge is found along the measurement line, the selected value in
the combination matrix is 10. Figure 6.6(a) shows the values of the combination matrix, and
Figure 6.6(b) shows the graphical representation of that combination matrix.
(a)
(b)
Figure 6.6: Combination matrix. (a) Values of the combination matrix. (b) Graphical representation of the
combination matrix.
6 Using skin colour and edge features 139
The sweep tracker of Section 4.7 is modified once again in order to find both image edges
and skin edges along the measurement lines, and get a score for each measurement line based
on the combination matrix. The modified sweep tracker is tested on the test video sequence
of Section 4.9. The threshold value of the edge detection operator is set to 100. The
performance results of this test are shown in Figure 6.7. The values for the distance metric in
charts (a) and (b) are, in average, smaller than those of the sweep tracker using only skin
colour in the measurement line. However, there are various peaks in the distance metric
chart, frames 360 to 890, Figure 6.7(b), that did not appear on the equivalent chart for only
skin edges, Figure 6.7(d). These peaks in the distance metric are produced by mislocations of
the fingers. The use of both skin edge and image edges in the measurement function is
beneficial on average. However, under certain punctual circumstances the image edges can
introduce fitting errors to the tracking.
Skin edge and image edge based sweep tracker
(a) (b) Skin edge based sweep tracker
(c) (d) Image edge based sweep tracker
(e) (f) Figure 6.7: Performance of the sweep tracker using edges and skin colour in the measurement function.
6 Using skin colour and edge features 140
6.4 Conclusions
We have compared the use of image edge features against skin colour features. We argued
that if no other skin colour objects interfere with the target hand, the use of skin colour
features is more reliable (and faster) than the use of edge features. Then, it was shown that
the performance of the sweep tracker when using both image edge features and skin colour
features was better on average than using only skin colour features. Finally, despite that the
use of edge features could be essential in order to track successfully certain situations in
which the skin colour areas are merged or occluded, the use of edge features alone proved to
be inferior than the use of only skin colour features – and introduced some fitting errors
when combined with skin colour features. The most common error is peaks in the distance
metric. These peaks are the result of mislocations of the fingers. This is a worse error than it
appears in the distance metric charts, as if the tracker was to be used in a HCI system a finger
mislocation will result in a wrong input for the HCI system.
141
7 Tracking improvements
A typical strategy to increase the accuracy and robustness of contour trackers based on
particle filters is to increase the number of particles used in the filter (Isard (1998); Isard and
Blake (1998a)). This strategy can also be used in the two articulated hand contour trackers
presented in Chapter 4 . However, such an increase in the number of particles results in a
slowdown of the tracking, which is not good for real-time applications. A trade-off between
the accuracy and the execution speed of the trackers has to be reached by using a certain
number of particles. This chapter presents a number of techniques that can improve the
tracking of the articulated hand trackers presented in Chapter 4 , without having to increase
the number of particles. These techniques can improve the following aspects of the tracking:
• Tracking accuracy, by using the techniques of Section 7.1, Section 7.3, and Section 7.4.
• Tracking robustness, by using the techniques of Section 7.3, and Section 7.4.
• Tracking repeatability, by using the technique of Section 7.2.
• Automatic reinitialisation, by using the technique of Section 7.4.
These techniques can be used separately or together, each one bringing an improvement to
the tracking. The techniques are first introduced and tested independently, by just adding the
technique in question to the sweep tracker of Section 4.7. Finally, the techniques are
7 Tracking improvements 142
combined together in a single tracker. This results in an improved tracker that benefits from
the improvements of each technique.
7.1 Switching template fitting methods during articulated tracking
This chapter describes two methods of fitting deformable templates when tracking
articulated objects using particle filters. One method fits a template to each of the links of an
articulated object in a hierarchical way; the method first fits a template for the base of the
articulated object and then fits a template for each of the links deeper in the hierarchy. The
second method fits the whole articulated object as a rigid object, and then refines the fitting
for each of the links of the articulated object in a hierarchical way, starting from the base.
Advantages and disadvantages of each method are discussed and a way of combining the
best of each method in a single tracker is presented. Results are given for the case of
articulated hand tracking.
7.1.1 Fitting templates to the links of an articulated object
This section discusses two methods of fitting an articulated template model to an articulated
object. It must be taken into account that when we refer to “fitting”, it is in the context of
particle filters, in particular the Condensation filter. This means that a template fitted to a
link is in fact a set of particles (various hypotheses of the template configuration) that
represent the link to some degree of accuracy. For display purposes one of these particles,
normally the one that fits the link best, or a weighted average of these particles is selected.
This means that in general the fitting of the template to the link is not perfect. Due to this
effect, when using particle filters to track an object through a video sequence, the output of
the tracking exhibits a certain degree of jitter. This jitter can depend on many factors, but in
general, the more particles the filter uses, the less jitter the output exhibits. Using other filters
like Kalman or recursive least squares can produce an optimal fit. However these filters often
cannot handle background clutter well.
The first method of fitting an articulated template to an articulated object involves finding
the configuration of the base link of the articulated object; then finding the configuration of
the second link relative to the first; then the configuration of the third link relative to the
second, and so forth. We will refer to this method as method 1.
7 Tracking improvements 143
Another method of fitting an articulated template to an articulated object is to first try to fit a
previous configuration of all the links of the articulated template as a single rigid template,
which we refer to as the combined template. Then the base link is refined using the position
of the base link in the combined template as an initial position estimate. We then proceed to
refit the second link, with respect to the first link, then the third, and so forth. We will refer
to this method as method 2.
When tracking an articulated object through a video sequence the fitting procedure, either
method 1 or 2, is repeated for each frame. Figure 7.1 shows a representation of the fitting
procedure for both methods 1 and 2. Templates in their fitted position are shown in grey. The
slight misalignments between the template and the link represent the typical jitter observed
in particle filters.
Method 2 has one more step than method 1. On the other hand, the final fit is often better
than with method 1, which means less jitter. The reduced jitter in method 2 can be explained
assuming that the initial configuration of the combined template is close to the true
configuration, and that there is significant background clutter. In general, the contour of the
combined template is more distinctive than the contour of any individual link. The combined
template is, therefore, less likely to match features on the background, as it is more likely to
be a unique structure in the image.
This suggests that method 2 is better than method 1; however the final fitting of the
articulated template depends of how good the initial fitting of the combined template as a
rigid object is. A potential problem that method 2 could face is illustrated in Figure 7.2. In
this case the second and third links of the object have changed considerably from the last
frame, and there is also background clutter that can distract the fitting procedure. In this
situation, the fitting of the combined template as a rigid object could be far from the real
position, and the following refitting steps for each of the links are not going to be able to find
the right configuration of links (because they are too far form the real configuration). This
effect can be carried on from one frame to the next, stopping the tracker from recovering. On
the other hand, if we use method 1 to fit the new configuration of links, the fit of the base
link is not going to be affected by the clutter that affects the 2nd and 3rd links in method 2. It
is more likely that the fit of the base link will be correct in this case, and consequently, the
rest of links may be fitted more precisely.
7 Tracking improvements 144
Method 1 Method 2
Figure 7.1: Fitting methods for an articulated object. On the left, fitting method 1, steps 1 to 3. On the right
fitting method 2, steps 1 to 4.
Figure 7.2: Potential problem of method 2. Left, fitting of the combined template as a rigid object is wrong
due to the significant change on the configuration of the articulated object and the effect of
distracting features on the image. Right, fitting of the base link alone is more precise.
In summary, the advantages of method 1 are fewer steps and better capability to keep and
recover tracking. On the other hand a particle filter tracker using method 1 exhibits more
7 Tracking improvements 145
jitter than using method 2. The advantage of method 2 is that it has less jitter than method 1;
however, it is more likely to lose track and not recover again. It makes sense to combine both
methods in a single tracker. One way of doing this is to switch from one method to the other
depending on the tracking conditions. This is what has been implemented in the following
articulated hand tracker.
7.1.2 Simplified articulated hand tracker
In order to illustrate the two template fitting methods, we have constructed a simplified
version of the sweep articulated hand contour tracker of Chapter 4 . This articulated hand
tracker implements Condensation (Isard and Blake, 1998a) and uses ideas of partition
sampling (MacCormick and Blake, 1999; MacCormick and Isard, 2000).
Each of the particles contains the parameterisation of an articulated hand template as the one
shown in Figure 7.3. This template is allowed to undergo Euclidean transformations, i.e.
translation, rotation and scale. These transformations are applied with respect to the point
indicated as hand pivot in Figure 7.3. Each of the fingers, including the thumb, can rotate
around finger pivots, also indicated in Figure 7.3, in order to model the abduction/adduction
movements of the fingers. These are the only modelled movements of the hand and fingers,
in total 9 DOF. In comparison with the articulated hand contour model described in Section
4.1, the simplified model used in this section lacks the second thumb joint, and the projected
length of the little, ring, middle and index fingers; therefore, these parameters cannot be
tracked. Seeing the hand template as an articulated object, the palm of the hand would be the
base link, and the fingers would introduce a second level in the hierarchy.
Figure 7.4 shows a flow chart of the hand tracker's operation. The initial position of the hand
template (initial particle) is adjusted manually on top of the hand. When the tracking starts, a
distribution of particles is generated from the initial particle. The particles evolve in time
following the Condensation algorithm. After each Condensation time-step, the fingers’
angles are found for a subset of the particles with the highest likelihood of representing the
configuration of the hand. The fingers’ angles are found using a deterministic search. This
deterministic search involves sweeping a certain range of angles, for each of the selected
particles, and selecting the angle for which the finger template has the highest likelihood of
representing the finger. This differs from the sweep implementation of Section 4.7 in that
7 Tracking improvements 146
there is no particle interpolation. This hand tracker can track a hand moving in a plane
parallel to the image, allowing abduction/adduction movement of the fingers. In addition, the
fact that in each time-step of the Condensation algorithm several particles, or hypotheses, are
propagated to the next step, allows the hand tracker to have a certain degree of resistance to
background clutter.
Figure 7.3: Simplified articulated hand contour model with 9 DOF.
Figure 7.4: Flow chart of the simplified hand tracker.
7.1.3 Results with the simplified articulated hand tracker
This section gives results of the articulated hand tracker described in the previous Section
7.1.2, supporting the conclusions about jitter and capability of tracking an articulated object
for method 1, method 2, and the combination of both methods. Finally, we study how the
combination method affects the tracking performance of the hand contour tracker described
7 Tracking improvements 147
in Section 4.7 (sweep implementation). Videos of the experiments are available in Appendix
B and on the supporting webpage as "video_sequence_7.1.avi" for the tracking output using
method 1; "video_sequence_7.2.avi" for the tracking output using method 2; and
"video_sequence_7.3.avi" for the tracking output using the combination of both methods.
In the articulated hand tracker, the palm of the hand is the base link of an articulated object,
and the fingers form a second level in the hierarchy of links. The palm consists of 3 contour
segments, as indicated in Figure 7.3. When fitting the hand template to the hand in an image
in method 1, the 3 contour segments of the palm are fitted first, and then the angle for the
fingers is found. In method 2 the whole hand is fitted as a rigid object and then the fingers
are refitted. When using method 2, the palm, which is the base link, is not refitted. This is
done in such a way because it reduces processing time considerably.
The combination of methods 1 and 2 is based on how good the lock on the fingers is. If the
lock on all the fingers is good, method 2 is used. As soon as the lock on a finger is lost, then
the tracker switches to use method 1. The criterion to decide whether a finger has a good
lock is based on a threshold of the likelihood of a finger template representing a finger on the
image.
All the experiments have been made using a fixed number of 250 particles for the
Condensation tracker. The input is a video sequence of 240 frames containing a hand moving
parallel to the image plane. The sequence includes both motion of the hand as a whole, and
of the fingers in relation to the palm. In this video sequence there are two critical zones.
From frame 66 to frame 80, both methods 1 and 2 almost loose lock on the hand due to a fast
horizontal rigid movement of the hand. Later, from frames 186 to 209, there is a fast
rotational movement of the hand that confuses the tracker using method 2. This results in the
fingers of the hand template being locked on the wrong fingers and not being able to recover
again. Method 1 and the combined methods, are able to recover from this situation.
Figure 7.5 shows selected frames from the critical zone in the range 186 to 209, for the
methods 1, 2 and the combination of both. Each of the frames from the combination of
methods 1 & 2, contains a number, either 1 or 2. This number indicates the method used for
that frame. At the beginning of the sequence the lock is kept in the three cases. From frame
191 there is a loss of lock in all three cases. Note that in method 2, the wrong lock is kept.
7 Tracking improvements 148
The little, and ring finger templates are locked on the ring and middle fingers, leaving the
middle finger template unlocked between the middle and index fingers. Finally from frame
204, methods 1 and the combined method, recover the lock on the hand. However, method 2
continues with a wrong lock and does not recover the correct track. Frame Method 1 Method 2 Combination 1 & 2
186
191
197
204
209
Figure 7.5: Selected frames from the critical zone. Method 1 and the combination 1 & 2 can track the whole
sequence. Method 2 gets confused and keeps a lock on the wrong fingers.
7 Tracking improvements 149
On the other hand, when comparing the tracking sequences for method 1 and 2, it is possible
to see that method 1, despite recovering from the second critical zone, has more jitter than
method 2. Jitter is the result of small misalignments between the fitting of the fingers and
palm templates and the real fingers and palm positions. These misalignments will be
different from one frame to the next, producing the impression that the output of the tracker
is shaking on top of the target, despite the lock being kept all along. This jittery output is
inherent to particle filters since they generally calculate an approximation to a solution.
When the number of particles used in a particle filter is higher, the approximation to the
solution is better, which means smaller misalignments and therefore less jitter5. Figure 7.6
shows examples of the misalignments that jitter produces for the three methods.
A plot of the variance of the parameters controlling the rigid movement of the hand shows
the amount of jitter exhibited in the tracker's output for each frame of the video sequence.
This variance is calculated over the set of particles that propagate from one time-step to the
following one. Figure 7.7 shows the variance (in pixels2) of the x and y coordinates of the
hand pivot for the three methods. Figure 7.8 shows the variance of the rotation angle and
scale factor of the whole hand as a rigid object. The rotation angle and scale have different
units– radians and a scaling factor; however, they are shown on the same chart as their
variances are in the same range.
Both Figure 7.7 and Figure 7.8 show a peak in the variance near frame 80. This corresponds
to the first critical zone. It can be appreciated that for all four parameters the variance is
smaller in the case of method 2. In the chart for the combined method, there is a signal that
tells when the tracker is using method 1 or method 2. It is possible to see that most of the
time the tracker is in method 2, switching to method 1 for short periods of one or two frames.
This brief switching from method 2 to method 1 is enough in most cases to allow the tracker
to recover a good lock on the fingers. The exception is during the first and second critical
zones, when the tracker is predominantly using method 1. In Figure 7.7 it is also possible to
see that for method 1, the variance for y is bigger that the variance of x. This is because the
three segments that form the hand palm are mostly in a vertical orientation. The
measurement function used to calculate the likelihood of a template representing an object in
the image, uses edge and colour information in the same way as MacCormick and Isard 5 Jitter also depends on how well the template models the object. If the template does not model the object properly the fitting will jump between local minima.
7 Tracking improvements 150
Method 1
Method 2
Combination 1 & 2
Figure 7.6: Various misalignments for each method of fitting the articulated template. Frame 150. Method
1 tends to produce more misalignments.
(2000). Having the three segments of the hand palm in a vertical orientation allows the
tracker to be more precise on the horizontal location of the template than on the vertical.
7 Tracking improvements 151
Figure 7.7: X and Y variance of the hand pivot.
Figure 7.8: Variance of the rotation angle and scale factor.
7 Tracking improvements 152
7.1.4 Tracking performance with the sweep implementation
In this section, we study how the combination of fitting methods 1 and 2 affects the tracking
performance of the hand contour tracker described in Section 4.7. This is the sweep
implementation; it originally uses a template fitting method 1, but for this experiment we add
the method switching capability. The template fitting method is switched from method 2 to
method 1 when the likelihood of any individual finger is smaller than 0.01 for at least three
consecutive frames; otherwise the fitting method is 2. The hand tracker is run on the test
video sequence of Section 4.9, frame intervals 30-174 and 360-890. Note that the frame
interval 175-359 is not used because the sweep tracker (as described in Section 4.7) cannot
keep tracking during this interval. The tracker is initialised at frames 30 and 360. The
evaluated performance measures are cost function, distance metric, and SNR (these
performance measures are described in Section 4.8). Videos of the tracking output are
available in Appendix B and on the supporting webpage as "video_sequence_7.4.avi" for
frames 30-174, and "video_sequence_7.5.avi" for the frames 360-890.
Figure 7.9 shows the tracking performance results when using the combined template fitting
method. On the left are the frames from 30 until 174; on the right are the frames from 360
until 890. Let us compare these results with the sweep results in Figure 4.13 and Figure 4.17;
we will refer to these figures as the results for fitting method 1. We can see that the cost
function, Figure 7.9(a) and (b), in the combined method has smaller average values than in
the method 1. At a first glance this suggests worse performance with the combined method;
however, the results for the distance metric, Figure 7.9(c) and (d), are considerably better for
the combined method than for the method 1. The most important data, though, is the variance
results for the distance metric, these are much more smaller in the combined method than in
the method 1; this indicates that the tracking is considerably more stable with the combined
method than with method 1 – as is expected. Figure 7.9(e) and (f), show the SNR results, the
average SNR is just marginally higher in the combined method than in the method 1. Finally,
Figure 7.9(g) and (h), shows how the template fitting method switches, between method 1
and method 2, along the test video sequence.
7 Tracking improvements 153
(a) (b)
(c) (d)
(e) (f)
(g) (h) Figure 7.9: Sweep implementation tracking performance when using the combined template fitting
method. The performance measures are cost function, (a) and (b); contour distance (c) and (d); and
SNR, (e) and (f). The bottom row shows the switching between fitting methods, (g) and (f).
7.1.5 Conclusions
We have described two methods for fitting templates to the links of an articulated object.
Method 1 can keep track and recover track better than method 2, but it has more jitter that
method 2. Method 2 has less jitter than method 1 but can loose track more easily. We present
a method of combining both methods, by switching between the two, that allows a more
robust tracking and less jitter when the tracking conditions are good.
7 Tracking improvements 154
Though the articulated tracking presented here is based on Blake and Isard’s Condensation
algorithm, and the measurement model described in (Isard and Blake, 1998; MacCormick
and Isard, 2000), there are two major differences. Blake and Isard’s method uses only one
fitting method, similar to the method 1 in this chapter. This means it does not use different
template fitting methods for different tracking conditions. Another difference is in the
implementation of the hand tracker. In the proposed methods, the fitting of the palm or the
hand (depending on whether method 1 or method 2 is in use) follows largely the
Condensation algorithm, but the fitting of the fingers is achieved by a deterministic search
instead of having separate particle distributions for fingers. Another way of making this point
clear is to say that some of the particle parameters, at each time-step, are found using the
condensation algorithm (parameters for the hand as a rigid object), while others are found by
a deterministic procedure (parameters describing angles between fingers).
Finally, we applied the template switching method to the sweep hand tracker implementation
of Section 4.7. The results were favourable, showing more tracking stability, and even more
accurate tracking, than the original sweep implementation, which uses method 1.
7.2 Quasi random sampling
The dynamical model of the Condensation algorithm, as described in Section 3.3, is
composed of a deterministic part, and a stochastic part. The stochastic part of the dynamics is
represented by the term wB in Equation (3.11); where w is a vector of independent random
normal )1,0(N variates, and B is a matrix that modulates the strength of the stochastic
component for each one of the dimensions of the configuration space. The reason for having
a stochastic component in the dynamics is to allow a random sampling of the configuration
space around a point fixed by the deterministic part of the dynamics.
The random normal variates )1,0(N , are typically generated using a uniform pseudo-random
number generator, whose output is then shaped into a Gaussian. A common implementation
uses the system rand() function, which is almost always a linear congruential generator,
as a uniform pseudo-random number generator. These generators, although very fast, have
an inherent weakness that they are not free of sequential correlation on successive calls. If
one of these pseudo-random number generators is used to generate points in a k-dimensional
space, the points will not fill up the space evenly, clumping in some occasions, leaving large
7 Tracking improvements 155
gaps in others. And these effects tend to worse when the number of dimensions increases.
Thus the random sampling will be sub-optimal and even inaccurate.
A promising extension to Condensation that addresses the sub-optimality of random
sampling is the incorporation of quasi-Monte Carlo methods (Press, et al., 1996;
Niederreiter, 1992). In such methods, the sampling is not done with random points, but with
a carefully chosen set of quasi-random points that span the sample space so that the points
are maximally far away from each other. Philomin, et al. (2000) used quasi-random sampling
with Condensation tracking; they reported superior tracking performance, in a pedestrian
contour tracking application, when substituting random sampling for quasi-random
sampling. The use of quasi-random sampling constitutes an interesting and straightforward
strategy in order to increase the performance of Condensation tracking. This section explores
how the articulated hand trackers presented in Chapter 4 can benefit from the use of quasi-
random sampling.
7.2.1 Quasi-random sequences
There exist a number of quasi-random sequences which possess beneficial properties for
sampling a configuration space; some examples include Hammersley, Halton, Sobol, Faure
and other sequences (Morokoff, 1994; Niederriter, 1992; Tezuka, 1992). The values of a
quasi-random sequence are generated in groups for a specific dimension; for example if the
configuration space has 8 dimensions, the values of a quasi-random sequence will be
generated in groups of 8 – forming points in the target configuration space. Intuitively, the
points resulting from a quasi-random sequence must be distributed such that any subvolume
in the space should contain points in proportion to its volume. The difference between this
quantity and the actual number of points in the subvolume is called the discrepancy. Quasi-
random sequences have low discrepancies and are also called low-discrepancy sequences.
Thus, quasi-random sequences can be used to generate samples that fill the configuration
space in a more desirable way. Some of the criteria that define a desirable sampling of a
configuration space are (Lindemann and La Valle, 2003):
• Uniformity: Good covering of the space is obtained without clumping or gaps. This can
be formulated in terms of optimising discrepancy.
7 Tracking improvements 156
• Lattice structure6: For any sample, the location of nearby samples can easily be
determined.
• Incremental quality: If the sequence is suddenly terminated, it has a decent coverage.
This is an advantage over a sequence that only provides high-quality coverage for a fixed
n.
Figure 7.10 shows the result of plotting 250 points from (a) a pseudo-random sequence, (b) a
Halton sequence, (c) a Sobol sequence. The Halton and Sobol sequences are generated for a
dimension d=4; then dimensions 1 and 2 are plotted. Notice how the pseudo-random points
clump in some regions, while in other regions there are gaps. The Halton points are better
distributed, they have lower discrepancy than the pseudo-random points. The Sobol points
also have a low discrepancy, and they show a more regular pattern than Halton points, so
they have a better lattice structure. The patterns will look different when different
dimensions are plotted together. The Halton sequence is based on a list of prime numbers
used for each dimension. The Sobol sequence is based on a number of direction numbers,
also specific for each dimension.
Pseudo-Random
(a)
Halton
(b)
Sobol
(c) Figure 7.10: Distributions of pseudo-random and quasi-random points. Figures show 250 points generated
using (a) pseudo-random sequence, (b) Halton sequence, and (c) Sobol sequence.
7.2.2 Application of quasi-random sequences in Condensation
In a Condensation algorithm, the stochastic part of the dynamics is generated from a vector
of independent random normal )1,0(N variates. The known general way to obtain )1,0(N is
by using the Box-Muller algorithm (Press, 1992) on a uniform distribution (pseudo-random
6 The lattice structure criteria is not typically so important in Monte Carlo methods, but it is important in other sampling applications.
7 Tracking improvements 157
based). However, when the uniform distribution is a low-discrepancy sequence, Box-Muller
algorithm damages the low-discrepancy sequence properties, altering the order of the
sequence, or scrambling the sequence uniformity (Moro, 1995; Galanti and Jung, 1997). In
this thesis, the conversion from a uniform quasi-random distribution to a Gaussian quasi-
random distribution is achieved using the Moro transformation (Moro, 1995): A Gaussian
value g is obtained from the uniform value u by applying the following mapping to each of
the dimensions of the configuration space:
12 erf (2 1)g u−= − (7.1)
where 1erf − is the inverse of the error function given by
2
0
2erf( )z
tz e dtπ
−= ∫
The results of applying this transformation to the uniform distributions of Figure 7.10 are
shown in Figure 7.11. These plots show only two of the dimensions, but similar plots would
result from plotting other dimensions of the configuration space. Notice how even after
transforming the original sequences into Gaussian distributions, those still retain properties
from the original uniform distributions: Figure 7.11(a) shows some clumping and gaps
among the points, Figure 7.11(b) and (c) have a better coverage.
Pseudo-Random
(a)
Halton
(b)
Sobol
(c) Figure 7.11: Gaussian transformation of uniform pseudo-random and quasi-random points. Figures show
250 points transformed into a Gaussian distribution from the uniform distributions of Figure 7.10.
The Gaussian transformed quasi-random sequences can be directly used as the term w in the
particle dynamics, Equation (3.11), of the Condensation algorithm. Each time a particle is
propagated from one time-step to the next one, the seed for the Halton and Sobol sequence
generators is reset. This assures a coherent sampling for each of the propagated particles.
7 Tracking improvements 158
7.2.3 Results
The performance of the articulated hand contour tracker of Section 4.7, sweep
implementation, was tested using three sampling methods: pseudo-random sampling, Halton
quasi-random sampling, and Sobol quasi-random sampling. The tests were performed on the
test video sequence of Section 4.9, frames 30 to 174, and from 360 to 890. The tracker was
initialised at frames 30 and 360. The recorded performance measure is the contour distance,
thus smaller values mean better performance. The quasi-random sampling always produces
the same performance results when run on the same video sequence, as the sample positions
are always the same. However, the performance using pseudo-random sampling may change
slightly, even in the same video sequence, depending on the particular sequence of numbers
produced by the pseudo-random number generator. Therefore, in order to make a fair
comparison between the quasi-random and pseudo-random samplings, the performance for
pseudo-random sampling is the averaged from 100 trials on the same video sequence, and
same frame interval.
Table 7.1 shows the contour distance performance metric of three sampling methods. For
each sampling method the average, variance, and median of the distance metric is shown.
For frames 30 to 174, we can see that the Halton, and Sobol samplings produce better
average and median results than the pseudo-random sampling; the variance for Halton and
Sobol samplings is also smaller than with the pseudo-random sampling. However, when we
look at the results for frames 360 to 890, the pseudo-random sampling has slightly better
average and variance results than the Halton and Sobol samplings; the median, though, is
slightly smaller in the case of Halton and Sobol than in pseudo-random.
7.2.4 Conclusions
From the results shown in the previous section, no significant performance increase is
perceived when using quasi-random sampling instead of pseudo-random sampling. Even in
the second interval of frames, from 360 to 890, the results point out slightly inferior
performance for quasi-random. In any case, these differences in performance are very small,
and could well be produced by the small disagreements between the distance metric and the
cost function, see Section 4.11. Previous research by Philomin, et al. (2000) reported
superior performance when using quasi-random sampling with Condensation tracking.
7 Tracking improvements 159
Table 7.1: Contour distance performance metric comparison using three sampling methods.
However, their results are presented for a basic Condensation tracker with synthetic
experiments, tracking an ellipse; and for a pedestrian tracking application. The tracking
mechanisms of these trackers differ from the ones used in the articulated hand contour
trackers of this thesis. Partition-sampling (Section 3.4.1), particle interpolation (Section 3.7),
finger sweep searches (Section 4.7), and techniques for refinement of the finger length
estimations (Section 4.6), can affect the tracking performance to such a degree that any
improvements a quasi-random sampling could introduce, would be severely damped on the
distance metric performance measure.
On the other hand, despite no real performance improvement is observed when quasi-random
sampling is used in the articulated hand tracker, the use of quasi-random sampling is still
highly desirable because it gives repeatability to the tracking. When using pseudo-random
sampling in the articulated hand tracker, the performance may vary from one trial to the next
one, even if the tracking is run on the same video sequence. When using the hand tracker in
HCI applications such as the VTS interface proposed in Chapter 8 , these differences in
performance could mean that exactly the same click event could sometimes be detected, and
other times could not. The tracking repeatability, gained by using quasi-random sampling,
will make possible that if a click event on a VTS is once detected (or not), the same click
event will always be detected (or not).
7 Tracking improvements 160
7.3 Variable Process Noise (VPN)
The dynamical model of the Condensation algorithm, as described in Section 3.3, is
composed of a deterministic part, and a stochastic part. The stochastic part of the dynamics is
represented by the term wB in Equation (3.11); where w is a vector of independent random
normal )1,0(N variates, and B is a matrix that modulates the strength of the stochastic
component for each one of the dimensions of the configuration space. The matrix B is also
known as process noise. Typically, the matrix B is constant along the tracking. In this section
we shall see how tracking can be improved by varying the process noise according to the
current tracking conditions.
In Condensation terminology, the deterministic part of Equation (3.11) is referred to as
prediction, and the stochastic part is referred to as noise. The reason for having a noise
component in the dynamics is to allow a random sampling of the configuration space around
the prediction point. The process noise modulates the extent of the random sampling. When
the tracking performance decreases, because the target exhibits brisk motion, a greater
process noise can prevent the tracker from losing the target; as the extent of the random
sampling will be greater, and more likely to cover far apart states. When the tracking
performance increases, because the target exhibits slower motion, a smaller process noise
can be sufficient to keep a lock on the target, while increasing the resolution of the lock.
Thus, a variable process noise can be exploited in order to have a more accurate tracking
when slow motions occur, and a more robust tracking when brisk motions occur. Blake and
Isard (1998) use this idea in contour tracking when using Kalman tracking; they define a
search region whose width is related to the Kalman's position covariance, this is a measure of
the tracking performance at each time-step. However, they do not use this idea with
Condensation tracking.
The tracking performance, that we refer to in the paragraph above, could be calculated, in
principle, with any of the performance measures of Section 4.8; however, a computationally
inexpensive performance measure can be defined in terms of the distribution of weights in a
particle set. Isard and MacCormick (2000) describe two extreme cases of the distribution of
weights in a particle set: if only one particle, in a particle set, has a high weight and the rest
of particles have a very small weight, there is significant danger that tracking could be lost.
On the other hand, if all the particles have the same weight, the tracking is more likely to
7 Tracking improvements 161
continue. Any particle set lies somewhere between these two extreme cases. This measure of
tracking performance can be used to control the process noise.
We use the relative weight of a particle, inside the particle set, to control the level of process
noise that is used when dynamics are applied to that particle; this only applies to particles
that are propagated from one time-step to the next one. We define three bands: in the first
band the process noise is reduced; in the second band the process noise is unaltered; and in
the third band the process noise is increased. Let S be the relative weight of a particle inside
the particle set; for example if a particle has S = 50%, this particle carries half of the particle
set's total weight. The process noise is controlled as follows (the constants have being found
empirically):
New Process Noise = Process Noise 0.5 if S < 8%New Process Noise = Process Noise 1.75 if S > 54%New Process Noise = Process Noise otherwise
× ×
These rules were added to the sweep tracker implementation of Section 4.7, and tested with
the test video sequence of Section 4.9. A video of the tracking output is available in
Appendix B and on the supporting webpage as "video_sequence_7.6.avi". The distance
metric performance results are shown in Figure 7.12. The most remarkable result is that the
sweep tracker with VPN is capable of tracking the whole test video sequence, including
frames 175 to 359, which correspond to very brisk rigid-hand motion. Note that with the
exception of the VPN tracker, Section 7.3, and the SCGS tracker, Section 7.4, no other
tracker implementation in this thesis could track this part of the test video sequence.
Figure 7.12: Distance metric performance measure when using VPN. Note that the use of VPN allows
tracking the test video sequence's section with brisk rigid-hand movement, frames 175 to 359.
7 Tracking improvements 162
In order to compare the tracking performance results between the sweep hand tracker with
VPN, and the sweep hand tracker with fixed process noise, we repeat the test initialising the
tracking at frames 30 and 360. Tracking is also initialised at frame 175 in order to have a
separate chart of the tracking performance during the brisk rigid-hand motion section. We
can see that from frames 30 to 174, Figure 7.13(a) and Figure 4.13(d), the version with fixed
process noise has better average distance metric, and smaller variance than the version with
VPN. However, from frames 360 to 890, Figure 7.13(c) and Figure 4.17(d), the version with
VPN has better average distance metric, and smaller variance.
We conclude that the use of VPN makes the hand tracking more robust, against brisk rigid-
hand motion; and also more accurate, when the rigid-hand motion is slow7, as is the case
from frames 360 to 890. However, in the first section of the test video sequence, from frames
30 to 174, when the global hand motion is medium, the tracking performance is slightly
reduced. It could be argued that during this section, of the test video sequence, the default
process noise is the adequate, and the possibility of switching to a larger or smaller process
noise slightly degrades the tracking accuracy.
7.4 Skin Colour Guided Sampling (SCGS)
Isard and Blake (1989b) introduced a tracking technique named ICondensation. This
technique combines the use of a Condensation based contour tracker with a skin colour blob
tracker, in order to improve the robustness and allow for automatic reinitialisation of their
tracker. The blob tracker runs on a low-resolution image, it is fast and robust, but conveys
little information other than the object centroid. However, the information about the centroid
of the object is sufficient to describe what areas should be searched for information about the
object. This information can be introduced in the contour tracker in the form of an
importance function. The importance function ( )g X describes which areas of the contour
tracker's state-space contain most information about the posterior. The idea is to concentrate
samples in those areas of state-space by generating samples from ( )g X rather than the prior
( )p X . The resulting samples are then weighted using a mixture weight which takes into
account both ( )g X and ( )p X . The desired effect is to avoid as far as possible generating
7 From frame 360 to 890 the hand motion is predominately articulated, while the rigid-hand motion is generally slow or null.
7 Tracking improvements 163
(a)
(b)
(c) Figure 7.13: Distance metric performance measure when using VPN, separate charts. Tracking is
initialised at the beginning of each chart, in order to compare performances with Figure 4.13, and
Figure 4.17.
any samples which have low weights, since they provide a negligible contribution to the
posterior.
In practice, Isard and Blake's (1989b) tracker implementation generates particles in three
different ways:
• Some particles are sampled from the particle distribution's prior these particles are named
condensation samples.
• For other particles, the translation part of the state is sampled from the importance
function, and the deformation part of the state is sampled from the distribution's prior,
these particles are named importance samples.
• Finally, for the rest of the particles, the translation part of the state is sampled from the
importance function, and the deformation part is sampled from a prior distribution
independent from the tracker's history, these particles are named initialisation particles.
7 Tracking improvements 164
Several methods similar to ICondensation have been used in visual-tracking for both single
and multiple targets (Wu and Huang, 2001; Pérez et al. 2004; Branson and Belongie, 2005).
In this section we propose a method based on ICondensation, which we call Skin Colour
Guided Sampling (SCGS). This method can be used to improve the robustness, and allow for
automatic reinitialisation, of the hand trackers presented in Chapter 4. The technique is based
on ICondensation, as described in (Isard and Blake, 1989b), and shares many of the same
elements. In particular, it shares the concept of condensation particles, importance particles,
and initialisation particles; however, it differs from ICondensation in a number of aspects:
• First, the low-level information comes from a skin coloured blob detection procedure on
the whole image, as opposed to blob tracking.
• Second, only the largest skin coloured blob is considered. This blob is analysed using
moments in order to convey extra information, and a heuristic method is used to calculate
insertion points for the importance samples, and initialisation samples.
• Third, the combination of the low-level information with the contour tracking does not
use an importance function. The insertion points previously calculated are used to
generate particles around them – in form of importance particles and initialisation
particles. These particles use the extra information about the blob in order to initialise a
larger part of their state.
• Finally, ICondensation uses initialisation particles and importance particles
simultaneously with condensation particles during normal tracking. In skin guided colour
sampling initialisation particles are used in combination to condensation particles only
when the tracking is considered lost – in order to reinitialise the tracking. And
importance particles are used in combination to condensation particles only during
normal tracking – in order to confer robustness against sudden or brisk movements of the
target.
7.4.1 Skin coloured blob detection and analysis
In order to apply SCGS to a video sequence, the first step is to find the skin coloured blobs
for each of the frames of the video sequence. This is achieved using the LC skin colour
classifier, described in Chapter 5, on a decimated frame of size 160x120 pixels. The resulting
skin colour image (binary image) is then processed in order to eliminate small skin coloured
blobs (one iteration of erosion) and connect together nearby skin coloured blobs (two
7 Tracking improvements 165
iterations of dilation). Finally, the connected components of the processed image are found,
and if the largest connected component is larger than a minimum threshold, then it is selected
for analysis. Moment analysis is used in order to find the largest component's centroid.
Figure 7.14(a) shows the skin coloured blobs as white areas; the largest skin coloured blob is
indicated by a green border, and its centroid is indicated by a red circle.
ICondensation as defined in (Isard and Blake, 1989b; Isard, 1998) follows the same steps
outlined above, with slightly different processing of the input image. However, at this point,
the proposed SCGS proceeds differently. ICondensation uses the centroid of the blobs as the
mean for a two-dimensional Gaussian, which constitutes their importance function. An offset
from the mean of the Gaussian and the covariance of the Gaussian are learnt off-line from
previous tracking sequences. The importance function is then sampled in order to insert
importance samples into the Condensation distribution. This approach works well for their
test environment, and image processing steps; however, the approach presents a problem for
this thesis' hand tracking, because of the different image processing, different tracking
environment, and different tracking possibilities. In Figure 7.14(a) we can see a subject in
short-sleeve holding his hand open in front of the camera. The centroid of the largest skin
coloured blob is quite far away from the hand contour position. If this centroid was used as
an insertion point for importance samples, the process noise would have to be increased to an
extent at which particles would be too spread out to be effective.
In order to avoid this problem we use an heurist method for the calculation of the insertion
points. Firstly, the area, major axis, and eccentricity of the largest skin coloured blob is
calculated using moments (Kilian, 2001). Secondly, two insertion points are calculated along
the major axis of the biggest skin coloured blob, as indicated in Figure 7.14(b). Each
insertion point lies at the following distance from the blob's centroid:
Distance to Top Insertion Point = 5 log(blob's eccentricity + 1)
Distance to Bottom Insertion Point = 7.5 log(blob's eccentricity + 1)
The insertion points will constitute the translation component for the importance samples and
initialisation samples. The insertion points are calculated under the assumption that the
biggest blob corresponds to the user's arm or hand. When the user wears short-sleeves, one
insertion point is on top of the potential hand pivot position, indicated in Figure 7.14(b).
When the user wears long-sleeves, the other insertion point is on top of the potential hand
pivot position, indicated in Figure 7.14(c) and (d). These two situations constitute two
7 Tracking improvements 166
extremes, and the potential hand pivot position will typically fall at some point, on the major
axis of the blob, between these two extremes.
(a) (b)
(c) (d) Figure 7.14: Skin coloured blobs. (a) White areas indicate skin colour; the red dot is the centroid of the largest
skin coloured blob. (b) The major axis of the skin coloured blob is indicated with a red line; to both
sides of the blob's centroid there is an insertion point, indicated with pink circles. When the subject
uses short-sleeve (b) the top insertion point is closer to the real hand pivot. When the subject uses
long-sleeve, (c) and (d), the bottom insertion point is closer to the hand pivot.
7.4.2 Combining low-level and high-level information
The insertion points determine the translation component of the initialisation particles. The
angle of the blob's major axis determines the angle of the initialisation particles. The scale of
the initialisation particle is determined using the following formula:
scale Area2ScaleFactor Blob's area= ×
where Area2ScaleFactor is the ratio between the largest blob area and the hand scale, at the
first frame of tracking, or at the initialisation frame. The configuration of the fingers is taken
from an initial finger configuration, corresponding to the hand open with splayed fingers.
Figure 7.15(b) illustrates how initialisation samples take the translation, angle and scale from
7 Tracking improvements 167
the blob information. As initialisation samples operate when the lock on the hand is lost, they
make possible to recover the lock on the hand, for example after the hand disappears and
then reappears on the tracking area.
The insertion points also determine the translation component of the importance samples, but
the scale, angle, and configuration of the fingers is taken from the particle with highest
weight in the previous times-step. Figure 7.15(c) illustrates how importance samples get the
finger configuration of the particle with highest weight from the previous time-step (shown
as a thicker blue contour).
(a) (b) (c) Figure 7.15: Importance samples and initialisation samples. (a) Shows together five importance samples in
magenta and five initialisation samples in light blue. (b) Initialisation samples take the angle and
scale from the blob information. (c) Importance samples take the fingers' configuration from the
particle with highest weight from the previous time-step.
7.4.3 Use of importance and initialisation particles
ICondensation as defined in (Isard and Blake, 1989b; Isard, 1998) uses initialisation and
importance particles simultaneously with condensation particles during normal tracking. It
was found that the use of initialisation particles during tracking can actually reduce the
tracking performance under certain situations. These situations correspond to those in which
the use of the tracker's dynamical model is more important, for example when a fast steady
motion of the user's hand occurs. Initialisation particles are generated in a way that makes
them likely to have a good fit on the target, and therefore be propagated to the following
time-step. However, initialisation particles do not have a tracking history and therefore they
cannot benefit from the dynamical model. Initialisation particles are generated as still
particles, with zero velocity. As a result, the particles generated from one initialisation
particle will be near (as far as the process noise allows them to spread) to the original
initialisation particle. This halts the tracking and degrades the tracking performance. This
halting of the tracking does not apply to importance samples, as those get the tracking history
7 Tracking improvements 168
from the particle with highest weight in the previous time-step. Importance samples do
contribute to strengthen the robustness of tracking when sudden, fast, motions of the target
occur.
In order to avoid the problems above, the initialisation particles are used in combination with
condensation particles only when the tracking is lost; and the importance particles are used in
combination with condensation particles only during normal tracking. The rule to detect
when tracking is lost and when tracking is normal uses a threshold, with time delay and
histeresis, on the weight of the particle with highest weight in the previous time-step:
• When the weight is below Llimit for T consecutive frames, the tracking state is switched
to "lost".
• When the weight is above Hlimit for T frames, the tracking state is switched to "normal".
The values for Llimit, Hlimit and T are found empirically, and they depend on the
measurement model of the tracker, see Section 4.3. This switching of tracking states allows
the tracker to use initialisation particles and importance particles when they are more needed:
initialisation particles when the tracking is lost and needs to be reinitialised; and importance
particles when the tracking is normal, but sudden, fast, motions of the target may occur.
7.4.4 Reinitialisation test
SCGS has been added to the sweep tracker of Section 4.7, and tested on a video sequence of
a user moving their hand rapidly in and out of the tracking area. The state of the tracking is
switched from normal to lost, and back to normal, following the rule described in the
previous section. When the user withdraws their hand from the tracking area, the tracking
state is switched to lost, and initialisation particles begin to be placed on top of the largest
skin coloured blob. When the user introduces their hand into the tracking area, some
initialisation particles are placed on top of the user's hand, until eventually, the tracking is
switched to normal, and importance particles begin to be placed on top of the largest skin
coloured blob. The number of both initialisation particles and importance particles used
during the tracking is 50; and the values for the tracking state switching rule are: Llimit =
0.3; Hlimit = 1000; T = 3.
7 Tracking improvements 169
frame 82 frame 303
frame 85 frame 305
frame 87 frame 306
frame 93 frame 307 Figure 7.16: Reinitialisation test selected frames.
The output of the tracking sequence for this test is available in Appendix B and on the
supporting webpage as "video_sequence_7.7.avi" for the reinitialisation test showing the
7 Tracking improvements 170
tracking output and the skin colour blobs; and "video_sequence_7.8.avi" for the
reinitialisation test showing the tracking output, the skin colour blobs, the initialisation
particles and the importance particles. Figure 7.16 shows some selected frames from the
tracking sequence. On the left, at frame 82, the user's hand is outside the tracking area; at
frame 85, the user's hand begins to appear on the tracking area; at frame 87, a first lock is
gained on the user's hand; finally, at frame 93, the tracker has a good lock on to the user's
hand. On the right, at frame 303, the hand is outside the tracking area; after just five frames
the tracker handles to get a good lock onto the user's hand, in frame 307.
7.4.5 Robustness test
The robustness test consists in measuring the tracking performance of the sweep tracker,
Section 4.7, including SCGS, on the test video sequence of Section 4.9. The section from
frames 175 to frame 359, which corresponds to very brisk rigid hand movement, could not
be tracked successfully using the sweep tracker. Now, this section of the test video sequence
can be tracked successfully thanks to the use of importance particles. The number of
importance particles used during the tracking is 50; and the values for the tracking state
switching rule are: Llimit = 0.003; Hlimit = 1000; T = 3. Note that the Llimit has been lowed
in order to prevent initialisation particles from appearing during the brisk motion section of
the test video sequence.
A video of the tracking output is available in Appendix B and on the supporting webpage as
"video_sequence_7.9.avi". Figure 7.17 shows the distance metric performance measure of
the tracking output. Comparing Figure 7.17 with Figure 7.12, it is possible to see that the
distance metric for the section of brisk rigid motion has a slightly worse performance in the
SCGS chart than in the VPN chart. This result is to be expected, as the brisk rigid motion
section involves not only fast translations but also fast changes in the contour's angle and
scale. The importance particles only take the translation component from the skin coloured
blobs, the angle and scale of the particle is taken from the particle with highest weight in the
previous time-step. Hence, they can only handle the fast translations, but not the fast
rotations, and fast changes in scale; for this reason, VPN exhibits slightly better performance
in this section. The other sections of the tracking sequence show similar performance in both
the SCGS chart and the VPN chart.
7 Tracking improvements 171
In order to compare the tracking performance results between the sweep hand tracker with
SCGS, and without SCGS, we repeat the test in three sections: frames 30 to 174; frames 175
to 359; and frames 360 to 890. The tracker is initialised at the beginning of each section.
Results can be seen in Figure 7.18. In the first section, frames 30 to 174, the version with
SCGS has slightly higher average distance metric, although the peaks are shorter, and the
variance is smaller than the version without SCGS, Figure 7.18(a) and Figure 4.13(d). If we
compare it with the VPN version, Figure 7.13(a), we see that both the average distance
metric and the variance are smaller in the SCGS version. The second section, frames 175 to
359, which corresponds to brisk rigid hand motion, can only be compared with the
performance results of the VPN, Figure 7.13(b). The latter has slightly smaller average
distance metric and slightly smaller variance than the SCGS version, for the reasons
mentioned above. In the third section, frames 360 to 890, the SCGS version, Figure 7.18(c),
has smaller average distance metric and smaller variance than the version without SCGS,
Figure 4.17(d), and just slightly smaller average and variance than the VPN version, Figure
7.13(c).
Figure 7.17: Distance metric performance measure when using SCGS. Note that the use of SCGS allows
tracking the test video sequence's section with brisk rigid-hand movement, frames 175 to 359.
7 Tracking improvements 172
(a)
(b)
(c) Figure 7.18: Distance metric performance measure when using SCGS, separate charts. Tracking is
initialised at the beginning of each chart, in order to compare performances with Figure 4.13, and
Figure 4.17.
7.4.6 Conclusions
Skin colour guided sampling as presented in this section enables the automatic initialisation
of the tracker, for when the target hand disappears and then reappears into the tracking area.
It also improves the tracking robustness when fast translations of the hand occur; although
we have seen that the use of VPN, Section 7.3, produced slightly better performance than
SCGS during the brisk rigid motion section of the test video sequence. The implementation
of SCGS uses the LC skin colour classifier in both the blob detection and in the contour
likelihood measurement function.
On the other hand, there is a limitation of SCGS. The limitation appears when the skin
coloured blob corresponding to the user's hand joins other skin colour blobs in the image; the
resulting blob cannot be used to predict the hand parameters accurately; this is illustrated in
7 Tracking improvements 173
Figure 7.19. This is a bigger problem for the initialisation particles than for the importance
particles, as the dependence of the former on the blob information is stronger. This is
illustrated in Figure 7.19(b). The initialisation particles can get an incorrect hand pivot
position, incorrect rotation angle, and incorrect scale. The importance particles can only get
an incorrect hand pivot position.
(a) (b) Figure 7.19: Skin coloured blobs mixing. (a) When skin coloured blobs mix, the resulting blob cannot predict
the hand parameters. (b) The position, angle, and scale of the initialisation samples, light blue,
depend entirely from blob information. Only the position of the importance samples, magenta,
depend on the blob information.
7.5 Combining tracking improvements
This section combines in a single sweep tracker, three of the techniques presented in this
chapter. The result is an improved tracker, that produces the best performance results so far,
and is good enough to be used in the applications of Chapter 8 . The techniques involved in
this improved tracker are: the template switching methods of Section 7.1, the variable
process noise of Section 7.3, and the skin colour guided sampling of Section 7.4. The Quasi-
random sampling, Section 7.2, is not used in this experiment as it does not does not bring a
performance improvement.
The improved sweep tracker is run on the test video sequence. Figure 7.20 shows the
distance metric performance measure for the test video sequence. The average distance and
variance are the smallest so far, and the peaks are shorter than the sweep tracker with VPN or
SCGS. A video of the tracking output for this experiment is available in Appendix B and on
the supporting webpage as "video_sequence_7.10.avi". In order to compare the tracking
performance between the bare sweep hand tracker and the sweep hand tracker including the
7 Tracking improvements 174
three techniques, the test is repeated in three sections: frames 30 to 174; frames 175 to 359;
and frames 360 to 890. Tracking is initialised at the beginning of each section. Figure 7.21
shows the distance metric results. In all three sections the performance is the best so far.
Figure 7.20: Distance metric for the sweep tracker with combined tracking improvements.
(a)
(b)
(c) Figure 7.21: Distance metric for the sweep tracker with combined tracking improvements, separate
charts.
175
8 Virtual touch screen
This chapter presents a vision-based interactive surface which has been named as Virtual
Touch Screen (VTS) interface. The VTS attempts to move beyond the traditional mouse and
computer screen interface by generating in the environment a surface that is both active as a
display and as a touch-sensitive pad. The contents of a VTS can be displayed by either the
use of a projector, which projects the contents on a selected surface, or by the use of a see-
through Head Mounted Display (HMD), which displays the contents on the HMD, but
appears to the user to be floating in their field of view. The VTS can be made touch-sensitive
by visually tracking the user's hand and interpreting their hand position and configuration in
order to determine when and where a user's finger touches the VTS.
There are a number of technical challenges is the realisation of the VTS interface, but the
most difficult and challenging element is the visual tracking of the user's hand. This needs to
be able to track not only the user's hand but also its configuration. Articulated hand tracking
of this sort, without the help of hand markers, or special gloves, is currently a very active
area of research (Nolker and Ritter, 1999; MacCormick and Isard, 2000; Shimada et al.,
2001; Zhou and Huang, 2003; Stefanov, 2005; Stenger's et al., 2006). In the absence of a
suitable articulated hand tracker that could satisfy the demanding requirements of the VTS, a
8 Virtual touch screen 176
especially tailored articulated hand tracker was developed for this purpose. This articulated
hand tracker was developed, improved, and finally presented in previous chapters (Chapter 3
to 7) of this thesis, making this chapter the goal of this thesis.
This chapter starts by describing the concept of a VTS interface, possible configurations, and
proposed operation, using an idealized visual hand tracking technology. Then it moves on to
describing three implementations of the VTS interface, the last two ones make use of the
articulated hand contour tracking developed in this thesis. Finally, a number of potential
applications for the VTS interface are described, and conclusions are drawn.
8.1 The VTS interface
The concept of a VTS interface is analogous to that of a touch sensitive screen. A user can
see information presented on the screen and can directly interact with this information by
touching the screen. In a touch sensitive screen the information is displayed on the screen by
the relevant screen's technology (CRT, LCD, etc) and this screen is made touch sensitive by
adding a transparent touch-sensitive membrane, or alternative technology. In a VTS the
information is displayed by either using a projector, which projects the information on a
selected surface, or by using a see-through Head Mounted Display (HMD), which displays
the contents in the HMD, but these appear to users as to be floating in their field of view.
The VTS is made touch-sensitive by visually tracking the user's hand and interpreting their
hand position and configuration. We will talk about a "Projector based VTS" when the
information in VTS is displayed by using a projector, and we will talk about "HMD based
VTS" when the information in the VTS is displayed by using a see-through HMD.
Leaving aside the implementation issues related to the development of such an interface, a
technological requirement for the VTS interface to work is that the field of view between the
camera and the user's hand has to be clear. Similarly, if a projector is used in order to display
the VTS contents, the field of view between the projector and the screen has to be clear.
These requirements leave us with a number of possible interface element configurations:
8 Virtual touch screen 177
Projector based VTS:
• Use of a front projector and an opaque screen in order to display the VTS information. A
camera placed on the same side of the screen as the projector, captures the user's hand
from behind. The set camera/projector/screen could be tilted to suit the user's
preferences. This configuration (illustrated in Figure 8.1(a)) requires tracking of the back
of the hand.
• Use of a rear projector and a diffuse screen in order to display the VTS information. A
wearable camera (for example, carried by the user on the shoulder) captures the user's
hand from behind. This configuration (illustrated in Figure 8.1(b)) requires tracking of
the back of the hand.
• Use of a rear projector and a diffuse screen in order to display the VTS information. A
camera placed on top of the VTS captures the user's hand from behind. The set
camera/screen/projector could be tilted to suit the user's preferences. This configuration
(illustrated in Figure 8.1(c)) requires tracking of the back of the hand.
• Use of a rear projector and a transparent screen, such as the commercially available DNP
HoloScreen (HoloScreen display material supports video projection and is nearly
transparent to IR and visible light (DNP, 2004)) in order to display the VTS information.
A camera behind the VTS captures the user's hand through the VTS (palm view). This
configuration (illustrated in Figure 8.1(d)) requires tracking of the front of the hand.
HMD based VTS:
• Use of a see-through HMD in order to display the VTS information as floating in the
user's environment. A camera placed behind the VTS captures the user's hand from its
front. This configuration (illustrated in Figure 8.1(d)) requires tracking of the front of the
hand.
• Use of a see-through HMD in order to display the VTS information as floating in the
user's environment. A camera placed on the HMD (allowing the implementation of
video see-through) captures the user's hand from behind. This configuration (illustrated
in Figure 8.1(e)) requires tracking of the back of the hand.
8 Virtual touch screen 178
(a)
Opaque screen + front projector
(b) (c)
Diffuse screen + rear projector
(d)
Transparent screen (Such as DNP HoloScreen)
+ rear projector
(e) (f)
See-through HMD
Figure 8.1: Six possible interface element configurations for the VTS. (a), (b), (c) and (d) are projector based
VTSs. (e) and (f) are HMD based VTSs.
8 Virtual touch screen 179
8.1.1 Hand tracking
Hand tracking is crucial in the development of the VTS interface. The hand tracking used in
a VTS has to be able to track the hand position and configuration throughout a video
sequence. It also has to be able to identify actions relevant to the VTS use. These actions are
primarily clicking with a finger or dragging a finger on the VTS surface. The accuracy and
repeatability with which these two actions (clicking and dragging) are able to be detected
will greatly determine the overall reliability of the VTS. Another important element of the
hand tracking is its real-time operation, a fast hand tracking will result in a responsive VTS
interaction, while a slow hand tracking will result in a non-responsive, uncomfortable, user
experience.
Full 3D articulated hand tracking is currently an active, and challenging, area of research in
the computer vision community (see literature review in Chapter 2 ). The challenge is even
bigger when using a single camera. However, the 2D nature of the VTS interaction does not
require full 3D articulated hand tracking. Considering this fact, a 2D articulated hand contour
tracking was proposed and developed in this thesis specially for the VTS. This hand contour
tracking requires the user to maintain their hand mostly parallel to the VTS. However, the
hand contour tracking can be very flexible, allowing real-time tracking of the user's hand in a
range of orientations both from the front (palm view) and from the back. The hand contour
tracking developed in this thesis was specially designed to handle the set of articulated
movements that are expected to happen when a user operates a VTS, in particular, fast
flexion/extension of the fingers, which can correspond to clicks or finger presses on the VTS
(see Section 4.6).
8.1.2 Operation
The operation of a VTS interface is similar to that of a touch sensitive screen. As in a touch
sensitive screen, the VTS user can directly click or drag information elements as they appear
on the display. However, due to the use of visual hand tracking technology, and the lack of
haptic feedback when the user clicks or drags an object on the VTS, the operation of a VTS
interface can be expected to be different to that of a touch sensitive screen.
The hand tracking technology may need an initialisation step before starting the tracking of
the user's hand. The purpose of this initialisation step is to gather information about the user's
8 Virtual touch screen 180
hand (while this is in a predetermined configuration) in order to make the hand tracking
more specific to that particular hand. Examples of tracking parameters that could be set
during the initialisation step to match a particular user's hand include: initial hand position
and configuration, particular skin tone of the hand, particular hand shape, current
illumination levels, and estimation of a kinematic model for the hand.
In order to incorporate this initialisation step into the usage of the VTS, the following
operation stages are proposed: (each of these stages can be indicated to the user by means of
visual or audio cues)
• Firstly, a hand shape can be displayed on the VTS in order to indicate to the user that the
VTS is ready to be initialised by a particular user. This stage is illustrated in Figure
8.2(a).
• Secondly, the user must place their hand inside the hand shape, at this moment the
system will detect the user's hand and tune a number of tracking parameters to this user.
This stage is illustrated in Figure 8.2(b).
• Thirdly, the hand shape disappears and it is replaced by the VTS display. At this point,
the user can start operating the VTS interface. This stage is illustrated in Figure 8.2(c).
(a) (b) (c) Figure 8.2: Proposed VTS operation.
Due to the hand tracking technology and the lack of haptic feedback, the VTS operation may
require the user to click or drag objects in a certain way. For example, the user may have to
operate the VTS from a certain distance, and may have to flex their fingers with a certain
speed and intensity in order for clicks to be recognized. A number of visual or audio cues can
be given in order to help with this. For example, the tracked hand can be displayed in a
different colour when it is held at a certain distance from the VTS (indicating the VTS can be
operated from that distance). Then different audio and visual feedback can be given
depending on whether a click or a drag on the VTS is detected.
8 Virtual touch screen 181
8.1.3 Usability
The VTS interface has the potential to enable a large number of applications. However, a
question arises about the usability of the VTS interface – the ease with which people can
employ the interface. This includes perception of the interface, learning curve, postural
comfort, etc. There are various points to take into account when evaluating the usability of
the VTS interface. The first point to think about is how comfortable is for the user to operate
a VTS interface.
Kölsch (2004) studied the postural comfort for HCI systems similar to the projected based
VTS and to the HMD based VTS. The study resulted in the definition and mapping of certain
"comfort zones" for various types of single-handed interaction while standing. The resulting
comfort zone for hand placement is within a half moon-shaped area about 35 to 45
centimetres from the shoulder joint, at an angular range from 70 degrees adduction to 50
degrees abduction (away from the body centre). They did not investigate the wrist angle
within that comfort zone. However, various studies in ergonomics, mainly focused in
preventing Carpal Tunnel Syndrome while typing, using a mouse, or other work related tasks
suggest that the wrist angle should not exceed 20° extended, nor be bent to either side (Bach
et al., 1997; Wellman et al., 2004). On the other hand, Sears (1991) studied the use of touch
screen keyboards at various tilt angles and with various key sizes. Three tilt angles over the
horizontal where studied: 30, 45, and 75 degrees, concluding that 75 degrees resulted in
more fatigue and lower preference ratings, and 30 degrees resulted in less fatigue and higher
preference.
From the number of suggested applications is it possible to see that the VTS interface, either
projector based or HMD based, can be deployed in a large range of orientations, with the
only major restriction is that the field of view between the camera and the user's hand (and
between the projector and the VTS in the case of a projector based VTS) must be clear. One
possible way to guarantee a comfortable interaction with a projector based VTS is to deploy
the screen at the correct height and tilt angle. This can allow a standing user to operate the
VTS with their hand in the comfort zone and with their wrist angle not exceeding 20° of
extension. In the HMD based VTS, if this is HMD stabilized, the user can look slightly
downwards so that the whole field of view is in the comfort zone, although this is the users
choice (the system can be operated outside the comfort zone if necessary). Alternatives to
8 Virtual touch screen 182
this mode of operation in the HMD based VTS are possible by changing the camera's
position and orientation. For example in HandVu (Kölsch, 2004) the camera is mounted on
the top of the HMD and pointed slightly downwards, in this way the user can look forward
and at the same time operate the HandVu system while keeping their hand in the comfort
zone. On the other hand, in a HMD based VTS the interface elements could potentially be
displayed with perspective, and the click and drag detection could be adapted to work on the
new tilted VTS. This new configuration could offer enough flexibility so that the user can
always operate the VTS in their hand comfort zone.
Another point to take into account when evaluating the usability of the VTS interface is the
type of use this interface would have. The use of a VTS interface is analogous to that of a
touch sensitive screen, and touch sensitive screens are better suited to information systems
with limited data entry (Gleeson, et al., 2004). A study made by Sears (1991) shows the
average words per minute that a user can input using a touch-screen keyboard (25 WPM), a
mouse activated keyboard (17 WPM), and a physical keyboard (53 WPM). He concludes
that where a large amount of data entry is required a keyboard is necessary. This suggests
that the best use of a VTS would be for interactive retrieval of information, like selecting
icons, items from menus, pointing at and dragging of objects, and input of short strings.
Another point to consider when evaluating the usability of the VTS interface is the learning
curve to use the interface. On the one hand, as the click and drag detection mechanisms of a
VTS are vision based, the detection of a particular click or a drag will depend on the
visibility (from the camera's point of view) of the acting finger at that moment in time. Self-
occlusions between fingers and the hand can hide some legitimate clicks on the VTS. This
will require the user to adapt to a particular way of clicking and dragging. This adaptation
will be related to the hand tracking technology. A given hand tracking technology may
impose harder movement restrictions (palm and fingers orientations and movements) than
another, resulting in a longer learning curve for the use of the interface. On the other hand,
the lack of haptic feedback may result by itself in longer learning curves. The lack of haptic
feedback could be alleviated by maximizing acoustic and visual cues.
Finally, in the case of the HMD based VTS, there is another point to consider when
evaluating its usability – this is the simulator sickness (Heider, 1998; Mollenhauer, 2004).
Simulation sickness is a condition where a person exhibits symptoms similar to motion
8 Virtual touch screen 183
sickness caused by prolonged use of a HMD. So far, HMD technology has always involved
some degree of simulator sickness due to various reasons. These include visual aspects of the
HMD such as the time lag in the presentation of information, magnification level, field of
view, optics, etc; and other hardware aspects such as size and weight of the HMD, HMD
fitting, etc. However, the HMD technology is improving very rapidly, this will result in
higher resolutions, lower weights, better optics, and reduced simulator sickness. Even now,
there are monocular HMD models with resolutions of 680x400 that can be attached to a pair
of glasses, and they weight as little as 35 grams (SV-6 PC viewer (MicropOptical, 2005)).
There are even some HMD manufactures that claim to have eliminated "cybersickness"
(LightVu (Mirage Innovations, 2006)). It is the opinion of this author that in a near future,
improvements in the HMD technology will make HMDs transparent and comfortable to the
users and they will largely eliminate simulator sickness effects. This will result in a
widespread use of HMDs, and consequently, the potential popularity of HMD based VTS
interfaces.
8.2 Implementations
Three VTS generations have been implemented in this thesis. The first generation VTS
demonstrates the VTS concept with the implementation of a VTS based keypad. In this
generation, the user's hand is tracked from the front (palm view) by using simple image
processing techniques, which require the use of a black background. The second generation
VTS uses the articulated hand contour tracking developed in chapters 3 to 7 . In this
generation, the user's hand is tracked from its front, on an arbitrary background, and fewer
hand motion restrictions are imposed than in the first generation VTS. Finally, the third
generation VTS uses the same tracker as the second generation one but with some minor
modifications. This generation is a HMD based VTS. The camera that captures the user's
hand is mounted on a HMD, thus the hand is captured from its back. The HMD allows the
user see the VTS contents floating on their field of view. Next, the three VTS generations
will be described in detail.
8.2.1 Projector based VTS (First Generation)
The first implementation of a projector based VTS was used to demonstrate the possibility of
detecting key-presses or clicks by using a digital camera and image processing. In this early
version of the VTS the background is black in order to simplify the image processing. The
8 Virtual touch screen 184
output of the VTS appears on the computer's screen and the user has to type on a frame
(where a VTS keypad is supposed to exist). This frame has a grid of threads that indicates
where the keypad keys are. The set up of the system is illustrated in Figure 8.3.
Figure 8.3: Set up of the first generation VTS.
Figure 8.4: Image processing for the first generation VTS.
Using simple image processing techniques the finger-tips and finger-valleys of the user's
hand are detected (as indicated in Figure 8.4). The projected lengths of the fingers can be
calculated from these hand features. The lengths of the fingers are continuously monitored in
order to detect changes in length that could be identified as key-presses. When the user's
hand is close enough to the VTS and one of these length changes happens, the final position
of the fingertip, before the finger recovers back to the rest position, is checked. If this
position is inside the area of a key a key-press is recognized. Figure 8.5 shows the key-press
detections from a video sequence of a user typing a telephone number. A video sequence
showing the operation of this first generation VTS is available in Appendix B and on the
supporting webpage as "video_sequence_8.1.avi".
8 Virtual touch screen 185
Figure 8.5: Typing a telephone number on the first generation VTS. In the chart the horizontal axis is the
frame number inside the sequence, and the vertical axis is the estimated finger's length.
This first VTS generation demonstrates the basic ideas in a VTS and, at the same time, it has
to deal with the same functional blocks as future VTS implementations, these are: interface
initialisation, and touch detection. However, in order to implement a more flexible and
reliable working VTS a different hand tracking technology is needed. Articulated hand
contour tracking is the choice of the next VTS generations.
8.2.2 Interface initialisation
In Section 8.1.2 the initialisation sequence of a VTS was introduced. This initialisation
involves the user placing their hand on a hand shaped contour in order for the tracker to
initialise a number of parameters for the hand tracking. This initialisation sequence is needed
8 Virtual touch screen 186
because of the hand tracking technology. This section refers to the initialisation of the
articulated hand contour tracker developed in this thesis for the VTS (chapters 3 to 7 ). The
initialisation sequence for this hand contour tracking technology involves three states which
are indicated to the user by the colour of the hand contour:
• The first stage (equivalent to that of Figure 8.2(a)) is indicated with a red hand contour.
In this state the hand contour is static in the centre of the field of view, and it will remain
in this state until a user places his hand with splayed fingers on top of the hand template.
This state is illustrated in Figure 8.6(a).
• The second stage (equivalent to that of Figure 8.2(b)) is indicated with a green hand
contour. During this state the user's hand contour is tracked. However, only global hand
tracking and abduction/adduction of the fingers is tracked. The flexion/extension of the
fingers is not tracked yet because the projected length of the fingers (when fully
extended) may need to be calculated. The hand tracking will remain in this state until the
location of the hand is considered good enough to perform the hand initialisation. In
order to tell when the hand location is good enough the score of the tracked hand contour
is monitored. When this score is above a certain threshold for 10 consecutive frames the
tracking state switches to the third state. The frame and hand contour configuration that
produced the best score in those 10 consecutive frames is used for the initialisation of
some parameters, such as tuning of the skin colour model. While the frame and hand
contour configuration at the moment of switching to the third state is used for other
parameters, such as the initial tracking position. This state is illustrated in Figure 8.6(b).
• The third stage (equivalent to that of Figure 8.2(c)) is indicated with a blue hand contour.
When the tracking switches to this state, a number of parameters are initialised. During
this state the tracker performs a fully articulated contour tracking of the user's hand. The
tracking will continue in this state for as long as the location of the hand contour is good
enough. If the output of the hand contour tracker has a low fitness for more than three
consecutive frames, the location of the tracked hand is considered lost, at which point the
state of the tracker goes back to the first state (red hand contour) waiting for a new
initialisation. This state is illustrated in Figure 8.6(c).
8 Virtual touch screen 187
(a) (b) (c) Figure 8.6: Initialisation states of the VTS hand contour tracker. (a) First tracking state, waiting for
initialisation. (b) Second tracking state, partial articulated tracking. (c) Third tracking state, full
articulated tracking.
When the tracker switches from state 2 (green hand contour) to state 3(blue hand contour)
the initialisation of a number of parameters occurs. These parameters are listed next:
Initial tracking position
As the initialisation can only occur if the fitness of the hand contour template is high, the
point at which the initialisation occurs is a good point to start fully articulated tracking. The
configuration of the hand contour at that point is the used as a initial tracking position.
Skin colour tone of the user's hand
The hand contour tracker uses the LC skin colour classifier described in Chapter 5 . This skin
colour classifier can be tuned (as described in Section 5.3) to the particular user's skin tone
during the VTS initialisation. The hand contour tracker can generate the masks necessary for
the LC classifier tuning at this point. The configuration of the hand contour that has best
fitness in the last 10 frames previous to this point is used for the tuning of the LC classifier.
The tuning of the LC classifier in this manner assumes that the lighting conditions during
hand tracking will be more or less the same as during initialisation.
Estimation of a kinematic model
A kinematic model of the user's hand could be used in order to determine the configuration
of the hand from a 2D image. Although a kinematic model of just the fingers would be
enough in order to calculate separation of a fingertip from the palm plane. The separation of
a finger from the palm plane together with the estimated distance between the palm and the
VTS could be used to detect clicks on the VTS surface. In order to use such a kinematic
8 Virtual touch screen 188
model the length of each finger segment needs to be known. If the length of the segments is
not known beforehand, it could be estimated from the hand tracking at the initialisation
stage. A possible simple method produced for this purpose involves estimating the length of
the fingers segments from the finger flexion creases in a frontal image (palm view) of the
user's hand taken at the initialisation point. The method involves scanning the pixels of a line
drawn along the finger and find the flexion creases as minima in the R channel (from an
RGB triad). Once the flexion creases are found in the image, the position of the joints in the
finger can be calculated using certain correction offsets. As the Index, Middle, Ring, and
Little fingers all have three segments and three joints, the procedure is the same for these
four fingers. Figure 8.7 illustrates the procedure to estimate the length of the finger segments
during the initialisation stage. The method has not been fully tested, and it is expected that
illumination changes can affect the detected position for the flexion creases; however, it
produces some initial results that can be used with a finger kinematic model in order to
detect clicks on the VTS surface.
Figure 8.7: Finding finger creases. (top) sampling line along index finger. (bottom) samples of the line
showing three local minimums which correspond to three finger flexion creases.
8 Virtual touch screen 189
Shape of the hand
The shape of a hand contour may vary from user to user. These variations may produce a
poor fit of the hand template to some users; and a poor fit results in a poor tracking
performance. A possible improvement to the hand tracking would involve tailoring a hand
contour template for each particular user's hand. The best moment to create this template
would be the initialisation stage. A modeless method to find the user's hand contour, such as
Snakes (Kass et al., 1987), could be used, and from the found hand contour a new hand
template could be created. This method could make the hand tracker robust to different users.
8.2.3 Touch detection
The hand contour tracking technology used in the VTS makes possible to know what the
state of the hand is at every frame of the input video sequence. This hand state has to be
interpreted in order to determine when a user's fingertip touches the VTS surface (this is also
referred as finger click on the VTS). Touch detection refers to the method used in order to
determine this event. When considering a touch detection method, it is important to
remember that the hand contour tracking used in the VTS requires the user to keep their hand
approximately parallel to the camera's image plane (the VTS plane). Having this into mind, a
number of methods for touch detection are possible:
Kinematic model
One possibility for interpreting the hand contour is to use a kinematic model of the fingers.
In the previous section a method of finding the lengths of the finger segments was suggested.
Once the length of the finger segments is calculated using this method, the kinematic model
of the user's fingers can be used in order to calculate the separation of a fingertip from the
palm plane. This separation, in combination with the estimated distance between the hand
and the VTS, can be used to determine when a finger is touching the VTS. The procedure to
find the separation of the fingertip from the palm plane involves the calculation of the
reverse kinematics of a chain of three links (the finger segments). The input of the procedure
is the 2D projected length of the finger; the output is the angles of the joints for that finger.
This process is fully explained in Appendix A.1. Figure 8.8 shows (on the left) a hand
flexing the middle finger. The calculation of the reverse kinematics allows us to find the
configuration of the finger joints (on the right) and, therefore, the separation of the fingertip
from the palm plane. A video sequence of a hand tracker that both calculates the finger
8 Virtual touch screen 190
segment lengths from the finger flexion creases, and then calculates the reverse kinematics is
available in Appendix B and on the supporting webpage as "video_sequence_8.2.avi".
Figure 8.8: Hand undergoing flexion of middle finger. A kinematic model for the fingers allows to calculate
the stick-out of the finger from the hand palm.
The accuracy of this method depends upon two factors: the accuracy with which lengths of
the finger segments are calculated; and the assumption that the flexion/extension of the
fingers follow a typical profile (as described in Appendix A.1). Ultimately an uncertainty
band is necessary. If the combination of the finger separation from the palm plane and the
distance between the hand and the VTS is inside this uncertainty band, a finger click on the
VTS is detected.
Thresholds
All the processes required in order to detect a finger click on the VTS using a kinematic
model for the fingers can be greatly simplified by using thresholds. A combination of
multiple thresholds involving the 2D projected length of the fingers for each of the distances
between the hand and the VTS can effectively do the same job as the kinematic model for
the fingers. These thresholds can be calculated from the kinematic model during the
initialisation stage, and then be stored in form of a lookup table. During tracking the lookup
table with the thresholds is continuously tested in order to detect finger clicks on the VTS.
This is several times faster than calculating the reverse kinematics. A further simplification
using thresholds involves the use of a single threshold for the distance between the hand and
the VTS (triggered when the hand is near enough to the VTS), and a single threshold for the
2D projected length of finger (triggered when the finger is flexed beyond a point).
Contour trackers based on particle filters tend to exhibit a certain degree of jitter in their
output contours. The same is true for the articulated hand contour tracker developed in this
8 Virtual touch screen 191
thesis for the VTS. This fact together with variations in illumination of the fingers, result in a
variable measurement of the 2D projected length of the finger (which corresponds directly to
the finger length parameter in the contour's state), and consequently potentially incorrect
finger click detections on the VTS. One way of making the touch detection more robust to
these finger length variations is to consider the rate of change of the finger length. Thus, a
simple and relatively robust method for touch detection involves three thresholds: one for the
distance between the hand palm and the VTS; another for the finger length; and finally
another for the rate of change of the finger length. This touch detection method relies on the
user touching the VTS surface in a particular way, which is determined by the three
thresholds.
Moving threshold
A touch detection method that is both more reliable and lessens the constraints imposed on
the way of clicking on the VTS (constraints on the amount of finger flexion and speed of the
flexion) is based on a moving threshold. The proposed method uses an Exponentially
Weighted Moving Average (EWMA) (NIST, 2006). of the finger length, and from this
EWMA a Lower Control Limit (LCL) is calculated. If the length of the finger becomes
smaller than the current LCL, then a click on the VTS is triggered. The EWMA and the LCL
are calculated as follows:
1(1 )xµ α α µ−= + − (8.1)
2LCL kµ σ= − (8.2)
where x is the current finger length, µ and 1µ− are the EWMAs of the finger length for the
current and previous time-steps respectively, α is the degree of filtering, 2σ is the variance
of the finger length, and k is a constant that modulates the distance between the EWMA and
the LCL. The parameter α can vary from 0 to 1. If α is near to 1, the filtering of x weak. If
α is near 0, the filtering of x is strong. The larger the constant k is the larger the amount of
finger flexion required to trigger a finger click is, and so is the duration within which a finger
flexion has to performed (slower finger clicks are possible). The values for α , k, and 2σ
have been found empirically resulting in α =0.65, k=1.2, and 2σ =0.1. Figure 8.9 shows the
finger length, x, and the LCL for the index finger during a video sequence of interaction with
a VTS. The vertical axis is the finger length. The horizontal axis is the frame number.
Arrows indicate the points at which a finger click on the VTS is detected.
8 Virtual touch screen 192
Figure 8.9: Touch detection using a moving threshold. The vertical axis is the finger length (parameter in the
hand contour state). The horizontal axis is the frame number from a video sequence of interaction
with the VTS. Arrows indicate the points at which a finger click on the VTS is detected.
Debouncing
The term debouncing is used in here as an analogy with keyboard technology. In keyboard
technology, debouncing refers to the filtering of spurious electric signals just before and after
a key is pressed and released. This filtering prevents the detection of various key presses
when only one is correct. In the VTS the concept refers to the filtering of the finger length
against sudden and brief changes, which could be produced by the contour jitter rather than
from a finger click. The technique involves waiting for two frames before triggering a finger
click, and two frames before releasing the finger click. During these two frames the finger
length has to be either below LCL (for key press) or above LCL (for key release). This same
procedure is used in keyboards for debouncing purposes.
Detecting the position of a finger click
Once a finger click is triggered, the position in the VTS where that event happened can be
calculated from the state of the hand contour. However, there are two points to take into
account when calculating this position. Firstly, the debouncing mechanism triggers a finger
click after the finger length is below LCL for two consecutive frames, but the finger click
position must be calculated using the length of the finger in the first of these two frames.
Secondly, the calculated finger click position has to be corrected with a small vertical offset.
8 Virtual touch screen 193
Sears (1991) studied the ergonomics of touch-screen keyboards and the effect of the key size
in their use. He reported that subjects consistently touched below targets. This phenomenon
has also been observed in a VTS based keypad. This is the reason why the calculated finger
click position is corrected with a small vertical offset. The size of this offset depends on the
size of the target and may be different for each particular VTS interface.
Dragging
We have seen various techniques to detect a finger click on the VTS surface. Another
important interaction that a VTS should support is dragging. Dragging is implemented by
first detecting a finger click, storing the length of the finger at that point, Lclick, and
establishing a threshold for that finger, LengthThreshold = Lclick + Margin (the value of
Margin can be different for each finger). If the finger length is below LengthThreshold the
finger is considered to be dragging on the VTS, and the position of the dragging fingertip
will be reported for every time-step. If the finger length goes above LengthThreshold the
drag operation is considered finished. Thus, every finger drag operation starts with a finger
click operation followed by a translation of the hand contour, while keeping the length of the
finger below LengthThreshold.
8.2.4 Projector based VTS (Second Generation)
The articulated hand contour tracking developed in chapters 3 to 7 , together with the
initialisation techniques and touch detection techniques described in the previous two
sections enables the implementation of a second generation VTS. In this implementation the
hand tracking is from the front of the hand (palm view). This is the same point of view as the
hand contour trackers presented in chapters 4 , 6 , and 7 . The VTS contents are designed to
be projected onto a screen such as DNP HoloScreen. This type of screen works as display
surface when light from a rear projector is incident at a particular angle, but it is transparent
to all other light. This allows us to both project the VTS contents onto it and do hand
tracking through it (as illustrated in Figure 8.1(d)). However, a HoloScreen was not available
while developing this VTS generation. As an alternative, the VTS was tested using a non-
reflective glass, on which the outline of some interfaces such as keypads, slider bars, etc,
were drawn. The VTS interfaces are aligned with the interfaces drawn on the glass, this
allows the user to operate the virtual interfaces while using the drawn ones as a visual aid.
The feedback produced by the VTS interfaces when clicks and drags occurred is shown on
8 Virtual touch screen 194
the computer screen. Note that with this set up the user's hand is tracked through the drawn
interfaces (camera behind the screen); however, this does not affect the hand contour
tracking because the interfaces are drawn with thin lines. Figure 8.10 shows the set up of the
second generation VTS.
Figure 8.10: Set up of the second generation VTS. The VTS is on a non-reflective glass and the VTS
feedback is displayed on the computer screen. However, this VTS generation is meant to be used in
combination with a projector and a screen such as DNP HoloScreen. This screen works as a
transparent surface from the camera's point of view but as a display surface from the projector's
point of view.
This VTS implementation follows the initialisation stage described in Section 8.2.2. During
this initialisation stage only the initial tracking position and LC skin colour classifier are
initialised. This VTS implementation uses the moving threshold touch detection method; the
finger click debouncing; the finger click position calculation; and the finger dragging
methods as described in Section 8.2.3.
In this VTS implementation when the user's hand is close enough to the VTS (for clicking on
it), a light blue circle appears in the centre of the palm. This indicates to the user that their
hand is at the correct distance to operate the VTS. In fact, touch detection operates only
when this circle appears on the hand contour. The VTS was tested with two types of
8 Virtual touch screen 195
interfaces: a keypad, and a slider bar. Figure 8.11 shows a sequence consisting of four
consecutive frames of a user pressing the key '0' on a keypad interface. In frame 150 the
index finger of the user is completely extended. In frame 151, the finger flexes rapidly
stopping on top of the key '0'. This amount of finger flexion and the speed with which it
occurred (1 frame) is enough to trigger a finger click. However, because of the debouncing
mechanism the key press will not be confirmed until frame 153. Effectively, in frame 152 the
key press is not yet confirmed. In frame 153, the key press is confirmed (using as location
for the click that of the fingertip in frame 151).
Frame 150 Frame 151
Frame 152 Frame 153 Figure 8.11: Keypad usage. Key press sequence.
In order to use a slider bar the user needs to follow three steps: firstly, click with a finger
onto the slider bar's cursor; secondly, drag the cursor along the slider bar up to the desired
position; and thirdly, lift the finger from the slider bar's cursor. Figure 8.12 shows four
frames which illustrate the usage of a slider bar. In frame 455, the user clicks on the slider
bar's cursor. In frame 459, the cursor is captured and the dragging can start. In frame 476 the
cursor has been dragged down the slider bar. Note that in order to improve the robustness of
the slider bar interface, once the cursor is captured only the vertical coordinate of the user's
8 Virtual touch screen 196
fingertip is used for the dragging. This means that even if the fingertip that captured the
cursor does not move exactly along the slider bar, the cursor will still be dragged to the
fingertip vertical position. Slider bars in common GUI interfaces have this same behaviour.
A video sequence showing the usage of a keypad and two slider bars in the second
generation VTS is available in Appendix B and on the supporting webpage as
"video_sequence_8.3.avi".
Frame 452 Frame 455
Frame 459 Frame 476 Figure 8.12: Slider bar usage. Dragging sequence.
8.2.5 Tracking from the back of the hand
Leading towards the third generation VTS, a new hand tracking point of view had to be
tested. In the third generation VTS the camera that captures the user's hand is mounted on a
HMD, as shown in Figure 8.15. This involves tracking the user's hand from its back. In
principle, the articulated hand contour tracker developed in chapters 3 to 7 , is capable of
tracking the user's hand both from the front view (palm view), and from the back view. The
only requirement is that the camera that captures the user's hand has to be approximately
parallel to the hand palm. However, when the user's hand is tracked using a camera mounted
8 Virtual touch screen 197
on a HMD, the hand is typically held just in front of the user (as opposed to holding the hand
in front of the user's shoulder, as in the case of the second generation VTS) so that the hand
stays near the centre of the camera's field of view. This point of view may result in the hand
contour appearing slightly different. Also, in this configuration (camera mounted on a
HMD), if the user turns around while operating the VTS, it is possible that the illumination
conditions of the hand could change substantially (depending on the light sources in the
room). For these reasons, some minor modifications have been made to the hand contour
tracker.
The first modification is in the hand contour template. The template is a horizontally flipped
version of the hand contour template described in Section 4.1. However, in this hand
tracking configuration (camera mounted on a HMD) the user's right hand is normally held
near the centre of the camera's field of view, and the user's arm appears in the lower right
quadrant of the image. In order to better adapt to the way the user's hand is held in this
configuration, the right hand side of the hand contour template is slightly shortened (see
Figure 8.13). On the other hand, the fact that the user's arm normally appears in the lower
right quadrant of the image is taken into account during hand tracking initialisation, when the
LC skin colour classifier is tuned. When the LC classifier is tuned two masks need to be
generated from the hand tracking information. These masks were referred in Section 5.3 as
SkinMask, and BackgroundMask. These mask are meant to segment the skin colour area of
the user's hand (SkinMask), and avoid any obvious skin colour areas in the image
(BackgroundMask). The obvious skin colour areas include the user's hand and arm.
Therefore, when generating the BackgroundMask the potential position of the user's arm is
taken out of the mask. Figure 8.14 shows the two initialisation images used in the LC
classifier tuning. These masks are equivalent to those shown in Figure 5.6 for the new
tracking point of view.
Finally, in this hand tracking configuration it is possible for the illumination conditions of
the hand to change substantially (as the user turns around while operating the VTS). This
illumination changes may result in a poor skin colour segmentation of the user's hand, and
consequent loss of tracking. In order to cope with illumination changes the LC classifier is
repeatedly tuned to the current hand skin colour, once for every frame of tracking. The
procedure is referred to as dynamic tuning and it is described in detail in Section 5.9. A
video sequence showing hand tracking from the back of the hand (and which uses dynamic
8 Virtual touch screen 198
tuning) is available in Appendix B and on the supporting webpage as
"video_sequence_8.4.avi".
Figure 8.13: New hand contour template.
(a) (b) Figure 8.14: New initialisation image masks. (a) Initialisation image segmented by SkinMask. (b)
Initialisation image segmented by BackgroundMask.
Hand tracking from the back of the hand can also enable other VTS configurations (apart
from the HMD based configuration). The camera that captures the user's hand can be
wearable or be placed somewhere on top of the VTS, so that the hand can be tracked from its
back. Examples of these configurations appear in Figure 8.1(b) and (c).
8.2.6 HMD based VTS (Third Generation)
The third generation VTS is a HMD based VTS. In this configuration the camera that
captures the user's hand is mounted on the HMD, and the user's hand is tracked from its
back. The VTS contents are displayed in the HMD and appear to the user as if they were
floating in their field of view. Figure 8.15 shows the set up of the third generation VTS. The
HMD model used in this set-up is an I-glasses (I-O Display Systems, 2006). The HMD
contains two 800×600 LCD screens, one in front of each eye, although in this model both
8 Virtual touch screen 199
screens show the same video output. The camera used in this set-up is a QuickCam Pro 3000
(Logitech, 2006). The camera was disassembled in order to replace the default lens with a
fisheye 2.1mm focal length lens, which provides an approximately 150° field of view. Then,
the camera was placed in a plastic housing and the set was attach to the front of the HMD
using Velcro strips. As the lens produces a fisheye image, the camera is first calibrated, and
the video from the camera is undistorted using CalibFilter from OpenCV (OpenCV, 2006).
Figure 8.15: Set up of the third generation VTS. The VTS contents are displayed in the HMD, and appear to
the user as if they were floating in their field of view. A camera is placed on the HMD in order to
capture the user's hand from its back. This allows tracking and interpretation of the of the user's
hand interaction with the VTS.
The third generation VTS uses the modified hand tracker (tracking from the back of the
hand) described in Section 8.2.5. The initialisation of the VTS interface follows the same
sequence as the second generation VTS (described in Section 8.2.2). A moving threshold and
debouncing (as described in Section 8.2.3) are used for the touch detection. The fingertip
dragging detection is also the same as in the second generation VTS.
In order to demonstrate the operation of the third generation VTS, two virtual interface
elements are implemented: a keypad, and spinning wheel. The keypad (shown in Figure
8.16(a)) has a display area and a dragging bar. When keys are pressed in the keypad those
8 Virtual touch screen 200
produce visual and acoustic feedback and the typed numbers appear in the display area.
When the user clicks on the dragging bar, the keypad can be repositioned on the VTS by
dragging it. The keypad is used in the following demonstrations in order to launch the wheel
and control some aspects of the VTS interface by typing in certain codes. The spinning
wheel (shown in Figure 8.16(b)) has a spinning area (area between the outer circle and the
inner circle), and a dragging handle (area in the inner circle). Users can spin the wheel by
clicking and dragging their fingertip around the spinning area. The dragging handle area
allow the users to reposition the wheel on the VTS. The spinning wheel can be used to
control values with a continuous magnitude.
(a)
(b) Figure 8.16: Example virtual interfaces. (a) Keypad. (b) Spinning wheel.
Despite the fact that the HMD presents 2D information only (with the same video output for
the two HMD screens), an illusion of depth is generated with a combination of colour cues
and occlusions. When the user's hand is far from the VTS (and in front of it), the user's hand
occludes the VTS interfaces, and the tracked hand contour is displayed in a light blue colour
(see Figure 8.17(a)). The occlusion of the VTS interfaces by the user's hand gives the
impression to the user that their hand is in front of the VTS interfaces. At this distance the
user's hand is too far away to operate the VTS. When the user's hand gets nearer to the VTS
(and still in front of it), the user's hand occludes the VTS interfaces, and the colour of the
tracked hand contour changes to dark blue (see Figure 8.17(b)). At this distance the user's
hand is near enough to operate the VTS. When the user's hand goes behind the VTS, the
colour of the tracked hand remains dark blue, but the VTS interfaces occlude the user's hand
(see Figure 8.17(c)). The occlusion of the tracked hand by the VTS interface gives the
8 Virtual touch screen 201
impression to the user that their hand is certainly behind the VTS. The VTS cannot be
operated when the user's hand is behind it.
(a)
(b)
(c)
Figure 8.17: Illusion of depth perception. (a) The user's hand is too far away from the VTS so as to operate it.
This is indicated with a light blue hand contour. (b) The user's hand is near enough to the VTS so as
to operate it. This is indicated with a dark blue hand contour. (c) The user's hand is behind the VTS
and cannot operate it. This is indicated by occluding the hand with the keypad.
8 Virtual touch screen 202
These three states (far way from the VTS, near enough to the VTS, and behind the VTS) are
controlled with the scale of the tracked hand contour. Two thresholds with hysteresis are set
manually on the hand contour scale, so that the switching between these three states
produces convincing depth perfection.
The occlusion of the VTS interfaces is implemented using the skin colour of the user's hand.
Each of the VTS interfaces is continually testing the image area they occupy for skin colour.
If skin colour is detected in that area, this skin colour is used to create a binary mask. The
mask is morphologically processed (in order to have more compact blobs) and used to
selectively display the pixels of the interface. The pixels inside the interface area where there
is no skin colour (mask values are zero) are displayed normally, but the pixels of the
interface area where there is skin colour (mask values are one) are not displayed, and
therefore the skin colour will appear on the image instead. This gives the impression to the
user that their hand is occluding the interface, and therefore it must be in front of it.
8.2.7 Third generation VTS experiments
In this section the VTS operation is tested in four experiments. These experiments aim to
demonstrate the capabilities of the third generation VTS interface, and they give an idea of
its potential applications. In all four experiments the VTS contents are HMD stabilised i.e.
the contents appear floating in field of view regardless of where the user is looking at.
Briefly:
• The first experiment involves testing the VTS operation against a plain background. This
makes the skin colour segmentation easier and the hand tracking more precise.
• The second experiment involves testing the VTS operation against a complex
background. A complex background makes the skin colour segmentation harder and the
hand tracking has to cope with non-ideal hand segmentation.
• The third experiment is similar to the first one but in this experiment the VTS interfaces
can be resized with a hand gesture.
• The fourth experiment uses the VTS to implement a drawing application.
8 Virtual touch screen 203
VTS operation with a plain background
In this experiment, an user operates the VTS in front of a white wall. In this situation, the
skin colour of the user's hand can be easily segmented from the background. This makes the
articulated hand contour tracking more precise. The experiment is recorded as a video
sequence which is available in Appendix B and on the supporting webpage as
"video_sequence_8.5.avi". Some example frames from this video sequence are shown in
Figure 8.18.
Frame 232 Frame 272 Frame 347
Frame 425 Frame 524 Frame 591
Frame 757 Frame 815 Frame 886
Frame 1269 Frame 1284 Frame 1304 Figure 8.18: Example frames of the actions occurring during the first experiment (third generation VTS).
8 Virtual touch screen 204
The experiment starts with the user initialising the hand contour tracker by placing their hand
on the floating red hand contour. Once the hand tracking starts (blue hand contour), a keypad
appears in the centre of the field of view. Then the user proceeds to sequentially type on the
keypad from the top row to the bottom row. During the first two rows the users types with
their hand parallel to the VTS (example frame on Figure 8.18 Frame 223). For the third row
the user types with their hand tilted forward (example on Frame 272). And on the last row
the user types with their hand tilted to the left (example on Frame 347). These three ways of
typing illustrate the freedom with which the user can type on a VTS. After this, the user
drags the keypad to one side (Frame 425), and types the code "123" and enter "E" (Frame
524). This code launches a spinning wheel interface (Frame 591). The spinning wheel is
dragged to a new position (Frame 757) and it is spun using the index finger (Frame 591),
then the middle finger (Frame 815), and finally the ring finger (Frame 886). Finally, the user
types the code "1" in the keypad. This code makes the spinning wheel control the brightness
of the input video with the angular position of the wheel (Frame 1269, Frame 1284, and
Frame 1304). Note that through the whole video sequence the shadow of the user's hand can
be seen on the wall, this shows that the user is operating a non-physical interface.
VTS operation with a complex background
In this experiment, a user operates the VTS against a complex background. A complex
background makes the skin colour segmentation harder as some colours in the background
can be misclassified as skin when they are not. This makes more difficult the tracking of the
user's hand. The experiment is recorded as a video sequence which is available in Appendix
B and on the supporting webpage as "video_sequence_8.6.avi". Some example frames from
this video sequence are shown in Figure 8.19.
This experiment is very similar to the first one with the only difference of using a complex
background. The experiment proceeds as follows: Firstly, the VTS is initialised, and a
keypad appears on the centre of the screen. Then the user types on the keypad (example
frame on Figure 8.19 Frame 412), launches the spinning wheel (typing code "123E"), and
drags the keypad to one side (Frame 595). Then, the user spins the wheel (Frame 834) and
activates the brightness control feature (typing code "1E") (Frame 1014). This allows the
user to control the brightness of the input video with the angular position of the wheel
(Frame 1149, and Frame 1158).
8 Virtual touch screen 205
Frame 412 Frame 595
Frame 834 Frame 1014
Frame 1149 Frame 1158 Figure 8.19: Example frames of the actions occurring during the second experiment (third generation
VTS).
VTS operation with resizable interfaces
This experiment tests two qualities of the VTS interface: firstly, the possibility of using hand
gestures to control aspects of the VTS interface; and secondly, the range of interface sizes
that the user can comfortably operate. In this experiment the size of the VTS interface
8 Virtual touch screen 206
elements can be controlled with the position of the thumb. The procedure is as follows:
firstly, the user has to click on the dragging handle of the interface element and hold the
finger down as if the interface was going to be dragged (optionally the interface can be
dragged); then, the thumb of the user's hand has to be completely flexed in order to activate
the resizing mechanism; from that moment on, when the thumb is flexed (past a threshold)
the size of the interface increases, conversely, when the thumb is completely extended
(passed a threshold) the size of the interface decreases. The experiment is recorded as a video
sequence which is available in Appendix B and on the supporting webpage as
"video_sequence_8.7.avi". Some example frames from this video sequence are shown in
Figure 8.19.
The experiment starts by initialising the interface. Then after typing on the keypad, the user
clicks and holds their finger on the keypad's dragging bar (example frame on Figure 8.19
Frame 291). At this point, the user flexes their right hand's thumb (Frame 306) and the
keypad's resizing mechanism starts to operate. As the user keeps their thumb flexed the size
of the keypad increases (Frame 319). After typing on the resized keypad, the user proceeds
to decrease the size of the keypad. For that, the user clicks on the keypad's dragging bar and
while holding their finger on the dragging bar, flexes the thumb in order to activate the
keypad's resizing mechanism (Frame 510). Then, the user extends their thumb completely
(Frame 532) and the size of the keypad starts to decrease (Frame 554). The user types the
code "123E" on the resized keypad (Frame 686). This launches the spinning wheel, the size
of which is first decreased (Frame 949, and Frame 1023), and then increased (Frame 1315).
Finally, the resized spinning wheel is operated. As the wheel's size is big, this one is operated
by the user using three fingers (Frame 1589, and Frame 1596).
8 Virtual touch screen 207
Frame 291 Frame 306 Frame 319
Frame 510 Frame 532 Frame 554
Frame 686 Frame 949 Frame 1023
Frame 1315 Frame 1589 Frame 1596 Figure 8.20: Example frames of the actions occurring during the third experiment (third generation
VTS).
VTS based drawing application
As a last experiment using the third generation VTS, a drawing application was developed.
In this application the user can draw strokes on the VTS by clicking and dragging their finger
on the virtual surface. The drawing application has a toolbar consisting of a number of
buttons, see Figure 8.21. From left to right, the first group of 6 buttons, allows the user to
select the colour with which the strokes are drawn. The current drawing colour is indicated
by a red border in the corresponding button. The following button in the toolbar, allows the
8 Virtual touch screen 208
user's hand to occlude the drawings or vice versa. When the state of this button is "_" the
drawings occlude the hand, when the state of the button is "^" the hand occludes the
drawings. The next button in the toolbar allows the user to delete a previously drawn stroke.
When the user clicks on the button, this changes from "D" to "X". When the state of this
button is "X" the user can click on a previously drawn stroke, and this one will disappear
from the screen. The last button on the toolbar is "C". When the user clicks on this button the
drawing area is cleared.
Figure 8.21: Drawing application toolbar (third generation VTS).
A video sequence showing the operation of this drawing application is available in Appendix
B and on the supporting webpage as "video_sequence_8.8.avi". The first thing to obverse in
this video sequence is that the user withdraws his hand and introduces it into the VTS's field
of view without going through the initialisation sequence. The hand contour tracking for this
drawing application uses the SCGS (Skin Colour Guided Sampling) technique as described
in Section 7.4. This technique allows the VTS to automatically initialise the hand tracking as
soon as the user's hand appears on the field of view (typically within 5 frames). As a result,
the user can easily withdraw their hand from the field of view (and keep it in a rest position),
and then introduce it again into the field of view when interaction with the VTS is required.
Figure 8.22 shows some frames from "video_sequence_8.8.avi", which illustrate how the
hand tracking is automatically initialised within 5 frames from the moment at which the hand
starts to appear on the field of view.
8 Virtual touch screen 209
Frame 77 Frame 90 Frame 91
Frame 92 Frame 93 Frame 94
Frame 95 Frame 96
Figure 8.22: Automatic hand tracking reinitialisation feature.
The video sequence "video_sequence_8.8.avi" continues with the user operating the drawing
interface in order to draw a simple scene with a house, a tree, a tractor, clouds and a sun.
Figure 8.23 shows some example frames from this video sequence, which illustrate the type
of actions occurring during the operation of the drawing application. By frame 2405 the user
has managed to draw most of the intended scene, having had to change the drawing colour
several times. Note that the strokes are smoothed out as they are drawn. This makes the
stroke to appear a bit delayed as the user drags their hand on the VTS, but on the other hand,
this avoids reflecting in the stroke the potential jitter of the hand contour. Also note that the
tracked hand contour is always dark blue, this means that the user can draw strokes in the
VTS regardless of the distance between the hand and the drawing surface (this new
behaviour does not apply to the toolbar). Frame 2515 illustrates how the user's hand can
occlude the drawings (by switching the "_" button to "^") creating the illusion that the hand
is on top of the drawing. Note that in previous frames the drawings were on top of the hand
(occluding the hand). Frames 2769 and 2792 illustrate how the user can delete a stroke (by
8 Virtual touch screen 210
switching the "D" button to "X" and clicking on the desired stroke). The user deletes the
stroke corresponding to the tree and redraws it with a different shape (Frame 3044). Finally,
the user clears the drawing area by clicking on the "C" button (Frame 3355).
Frame 234 Frame 589 Frame 652
Frame 957 Frame 2405 Frame 2515
Frame 2678 Frame 2769 Frame 2792
Frame 3044 Frame 3117 Frame 3355 Figure 8.23: Example frames of the actions occurring during the drawing application experiment (third
generation VTS).
The drawing application allows the user to draw using either the index finger, the middle
finger, or the ring finger. It also allows the user to draw using multiple fingers at the same
time (multiple points of input). This results in drawing multiple strokes at the same time. A
video sequence illustrating drawing with multiple fingers at the same time is available in
8 Virtual touch screen 211
Appendix B and on the supporting webpage as "video_sequence_8.9.avi". Some example
frames from this video sequence are shown in Figure 8.24. Frame 182 shows how the user
clicked and dragged their index and middle fingers on the VTS, that produced two parallel
strokes. Frame 390 shows how the user draws simultaneously three horizontal strokes
representing a sea. Frame 1063 shows how the user draws simultaneously the two eyes of a
sun. Finally, frame 1192 shows the finished drawing.
Frame 182 Frame 390
Frame 1063 Frame 1192 Figure 8.24: Example frames showing drawing with multiple fingers at the same time (multiple points of
input).
8.3 Applications
The unique characteristics of the VTS interface lead to a number of possible applications that
go beyond emulating traditional touch sensitive screens. Section 8.2 gave us an basic idea of
the potential of the VTS interface as implemented in three VTS generations. This section
describes a number of potential applications of the VTS interface. Some of the suggested
applications may involve other technologies too, but the VTS remains central in the
8 Virtual touch screen 212
application. Depending on whether the VTS is projector based or HMD based the potential
applications vary slightly.
Possible applications for a projector based VTS are:
• An alternative to touch sensitive panels. This alternative is particularly attractive when
large touch sensitive panels are required, because these are difficult to construct and thus
quite expensive. This would enable applications such as large desktops, interactive
drafting boards, and large interactive information points.
• VTSs make possible not only traditional Windows, Icons, Menu, Pointer (WIMP)
interaction but also direct manipulation of objects in the VTS, and gesture recognition.
For example objects displayed in the VTS could be directly relocated, resized, and
reoriented using a particular gesture, or just clicking on them and dragging them on a
particular way (similar to direct manipulation of object on screen using a mouse). It is
possible to detect two or more fingers clicking at the same time on the VTS (multiple
points of input). Finally, it is also possible to use two identical hand trackers (one for
each hand) and use both hands in the interaction. This would enable applications such as:
direct manipulation of maps in GIS software, direct drawing media, or the simulation of
musical instruments such as a virtual piano, virtual DJ desks, etc.
• The VTS is specially suited to applications in which physical contact is not desired (as
clicks and drags can happen at a certain distance from the screen). These type of
applications can be divided into two scenarios:
• In the first scenario, it is essential to avoid contamination of the user's hand. For
example, the VTS could provide a sterile interface for use in hospitals, operating
theatres, or clean rooms.
• In the second scenario, the interfaces are subjected to extreme wear, for example:
heavy use in public places, hazardous environments with high humidity, dust, or
even underwater. Mechanical interfaces could stop working under these
conditions. A VTS could keep operating as long as the user's hand can be tracked.
If the screen where the VTS is projected is transparent (such as the DNP HoloScreen (DNP,
2004)) a number of additional applications are possible:
8 Virtual touch screen 213
• Video conference with gaze awareness and direct manipulation of virtual objects. The
concept is similar to that of ClearBoard (Ishii and Kobayashi, 1992). This is a video
conferencing application where the camera is situated behind the screen, in front of
which the user stands. When the user stands in front of the screen's centre, the camera
points directly to the user's eyes. This makes not only direct eye contact possible but also
awareness of the other collaborator's gaze. In ClearBoard the user needs a stylus in order
to draw or manipulate virtual objects. In a VTS the user could use both of their hands.
• Spatial displays. As the VTS in this category is transparent, it is possible to see the scene
behind it. This can allow pointing out objects in the field of view, as in DigiScope
(Frescha and Keller, 2003), or sending areas in the field of view to a recognition engine,
as in HandVu (Kölsch, 2004). In combination with direct manipulation it would be
possible to place virtual objects on the scene, as seen through the VTS. This could have
applications in surveillance, modelling, prototyping, maintenance, and training.
• Use in museums, information points and shop displays. Information relevant to the user
could be projected on the shop display, and users could interact directly with this
information using their hands (clicking and dragging items). They could browse articles,
request information, they could even try virtual clothes in a virtual mirror.
• 'Minority report' interfaces. As previously suggested in a similar system, TouchLight
(Wilson, 2004), these type of interfaces allow filmmakers to cleanly put the interaction
system and the actor's face in the same shot.
If the VTS uses a see-through HMD in order to visualize the VTS contents, a number of
extra applications are possible (Note that most of the projector based VTS applications can
also be effectively implemented using a HMD based VTS):
• A cheap and flexible alternative to handheld keypads, controls, or pointing devices in
HMD based AR environments. It would be possible to implement it using cheap USB
cameras. This alternative has the extra advantage that the user does not need to have their
hand occupied by interface devices, and consequently they could have their hands free to
touch or grab any physical object.
8 Virtual touch screen 214
• Direct manipulation of virtual objects in the field of view. The VTS does not need to be
2D when using a see-through HMD (the 2D operation is an artificial constrain). Virtual
objects could be relocated, resized, and reoriented not only on a 2D virtual surface, but
also above and under it. Occlusion of the virtual objects by either the user's hand or other
virtual objects could give a perception of depth.
• As the user could see the VTS superimposed in the real world, it is possible to point out
an object in the field of view, and select areas in the field of view for processing or
recognition, as in the case of HandVu (Kölsch, 2004). In combination with direct
manipulation it would be possible to place virtual objects on the scene, as seen through
the VTS. This could have applications in surveillance, hybrid modelling, prototyping,
maintenance, and training.
• Interface for mobile computing. A VTS could become a flexible alternative interface for
PDAs, or other mobile computing devices.
• The VTS would have a large field of application in the video game industry. For
example, two user's using a video see-through HMD could see each other and pass to
each other a virtual ball, sort of virtual tennis. Also, taking into account that input from
various fingers and also from both hands is possible, it would be possible to reproduce
for leisure a number of musical instruments such as pianos, DJ decks, etc.
• The VTS is specially suited to applications in which physical contact is not desired (as
clicks and drags happen on the air). These type of applications can be divided into two
scenarios:
• In the first scenario, it is essential to avoid contamination of the user's hand. For
example the VTS could provide a sterile interface for use in hospitals, operating
theatres, or clean rooms.
• In the second scenario, the interfaces are subjected to extreme wear, for example:
heavy use in public places, hazardous environments with high humidity, dust, or
even underwater. Mechanical interfaces could stop working under these
conditions. A VTS could keep operating as long as the user's hand can be tracked.
In the applications involving a see-through HMD, the VTS is HMD stabilised. That is,
wherever the user looks at the VTS interface elements will be in the field of view. The VTS
could be made world stabilized by using fiducial markers such as the ones provided by
ARToolkit (ARToolkit, 2006) or ARTag (Fiala, 2004). For example, a HMD could house a
8 Virtual touch screen 215
camera in order to implement video see-through, track the user's hand, and recognize fiducial
markers. When the system recognizes a certain fiducial marker this could be used in order to
visualise a VTS interface in a location and orientation relative to the marker. The possibility
of making the VTS world stabilised opens up a set of new applications:
• Distributed VTS interfaces which could be activated when the user looks at them. The
idea is that an user wearing a video see-through HMD could move inside a building or
other area, and each time the camera captures a fiducial marker a VTS interface can
appear in that place. That VTS interface could enable the user to perform some task
relevant to that position, for example, operation of a nearby machine, get access to a
nearby door, etc.
• The use of multiple fiducial markers arranged on a wall could make possible to
implement a large VTS which size is bigger than the video see-through field of view.
This arrangement could enable large continuous desktop surfaces. For example a large
windows manager desktop could be directly displayed, and be active for interaction, over
a whole wall – or even over the four walls, ceiling and floor of the room. The only
requirement of having to arrange a number of fiducial markers over the wall would make
this large desktop approach much cheaper than any other technology.
• Interactive textbooks. These textbooks contain fiducial markers in their pages. The
markers could be recognized by the system and the desired information is rendered on
top of the page (as if it was printed on the page). The page could become a VTS surface.
This would allow the user to click and drag on the page in order to trigger new contents.
The concept is similar to MagicBook (Billinghurst et al., 2001) or EnhancedDesk (Koike
et al. , 2001).
8.4 Conclusions
This chapter has described the concept of a Virtual Touch Screen (VTS) interface. A VTS is
an interface analogous to a touch screen. In a VTS the sensing technology is based on visual
tracking of the user's hand, and the display technology is based on either a projector and a
screen, or a HMD. A number of possible VTS configurations, the proposed operation of a
VTS, and usability factors have also been described and discussed in the first part of this
chapter. The second part of this chapter presented the current implementations of the VTS
interface. Three generations of VTS implementations have been developed in this thesis. The
8 Virtual touch screen 216
first generation was intended to demonstrate the VTS concept, but its hand tracking was too
restricted and only worked against a black background. This constraints reduce the usability
of this first generation VTS. The research presented in chapters 3 to 7 of this thesis, resulted
in an articulated hand contour tracker that is specially designed for the VTS use. This hand
tracker is used in the second generation VTS, and with minor modifications, in the third
generation VTS. The second generation VTS is a projector based VTS and the hand tracking
is from the front of the hand (palm view). The third generation VTS is a HMD based VTS,
and the hand tracking is from the back of the hand. The operation of the second and third
generation VTS was demonstrated through a number of experiments which involved
operating buttons, keypads, slider bars, and spinning wheels. Finally, a drawing application
was developed for the third generation VTS. This drawing application allows the user to
draw on the VTS by using clicks and drags of a finger.
The potential applications for the VTS range from an alternative to touch screens, handheld
keypads, controls, and pointers, to spatial displays, collaboration environments, and
entertainment industry (see Section 8.3 for a detailed list of potential applications). The VTS
interface also has potential applications in environments where physical contact is not
desired. This includes sterile environments where contamination of the user's hand needs to
be prevented (hospitals, or clean rooms), and environments where extreme wear would make
touch screens, keyboards, and mouse unfeasible (heavy use, high humidity or dust, or
underwater).
The hand tracking used in the second and third generation VTS follows the initialisation
sequence described in Section 8.2.2. During this initialisation sequence, the skin colour
model used in the tracking is tuned to every new user that operates the VTS, but the hand
contour template remains unchanged. This may result in an incorrect fitting of the hand
contour template to different users. Thus, a subject of future work is to create a mechanism
that can adapt the hand contour shape to that of the current user's hand. Isard and
MacCormick had some success in making a hand contour tracker robust to different users.
They used two trackers that operated simultaneously; one tracked the rigid movement of a
hand, and the other tracked changes in the shape of the contour. The shape of the contour
was only allowed to change within a space of deformations calculated from examples and
reduced with PCA (Isard, 1998). This technique is a possible candidate in order to make the
VTS fully multi-user. However, the hand contour they track is not articulated and is simpler
8 Virtual touch screen 217
than the one used for the VTS. Their hand contour consists of an index finger, thumb and
hand, whereas the hand contour used in the VTS is more complex and is articulated (14
DOF). This raises doubts about the success of this method on the hand contour used for the
VTS. An alternative method could involve using another tracker, or mechanism such as
snakes (Kass et al. 1987), in order to find the shape of the user's hand during initialisation,
and then create an articulated hand contour template from that shape.
Questions about the usability (easy of use) of the VTS interface arose and were discussed in
Section 8.1.3. The flexibility of the VTS interface makes it possible to deploy it in a large
range of configurations and orientations (the only real restriction is that the field of view
between the camera and the user's hand must be clear). This allows the VTS to be deployed
in such a way that allows the VTS operation within the user's comfort area (as defined by
Kölsch (2004)), and within the wrist angle range suggested by Bach et al. (1997) and
Wellman et al. (2004). The only unavoidable shortcoming of the VTS interface is the lack of
haptic feedback. This problem can be alleviated by maximizing audio and visual feedback.
Finally, as regards to the usage scope of a VTS, it must be noted that as in the case of other
interactive surfaces, such as touch screens, VTS interfaces are better suited to information
systems with limited data entry.
218
9 Closing Discussion
This thesis has developed a specific visual articulated hand tracking system which enables
the creation of a novel vision-based interactive surface, referred to as the Virtual Touch
Screen (VTS). The thesis further develops and improves existing contour tracking (Blake
and Isard, 1998) and partition sampling (MacCormick and Isard, 2000) algorithms for the
creation of a robust hand tracker for the VTS. The existing tracking algorithms, though have
the potential of tracking complex objects in real-time against cluttered backgrounds, do not
satisfy the demanding hand tracking requirements of the VTS. As a result a number of novel
techniques for articulated hand contour tracking have been developed and presented. The
final visual articulated hand tracking system is used for the creation of the VTS interface.
9.1 Summary
This thesis has developed a visual articulated hand tracking system capable of satisfying the
hand tracking requirements of the VTS interface. The development of this tracking system
has been possible thanks to the combination of a number of other developments, these are:
• A novel technique, referred to as particle interpolation, which makes it possible to
improve the efficiency of particle propagation between time-steps in tracking tree-
structured articulated objects using particle filters and partition sampling.
9 Closing discussion 219
• Development of a novel skin colour classifier, referred to as the Linear Container (LC)
classifier and testing of the classifier under various conditions for use in hand tracking
for HCI. The classifier is robust to illumination (brightness) changes, requires less
storage, and is significantly faster than existing classifiers
• A novel measurement function based on skin colour that is both faster and more reliable
than existing edge based measurement functions provided no other skin colour objects
appear in the background.
• A novel skin colour based importance sampling, referred to as Skin Colour Guided
Sampling (SCGS), that allows the estimate of position, scale, and angle of the hand
contour from low-level information, for either users wearing long sleeve or short sleeve.
• A novel contour fitting method for articulated contour trackers that improves tracking
agility and reduces jitter on the tracking output.
• A novel method for particle filter based contour trackers, referred to as Variable Process
Noise (VPN), which varies the size of the contour's search region in order to cope with
brisk target movements.
The final visual articulated hand tracking system has been used to create the VTS interface.
However, the tracking system is just a part of the VTS, and the full VTS implementation has
required the development of other techniques dealing with: Interface intialisation, touch
detection, and occlusion of interface elements by the user's hand in order to simulate a depth
feeling.
The capabilities of the VTS interface have been demonstrated through a number of
experiments. These experiments have involved the operation of various interface elements,
such as keypads, sliderbars, control wheels, and buttons, against plain and cluttered
backgrounds. Finally, the capability of the VTS interface to complete a task has been
demonstrated through a hand drawing application.
The original goal of the project has been achieved. This is the implementation of a visual
articulated hand tracking system that can enable the creation of the VTS. However, questions
about usability of the VTS interface arise. The first question involves the posture of a user
operating a VTS. The VTS may be deployed in a position that forces the user to pose
uncomfortably. However, this same problem can occur to a keyboard, a touch screen, or any
9 Closing discussion 220
other hardware interface that is not properly deployed. The solution to this problem is to
deploy the VTS in such a way that allows the VTS operation within the user's comfort area
(as defined by Kölsch (2004)), and within the wrist angle range suggested by Bach et al.
(1997) and Wellman et al. (2004). This is easy to achieve in a projector based VTS, as the
location of screen and camera force the user to operate it holding their hand in a given way.
However, this is not so easy to achieve in a HMD based VTS where the contents are always
in the HMD's field of view. In this case it is up to the user to operate the VTS in a
comfortable position. Another question about the VTS usability is the quality of the visual
articulated hand tracking system. This quality could always be better, and indeed the current
visual hand tracking system can be easily fooled if desired. Operation of the VTS within the
visual articulated hand tracking system capabilities is a skill that the current VTS user has to
acquire with practice. Further improvements to the tracking quality are part of the future
work of this thesis.
The only unavoidable shortcoming of the VTS interface, even with an ideal hand tracking, is
the lack of haptic feedback. This problem could be alleviated to some extend by maximizing
audio and visual feedback.
Finally, it is worth to remember the usage scope of a VTS. As in the case of other interactive
surfaces, such as touch screens, VTS interfaces are better suited to information systems with
limited data entry.
9.2 Future work
The first part of the future work is related to improving the hand tracking system and the
VTS interface in order to reach a commercial quality system. The second part is concerned
with the extension of the VTS interface to support new applications.
The visual articulated hand tracking system developed in this thesis is fast, accurate, and
robust enough to enable the creation of the VTS. However, errors in the hand tracking may
occur, and these errors could result in incorrect interface actions. This demands a very high
quality tracking that can guarantee correct tracking in a broad range of conditions. The first
aspect of the current articulated hand tracking system liable to improvement is the skin
colour detection. The LC skin colour classifier is most effective when the skin colour
9 Closing discussion 221
dispersion of the user's hand is small. When the skin colour dispersion is high (for example
one side of the hand is dark while the other side is bright) the performance of the LC
classifier decreases. This could be overcome by dynamic skin colour modelling together with
multiple LC classifiers. Each LC classifier could deal with a fraction of the hand template.
Then, while tracking, each one of the LC classifiers could get updated to the particular skin
colour in that area of the hand template. In this way, the skin colour dispersion supported by
the hand tracker could be higher, which in turn would improve the tracking performance.
Another aspect for improvement in the current articulated hand tracker system is its multi-
user capability. At the moment, a single deformable template is used for hand tracking. The
template can transform within a space of Euclidian similarities in order to adapt to different
hands. However, if the hand shape is different there will always be matching errors with only
one template. One possibility to solve this problem could be to create a template specific for
each user. The template could be created during the initialisation step. The user would place
their hand over the hand contour and then using a mechanism such as snakes (Kass et al.
1987), the shape of the hand initialising the tracker could be found. From that shape a new
template specific to that user could be generated, and used later for the tracking. As the
template used in tracking is specific to the current user, the match should be more accurate.
This in turn improves the tracking performance. Once the articulated hand tracking system is
prepared to work with multiple users aspects such as the usability of the VTS interface could
be calculated for a representative group of users.
The VTS interaction experience is closely linked to the quality of the visual hand tracking
system employed. However, there are some aspects of the VTS that could be improved
independently of the hand tracking system in order to make the VTS interaction easier, and
to reduce learning time. One of these aspects is the click and drag detection mechanisms.
Another aspect is to provide mechanisms that reduce the impact of the lack of haptic
feedback, for example, maximizing audio and visual feedback, or imposing simulated
surface constraints. Lindeman et al. (2001) reported that the imposition of simulated surface
constraints (such as clamping) can compensate to some extend the decrease in performance
produced by the lack of haptic feedback in virtual work surfaces. The notion of clamping
could easily be implemented in a VTS by drawing the hand contour with a fixed minimum
scale, each time the user stretches their hand beyond the virtual surface (no occlusion of the
9 Closing discussion 222
user's hand by the interface). During this state (scale of the hand clamped to a minimum) the
VTS could still be operated.
A second part of the proposed future work is concerned with the extension of the VTS
interface to support new applications. One interesting extension would be to make the HMD
based VTS to be world stabilised. In the applications involving a see-through HMD, the VTS
is HMD stabilised. That is, wherever the user looks at the VTS interface elements will be in
the field of view. The VTS could be made world stabilized by using fiducial markers such as
the ones provided by ARToolkit (ARToolkit, 2006) or ARTag (Fiala, 2004). For example a
HMD could house a camera in order to implement video see-through, track the user's hand,
and recognize fiducial markers. When the system recognizes a certain fiducial marker this
could be used in order to visualise a VTS interface in a location and orientation relative to
the marker. The possibility of making the VTS world stabilised opens up a set of new
applications (these are detailed at the end of Section 8.3):
• Distributed VTS interfaces which could be activated when the user looks at them.
• The use of multiple fiducial markers arranged on a wall could make possible to
implement a large VTS which size is bigger than the video see-through field of view.
• Interactive textbooks.
One final piece of future work is the creation of a library that could give easy access to the
functions involved in a VTS, from hand tracking, VTS initialisation, click and drag detection
mechanisms, to the inclusion of new interface elements. This library would enable
developers of AR projects to easily incorporate and customise a VTS interface into their
systems.
223
Appendix A
A.1 Reverse kinematics of a chain of three links The three segments that form a finger can be represented as a kinematic chain of three links.
The position of the last link's end (fingertip) can be calculated from the angles of the three
joints, and the lengths of the three links. The process to calculate this is called direct
kinematics (or simply kinematics). The reverse process (reverse kinematics) returns the
angles of the three joints given the lengths of the three links and the position of the last link's
end. The problem can be represented as in Figure A.1.
Figure A.1: Kinematic chain representing a finger.
Appendix A 224
In Figure A.1, the lengths of the finger segments are indicated as: PPL for the length of the
Proximal phalanx segment; MPL for the length of the Middle phalanx segment; and DPL for
the length of the Distal phalanx segment. The angles of the joints are indicated as: αPP for the
Metacapophalangeal joint angle; αMP for the Proximal interphalangeal joint angle; and αDP
for the Distal interphalangeal joint angle. When the finger is fully extended the angles of the
joints are all zero, the value of y equals the length of the finger, and the value of x is zero. As
the finger flexes the value of y decreases, and the value of x increases. For touch detection
purposes, the 2D projected length of the finger is indicated as the distance B in Figure A.1
(this is a value measured from the hand configuration), and the wanted value is x (the
separation between the fingertip and the palm plane).
The direct kinematics for the chain of three links of Figure A.1 is:
)cos()cos()cos( DPMPppDPMPppMPppPP LLLy αααααα +++++= (A.1.1)
In order to find the reverse kinematics from the direct kinematics two constraints are used:
• MPDP αα32
= , this is an anatomic movement constraint due to the fact that the distal
interphalangeal and proximal interphalangeal joints share the same tendon.
• PPMP αα 2= , this is an artificial constraint. It is used to simplify the reverse kinematics
considering only a typified finger flexion. This finger flexion relates to a typified
trajectory of a finger when typing on a keyboard.
Substituting the constrains in Equation A.1.1 yields:
)3
13cos()3cos()cos( ppDPppMPppPP LLLy ααα ++= (A.1.2)
)3
13sin()3sin()sin( ppDPppMPppPP LLLx ααα ++= (A.1.3)
The lengths of the fingers are assumed to be known. In order to find the separation between
the fingertip and the palm plane (x) from the distance B in Figure A.1, equations A.1.2 and
A.1.3 need to be solved. However, a close expression that tells PPα for a given y is not easy
to find, as there is no expression that tells PPα from 13cos( )3DP ppL α . Three alternative
solutions are suggested:
• Use a polynomial approximation of the cos function and substitute it in Equation A.1.2.
Then PPα can be isolated and used in Equation A.1.3 in order to find x.
Appendix A 225
• Use an iterative approach. The value of PPα can be changed in little steps, evaluating
Equation A.1.2 at each step. When the resulting y is near enough to the measured B the
iteration stops and current PPα is used in Equation A.1.3.
• Calculated a number of combinations of B, x before the tracking starts (for example
during the tracking initialisation), and store it in a lookup table. These combinations can
be calculated for a range of valid PPα values just applying Equation A.1.2 and Equation
A.1.3.
226
Appendix B
This appendix comprises a number of video sequences which illustrate the results of various
experiments throughout the thesis. The video sequences are available in the attached CD and
in the supporting webpage at: http://www.cs.nott.ac.uk/~mtb/thesis
The video sequences contained in the CD and in the supporting webpage are listed next:
Chapter 4
video_sequence_4.1.avi Test video sequence.
ground_truth.txt Ground truth for the test video sequence.
video_sequence_4.2.avi Particle-set implementation; tracking output on test video
sequence frames 30-174.
video_sequence_4.3.avi Particle-set implementation; tracking output on test video
sequence frames 360-890.
video_sequence_4.4.avi Sweep implementation; tracking output on test video sequence
frames 30-174.
video_sequence_4.5.avi Sweep implementation; tracking output on test video sequence
frames 360-890.
Appendix B 227
Chapter 5
video_sequence_5.1.avi Video illustrating the tuning operation of the LC skin colour
classifier
video_sequence_5.2.avi Mediterranea subject video sequence, including the output of
the RGB histogram classifier, the rg histogram classifier and
the LC classifier.
video_sequence_5.3.avi White Caucasian subject video sequence, including the output
of the RGB histogram classifier, the rg histogram classifier and
the LC classifier.
video_sequence_5.4.avi Black African subject video sequence, including the output of
the RGB histogram classifier, the rg histogram classifier and
the LC classifier.
video_sequence_5.5.avi Chinese subject video sequence, including the output of the
RGB histogram classifier, the rg histogram classifier and the
LC classifier.
video_sequence_5.6.avi Example of a LC classifier initialisation in a HCI system.
Office illumination during the day.
video_sequence_5.7.avi Example of a LC classifier initialisation in a HCI system.
Office illumination during the night.
vide_sequence_5.8.avi Dynamic tuning vs. Static tuning of the LC classifier.
Chapter 7
video_sequence_7.1.avi Tracking output using template fitting method 1.
video_sequence_7.2.avi Tracking output using template fitting method 2.
video_sequence_7.3.avi Tracking output using the combination of template fitting
method 1 and 2.
video_sequence_7.4.avi Tracking output of the test video sequence using the
combination of method 1 and method 2. Frames 30-174.
video_sequence_7.5.avi Tracking output of the test video sequence using the
combination of method 1 and method 2. Frames 360-890.
video_sequence_7.6.avi Tracking output of the test video sequence using variable
process noise. Frames 0-890.
Appendix B 228
video_sequence_7.7.avi Skin colour guided sampling reinitialisation test showing the
tracking output and the skin colour blobs .
video_sequence_7.8.avi Skin colour guided sampling reinitialisation test showing the
tracking output, the skin colour blobs, the initialisation
particles and the importance particles
video_sequence_7.9.avi Skin colour guided sampling robustness test on the test video
sequence. Frames 0-890.
video_sequence_7.10.avi Tracking output using template fitting methods 1&2, variable
process noise, and skin colour guided sampling.
Chapter 8
video_sequence_8.1.avi First generation VTS. The video sequence shows VTS user
typing a telephone number on virtual keypad. Black
background.
video_sequence_8.2.avi A finger kinematic model makes possible to find the finger's
joint angles. Black background.
video_sequence_8.3.avi Second generation VTS usage demo. The video sequence
shows the VTS user typing on a keypad and using slider bars.
video_sequence_8.4.avi Hand tracking from the back of the hand (Camera mounted on
a HMD).
video_sequence_8.5.avi Third generation VTS, first experiment. VTS operation against
a plain background.
video_sequence_8.6.avi Third generation VTS, second experiment. VTS operation
against a complex background.
video_sequence_8.7.avi Third generation VTS, third experiment. VTS operation with
interfaces of various sizes. Thumb gesture can control the size
of the interfaces.
video_sequence_8.8.avi Third generation VTS. Drawing application demo.
video_sequence_8.9.avi Third generation VTS. Multiple points of input demo for the
drawing application.
Bibliography 229
Bibliography Abe, K., Saito, H. and Ozawa, S. (2000). 3-D Drawing System via Hand Motion Recognition
from Two Cameras. In Proceeding of the 6th Korea-Japan Joint Workshop on Computer
Vision, pp. 138-143.
Ahlberg, J. (1999). A system for face localization and facial feature extraction. Tech. Rep.
LiTH-ISY-R-2172, Linkoping University.
Assan, M. and Grobel, K. (1997). Video Based Sign language Recognition using Hidden
Markov Models. Gesture and Sign Language in Human-Computer Interaction, Intl. In Proc.
of Gesture Workshop, vol. 1371 of Lecture Notes in Computer Science, pp. 97-110.
ARToolkit. (2006). http://www.hitl.washington.edu/artoolkit
Bach, J., Honan, M., and Rempel, D. (1997). Carpal tunnel pressure while typing with the
wrist at different postures. In Proceedings of the Marconi Research Conference (San
Francisco: University of California, San Francisco and Center for Ergonomics), Paper 17.
Billinghurst, M., Kato, H., Poupyrev, I. (2001). The MagicBook: A Transitional AR
Interface. Computers and Graphics, pp. 745-753.
Blake, A. and Isard, M. (1998). Active contours. Springer.
Bowden, R. (1999). Learning Non-linear Models of Shape and Motion. PhD thesis, Brunel
University.
Bowden, R., Heap, A., and Hart, C. (1996). Virtual Datagloves: Interacting with Virtual
Environments Through Computer Vision. In Proc. 3rd UK VR-Sig Conference, DeMontfort
University, Leicester, UK, July 1996.
Bradski, G. (1998). Computer Vision Face Tracking for Use in a Perceptual User Interface.
Intel Technology Journal, 2(2):12-21.
Bibliography 230
Brand, J. and Mason, J. (2000). A comparative assessment of three approaches to pixel level
human skin-detection. In Proc. of the ICPR, vol. 1, 1056-1059.
Branson, K. and Belongie, S. (2005). Tracking Multiple Mouse Contours (without Too Many
Samples). Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR'05), 1, 1039 - 1046.
Brown, D., Craw, I., and Lewthwaite, J. (2001). A SOM based approach to skin detection
with application in real time systems. In Proc. of the British Machine Vision Conference.
Chai, D., And Bouzerdoum, A. (2000). A Bayesian approach to skin color classification in
ycbcr color space. In Proc. of IEEE Region Ten Conference (TENCON’2000), vol. 2, 421-
424.
Chen, Q., Wu, H., and Yachida, M. (1995). Face detection by fuzzy pattern matching. In
Proc. of the ICCV, 591-597.
Cootes, T., Taylor, C., Cooper, D., and Graham, J. (1995). Active shape models - their
training and application. Computer Vision and Image Understanding, 61, 1, 38-59.
Crowley, J., Brard, F. and Coutaz, J. (1995). Finger tracking as an input device for
augmented reality. In Proc. Workshop Automatic Face and Gesture Recognition, pp. 195-
200.
Cui, Y. and Weng, J. (1996). Hand sign recognition from intensity image sequences with
complex background. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp.
195-200.
Davis, J. and Shah, M. (1994). Visual gesture recognition. Vision, Image, and Signal
Processing, vol. 141, pp. 101-106.
Dietz, P. and Leigh, D. (2001). DiamondTouch: A Multi-User Touch Technology.
Proceedings of the 14th annual ACM Symposium on User Interface Software and
Technology (UIST), ISBN: 1-58113-438-X, pp. 219-226.
Bibliography 231
DirectShow. (2005). Official DirectShow SDK documentation from MSDN. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directshow/ htm/directshow.asp
Drugan, M.M., and Thierens, D. (2004). Evolutionary Markov Chain Monte Carlo. In P.
Liardet (Ed.), Proceedings of the Sixth International Conference on Artificial Evolution - EA
2003 (pp. 63-76). Springer.
Du, H., Oggier, T., Lustenberger, F. and Charbon, E. (2005). A Virtual Keyboard Based on
True-3D Optical Ranging. British Machine Vision Conference, Vol. 1, pp. 220-229.
Fiala, M. (2004). Artag revision 1, a fiducial marker system using digital techniques. In
National Research Council Publication 47419/ERB-1117, November 2004.
Fleck, M., Forsyth, D. A., and Bregler, C. (1996). Finding naked people. In Proc. of the
ECCV, vol. 2, 592-602.
Frescha, A., and Keller, M. (2003). DigiScope: An Invisible Worlds Window. In Adjunct
Proceedings, UbiComp2003, 261-264.
Fukumoto, M. and Tonomura, Y. (1997). "Body Coupled FingeRing": Wireless Wearable
Keyboard. In ACM CHI '97, pp. 147-154.
Galanti, S., and Jung, A. (1997). Low-Discrepancy Sequences: Monte Carlo Simulation of
Option Prices. Journal of Derivatives, 63-83.
Gelb, A., editor (1974). Applied Optimal Estimation. MIT Press, Cambridge, MA.
Gleeson, M., Stanger, N., and Ferguson, E. (2004) Design strategies for GUI items with
touch screen based information systems: assessing the ability of a touch screen overlay as a
selection device. Discussion Paper 2004/02. Department of Information Science, University
of Otago, Dunedin, New Zealand. Available from http://www.business.otago.ac.nz/infosci/pubs/papers/papers/dp2004-02.pdf
Bibliography 232
Gomez, G. (2002). On selecting colour components for skin detection. In Proc. of the ICPR,
vol. 2, 961-964.
Gomez, G., and Morales, E. (2002). Automatic feature construction and a simple rule
induction algorithm for skin detection. In Proc. of the ICML Workshop on Machine Learning
in Computer Vision, 31-38.
Heap, T. and Samaria, F. (1995). Real-time hand tracking and gesture recognition using
smart snakes. In Proceedings of Interface to Real and Virtual Worlds, pp. 261-271.
Heap, T. and Hogg, D. (1996). Towards 3D hand tracking using a deformable model. In
Proc. IEEE Int. Conf. Automatic Face and Gesture Recognition, pp. 140-145.
Heap, T. and Hogg, D. (1998). Wormholes in shape space: Tracking through discontinuous
changes in shape. In Proc. 6th Int. Conf. on Computer Vision.
Heider, M. (1998). The Adaptive Effects Of Virtual Interfaces: Vestibulo-Ocular Reflex and
Simulator Sickness. PhD thesis, University of Washington.
DNP. (2004). HoloScreen. http://www.en.dnp.dk/data/675421/5948/HOLO_A4_092004.pdf
I-O Display Systems. (2006). I-glasses PC/SGVA. http://www.i-glassesstore.com/iglasses-pc-hr.html
Isard, M. (1998). Visual Motion Analysis by Probabilistic Propagation of Conditional
Density. PhD thesis, University of Oxford.
Isard, M. and Blake, A. (1998a). Condensation - conditional density propagation for visual
tracking. Int. J. Computer Vision, 28, 1, 5-28.
Bibliography 233
Isard, M. and Blake, A. (1998b). ICONDENSATION: unifying low-level and high-level
tracking in a stochastic framework. In Proc. European Conf. on Computer Vision, Freiburg,
Germany. vol. 1, 893-908.
Isard, M. and MacCormick, J. (2000). Hand tracking for vision-based drawing. Technical
report, Visual Dynamics Group, Dept. Eng. Science, University of Oxford. Available from http://www.robots.ox.ac.uk/~vdg
Ishii, H. and Kobayashi, M. (1992). ClearBoard: A Seamless Media for Shared Drawing and
Conversation with Eye-Contact. In Conference on Human Factor in Computing Systems
(CHI), 525-532.
Jones, M. J. and Rehg, J. M. (1999). Statistical color models with application to skin
detection. In Proc. of the CVPR ’99, vol. 1, 274-280.
Jorda, L., Perrone, M., Costeira, J., and Santos-Victor, J. (1999). Active face and feature
tracking. In Proc. of the 10th International Conference on Image Analysis and Processing,
572-577.
Kass, M., Witkin, A., and Terzopoulos, D. (1987). Snakes: Active contour models. In Proc.
1st Int. Conf. on Computer Vision, 259-268.
Kilian, J. (2001). Simple Image Analysis By Moments. OpenCV library documentation.
Kim, Y., Soh, B., and Lee, S. (2005). A New Wearable Input Device: SCURRY. In IEEE
Transactions on industrial electronics, vol. 52, no. 6, pp. 1490-1499.
Kitagawa, G., (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state
space models. Journal of Computational and Graphical Statistics, 5, 1, 1-25.
Koike, H., Sato, Y., and Kobayashi, Y. (2001). Integrating paper and digital information on
EnhancedDesk. ACM TOCHI, 8 (4), 307-322.
Bibliography 234
Kölsch, M. (2004). Vision Based Hand Gesture Interfaces for Wearable Computing and
Virtual Environments. PhD thesis, University of California, Santa Barbara.
Kolsch, M. and Turk, M. (2005). Hand tracking with Flocks of Features. Computer Vision
and Pattern Recognition, CVPR, vol. 2, 20-25.
Kuch, J., and Huang, T. (1995). Vision-based hand modeling and tracking for virtual
teleconferencing and telecollaboration. In Proc. IEEE Int. Conf. Computer Vision, pp 666-
671.
Kurata, T., Okuma, T., Kourogi, M., and Sakaue, K. (2001). The Hand Mouse: GMM Hand-
color Classification and Mean Shift Tracking. In Second Intl. Workshop on Recognition,
Analysis and Tracking of Faces and Gestures in Realtime Systems, pp. 119-124.
Kurata, T., Kato, T., Kourogi, M., Keechul, J., and Endo, K. (2002). A functionally-
distributed hand tracking method for wearable visual interfaces and its applications. In IAPR
Workshop on Machine Vision Applications, pp. 84-89.
Kwok, N. M., Zhou, W., Dissanayake. G., and Fang, G. (2005). Evolutionary Particle Filter:
Re-sampling from the Genetic Algorithm Perspective. EEE/RSK International Conference
on IROS.
Lee, J. and Kunii, T. (1995). Model-based analysis of hand posture. IEEE Comput. Graph.
Appl., vol 15. no. 5, pp. 77-86.
Lee, J. Y., and Yoo, S. I. (2002). An elliptical boundary model for skin color detection. In
Proc. of the 2002 International Conference on Imaging Science, Systems, and Technology.
Lee, M. and Woo, W. (2003). ARKB: 3D vision-based Augmented Reality Keyboard.
International Conferece on Artificial Reality and Telexisitence (ICAT03), paper ISSN 1345-
1278, pp. 54-57.
Lin, J., Wu, Y., and Huang, T. (2000). Modelling the constraints of human hand motion.
Workshop on Human Motion, pp. 121-126.
Bibliography 235
Lindeman, R., Sibert, J. and Templeman, J. (2001). The Effect of 3D Widget Representation
and Simulated Surface Constraints on Interaction in Virtual Environments. In Proc. of IEEE
Virtual Reality 2001, pp. 141-148.
Lindemann, S., La Valle, S., (2003). Incremental Low-Discrepancy Lattice Methods for
Motion Planing. ICRA, 2920-2927.
Logitech. (2006). http://www.logitech.com/
MacCormick, J. and Blake. A. (1999). A probabilistic exclusion principle for tracking
multiple objects. In Proc. 7th International Conf. Computer Vision, 572-578.
MacCormick, J. and Isard, M. (2000). Partitioned sampling, articulated objects, and
interface-quality hand tracking. In European Conf. Computer Vision.
MacCormick, J. (2000). Probabilistic modelling and stochastic algorithms for visual
localisation and tracking. PhD thesis, University of Oxford.
Matsushita, N. and Rekimoto, J. (1997). HoloWall: Designing a Finger, Hand, Body, and
Object Sensitive Wall. In Proc. of the ACM UIST'97 Symposium on User Interface Software
and Technology, pp. 209-210.
Malik, S. and Laszlo, J. (2004). Visual touchpad: a two-handed gestural input device. In
Proceedings of ICMI '04, pp. 289-296.
Menser, B., and Wien, M. (2000). Segmentation and tracking of facial regions in color image
sequences. In Proc. Visual Communications and Image Processing, SPIE, 731-740.
MicroOptical. (2005). SV-6 PC viewer. http://www.microopticalcorp.com/DOCS/sV6mobile_MK-0061A.pdf
Mirage Innovations. (2006). LightVu. http://www.mirageinnovations.com
Bibliography 236
Mollenhauer, M. (2004), Simulator Adaptation Syndrome Literature Review, Realtime
Technologies Technical Report, 2004.
Morokoff, W., and Caflish, R. (1994). Quasi-Random Sequences and Their Discrepancies.
SIAM J. Sci. Comput. 15:6, 1251-1279.
Moro, B. (1995). The full monte, Risk. 8(2), 53-57.
Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods.
SIAM, Philadelphia, PA.
NIST. (2006). EWMA Control Charts. Engineering Statistics Handbook. http://www.itl.nist.gov/div898/handbook/pmc/section3/pmc324.htm
Nolker, C. and Ritter, H. (1997). Detection of fingertips in human hand movement
sequences. In Proc. of the International Gesture Workshop on Gesture and Sign Language in
Human-Computer Interaction. pp. 209-218.
Nolker, C. and Ritter, H. (1999). GREFIT: Visual Recognition of Hand Postures. In Proc. of
the International Gesture Workshop, 61-72.
Oliver, N., Pentland, A., And Berard, F. (1997). Lafter: Lips and face real time tracker. In
Proc. Computer Vision and Pattern Recognition, 123-129.
OpenCV. (2006). http://sourceforge.net/projects/opencvlibrary
OTL. (1999). http://www.robots.ox.ac.uk/~vdg/Darling.html
Pantrigo, J. J., Sánchez, Á., Gianikellis, K., Montemayor, A. (2005). Combining Particle
Filter and Population-based Metaheuristics for Visual Articulated Motion Tracking.
Electronic Letters on Computer Vision and Image Analysis, (5), No. 3, 68-83.
Bibliography 237
Pavlovic, V., Sharma, R., and Huang, T. (1997). Visual Interpretation of Hand Gestures for
Human-Computer Interaction: A Review. IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 19, No. 7, 677-695.
Peer, P., Kovac, J., and Solina, F. (2003). Human skin colour clustering for face detection. In
International Conference on Computer as a Tool, EUROCON, The IEEE region 8, vol 2,
144-148.
Pérez, P., Vermaak, J., and Blake, A. (2004). Data fusion for visual tracking with particles.
Proceedings of the IEEE, 92(3):495-513.
Pingali, G., Pinhanez, C., Levas, A., Kjeldsen, R., Podlaseck, M., Chen, H. and Sukaviriya,
N. (2003). Steerable Interfaces for Pervasive Computing Spaces. In Proceedings of the First
IEEE International Conference on Pervasive Computing and Communications, pp. 315-322.
Philomin, V., Durasiswami, R., and Davis, L. (2000). Quasi-Random Sampling for
Condensation. ECCV, (2), 134-149.
Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (1996). Numerical
Recipes: The Art of Scientific Computing. 2nd Edition, Cambridge University Press.
Press, W. H., Flannery, B. P., and Teukolsky, S. A. (1992). Numerical Recipes in C: The Art
of Scientific Computing. Cambridge University Press. Quek, F. (1995). Eyes in the interface. Image Vision Computing, vol. 13(6), pp. 511-525.
Rehg, J., (1995). Visual Analysis of High DOF Articulated Objects with Application to Hand
Tracking. PhD thesis, Electrical and Computer Eng., Carnegie Mellon University.
Rehg, J., and Morris, D. (1997). Singularities in articulated object tracking with 2-d and 3-d
models. Tech. rep., Digital Equipment Corporation, Cambridge Research Lab.
Rehg, J., and Kanade, T. (1995). Model-based tracking of self-occluding articulated objects.
In Proc. IEEE Int. Conf. Computer Vision, pp. 460-475.
Bibliography 238
Rekimoto, J., Matsushita, N. (1997). Perceptual surfaces: Towards a human and object
sensitive interactive display. In Workshop on Perceptual User Interfaces (PUI-97), pp. 30-
32.
Roeber, H., Bacus, J. and Tomasi, C. (2003). Typing in thin air: the Canesta projection
keyboard -- A new method of interaction with electronic devices. Proceedings of the
Conference on Human Factors in Computing Systems (CHI 2003), pp. 712-713.
Rosales, R. and Sclaroff, S. (2000). Inferring body pose without tracking body parts. In Proc.
IEEE Conf. Computer Vision and Pattern Recognition, vol II, pp. 721-727.
Schumeyer, R., and Barner, K. (1998). A color-based classifier for region identification in
video. In Proc.Visual Communications and Image Processing, SPIE, vol 3309, 189-200.
Sears, A. (1991). Improving Touchscreen Keyboards: Design Issues and a Comparison with
Other Devices. Interacting with Computers, vol. 3, 253-269.
Segen, J. and Kumar, S. (1998). GestureVR: Vision-Based 3D Hand Interface for Spatial
Interaction. In ACM Multimedia Conference, Bell Laboratories, pp. 455-464.
Senseboard. (2000). http://www.senseboard.com
Shimada, N., Shirai, Y., Kuno, Y., and Miura, J. (1998). Hand Gesture Estimation and Model
Refinement using Monocular Camera -- Ambiguity Limitation by Inequality Constraints. In
Proc. of The 3rd Int. Conf. on Automatic Face and Gesture Recognition, pp. 268-273.
Shimada, N., Kimura, K., and Shirai, Y. (2001). Real-time 3-D hand posture estimation
based on 2-D appearance retrieval using monocular camera. In Proc. Int. WS RAFTFG-RTS,
pp. 23-30.
Sigal, L., Sclaroff, S., and Athitsos, V. (2000). Estimation and prediction of evolving color
distributions for skin segmentation under varying illumination. In Proc. IEEE Conf. on
Computer Vision and Pattern Recognition, vol. 2, 152–159.
Bibliography 239
Skarbek, W., and Koschan, A. (1994). Colour image segmentation – a survey –. Tech. Rep.
Institute for Technical Informatics, Technical University of Berlin, October.
Soriano, M., Martinkauppi, B., Huovinen, S., and Laaksonen, M. (2000). Using the skin
locus to cope with changing illumination conditions in color-based face tracking. In Proc. of
the IEEE Nordic Signal Processing Symposium, pp. 383-386.
Stafford, Q. and Robinson, P. (1996). BrightBoard: A Video-Augmented Environment. In
Proc. of the CHI96, pp. 134-141.
Starner, T. and Pentland, A. (1995). Real-Time American Sign Language Recognition From
Video Using Hidden Markov Models. In International Symposium on Computer Vision, vol.
5B Systems and Applications, pp. 265-270.
Stefanov, N., Galata, A. and Hubbold, R. (2005). Real-time hand tracking with Variable-
length Markov Models of behaviour. In IEEE Int. Workshop on Vision for Human-Computer
Interaction (V4HCI), in conjunction with CVPR 2005.
Stenger, B., Mendonca, P., and Cipolla, R. (2001). Model-Based 3D Tracking of an
Articulated Hand. In CVPR, Volume II, pp. 310-315.
Stenger, B., Arasanathan, T., Torr, P., and Cipolla, R. (2006). Model-Based Hand Tracking
Using a Hierarchical Bayesian Filter. In PAMI, vol. 28, No. 9, pp. 1372-1384.
Stern, H., and Efros, B. (2002). Adaptive color space switching for face tracking in multi-
colored lighting environments. In Proc. of the International Conference on Automatic Face
and Gesture Recognition, 249-255.
Stoll, P. and Ohya, J. (1995). Application of HMM modelling to recognizing human gestures
in image sequences for a man-machine interface. In Proc. IEEE Int. Workshop on Robot and
Human Communication. pp. 129-134.
Bibliography 240
Sturman, D. J. (1992). Whole-Hand Input. PhD thesis, Media Arts and Science Laboratory,
Massachesetts Institute of Technology, Cambridge, MA USA.
Terrillon, J. C., Shirazi, M. N., Fukamachi, H., and Akamatsu, S. (2000). Comparative
performance of different skin chrominance models and chrominance spaces for the automatic
detection of human faces in color images. In Proc. of the International Conference on Face
and Gesture Recognition, 54-61.
Tezuka, A. (1995). Uniform Random Numbers: Theory and Practice. Kluwer Academic
Publishers.
Tissainayagam, P. and Suter, D. (2002). Performance measures for assessing contour
trackers. Int. Journal of Image and Graphics, 2, 343-359.
Triesch, J. and Malsburg, C. (1996). Robust classification of hand postures against complex
background. In Proc. Int. Conf. Automatic Face and Gesture Recognition, pp. 170-175.
Uosaki, K., Kimura, Y., and Hatanaka. T. (2004). Evolution strategies based particle filters
for state and parameter estimation of nonlinear models. Congress of Evolutionary
Computation, 884-890. Vezhnevets V., Sazonov V., Andreeva A. (2003). A Survey on Pixel-Based Skin Color
Detection Techniques. In Proc. Graphicon, 85-92.
Vogler, C. and Metaxas, D. (1998). ASL Recognition Based on a Coupling Between HMMs
and 3D Motion Analysis. In Proc. International Conference on Computer Vision. Mumbai,
India. pp. 363-369.
Welch, G., and Bishop, G. (2002). An introduction to the kalman filter. Technical Report 95-
041, University of North Carolina at Chapel Hill, Department of Computer Science.
Wellman, H., Davis, L., Punnett, L., and Dewey, R. (2004) Work-related carpal tunnel
syndrome (WR-CTS) in Massachusetts, 1992-1997: source of WR-CTS, outcomes, and
employer intervention practices. American Journal of Industrial Medicine, 45, 139-152.
Bibliography 241
Wilson, A. (2004). TouchLight: An Imaging Touch Screen and Display for Gesture-Based
Interaction. International Conference on Multimodal Interfaces.
Wilson, A. (2005). PlayAnywhere: A Compact Tabletop Computer Vision System. Proc.
UIST '05, ACM Press, pp. 83-92.
Wellner, P. (1993). Interacting with paper on the DigitalDesk. Communications of the ACM,
36(7), pp. 87-96.
Wren, C., Azarbayejani, A., Darrell, T., and Pentland, A. (1997). Pfinder: Real-time tracking
of the human body. In IEEE Transactions on Pattern Analysis and Machine Intelligence,
19(7), pp. 780-785.
Wu, Y. and Huang, T. (1999). Capturing articulated human hand motion: A divide-and-
conquer approach. In Proc. IEEE Int. Conf. Computer Vision, pp. 606-611.
Wu, Y. and Huang, T. (2000). View-independent recognition of hand postures. In Proc.
IEEE Int. Conf. Computer Vision and Pattern Recognition, vol. II, pp. 88-94.
Wu, Y. and Huang, T. (2001). A Co-inference Approach to Robust Visual Tracking. In Proc.
IEEE ICCV, Vol. II, 26-33.
Wu, Y., Lin, J., and Huang, T. (2001). Capturing natural hand articulation. In ICCV, volume
2, 426-432.
Yang, J., Xu, Y., and Chen, C. (1994). Gesture interface: Modelling and learning. In Proc.
IEEE Int. Conf. Robotics and Automation, vol. 2, pp. 1747-1752.
Yang, M. H., and Ahuja, N. (1999). Gaussian mixture model for human skin color and its
applications in image and video databases. In Proc. of the SPIE: Conf. On Storage and
Retrieval for Image and Video Databases, vol. 3656, 458-466.
Bibliography 242
Yang, M., and Ahuja, N. (1998) Detecting human faces in color images. In Proc. of ICIP,
vol. 1, 127-130.
Yuille, A. and Hallinan, P. (1992). Deformable templates. In Blake, A. and Yuille, A.,
editors, Active Vision, 20-38. MIT.
Zhang, Z., Wu, Y., Shan, Y. and Shafer. S. (2001). Visual panel: Virtual mouse keyboard
and 3d controller with an ordinary piece of paper. In Workshop on Perceptive User
Interfaces. ACM Digital Library, ISBN 1-58113-448-7.
Zarit, B. D., Super, B. J. and Quek, F. K. H. (1999). Comparison of five color models in skin
pixel classification. In ICCV’99 Int’l Workshop on recognition, analysis and tracking of
faces and gestures in Real-Time systems, 58- 63.
Zhou, H. and Huang, T. (2003). Tracking articulated hand motion with eigen-dynamics
analysis. In Proc. 9th Int. Conf. on Computer Vision, vol. 2, pp. 1102-1109.