real-time head pose classiﬁcation in uncontrolled ...sergio/linked/cvcrd2010reyes.pdf ·...

Real-time head pose classification in uncontrolled environmentswith Spatio-Temporal Active Appearance Models

Miguel Reyes∗ and Sergio Escalera+ and Petia Radeva +

∗ Matematica Aplicada i Analisi ,Universitat de Barcelona, Barcelona, SpainE-mail:[email protected]

+ Matematica Aplicada i Analisi, Universitat de Barcelona, Barcelona, SpainE-mail:[email protected]

+ Matematica Aplicada i Analisi, Universitat de Barcelona, Barcelona, SpainE-mail:[email protected]

Abstract

In this paper, we present a full-automatic real-time sys-tem for recognizing the head pose in uncontrolled envi-ronments over a cotinouos spatio-temporal behavior bythe subject . The method is based on tracking facial fea-tures through Active Appearance Models. To differenti-ate and identify the different head pose we use a multi-classifier composed of different binary Support VectorMachines. Finally, we propose a continuous solution tothe problem using the Tait-Bryan angles, addressing theproblem of the head pose as an object that performs arotary motion in dimensional space.

1 Introduction

Recognition of head pose has been a challengingproblem in recent years, mainly due to interestin the technology industry to develop new inter-faces to interact with the new generation of gad-gets and applications. When we speak about headpose, we refer to its spatial location, treating thehead like a point in a coordinate space, which canperform positional and rotational movements inany of the axes of the space where it moves. Inmost articles, the problem is addressed in a two-dimensional space, where a subject can make head

movements on the horizontal and vertical axes. Ina majority of published works[1] a two-stage se-quential approach is adopted, aiming at locatingthe regions of interest and verifying the hypotheseson the head pose’s presence (detection stage), andsubsequently determining the type of the detectedhead pose (recognition stage).

In this paper, we perform a detection of a setof facial features. For both, detection and track-ing, we base on Active Appearance Models. Theface pose recovery is obtained by the analysis ofimage sequences from uncontrolled environments,performing detection and tracking based on con-tinuous spatio-temporal constraints.

2 Method

Since the method should work in uncontrolled en-vironments, we do not have information about thelocalization of the face. In this sense, we apply awindowing strategy using frontal and profile Viola-Jones [4] detectors, which provide positive bound-ing boxes for the initialization of face features lo-cation.

2.1 Detection and tracking of facial fea-tures

Once we have reduced the original image informa-tion into a region of interest where facial featuresare present, the next step is to fit all of them in acontextual model by means of a mesh using Ac-tive Appearance Models[2] (AAM). AMM benefitfrom Active Shape Models[3], providing informa-tion to the combination of shape and texture. Ac-tive Appearance Model is generated by combininga model of shape and texture variation. First, aset of points are marked on the face of the train-ing images that are aligned, and a statistical shapemodel is built. Each training image is warped sothe points match those of the mean shape. Thisis raster scanned in a texture vector, g, which isnormalized by applying a linear transformation,g → (g−µg1)

σg, where 1 is a vector of ones, and

µg and σ2g are the mean and variance of elementsof g. After normalization gT 1 = 0 and |g| = 1.Then, eigenanalysis is applied to build a texturemodel. Finally, the correlations between shape andtexture are learnt to generate a combined appear-ance model. The appearance model has parameterc controlling the shape and texture according to:

x = x+Qsc

g = g +Qgc

where x is the mean shape, g the mean texturein a mean shaped patch, and Qs, Qg are matri-ces designing the modes of variation derived fromthe training set. A shape X in the image framecan be generated by applying a suitable transfor-mation to the points, x : X = St(x). Typ-ically, St will be a similarity transformation de-scribed by a scaling s, an in-plane rotation, θ, and atranslation (tx, ty). Once constructed the AAM, itis deformed on the image to detect and segmentthe face appearance as follows. During match-ing, we sample the pixels in the region of interestgim = Tu(g) = (u1+1)gim+u21, where u is thevector of transformation parameters, and project

into the texture model frame, gs = T−1u (gim). Thecurrent model texture is given by gm = g + Qgc.The current difference between model and image(measured in the normalized texture frame) is asfollows:

r(p) = gs − gm

Given the error E = |r|2, we compute the pre-dict displacements δp = −Rr(p), where R =

(∂rT

∂r∂p∂p)−1 ∂rT

∂p . The model parameters are updatedp 7→ p + κδ, where initially κ = 1. The newpoints and model X

′frame texture g

′m are esti-

mated, and the image is sampled at the new pointsto obtain gmi, obtaining the new error vector asr′= T−1

u′−gm. A final condition guides the end of

each iteration: if∣∣∣r′ ∣∣∣2 < E, the we accept the new

estimate, otherwise, we set to κ = 0.5, κ = 0.25,and so on. The procedure is repeated until no im-provement is made to the error. Finally, we get adescription of facial features through shape infor-mation (X

′) and texture (gm). As for the perfor-

mance of tracking, the acquisition of the featuresof shape and texture on the time t+ 1, it is a slightvariation of the facial features (shape and texture)on the previous time (t), therefore, the cost of con-vergence to the new pose is low, this is satisfied ifmet spatiotemporal behavior by the subject.

2.2 Head pose recovery

The description of facial features through AAMprovides a vector descriptor of shape, consistingof 21 points which form the silhouette of a mesh.The target is to discretize the different types ofmesh that can form, which will produce differenthead poses. We use a training set, labeling eachpoint structure, and then perform discrete classifi-cation. The outputs of the classifier are as follows:right, middle-right, frontal, middle-left, and left. Inorder to get the discrete classifier, we establish atraining set with different structures of points re-lating the head pose. These structures of pointsshould be aligned. The support vector machineclassifier[6] is used in a one-against-all design in

order to perform five-class classification. The finalchoice of the discrete output is done by majorityvoting. Taking into account the discontinuity thatappears when a face moves from frontal to profileview we use three different AAM corresponding tothree meshes: frontal view =F , right lateral view=R, left lateral view =L. In order to include tem-poral and spatial coherence, meshes at frame t+ 1are initialized by the fitted mesh points at frame t.Additionally. we include a temporal change-meshcontrol procedure, as follows

=t+1 = min=t+1{E<F, E<R

, E<L},=t+1εν(=t)

where ν(=t) corresponds to the meshes contigu-ous to the mesh t fitted at time t (including thesame mesh). This constraint avoids false jumpsand imposes smoothness in the temporal face be-havior (e.g. a jump from right to left profile viewis not allowed).

In order to obtain a continuous output. The goalis to extract the angles of pitch and yaw movementbetween two consecutive frames. The angles areextracted from the following transformations:

Ry,θ =

cos θ 0 sin θ0 1 0− sin θ 0 cos θ

Rx,ψ =

1 0 00 cosψ − sinψ0 sinψ cosψ

The transformation that will cause the motion is

R = Ry,θRx,ψ. Getting in Vi the information ofshape in the frame t (trough AMM), and Vf rela-tive to frame t+1, θ and ψ angles will be extractedby solving the following trigonometric equation:

RVi = Vf

3 Results

First we clarify which data has been subjected forevaluation. Regarding the creation of the train-ing set for the detection of facial features, we used

Figure 1: Discrete samples of the classifier output

Figure 2: Shape projected on three PCA components

the database called Labeled Faces in the Wild[5].This data set is composed of several samples fromfrontal, right, and left profiles. In Figure 2, shapecharacteristics of each of the classes are shown us-ing 98.45% of principal components after manuallabeling.

On the other hand, we get a representation oftexture features as shown in Figure 3. Unlikethe previous case, the projection of the character-istics of texture has a similar trend for all threeclasses. This is because the texture features arebased mainly on skin color. Figure 3.

The next phase of testing is to evaluate themethod on a video sequence. The results are shownin the next table.

Head poses % on a sequence of 7269 framesHead % of Occurrence System successLeft 0.1300 87

Middle-left 0.1470 73Frontal 0.2940 95

Middle-right 0.1650 76Right 0.2340 89

Finally we elaborated different tests to observethe continuous system output for the same problem(Figure 4 ).

Figure 3: Texture projected on three PCA components

Figure 4: Initial frame φ = 0 and θ = 0 and finalframe φ = 18.44 and θ = −6.0385

4 Conclusion

It is noted that AAM is able to detect a high num-ber of facial features from different views in a ro-bust and reliable way. Another quality that AAMis its ability to work with success in different sit-uations from uncontrolled environments, such aspartial occlusion or noise. It has also been ob-served that the robustness and reliability of facialfeature extraction is very dependent on the numberof points of the model. A greater number of pointsa more difficult to detect and track. Another im-portant factor regarding the number of points thatform a mesh is its computational cost, since this isentirely dependent on the density of dots forming apattern. This is of paramount importance for real-time applications, since a high number of pointscan make the system unsustainable. In our work, anumber of points greater than 21 can lead to prob-lems of a slowdown in existing home computers.

Referring to the discrete outputs obtained by theclassifier, we observed that the results were suc-

cessful when the ”head pose was near-fully frontalor profile”. For the continuous output of the sys-tem, its success rate is totally dependent and sensi-tive to the monitoring carried out by AAM. Actu-ally, the computational cost of calculating the an-gle difficulties the system to work in real time.

References

[1] Lisa M. Brown and Ying-Li Tian IBMT.J, ”Comparative Study of Coarse HeadPose Estimation”, Watson Research CenterHawthorne, NY 10532

[2] T. Cootes, J. Edwards, and C. Taylor, ”Ac-tive Apperance Models. IEEE Transactions onPattern Analysis and Machine Intelligence”,IEEE Transactions on Pattern Analysis andMachine Intelligence 23(6):681685

[3] T. Cootes, C. Taylor, D. Cooper, and J. Gra-ham, ”Active Shape Models - their trainingand application”, Computer Vision and ImageUnderstanding, 61(1):3859.

[4] Paul Viola, Michael Jones, ”Rapid Object De-tection using a Boosted Cascade of SimpleFeatures”, Conference on computer vision andpattern recognition 2001,

[5] Gary B. Huang and Manu Ramesh and TamaraBerg and Erik Learned-Miller, ”Labeled Facesin the Wild: A Database for Studying FaceRecognition in Unconstrained Environments”,University of Massachusetts, Amherst 2007,

[6] David Meyer, Friedrich Leisch, and KurtHornik., ”The support vector machine un-der test”, Neurocomputing 55(1-2): 169-186,2003,

real-time head pose classiﬁcation in uncontrolled ...sergio/linked/cvcrd2010reyes.pdf ·...

Documents