face alignment robust to occlusion

8/6/2019 Face Alignment Robust to Occlusion

1/6

Face Alignment Robust to Occlusion

Myung-Cheol Roh, Takaharu Oguri, and Takeo KanadeCarnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, USA

{myungroh, oguri, takeo.kanade }@cs.cmu.edu

Abstract In this paper we present an approach to robustlyalign facial features to a face image even when the face ispartially occluded. Previous methods are vulnerable to partialocclusion of the face, since it is assumed, explicitly or implicitly,that there is no signicant occlusion. In order to cope with thisdifculty, our approach relies on two schemes: one is explicitmulti-modal representation of the response from each of theface feature detectors, and the other is RANSAC hypothesize-and-test search for the correct alignment over subset samplingsof those in the feature response modes. We evaluated theproposed method on a large number of facial images, occludedand non-occluded. The results demonstrated that the alignment

is accurate and stable over a wide range of degrees andvariations of occlusion.

I. INTRODUCTION

In this paper we present a new approach to robustlyalign facial features to a face image even when the faceis partially occluded. Previous face alignment methods[1],[2], [3], [4], [5] nd the optimal t of a regularized face-shape model (such as a PCA-regularized model) using all of the observable local features, typically by iterative gradientsearch around the initial position obtained by a face detector.Although certain safety measures are often combined, suchas adjustable weight on different features depending on

features strength, these methods are vulnerable to partialocclusion of the face, since it is assumed, explicitly orimplicitly, that there is no signicant occlusion and that thegiven initial position is close to the correct position.

Fig. 1 shows examples of face alignments on non-occludedand occluded facial images: Fig. 1(a) shows face detec-tion results by Haar-like feature-based face detector[6], andFig. 1(b) shows alignment results by Bayesian Tangent ShapeModel (BTSM)[2]. For the left image in Fig. 1(b), facialfeatures in the upper part of the face were aligned, but thosein the lower face, including the mouth and jaw, were notperfectly aligned because of confusion by a mustache andbeard. In the right image of Fig. 1(b), which contains heavyocclusion, many of the visible features were not alignedcorrectly. It is worthy of noting that alignments of thefeatures of upper face, though completely visible themselves,are affected by the lower face occlusion.

In fact, the difculty that partial occlusion causes is that of cyclic dilemma; occluded or erroneous features should notparticipate in alignment, but one cannot tell whether a featureis occluded unless the correct alignment is known. In order tocope with this difculty, our approach relies on two schemes:one is explicit multi-modal representation of the responsefrom each of the face feature detectors, and the other is

(a) Face detection (b) Face alignment

Fig. 1. Face detections by Haar-like feature based face detector[6], andface alignment results by BTSM [2].

RANSAC (Random sampling consensus) hypothesize-and-test search for the correct alignment over subset samplingsof those response modes.

Fig. 2 shows an overview of the proposed method. Givenan image, a large number of detectors, each trained for anindividual facial landmark feature, run and produce theirresponse map. Each of the red boxes in Fig. 2(a) repre-sent the region where each feature detector is applied 1 . Ineach response map, all the response peaks are identiedand localized by the Mean-Shift method, and each peak isdescribed by a 2-D Gaussian model (Fig. 2(b)). The correctalignment we seek is the combination of response peaks,one selected from each of visible facial features and nonefrom occluded features, whose locations match well withthe face shape model. Since we dont know which featuresare visible beforehand, we employ the RANSAC strategy.A subset of features is randomly chosen and assumed to beinlier (visible), a hypothesis of face alignment is generatedfrom them, and its agreement with other remaining features istested in terms of the median of mismatch degrees (Fig. 2(c)).This is repeated until an alignment with a better acceptabledegree of mismatch is found. The alignment is slightlyadjusted by using only the responses of those features whosedegree of mismatch is less than the median (that is, those thatare identied as visible). Finally, the shape is rened with ad-ditional inliers which are also estimated as visible (Fig. 2(d)).There are the two ideas in the RANSAC-based strategy. Therst is the use of feature subsets for hypothesis generation. Itincreases a chance that the generated hypothesis will includethe correct one. The second is the use of median (instead of mean) of the degree of mismatch for evaluating hypotheses.It favors a hypothesis in which a majority of features matchvery well and some do not, instead of one in which all of the features match relatively well on average.

1 Although to set search regions is not necessary in our method, for thesake of efciency, whose size and location is found by a face detector[6].

239


2/6

(a) (b) (c) (d)

Fig. 2. An overview of the proposed method. (a) Search regions of threeexamples in red boxes. (b) Feature response maps of right eye corner (top), anose part (middle), and cheek part (bottom) detectors. Red circles representestimated distribution. In the top feature map, any feature points was notdetected. (c) Hypotheses of face alignments by blue lines. (d) Red line showsa nal alignment after rening. Yellow circles represent feature points whichare identied as visible.

We evaluated the proposed method on a large numberof facial images, occluded and non-occluded. The resultshave demonstrated that the method produces accurate andstable copying with a wide range of degrees and variationsof occlusion and confusing appearances.

II. FACE MODEL

We dene two terminologies as follows: Feature ID isa name of a point in PDM (Point Distribution Model) andFeature point is an object of a feature ID where the objectis a response peak which is described by a 2-D GaussianModel.

In this paper, shape variation of a face is represented bythe PDM in Constrained Local Model (CLM) framework.The non-rigid shape function can be described as,

S = X + p (1)

where X = [x t1 , , x tN I ], p is a shape parameter vector,and is the matrix of concatenated eigenvectors. The rst4 eigenvectors of are forced to correspond to similarityvariations. In order to get p , we employ the CLM as convexquadratic tting in [1].

A. Learning Local Feature Detectors

In the CLM framework, the local feature detector foreach feature ID should be trained. In this paper, the linearSupport Vector Machines (SVMs) are employed for localfeature detectors because of its computational efciency[7].To learn local patches, the images in training data arenormalized based on ground-truth. Rotations are normalizedby Generalized Procrustes Analysis (GPA)[8], and the scalesare normalized on the basis of the size of face regiongiven by a face detector. Positive patches are obtained bycentering around the ground-truth, while negative examplesare obtained by sampling patches shifted away from theground-truth. The output of the feature detector is obtained

(a) (b)

Fig. 3. Distributions of feature IDs in the training data: (a) Featuredistributions of all feature IDs. (b) Example of feature distributions: lefteye corner (top), right nostril (middle), and left chin (bottom).

by tting a logistic regression function to the score of theSVM and the label {not aligned (-1), aligned (1) }[1].

B. Learning Property of Feature ID

In [1], for the sake of complexity, diagonal matrixes wereused to represent feature distributions. Fig. 3 shows thedistributions of feature responses for feature IDs. This imageshows that the distribution shapes of feature IDs in the nostriland eye corner (top and middle images in Fig. 3(b)) are verydifferent from that of the feature ID in the cheek region(bottom image in Fig. 3(b)). The eye corner and nostrilcan be localized accurately since they have clear peaks inthe feature responses. However, the feature ID in the cheek region cannot be localized accurately since they have a longelliptical distribution. In order to reect the property, we usefull matrix to represent each distribution. This property isalso used in shape renement stage later.

To learn the property of kth feature ID, feature responsemaps of local images around ground-truth are obtained intraining data, the mean of the feature response maps arecalculated, and distribution of the mean is described by afull covariance matrix as follows:

k =20 / 00 11 / 0011 / 00 02 / 00

(2)

where pq = x y (x x) p(y y)qF k (x, y ),F k (x, y )is feature response value at (x, y ) and (x,y) is a centroidposition of feature response.

III. ALIGNMENT FACE WITH OCCLUSION

A. Feature Response Extraction

Unlike the conventional alignment methods which searchsmall areas based on a given initial shape[1], [2], [3], initialshape is not required in the proposed method. However, forthe sake of efciency, a search window for each feature IDis set according to a face position given by a face detector.Within the search windows, feature responses are calculatedand candidate feature points are obtained.

To obtain multiple candidate feature points from a multi-modal distribution of feature response, the feature response

240


3/6

is partitioned into multiple segment regions by the Mean-Shift segmentation algorithm[9]. Then the distributions inthe segmented regions are approximated through convexquadratic functions as follows:

argminA l

k ,blk ,c

lk x lk

|| E k {Y (x lk + x )} x T A lk x

+2 b lT k

x clk|| 2 subject to A l

k> 0 (3)

where lk is a lth segment region in kth feature ID, x lk isthe centroid of lk , E k () is the inverted match-score functionobtained by applying the kth feature detector to the source

image Y , A lk =a11 a12a21 a22

, and b lk =b1b2

.

B. Shape Hallucination

In this paper, it is assumed that at least half of featureIDs are not occluded. Let the number of candidate featurepoints for ith feature ID be N P i (1 < i < N

I ) where N I

is number of the feature IDs. Then, among total featurepoints N

I

i =1 N P i , at least N

I / 2 feature points and at most

N I

feature points exist that support an input facial image.Therefore, the correct combination of feature points shouldbe found. In order to nd the correct combination of featurepoints, RANSAC[10], [11] is employed.

1) Selecting proposal feature points: In using RANSACstrategy, there are two steps of random samplings in ourmethod. The rst one is to select M feature IDs. Thisis to propose a set of non-occluded feature IDs. The M is a minimum number of feature IDs that is required tohallucinate a face shape. Second we select a feature pointamong multiple feature points for the each selected featureID. This is to propose true feature points among multiple

feature responses for the feature IDs.The sampling is iterated until all the samples are inliers,and the maximum number of iterations k in RANSAC canbe determined as follows[10]:

k =log(1 p)log(1 w)

(4)

where p is the probability that the RANSAC selects onlyinliers in k iteration and w is the probability that all M points are inliers. In our case, w is M 1i =0

N I / 2 iN I i

1N Pj

where N P j is the number of feature points for j th featureID. For a simple example, let N P j for all j are 2 and p, M ,and N I be 0.99, 10, and 83. Then, the maximum number of iterations k is larger than 0.9 106 , which is not feasible.In our experiments, a proper set of feature points could beselected in 400 iterations in many cases, but in some othercases it could not be done even after 2, 000 iterations.

In order to reduce the number of iterations, we proposea method of coarse ltering of non-candidate feature points.The method is using RANSAC based on the concept of thegeneralized Hough transform. Two feature points which donot belong to same feature IDs are randomly selected, andscale and rotation are calculated. Then, a centroid of faceshape is estimated from the two points and agreement with

centroids obtained from other remaining feature points withrespect to the scale and the rotation is tested. The details of the proposed ltering method is as follows:

1 Select two random feature points y i , y j which do notbelong to same feature ID.

2 Estimate scale S l and rotation R l parameters from twofeature points y i , y j in lth iteration.

3 Calculate distributions of estimated centroid of shapefor all feature points: F (x ) = T k =1 f (x |y k , S l , R l )where for given S l and R l the f is a normal distributionfunction of centroid for feature point y k and T =

N I

i=1 N P i . x represents a position in an image.

4 Get maximum value C lS l ,R l = max F (x ) and the posi-tion x lmax = argmax X F (x ).

5 Repeat 1 to 4 until the number of iterations reachesthe given number or the C lS l ,R l is larger than a giventhreshold.

6 Get S L , R L , and x Lmax where L = arg max l C lS l ,R l .7 Given S L and R L , calculate Mahalanobis distance be-

tween x Lmax and each feature point with distribution

f (x Lmax |y k , S l , R l ).8 Take out a feature point that has minimum distance.9 Repeat 6-8 until we get a given number of feature IDs,

so that the number is at least larger than M .As a result of the coarse ltering, almost all selected featurepoints are inliers. In our experiments, we could get a properinlier set within 5 iterations. Thus, the number of iterationsfor selecting M correct feature points could be reduced fromseveral hundred to less than ve.

2) Hallucinating shapes: From the coarsely ltered fea-ture points and feature IDs, M feature IDs are selected andthen M feature points, which are associated to the selected

feature IDs, are selected. Then the parameter p is calculatedexplicitly as follows:

p = ( t A ) 1 b (5)

where

A =

A sel 1sel 1 0...

. . ....

0 A sel Msel M

, b =

b sel 1sel 1...

b sel Msel M

(6)

The Asel jsel k and b

sel jsel k are obtained by (3) where selk and sel j

are a kth element and a j th element among selected featurepoints and feature IDs above and 1 r N P sel k .

In [1], they applied this tting method iteratively. Sincea distribution of a feature response is represented by onequadratic tting function (unimodal distribution) even wherethere could be multiple strong feature responses (multi-modal distributions), it is required to reduce search regioniteratively until all quadratic tting functions localize correctpositions with smaller search regions for each iteration.However, in our method, a feature response for each featureID is explicitly described by multi-model distributions. Thus,given a set of selected feature points, a facial shape can behallucinated by a closed form.

241


4/6

3) Testing Hallucinated Shape: Since we are assumingthat at least half of all feature IDs are not occluded, we canalways accept N I / 2 feature points in which the distancesbetween the positions of the hallucinated feature points andof feature responses are smaller than the median of all theresiduals.

The residual between a position of a feature ID of theshape, x and corresponding feature response of the featureID x can be calculated as follows:

R(x , x ) = ( x x ) t [G( m ) + f ] 1 (x x ) (7)

where m and f are covariance matrixes of shape modeland feature response. The G is a geometric transformationfunction that is estimated from the hallucination shape. N I / 2feature points where the residual R(x i , x i ) are smaller thanthe median value of the residuals for all feature points x i areaccepted as inliers. Using the inliers, the shape is updated.

The shape is tested in terms of the median of theresiduals[12]. Repeat the proposal feature points selection,the hallucination and the test until the median is smaller

than a given threshold or the number of the iteration reachesa given number. Then, a hallucinated shape, with minimumerror, results in an aligned shape with only inlier featurepoints.

C. Shape renement

In the previous step, at least N I / 2 feature points wereselected safely. However, there can be more unoccludedfeature points. For example, in the case of a non-occludedfacial image, almost of all feature IDs are visible, butonly half of the feature points will be selected throughprevious steps. There can be more visible inliers than half.In the shape renement, more feature points are chosen as

inliers by means of measuring Mahalanobis distance betweenthe hallucinated feature point and a peak point of featureresponse around the hallucinated feature point.

In order to pick up a peak point around the hallucinatedfeature point, strong feature responses on perpendicular linesto tangent lines at hallucinated feature points are searched.Then the picked up position, which has strong featureresponses, can be accepted as an inlier. There are two waysof searching strong feature points along the perpendicularlines.

The rst is to search the strongest feature response alongthe perpendicular lines as it is used in [2]. The featureresponses on the perpendicular lines are searched. However,if a true strong feature response point is slightly off from theperpendicular line, this fails to pick up the point.

The second is not to see the feature response only onthe perpendicular line but also to see neighbors accordingto properties of feature IDs. As mentioned, each feature IDhas a different property of feature response distributions. Asshown in the Fig. 2(b) and Fig. 3, it is clear that the featureresponse of the cheek region is distributed along the edge of the cheek which is almost a linear shape, while that of thenostril is distributed as almost a circular shape. By seeingits neighboring responses, although a feature response on a

(a) (b)

Fig. 4. Distance metrics for feature IDs on eye corner and cheek. The 3

red dots in each gure represent current and its neighboring feature points.

perpendicular line is weak, this gives a chance to pick up thepoint. If the picked up point has a strong feature response,the point is selected as an inlier and contributes to rene theshape.

The strength of the feature response S along the perpen-dicular line is dened as follows:

S i (k) =x

N i (x |k, )( x ) (8)

where the (x ) and the N i (x ) are a feature response at xand the normal probability density function. The i and kare corresponding ith feature IDs to this feature point and aposition on the perpendicular line, respectively.

IV. E RROR M EASUREMENT

The properties of feature IDs shown in Fig. 3 give usa very interesting point of view on how to measure errorsbetween aligned and groundtruth shapes. Conventionally, theerror was measured by Euclidean distance. This measurementcalculates distance without considering the feature properties.Unlike the feature points of eye corners and lip corners, thefeature points of cheek regions and middle of eye-brows arethe hard to point out exact positions in an image. Manyof the positions on the edge can be answers. However, the

conventional Euclidean distance cannot handle with theseanswers.

We propose a new metric based on shape property whichconsiders geometrical properties of feature IDs as follows:

1) If a feature ID is positioned on a sharp point such ascorners of eyes, the metric is likely to be a Euclidiandistance between a target point and associated ground-truth point.

2) If a feature ID is positioned on a middle line such asa cheek, the metric is a distance between the edge lineand a target point.

It is dened as follows:

D (i, x ) = min( d(y i 1 , v x ), d(y i+1 , v x )) (9)d(y k , v x ) = a(y k , v x )2 sin + b(y k , v x )2 (10)where

a(y k , v x ) =v x y k| y k |

,

b(y k , v x ) = | c(y k , x) a(y k , x) |c(y k , v x ) = | v x y k |,

sin = 1 y k v h| y k || v h |2

.

242


5/6

The i is an index of ith feature point of ground-truth and v his a tangent line at the ith feature point. The v x and y k aregi x and gi gk , respectively, where the gk is a ground-truthposition of kth feature IDs.

Fig. 4 shows examples of distance metrics for feature IDson an eye corner and cheek. This measurement allows a pointon the cheek to move along the lines.

V. EXPERIMENTS AND ANALYSIS

In order to show robustness of the proposed method, weconducted two experiments in terms of accuracy and stability.The accuracy represents how face alignment is localizedaccurately and the stability represents how face alignmentis stable under occlusions.

For experiment and analysis, two data sets were used: ARdatabase (ARDB)[13], and images in which occlusions weregenerated articially (AGO). The ARDB includes imagesof non-occlusion and two types of occlusions: (1) non-occlusion, (2) upper parts occlusion with sunglasses, and (3)lower parts occlusion with a scarf. For the AGO images,occlusions were articially generated from non-occludedface images.

In order to measure error of an alignment shape, weintroduce two concepts of errors: ( en ) errors between ground-truth and associated alignment points in the non-occludedpart and ( eo ) errors between those in the occluded part. Theen represents how well the shape is aligned regardless of occlusions. The error eo represents how well the occludedpoints are hallucinated. In this paper, en is our main concern.

125 images in the ARDB were used for learning featuredetectors and PDM. For the measurement, each image wasscaled according to the ground-truth of the eye positions so

that the inter-ocular distance was 50 pixels. A. Accuracy test

In an accuracy test, localization accuracy was measuredusing en on the ARDB. Note that the test images includesoccluded images as well as non-occluded images.

Fig. 5 shows the result of the accuracy test. The test resultusing the BTSM[2] is also given for comparison. The x-axisrepresents images and the y-axis represents mean errors of the alignment shapes. The images in the bottom of Fig. 5show face detection (Black line) and the alignment results bythe BTSM (Blue line) and the proposed method (Red line).The proposed method (Red circles) outperformed the BTSM(Blue rectangles) for almost all of the images regardless of occlusions. As the errors of the alignment by the BTSMare increasing signicantly from 27th image, that of thealignment by the proposed method are not increasing. Manyalignment results by the BTSM had large errors due to theinitial position problem as well as occlusions.

More examples are shown in Fig. 6. The rst row showsface detection results and the second and third rows showalignment results by the BTSM and the proposed method.Also, feature points identied as visible were shown asyellow circles in the third row. The proposed method showed

Fig. 5. The accuracy test. The x -axis represents images and the y -axisrepresents errors of the alignment shapes. The images are aligned by theerrors of the BTSM results. Blue rectangles and red circles represent theBTSM and the proposed face alignment method, respectively.

very good performance on the occluded images as well asnon-occluded images.

B. Stability test

A stability test is to measure robustness of face alignmentwhere various partial occlusions exist. For this test, AGOimages were generated as follows: A non-occluded facialimage on the ARDB was partitioned so that each partitioncan cover 6% to 7% of facial features in ground-truth. Up to9 partitions were randomly chosen and the selected partitionregions were lled with black so, at most, 63% of facialfeatures were occluded. For each degree of occlusion (the

percentage of occluded facial features), 5 occlusions of 5subjects were generated. Some images are shown in thebottom of Fig. 7.

In this stability test, the alignment results on original non-occluded images were used as ground-truth.

Fig. 7 shows test results. The y-axis represents errors mea-sured by (9) and the x-axis represents degree of occlusions.The center line represents the mean errors with maximum(above) and minimum (below) errors. The images in thebottom of Fig. 7 show examples of alignment results onthe image with 1, 3, 5, and 7 degrees of occlusions. Featurepoints identied as visible ones were shown as yellow circles.

In the Figure, with up to 7 degrees of occlusions, theerror is not increased signicantly and it means the proposedmethod is stable where at least half of facial features arevisible. After 8 degrees of occlusions, the error is increaseddrastically because more than half of facial features areoccluded.

VI. CONCLUSIONS

In this paper we proposed an approach of face alignmentthat is robust to partial occlusion. In the proposed approach,the feature responses from face feature detectors are explic-itly represented by multi-modal distributions, hallucinated

243


6/6

Fig. 6. Comparison of face alignment results. The rst row shows input images and its face detection result, the second row the alignment results by theBTSM, and the third row the alignment results by the proposed method. The rst to fourth columns are images of ARDB and the fth and sixth are thosewhich occlusions were articially generated.

Fig. 7. The stability test. The x -axis represents deree of occlusions andthe x -axis represents errors. The center line represents the mean errors withmaximum and minimum errors. The images at the bottom show examplesof alignment results at several degrees of occlusion - 7, 22, 36 and 50%.

facial shapes are generated, and a correct alignment is

selected by RANSAC hypothesize-and-test. A coarse inlierfeature points selection method, which reduces the numberof iteration in the RANSAC, and a new error measurementmetric that is based on shape properties are also introduced.

We evaluated the proposed method on ARDB and im-ages in which occlusions were generated articially. Theexperimental results have demonstrated that the alignmentis accurate and stable over a wide range of degrees andvariations of occlusion. Also, it shows that it is better touse partial information which is not corrupted (visible) thanto use all information that may be corrupted (invisible).

Acknowledgment This research was supported by ONRMURI Grant N00014-07-1-0747.

R EFERENCES

[1] Y. Wang, S. Lucey and J. F. Cohn, Enforcing Convexity for ImprovedAlignment with Constrained Local Models, in Proc. of the IEEE Conf.on Computer Vision and Pattern Recognition , 2008.

[2] Y. Zhou, L. Gu, and H. Zhang, Bayesian Tangent Shape Model:Estimating Shape and Pose Parameters via Bayesian Inference, inProc. of the IEEE Conf. on Computer Vision and Pattern Recognition ,vol. 1, Wisconsin, 2003, pp. 109-116.

[3] T.F. Cootes, C.J. Taylor and Manchester M. Pt, Statistical Models of Appearance for Computer Vision, 2000.

[4] L. Gu and T. Kanade, A Generative Shape Regularization Model forRobust Face Alignment, in Proc. of the 10th European Conferenceon Computer Vision , 2008.

[5] D. Cristinacce, T.F. Cootes, Feature Detection and Tracking withConstrained Local Models, in British Machine Vision Conference ,2006, pp. 929-938.

[6] P. Viola, M. Jones, Robust Real-Time Face Detection, Int. Journal of Computer Vision , vol. 57, no. 2, 2004, pp. 137-154.

[7] C. M. Bishop, Pattern Recognition and Machine Learning. Springer,2006.

[8] A. Ross, Procrustes analysis, Technical Report, Department of Com- puter Science and Engineering , University of South Carolina, SC29208.

[9] D. Comaniciu and P. Meer, Mean Shift: A Robust Approach TowardFeature Space Analysis, The IEEE Transactions on Pattern Analysisand Machine Intelligence , vol 24, 2002, pp 603-619.

[10] M.A. Fischler and R.C. Bolles, Random Sample Consensus: AParadigm for Model Fitting with Applications to Image Analysisand Automated Cartography, Comm. of the ACM , vol. 24, 1981, pp.381395.

[11] Y. Li, L. Gu, and T. Kanade, A Robust Shape Model for Multi-ViewCar Alignment, in Proc. of the IEEE Int. Conf. on Computer Visionand Pattern Recognition , 2009.

[12] P.J. Rousseeuw. Robust regression and outlier detection , Wiley, NewYork, 1987.

[13] A.M. Martinez and R. Benavente, The AR Face Database, CVCTech. Report #24, 1998.

244

face alignment robust to occlusion

Documents