face alignment under various poses and...

J. Tao, T. Tan, and R.W. Picard (Eds.): ACII 2005, LNCS 3784, pp. 40 – 47, 2005. © Springer-Verlag Berlin Heidelberg 2005

Face Alignment Under Various Poses and Expressions

Shengjun Xin and Haizhou Ai

Computer Science and Technology Department, Tsinghua University, Beijing 100084, China

[email protected]

Abstract. In this paper, we present a face alignment system to deal with various poses and expressions. In addition to global shape model, we use component shape model such as mouth shape model, contour shape model in addition to global shape model to achieve more powerful representation for face compo-nents under complex pose and expression variations. Different from 1-D profile texture feature in classical ASM, we use 2-D local texture feature for more ac-curacy, and in order to achieve high robustness and fast speed it is represented by Haar-wavelet features as in [5]. Extensive experiments are reported to show its effectiveness.

1 Introduction

Face alignment, whose goal is to locate facial feature points, such as eye-brows, eyes, nose, mouth and contour, is very important in face information processing including face recognition, face modeling, face expression recognition and analysis, etc. Since face information is very critical in human to human interaction, it is a key technology to make machine be able to process it in order to realize a natural way of human to machine interaction. In many complex face information processing researches, such as face expression analysis, as a fundamental preprocess to collect and align data, the face alignment algorithm is required workable under various poses and expressions. In this paper, this problem is discussed.

In the literature, Cootes et al. [1] [2] proposed two important methods for face alignment: Active Shape Model (ASM) and Active Appearance Model (AAM). Both methods use the Point Distribution Model (PDM) to constrain a face shape and pa-rameterize the shape by PCA, but their feature models are different. In ASM, the feature model is 1-D profile texture feature around every feature point, which is used to search for the appropriate candidate location of every feature point. However, in AAM, the global appearance model is introduced to conduct the optimization of shape parameters. Generally speaking, ASM outperforms AAM in shape localization accu-racy and more robust to illumination but has local minima problem, the AAM is sen-sitive to illumination and noisy background but can get optimal global texture. In this paper, we focus our work on ASM. In recent years, many new derivative methods have been proposed, such as that of ASM-based, TC-ASM [3], W-ASM [4], and Haar-wavelet ASM [5], that of AAM-based, DAM [6], AWN [7]. However the prob-lem is still an unsolved one for practical applications since their performances are very sensitive to large variations in face pose and especially in face expression al-though usually they can acquire good results on neutral faces, which may be caused

Face Alignment Under Various Poses and Expressions 41

by the global shape model that is not so powerful to represent changes in face compo-nents under complex pose and expression variations.

As mentioned above, classical ASM use 1-D profile texture feature perpendicular to the feature point contour as its local texture model. However, this local texture model, which is related to a small area, is not sufficient to distinguish feature point from its neighbors, so ASM often suffer from local minima problem in the local searching stage. To overcome this problem, we follow the approach in [5] to use 2-D local texture feature and represent it by Haar-wavelet features for robustness and high speed.

In this paper, we extend the work [5] to multi-view face with expression variations, and we use component shape model such as mouth shape model, contour shape model in addition to global shape model to achieve more powerful representation for face components under complex pose and expression variations. This approach is devel-oped over a very large data set and the algorithm is implemented in a hierarchical structure as in [8] for efficiency.

This paper is organized as follow: In Section 2, the overview of the system frame-work and the pose-based face alignment algorithm is given. In Section 3, experiments are reported. Finally, in Section 4, conclusion is given.

2 Overview of the System

The designed system consists of four modules: multi-view face detection (MVFD) [10], facial landmark extraction [11], pose estimation [12], and pose-based alignment, as illustrated in Fig. 1 and Fig. 2 (first two pictures are from FERET[14]). In this paper, pose-based alignment module will be introduced in detail.

Fig. 1. Framework of the system

2.1 Pose Based Shape Models

Considering face pose changes in off image plane from full profile to frontal (not losing generality, here we consider from right full profile to frontal), five types of global shape of Point Distribution Model (PDM) are defined as shown in Fig. 3, which are 37 points for [ 90 , 75 )− −o o , 50 points for [ 75 , 60 )− −o o , 59 points for

[ 60 , 45 )− −o o and 88 points for [ 45 , 15 )− −o o and [ 15 , 15 ]− +o o . So, over correspond-

ing training sets totally five PDMs are set up as posed based shape models.

42 S. Xin and H. Ai

In addition to the above global shape models, component shape mod-els for local shape representation are introduced in order to capture accurate shape changes due to large variations in poses and expressions as shown in Fig.4. The reason for this is that global shape model is too strong to represent local shape changes. Taking a face with open mouth as an example (the picture is from AR[13]) shown in Fig.5a, we found that many of mouth feature points truly reach their correct positions in local search stage, but due to their contribution in global level is too little to have significant effects in the final shape they will leave their correct positions under the global shape model constraint shown in Fig.5e. How-ever, if component shape model, that is, mouth shape model is used, their contribution is big enough to change the final shape shown in Fig.5f.

a)[ 90 , 75 )− −o o b)[ 75 , 60 )− −o o c)[ 60 , 45 )− −o o d) [ 45 , 15 )− −o o e) [ 15 , 15 ]− +o o

Fig. 3. Pose based shape model (mean shape) from right full profile to frontal

Fig. 4. Pose-based face alignment

Fig. 2. Component shape model for frontal face (mean contour and mean mouth)


In summary, the face alignment consists of two-stage processing, the first stage us-ing global ASM model, the second stage using component ASM model with the ini-tialization from the first stage, see Fig. 5 for an example. In this way, the accuracy is improved significantly.

2.2 Local Texture Model

The 2-D local texture feature represented by Haar-wavelet features proposed in [5] as illustrated in Fig. 6 (the picture is from AR[13]) is adopted. For each point, over training set those features are clustered by K-means clustering into several representa-tive templates.

2.3 Alignment

In the hierarchical alignment algorithm shown in Fig. 7, for a given face image, first supposing several facial landmark points are known (for example, by way of manually labeling), a regression method is used to initialize a full shape from those given points to start the ASM algorithm. Second the Haar-wavelet feature of every feature point and its neighbors (a 3 3× area) are computed (described in section 2.2) to select current candidate point based on Euclidean distance between the current

Fig. 6. Haar-wavelet feature extraction

Fig. 5. Face alignment using global & component shape mode

a) Sourc image b) Face alignment result c) Refined by contour shape model

d) Refined by mouth shape model

e) Feature points of mouth before refined by mouth shape model

f) Feature points of mouth after refined by mouth shape model

44 S. Xin and H. Ai

Fig. 7. Flowchart of the hierarchical alignment algorithm

Haar-wavelet feature and the trained templates. Third those candidate points are projected to the shape space to get update shape parameters and pose parameters. Repeat from the second step until the shape converges in current layer. If this layer is the last layer, then stop, otherwise move to the next layer.

3 Experiment

3.1 Training and Testing Data Set

Different from the view ranges presented in [6], that is[ 90 , 55 )− −o o ,[ 55 , 15 )− −o o ,

[ 15 ,15 ]− , [15 ,55 ]o o , [55 ,90 ] , we divide the pose of full range multi-view face

into the following intervals based on the visibilities of facial feature points and fine

mode of shape variations: [ 90 , 75 )− −o o , [ 75 , 60 )− −o o , [ 60 , 45 )− −o o , [ 45 , 15 )− −o o ,

[ 15 , 15 ]− +o o , (15 ,45 ]o o , (45 ,60 ]o o , (60 ,75 ]o o , (75 ,90 ]o o . The view [ 15 , 15 ]− +o o

corresponds to frontal. The experiments are conducted on a very large data set. For frontal view, the data set consists of 2000 images including male and female aging from child to old people, many of which are with exaggerated expressions such as open mouths, closed eyes, or have ambiguous contours especially for old people. The average face size is about 180x180 pixels. We randomly chose 1600 images for train-ing and the rest 400 for test. For the other views, we labeled feature points of 300

images of one side of view, such as[ 90 , 75 )− −o o with a semi-automatic labeling tool

as their Ground Truth Data for training, and used the 300 mirrored images of its sym-

metric view, such as (75 ,90 ]o o for testing.

In the system illustrated in Fig. 1, right now ‘Facial landmark extraction’ [11] is implemented for frontal faces and ‘Pose estimation’ [12] can only be used for the views[ 45 , 15 )− −o o ,[ 15 , 15 ]− +o o , (15 , 45 ]o o . So for the other part, manually picking

several points and selecting the corresponding pose interval are necessary to start the experiments.


3.2 Performance Evaluation

The accuracy is measured with relative pt-pt error, which is the point-to-point dis-tance between the alignment result and the ground truth divided by the distance be-tween two eyes (If the face is not frontal, then we use the distance between the eye corner and mouth corner that can be seen). The feature points were initialized by a linear regression from 4 eye corner points and 2 mouth corner points of the ground truth. After the alignment procedure, the errors were measured.

In Fig. 8a, the distributions of the overall average error are compared with Classi-cal ASM [1], Gabor ASM [4], Haar-wavelet ASM [5]. It shows that the presented method of Haar-wavelet ASM with component model is better than the other three. In Fig. 8b, the average errors of the 88 feature point are compared. The distributions of the overall average errors of the four views except frontal are compared in Fig. 9 and the average error of each feature point of the other four views are showed in Fig. 10. The average execution time per iteration is listed in Table 1.

Fig. 8. Comparison of classical ASM, Gabor ASM, Haar-wavelet ASM and Haar-wavelet ASM with component model

a) Distribution of relative average pt-pt error

b) Relative average pt-pt error for each feature point

Fig. 10. Relative average pt-pt error for each feature point of multi-view

Fig. 9. Distribution of relative average pt-pt error of multi-view

46 S. Xin and H. Ai

Some experimental results on images from FERET[14], AR[13], and internet which are independent of the training/testing set with large poses and expression variations are shown in Fig. 11, Fig. 12, Fig. 13.

Table 1. The average execution time per iteration

Algorithm Execution time (per iteration) Classical ASM 2ms

Gabor ASM 576ms Haar-wavelet ASM 30-70ms

Frontal 53ms -45degree ~ -15degree 58ms -60degree ~ -45degree 54ms -75degree ~ -60degree 45ms

Haar-wavelet ASM with com-ponent model of

this paper

-90degree ~ -75degree 35ms

Fig. 11. Multi-view face alignment results

Fig. 12. Some results on face database of AR [13]

Fig. 13. Some results on face database of FERET [14] and Internet pictures

4 Conclusions

In this paper, we extend the work [5] to multi-view face with expression variations using component shape model such as mouth shape model, contour shape model in


addition to global shape model. A semi-automatic multi-view face alignment system is presented that combines face detection, facial landmark extraction, pose estimation and pose-based face alignment into a uniform coarse-to-fine hierarchical structure based on Haar-wavelet features. With component shape model, we can deal with faces with large expression variation and ambiguous contours. Extensive experiments show that the implemented system is very fast, yet robust against illumination, ex-pressions and poses variation. It could be very useful in facial expression recognition approaches, for example, to collect shape data.

Acknowledgements

This work is supported by NSF of China grant No.60332010.

References

1. T Cootes, D Cooper, C Taylor, and J Graham, Active shape models – their training and application. Computer Vision and Image Understanding, 61(1):38-59, 1995

2. T Cootes, G Edwareds, and C Taylor, Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681-685, 2001

3. Shuicheng Yan, Ce Liu, Stan Z. Li, Hongjiang Zhang, Heung-Yeung Shum, Qiansheng Cheng. Texture-Constrained Active Shape Models.

4. Feng Jiao, Stan Li, Heung-Yeung Shum, Dale Schuurmans, Face Alignment Using Statis-tical Models and Wavelet Feature, Proceedings of IEEE Conference on CVPR, pp. 321-327, 2003.

5. Fei Zuo, Peter H.N. de With, Fast facial feature extraction using a deformable shape model with Haar-wavelet based local texture attributes, Proceedings of IEEE Conference on ICIP, pp. 1425-1428, 2004.

6. S. Z. Li, S. C. Yan, H. J. Zhang, Q. S. Cheng, Multi-View Face Alignment Using Direct Appearance Models, In Proceedings of The 5th International Conference on Automatic Face and Gesture Recognition. Washington.DC, USA, 2002

7. C. Hu, R. Feris, and M. Turk Active Wavelet networks for Face Alignment In British Ma-chine Vision Conference, East Eaglia, Norwich, UK, 2003

8. Ce Liu, Heung-Yeung Shum, and Changshui Zhang, Hierarchical Shape Modeling for Automatic Face Localization, Proceedings of ECCV, pp.687-703, 2002.

9. P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple features, in Proc. CVPR, 2001, pp. 511–518.

10. Bo WU, Haizhou AI, Chang HUANG, Shihong LAO, Fast Rotation Invariant Multi-View Face Detection Based on Real Adaboost, In Proc. the 6th IEEE Conf. on Automatic Face and Gesture Recognition (FG 2004), Seoul, Korea, May 17-19, 2004.

11. Tong WANG, Haizhou AI, Gaofeng HUANG, A Two-Stage Approach to Automatic Face Alignment, in Proceedings of SPIE Vol. 5286, 558-563, 2003.

12. Zhiguang YANG, Haizhou AI, et.al, Multi-View Face Pose Classification by Tree-Structured Classifier, The IEEE Inter. Conf. on Image Processing (ICIP-05), Genoa, Italy, September 11-14, 2005.

13. http://rvl1.ecn.purdue.edu/~aleix/aleix_face_db.html 14. P. J. Phillips, H. Wechsler, J. Huang, and P. Rauss, “The FERET database and evaluation

procedure for face recognition algorithms”, Image and Vision Computing J, Vol. 16, No. 5, pp 295-306, 1998.

face alignment under various poses and...

Documents