fast and robust face recognition via coding residual map ...cslzhang/paper/frfr_pr.pdf · meng...

1

Fast and Robust Face Recognition via Coding Residual

Map Learning based Adaptive Masking

Meng Yang*, ZhizhaoFeng*, Simon C. K. Shiu, Lei Zhang1

Dept. of Computing, The Hong Kong Polytechnic University, Hong Kong, China

Abstract: Robust face recognition (FR) is an active topic in computer vision and biometrics, while

face occlusion is one of the most challenging problems for robust FR. Recently, the representation (or

coding) based FR schemes with sparse coding coefficients and coding residual have demonstrated

good robustness to face occlusion; however, the high complexity of l1-minimization makes them less

useful in practical applications. In this paper we propose a novel coding residual map learning

scheme for fast and robust FR based on the fact that occluded pixels usually have higher coding resi-

duals when representing an occluded face image over the non-occluded training samples. A dictio-

nary is learned to code the training samples, and the distribution of coding residuals is computed.

Consequently, a residual map is learned to detect the occlusions by adaptive thresholding. Finally the

face image is identified by masking the detected occlusion pixels from face representation. Experi-

ments on benchmark databases show that the proposed scheme has much lower time complexity but

comparable FR accuracy with other popular approaches.

Keywords: Robust face recognition, coding residual map, adaptive masking

*The first two authors contribute equally to this work. 1Corresponding author. Email: [email protected].

2

1 Introduction

Face recognition (FR) has been an active research topic in the fields of computer vision and biome-

trics for more than three decades [1-7, 26-28]. Early FR methods analyze the geometric features of

facial images, such as the location of nose, eyes, and mouth, [1-2]. However, these methods are very

sensitive to the changes in illumination and facial expression. To solve this problem, the appearance

based FR methods extract some holistic features from the original face image vectors for matching.

Many subspace learning based holistic feature extraction methods have been developed, including

Eigenfaces [29], Fisherfaces [36], Local Preserving Projection (LPP) [4], 2D-PCA [31], Independent

component analysis (ICA) [30], Sparsity Preserving Projections [5], etc.

The subspace learning based FR methods are simple to implement, fast, and work well for face im-

ages without occlusion. However, in many practical FR applications, the face images are occluded

(e.g., with sunglasses and scarf), and the conventional subspace based methods [4-6, 29-31, 36] can-

not deal with occlusion well. Representation based face classification methods have been recently

proposed [7, 9-11, 24-25] for robust FR. One typical example is the sparse representation based clas-

sification (SRC) scheme [7], which hypothesizes that the multiple training images of a subject could

well reconstruct other samples from the same subject. In SRC, the query image (or its holistic feature

vector) is sparsely coded over the training images (or their holistic feature vectors), and the identity of

the query sample is assigned to the class which yields the minimum coding residual. In the coding

process of SRC, the l1-norm sparsity is imposed on the coding coefficients to increase discrimination.

Furthermore, to increase the robustness to occlusions, in SRC the l1-norm is also used to characterize

the coding residual, which makes SRC more time consuming.

Apart from SRC, other methods which are robust to face occlusions have also been developed, such

as Eigenimages [37-38], probabilistic local approach [39], auto-associative network [40], and occlu-

sion estimation via partial filters [41]. In [40] an auto-associative network was used to detect outliers

and the occluded face was then completed by replacing the occluded pixels by the outputs of the auto-

associative network. Compared to these methods [37-41], SRC could handle more general types of

3

occlusions, including real disguise, continuous occlusion, pixel-wise corruption with unknown loca-

tion and unknown intensity, etc.

FR methods based on local features (e.g., local binary pattern [42], line edge map [44], parabola

edge map [45], and oriented edge magnitude feature [46]) have also been proposed. For example, Gao

and Leung [44] proposed to represent face images by line edge maps and recognize face images by

minimizing the line segment Hausdorff distance. In order to encode not only local structure but also

global structure of a face image, Chen and Gao [43] proposed the Stringface based on face edge/line

features, and transformed FR as a string-to-string matching problem. Although edge feature such as

Stringface [43] is robust to some types of face variations (e.g., lighting changes) to some extent, it

cannot handle facial occlusion with random block and random pixel corruption, where accurate edge

detection is very difficult to achieve [47].

SRC based FR has been attracting much attention from researchers due to its promising results to

recognize occluded face images. Many related works have been developed, such as the l1-graph for

image classification [8], kernel based SRC [9], Gabor feature based SRC [10, 48], robust sparse cod-

ing (RSC) [24], robust alignment with sparse and low rank decomposition [25], joint dimension re-

duction and dictionary learning [49], face and ear multimodal biometric system [50], etc. In particular,

the RSC method [24] has shown excellent results in FR with various occlusions. From the viewpoint

of maximal likelihood estimation, the l1-norm or l2-norm characterization of the representation resi-

dual is only optimal when the residual follows Laplacian or Gaussian distribution, thus Yang et al.

[24] used a robust regression function to measure the representation residual. Although the RSC me-

thod has high recognition accuracy, it is also very time-consuming, like SRC.

Very recently, Zhang et al. [12] verified that it is not the sparse representation (i.e., the l1-norm

sparsity imposed on the coding coefficients) but the collaborative representation (i.e., using the train-

ing samples from all classes to collaboratively represent the query face image) that plays the key role

in SRC for face classification. Zhang et al. then proposed to use the l2-norm to regularize the repre-

sentation coefficients, and the so-called collaborative representation based classification with regula-

rized least square (CRC_RLS) method achieves similar accuracy to SRC but with significantly less

computational cost [12].

4

In CRC_RLS, the coding residual is modeled by l2-norm. Although it achieves reasonable perfor-

mance in FR with real disguise (e.g., sunglass and scarf) on the AR database, it cannot robustly deal

with other types of face occlusions (e.g., random corruption, random occlusion). If we use l1-norm to

model the representation residual for robustness to occlusion, this will increase much the computa-

tional cost. It is therefore very important for us to develop a robust but fast FR scheme to handle face

occlusion. This is the motivation of our work.

When a query face image is represented as a linear combination of non-occluded face images, the

representation residual actually can be used to detect many face occlusion pixels because those oc-

cluded pixels tend to have larger reconstruction errors than the normal non-occluded pixels. There-

fore, if we can detect and remove these pixels from the face representation, more robust FR results

can be obtained. However, different face features, such as eyes, nose, mouth and cheek, will have

different representation accuracy and thus we cannot use a single threshold to detect the face occlu-

sion pixels. In this paper, we propose to learn a face coding residual distribution map from the training

samples by coding them over a dictionary, which is also learned from the training samples. Then for a

given query image, it is coded over the learned dictionary, and is then adaptively masked based on its

coding residual and the pre-learned coding residual map. By removing the detected occlusions, the

query face image can then be robustly represented and classified. Although the idea of our occlusion

detection method is quite simple, our experimental results showed that the proposed scheme can

achieve competitive results with other well-known and robust FR methods in terms of both speed and

accuracy.

The rest of this paper is organized as follows. Section 2 briefly introduces the main procedures of

SRC and CRC. The proposed error map learning method is presented in detail in Section 3. Section 4

introduces the spatially adaptive masking. Section 5 conducts extensive experiments to test the pro-

posed method and compare it with other well-known methods. Finally, Section 6 concludes this paper.

5

2 SRC and CRC

In [7], Wright et al. proposed a sparse representation based classification (SRC) scheme for face rec-

ognition. Let A = [A1, A2, …, Ac] be the set of training samples from all the c classes, where Ai is the

subset of the training samples from class i. Denote by y a query sample. In SRC, y is sparsely coded

over A via l1-minimization

{ }2

2 1ˆ arg min γ= − +y Aαα α α (1)

where γ is a scalar constant. The classification is done by

( ) { }identity argmini ie=y (2)

where 2

2ˆi i ie = −y Aα , ˆ =α [ 1α̂ ; …; ˆcα ] and ˆiα is the coefficient vector associated with class i.

To make SRC robust to face occlusion, an identity matrix I is introduced as a dictionary to code the

occluded pixels [7]:

[ ] [ ] [ ], 1ˆ arg min ; s.t. , ;⋅β β y = A I βαα = α α (3)

Actually Eq. (3) is essentially equivalent to:

1 1ˆ arg min { }= − +y Aαα α γ α (4)

Since both the coding coefficient and coding residual are modeled by l1-norm, the complexity of SRC

is very high.

Though the role of sparse representation in robust FR is much emphasized in [7], Zhang et al.[12]

verified that it is the collaborative representation mechanism (i.e., representing the query image colla-

boratively using all training samples) in SRC that affects much FR. The l1-norm sparsity on the cod-

ing coefficients is not that crucial to FR. Zhang et al. proposed to simply use the regularized least

square to code the query image, and the so-called CRC_RLS scheme represents y as [12]

{ }2 2

2 2ˆ arg min γ= − +y Aαα α α

(5)

The classification in CRC_RLS is similar to that in SRC.

6

The complexity of CRC_RLS is significantly lower than SRC because α̂ can be simply calculated

as ˆ = ⋅P yα , where ( ) 1T Tγ−

= + ⋅P A A Ι A can be pre-calculated and it is independent of y. However,

CRC_RLS has no special settings to deal with occlusion and it is less robust to occluded faces.

3 Coding residual map learning

3.1 Dictionary learning

In representation based FR, the recognition is performed by coding the query image over a dictionary.

One straightforward way is to use the original training samples as the dictionary, such as in SRC [7]

and CRC_RLS [12]. Since the training samples are generally non-occluded face images, the occluded

pixels in a query image usually cannot be well reconstructed by the non-occluded samples and thus

they will have big reconstruction errors. Intuitively, we can use the coding residual to detect the oc-

cluded pixels (for example, by setting a detection threshold), and then mask the detected occlusion

pixels from the face coding to achieve robust FR.

However, the face has various features, e.g., eyes, nose, mouth and cheek, which will have different

variances of coding residuals. Therefore, it is hard to use a global threshold to effectively detect the

occlusions in different facial areas. In order to make the occlusion detection more accurate, knowing

the coding residual variances of different facial features is important such that spatially adaptive oc-

clusion detection can be carried out.

To achieve the above objective, we can learn a dictionary from the training samples, and use this

dictionary to code the training samples. The variances of coding residuals can then be computed to

build the coding residual map. By coding a query image over this dictionary and with the learned cod-

ing residual map, the occlusion pixels can then be adaptively detected. In addition, compared with

using the original training samples as the naïve dictionary for face representation, dictionary learning

can also bring advantages such as removing noise and trivial structures for more accurate face repre-

sentation, as well as making the representation more discriminative.

Many dictionary learning methods have been proposed for image processing [13-14, 32-33] and

pattern recognition [15-16, 34, 49] in the past. In [16], a Fisher discrimination dictionary learning

7

(FDDL) method was proposed for sparse representation based image recognition. Inspired by FDDL

and considering that the sparsity on the coding coefficients is not that critical for FR [12], we propose

a simpler dictionary learning model. Denote by D= [D1, D2, …, Dc] the dictionary to be learned,

where Di is the class-specified sub-dictionary associated with class i. The dictionary D is learned from

the training dataset A. In general, we require that each column of the dictionary Di is a unit vector, and

the number of atoms in Di is no bigger than the number of training samples in Ai.

We denote by Xi the coding coefficient matrix of Ai over D, and Xi = [Xi1; …; Xi

j; …; Xic], where Xi

j

is the coding matrix of Ai over the sub-dictionary Dj. We propose to learn the dictionary D by optimiz-

ing the following objective function:

( ){ }22( , ) 1 21

( , )arg min c

i i i iFi FJ R λ λ

== + + −∑D X

D XD X X X

(6)

where

2 221( )

ci jji i i i i i j iF F Fj i

R =≠

= − + − +∑D A DX A D X D X

(7)

and iX is the column mean matrix of Xi, i.e., every column of iX is the mean vector mi of all the col-

umns in Xi. The parameters λ1 and λ2 are positive scalar numbers to balance the F-norm terms in Eq.

(6).

From Eq. (7), one can see that the term Ri(D) ensures that the training samples from class i (i.e., Ai)

can be well reconstructed by the learned sub-dictionary Di, while they have small representation coef-

ficients on the other sub-dictionaries Dj, j≠i. Therefore, the learned whole dictionary D will be discri-

minative in terms of reconstruction. On the other hand, the term 2

i i F−X X in Eq. (6) will make the

representation of samples from the same class close to each other, reducing the intra-class variations.

Finally, we use the F-norm, instead of the l1-norm, to regularize the coding coefficients X in Eq. (6),

and this significantly reduces the complexity of optimizing Eq. (6).

The objective function in Eq. (6) is a joint optimization problem of D and X, and it is convex to D

or X when the other is fixed. Like in many multi-variable optimization problems, we could solve

8

Eq.(6) by optimizing D and X alternatively from some initialization. Since in each step, the optimiza-

tion is convex and all the terms involved are of F-norm, the optimization can be easily solved. The

learned dictionary D will be different by using different settings of parameters λ1 and λ2. Our experi-

mental results show that the final FR rates are not sensitive to λ1 and λ2 in a wide range. In our expe-

riments, we set them as λ1=0.001 and λ2=0.002 for all the databases by experience.

3.2 Residual map learning

Once the dictionary D is learned from the training dataset A, it can be used to code a given query

sample y by

{ }2 2

2 2ˆ arg min γ= − +α y Dα α α

(8)

The solution α̂ can be easily calculated as ˆ = ⋅α P y , where the projection matrix

( ) 1T Tγ−

= + ⋅P D D Ι D can be pre-computed. We can then calculate the coding residual ˆ−ye = y Dα .

When there are occlusions in the query image y, its coding residual at the occluded locations will

probably exceed the “normal range”. Therefore, if we know the “normal range”, more specifically the

standard deviation, of each element of ey, we can adaptively detect the occlusions in y.

Obviously, the deviation of coding residual will vary with different facial areas. In most cases, the

areas such as eyes and mouth will have bigger residual than the areas such as cheek because they have

more edge structures which are more difficult to reconstruct. This coding residual deviation map can

be learned by coding the training samples in A over the dictionary D. There is

{ }2 2ˆ arg minF F

γ= − +ΛΛ A DΛ Λ

(9)

Clearly, ˆ = ⋅Λ P A , and the coding residual matrix is ˆ= −E A DΛ .

Each row of E, denoted by ek, contains the coding residuals of all face samples at the same location

k, and thus its standard deviation can be used to define the normal range of the coding residual at this

location. Denote by

9

σk =std(ek) (10)

the standard deviation of ek, and then all the σk together will build a coding residual map, which indi-

cates the normal range of coding residual at each location. Certainly, from Eq. (9) we know that the

residual map depends on the parameter γ, and Fig. 1 shows several residual maps calculated on the

AR database [17] with different values of γ. It can be seen that the different face structures will have

different residual deviation, while the residual map varies with γ. Therefore, a suitableγ must be de-

termined for robust FR.

Fig. 1. Examples (by the AR database) of the learned coding residual map with different settings of γ. From left to right, γ=0.1, 1, 2, respectively.

We determineγ by checking which value of γ can make the coding in Eq. (9) optimal. It can be em-

pirically found that the coding residuals in E and the coding coefficients in Λ are nearly Gaussian

distributed. We assume that the residuals in E and the coefficients in Λ follow i.i.d. Gaussian distribu-

tions, respectively. Based on the maximum a posterior (MAP) principle, the desired coefficients Λ

should make the probability P(Λ|A) maximized. According to the Bayesian formula and after some

straightforward derivations, the parameter γ which could lead to the MAP solution of Λ should satisfy

2 2

opt F Fγ = E Λ

(11)

In implementation, we use a set of different values of γ to code A by Eq. (9), and the corresponding

coefficient Λ and residual E can be obtained. Then we could check if the used γ is close enough to the

associated γopt in Eq. (11). The γ which is close to its associated γopt the most is then used to compute

the residual map elements σk.

10

4 Recognition with adaptive masking

4.1 Detecting the occlusion pixels

Once the residual map is learned, we can use it to detect the occluded outlier pixels in the query image

y based on its coding residual ey. Intuitively, at location k, if |ey(k)| is much bigger than σk in the resi-

dual map, it is highly possible that pixel k in y is occluded. It is empirically found that the coding resi-

dual at location k, i.e., ek, approximately follows zero-mean Gaussian distribution, while the shape of

the Gaussian distribution is controlled by σk. Fig. 2 plots the histograms of ek at three different types

of areas, eye, nose and cheek, respectively. We can see that the distributions are Gaussian like, and

the eye and nose regions have much higher standard deviation values than the smooth cheek area.

Fig. 2. Histograms of the coding residuals at regions of eye (in brown), nose (in red) and cheek (in blue), respec-tively.

It is known that most of the values of a Gaussian distribution will fall into the interval bounded by

several times of its standard deviation. Therefore, if a pixel at location k is occluded, the coding resi-

dual at this location will often exceed the normal range, and |ey(k)|>cσk is very likely to happen, where

c is a constant. In this paper, we use the following simple rule

pixel k is occluded if |ey(k)|>cσk (12)

11

to detect the occlusions. This occlusion detection rule is rough, however, it makes the detection effi-

cient and it is able to remove a large portion of occluded pixels and improve the recognition rate sig-

nificantly, as will be seen in our experimental results.

4.2 Masking and coding

After detecting the occluded pixels in y, we can partition the query image y into two parts: y=[ync;yoc],

where ync denotes the non-occluded part and ync denotes the occluded part. Since each pixel in y has a

corresponding row in the learned dictionary D, we can accordingly partition the dictionary D into two

parts, i.e., D=[Dnc; Doc], where Dnc is the sub-dictionary for ync and Doc is for yoc.

Since the occlusion in the query image will reduce the FR accuracy, obviously we can exclude yoc

from coding and use only ync to recognize the identity of y. Therefore, the coding after masking is

performed as:

{ }2 2

2 2ˆ arg minnc nc nc γ= − +α y Dα α α

(13)

The solution of Eq. (13) is ˆnc nc nc= ⋅α P y with ( ) 1T Tnc nc nc ncγ

−= + ⋅P D D Ι D . Since Pnc depends on the

input query sample y and it cannot be pre-computed, the calculation of ˆncα is the most time-consuming

step of our proposed scheme. The detailed complexity analysis will be made in Section 5.5, and the

running time comparison will demonstrate that the proposed scheme is much faster than state-of-the-

art robust FR methods with comparable FR accuracy.

4.3 Classification

After ˆncα is obtained by solving Eq. (13), the coding coefficient and class-specific coding residual can

be used to determine the identity of query image y. The class-specific coding residual can be calcu-

lated as _ _ 2ˆi nc nc i nc ie = −y D α , where Dnc_i and _ˆnc iα are the sub-dictionary and sub-coding vector

associated with class i, respectively. Recall that in the dictionary learning stage in Section 3.1, we

have also learned the mean coding vector mi of each class. Denote by mnc_i the corresponding mean

12

coding vector to the non-occluded face vector ync. The distance between _ˆnc iα and mnc_i can also help

classifying y. Let _ 2ˆi nc nc ig = −α m . Finally, we could fuse ei and gi for decision making:

fi =ei + w⋅gi (14)

where weight w is a constant. The identity of the query image is then determined by: identity(y) =

argmini{fi}.

5 Experimental results

In this section, we perform extensive experiments on benchmark face databases to demonstrate the

performance of our proposed algorithm. We first discuss the parameter selection in Section 5.1; in

Section 5.2 we test the proposed algorithm on two databases (AR [17] and MPIE [18]) without dis-

guise and occlusion; in Sections 5.3 and 5.4, we test the proposed method on three databases, AR,

Extended Yale B and MPIE, with disguise and occlusion; finally, we will discuss the computational

efficiency of the proposed method in Section 5.5.

5.1 Parameter selection

There are five parameters in our algorithm: λ1 and λ2 in Eq. (6), γ in Eq. (9) and Eq. (13), the constant

c in Eq. (12) and the weight w in Eq. (14). Among these parameters, λ1 and λ2 are related to dictionary

learning, and w is related to the distance measurement. Based on our experimental experience, we

simply fix λ1=0.001, λ2=0.002, and w=0.1 on all datasets. As discussed in Section 3.2, the value of γ is

automatically determined by Eq. (11) in the training phase.

The value of c is critical in detecting the occlusion points. If c is too small, too many points will be

regarded as outlier; if c is too large, fewer outliers will be detected. Fig. 3 shows some detection ex-

amples. If no specific instruction, in our following experiments, c is set as 2, 1 and 1 in the AR, Ex-

tended Yale B and MPIE databases, respectively.

13

(a) (b) (c) (d) (e) (f)

Fig. 3. Example of occlusion point detection. (a) is the original test image. From (b) to (f): the occlusion detec-tion results by letting c=12,10,6,4,2, respectively.

5.2 Recognition without occlusion

Although our algorithm is mainly designed to handle occlusion, it can also be applied to recognize

normal face images. In clean face images without occlusion or disguise, there are still some pixels

which may lead to recognition error, and they can be viewed as outliers. We can detect these pixels by

using our proposed method and exclude them from recognition. In this section, we evaluate our algo-

rithm in the popular databases AR and MPIE.

a) AR database: The setting in our experiments is the same as in [7]. A subset in AR [17] that con-

tains 50 males and 50 females with only illumination and expression variances is used. For each indi-

vidual, the seven images from Section 1 are used as training samples while the other seven images

from Section 2 are used for testing. The original face image is downsampled to 36×22 in our method.

The best results of competing methods, including the nearest neighbor (NN) classifier, SRC [7],

CRC_RLS [12], RSC [24] and the proposed method, are presented in Table 1. In addition, to more

clearly show the advantage brought by the learned residual map, we also present the result of

CRC_RLS coupled with an outlier detector by global thresholding. That is, we first remove the pixels

whose representation residuals are larger than a predefined global threshold, and then classify the face

image using the remaining part by CRC_RLS. We call this global thresholding based scheme

CRC_RLS_GT. In the following experiments, we report the best results of CRC_RLS_GT by choos-

ing a suitable threshold.

CRC_RLS, CRC_RLS_GT, and the proposed method have similar reconstruction strategy, while

the proposed method achieves higher recognition rate. This demonstrates that the proposed outlier

pixel detection method can remove some insignificant (even negative) pixels in the face images, and

14

hence improve the recognition rate. Using a global threshold to remove the outlier pixels is not help-

ful, and CRC_RLS_GT has the same accuracy as CRC_RLS. Compared with NN and SRC, the pro-

posed method also achieves about 23.8% and 1.8% higher recognition rate. The accuracy of the pro-

posed algorithm is only slightly lower than RSC, whose complexity is much higher than our method

(please refer to Section 5.5. for the running time comparison).

Table 1. Recognition rates on the AR database by different methods.

NN SRC CRC_RLS CRC_RLS_GT RSC Proposed 71.3% 93.3% 93.7% 93.7% 96.0% 95.1%

b) Multi PIE database: In this experiment, all the 249 subjects in Session 1 in CMU Multi-PIE

[18] are used. We use 7 frontal images with extreme illuminations {0, 1, 7, 13, 14, 16, 18} and neutral

expression of each subject as the training set. For testing set, 4 images of illuminations taken with

smile expressions from each person in the same session are used. The face images are directly down-

sampled to 25×20. Table 2 shows the recognition rates of different methods. Again, the proposed me-

thod achieves the second best recognition rate, just after the RSC scheme.

Table 2.Recognition rates on the MPIE database by different methods.

NN SRC CRC_RLS CRC_RLS_GT RSC Proposed 86.4% 93.9% 94.1% 94.1% 97.8% 94.4%

5.3 Recognition with real disguise

As in [7], a subset from the AR database, consisting of 50 male and 50 female subjects, is used here.

800 images (about 8 samples per subject) of non-occluded frontal views with various facial expres-

sions in both sessions are used for training, while the samples with sun glasses and scarves (1 sample

per subject) in both sessions are used for testing. Fig. 4 shows some example query images with dis-

guise. The images are directly down-sampled to 42×30 with normalization.

15

Fig. 4. Testing samples with sunglasses and scarves in the AR database.

The occlusion pixels detected by the proposed method are illustrated in Fig. 5. It can be seen that

the proposed algorithm detects many outlier points, and also makes some wrong judgments. Fortu-

nately, occluded parts are detected as outliers more often than non-occluded parts. For smaller ratio of

occlusion (e.g., sunglasses disguise), the detection result is more accurate (see Figs. 5(a) and 5(c)); but

when the occlusion ratio is large (e.g., the scarf disguise), the detection result is less accurate (see

Figs. 5(b) and 5(d)). It should be stressed that our goal is to introduce a robust and efficient FR

scheme instead of occlusion detection.

The recognition rates by competing methods are listed in Table 4. Although the occlusion detection

is not accurate enough, the proposed scheme can still obtain much better results than all the competing

methods except for RSC. By detecting outlier pixels with a global threshold, the CRC_RLS_GT me-

thod could improve the performance in some case (e.g., FR with sunglasses), while the proposed

adaptive thresholding method can achieve 15% higher accuracy than CRC_RLS_GT in the sunglass

case. Again, although the RSC method has the highest accuracy, we would like to emphasize that it

has much higher complexity than the proposed method.

(a) (b) (c) (d)

Fig. 5. The occlusion detection results of some testing samples in the AR database.

16

Table 3. Recognition results by different methods on the AR database with sunglasses and scarves disguise.

Algorithm Sun- Scarves NN 70.0% 12.0%

SRC [7] 87.0% 59.5% CRC_RLS [12] 68.5% 90.5% CRC_RLS_GT 77.5% 90.5%

RSC [24] 99.0% 97.0% Proposed 93.0% 90.5%

5.4 Recognition with random block occlusion

In this section, we test the robustness of our algorithm to random block occlusion. As in [7], Subsets 1

and 2 of the Extended Yale B database [6] are used for training and Subset 3 for testing. Each testing

sample will be inserted an unrelated image as block occlusion, and the blocking ratio is from 10% to

50% as illustrated in Fig 6. The images are cropped and down-sampled to 48×42.

All training and testing samples are normalized to reduce the effect of illumination variance. The

occlusion detection results of the example images in Fig. 6 are shown in Fig. 7. The recognition rates

by the competing methods are listed in Table 4. We can see that when the block occlusion ratio is low,

all methods can achieve good recognition accuracy; when the block occlusion ratio increases, the ac-

curacy of NN, CRC_RLS and SRC will decrease rapidly, while RSC and the proposed method can

still have good results. Meanwhile, the proposed method performs much better than the global thre-

sholding based CRC_RLC_GT (e.g., over 5% and 7% improvements in 40% and 50% block occlu-

sion, respectively). When the occlusion ratio is 50%, the proposed method surpasses SRC more than

10%, while being only about 6% lower than RSC.

Fig. 6. Examples of random block occlusion in the Extended Yale B database. From left to right: occlusion ratio is 20%, 40%, 50%, respectively.

17

Fig. 7. Occlusion detection results of the samples in Fig. 6.

Table 4. Recognition results by different methods on the Extended Yale B database with various random occlusion ratios.

Occlusion ratio 10% 20% 30% 40% 50% NN 90.1% 85.2% 74.2% 63.8% 48.1%

SRC [7] 100% 99.8% 98.5% 90.3% 65.3% CRC_RLS [12] 99.8% 93.6% 82.6% 70.0% 52.3% CRC_RLS_GT 100% 99.8% 96.9% 88.1% 70.9%

RSC [24] 100% 100% 99.8% 96.9% 83.9% Proposed 100% 99.8% 98.5% 93.6% 77.9%

5.5 Recognition with random pixel corruption

In this section, we evaluate the robustness of the proposed method to random pixel corruption. As in

[7], we still use Subsets 1 and 2 (717 images, normal-to-moderate lighting conditions) of the Extended

Yale B database [6] for training, and use Subset 3 (453 images, more extreme lighting conditions) for

testing. The images were resized to 96×84 pixels. For each testing image, we replaced a certain per-

centage of its pixels by uniformly distributed random values within [0, 255]. The corrupted pixels

were randomly chosen for each test image and the locations are unknown to the algorithm. Some cor-

rupted face images are shown in Fig. 8, from which we can observe that it is difficult to recognize the

corrupted face images even for humans.

The recognition rates of all competing methods under different corruption ratios (from 0% to 70%)

are listed in Table 5. Here we set c as 6 in all tests. The proposed method, SRC and RSC all achieve

100% FR rates when the corruption ratio is from 0% to 40%. When the corruption ratio is 50% or

60%, the performance of the proposed method is still very close to RSC and SRC. Meanwhile, the

proposed method clearly outperforms NN, CRC_RLS, and CRC_RLS_GT.

18

Fig. 8. Example face images with random pixel corruption in the Extended Yale B database. From left to right: corruption ratio is 40%, 50%, 60%, respectively.

Table 5. Recognition results by different methods on the Extended Yale B database with various random pixel corruption ratios.

Corruption 0% 10% 20% 30% 40% 50% 60% 70% NN 94.0% 96.2% 96.5% 94.7% 82.3% 65.6% 40.2% 25.8% SRC [7] 100% 100% 100% 100% 100% 100% 99.3% 90.7% CRC_RLS[12] 100% 100% 100% 99.8% 98.9% 96.4% 79.9% 45.7% CRC_RLS_GT 100% 100% 100% 100% 100% 98.5% 88.3% 58.3% RSC[24] 100% 100% 100% 100% 100% 100% 100% 99.3% Proposed 100% 100% 100% 100% 100% 98.9% 91.4% 65.3%

5.6 Complexity analysis

From the experimental results in Sections 5.2~5.5, we see that the proposed method’s recognition

accuracy is only slightly lower than RSC but much higher than SRC (in most cases) and CRC_RLS,

which are among the state-of-the-art methods. Let’s then compare the time complexity and running

time between our method and SRC, CRC_RLS and RSC. (Note that the complexity of CRC_RLS_GT

is the same as the proposed method.)

The computational cost in our method mainly comes from solving Eq.(13). The solution of Eq. (13)

can be written as

( ) ˆT Tnc nc nc nc ncγ⋅ = + ⋅D y D D Ι α

(15)

Eq. (15) can be viewed as a linear equation system, where the coefficient ˆncα is to be computed. Here

we use the Conjugate Gradient Method (CGM) [19-20] to solve this linear equation system, which is

much more efficient than solving the inverse problem in Eq. (15). Suppose that Dnc is an N×M matrix.

In FR problems, usually we have N>M. The solution of Eq. (15) is a vector of length M. The compu-

tational complexity of CGM is O(M) for each iteration, so the complexity of solving Eq. (15) is

O(KM) if there are totally K iterations. Since we do not need a very accurate solution to Eq. (15), we

19

set K=30 as the maximum number of iterations in our experiments. The complexity for calculating

Tnc ncD D is O(NM2). Thus the total time complexity of our method is O(NM2+KM).

The following Tables 6~8 show the running time of CRC_RLS [12], RSC [24], SRC with l1_ls [21]

and SRC with fast homotopy [22] (for other fast l1-norm minimization methods please refer to [23]),

and our proposed algorithm in three experiments, which are conducted under MATLAB programming

environment on a PC with Intel R Core 2 1.86 GHz CPU and 2.99 GB RAM. The settings are the

same as those in previous sections, i.e., AR with sunglass disguise, Extended Yale B with 50% occlu-

sion, and MPIE without occlusion. All samples are directly down-sampled from the original face im-

ages. The reported running time is the average time consumed by each testing sample.

Table 6. Recognition rates and running time on the AR database with sunglass disguise.

Algorithm Recognition Rate Running Time SRC(l1_ls) 87.0% 34.50s

SRC(homotopy) 86.4% 0.120s CRC_RLS 68.5% 0.017s

RSC 99.0% 46.91s Proposed 93.0% 0.141s

Table 7. Recognition rates and running time on the Extended YaleB database with 50% block occlusion.



RSC 83.9% 73.36s Proposed 77.9% 0.278s

Table 8. Recognition rates and running time on the MPIE database without occlusion.



RSC 97.8% 120s Proposed 94.4% 0.858s

20

From Tables 6~8, we can make the following conclusions. First, in all the experiments, RSC al-

ways achieves the best recognition rates, while the proposed method always has the second best rec-

ognition rates. However, the speed of our method is hundreds of times faster than RSC. Second,

CRC_RLS is the fastest algorithm among all the competing methods. It has similar recognition rate to

our proposed method for FR without occlusion (refer to Table 8), but has much worse recognition

rates than ours for FR with occlusion. Third, SRC implemented by homotopy techniques has similar

running time to our method, whereas its recognition rates are lower than our method, especially for

FR with occlusion. Finally, SRC implemented by l1_ls is very slow without improving much the rec-

ognition rates compared to SRC implemented by homotopy. In summary, the proposed method

achieves a very good balance between robustness and efficiency. In practical FR applications, the

database can be of large scale, and our method could lead to desirable recognition accuracy with ac-

ceptable time consumption.

6 Conclusion

In this paper, we proposed a simple yet robust and efficient FR scheme by coding the query sample

over a dictionary learned from the training samples. To make the FR robust to occlusions and dis-

guise, a coding residual map was first learned from the training samples, and then it is used to detect

adaptively the outlier points in the query sample. The detected outliers were then excluded from the

coding of the query sample to improve the robustness of FR to occluded samples. Our extensive expe-

rimental results on benchmark face databases show that the proposed scheme is very competitive with

state-of-the-art methods in terms of accuracy, and it is much faster than the methods such as SRC [7]

and RSC [24]. Overall, the proposed method has both robust FR performance and efficiency. It is a

very good candidate for real-time face recognition applications.

21

7 References

1. I. Cox, I, J. Ghosn, and P. Yianil, Feature-base face recognition using mixture distance. In ICCV, 1996.

2. L. Wiskott, J.M. Fellous, N. Kuiger, C. Malsburg. Face recognition by elastic bunch graph matching. IEEE

Trans. Pattern Analysis and Machine Intelligence, 19(7): 775-779, 1997.

3. J. H. Wang, J. You, Q. Li, and Y. Xu. Orthogonal discriminant vector for face recognition across pose.

Pattern Recognition, 45(12): 4069-4079, 2012.

4. X. He and P. Niyogi, Locality Preserving Projections. In NIPS, 2003.

5. L. S. Qiao, S. C. Chen, and X. Y. Tan. Sparsity preserving projections with applications to face recognition.

Pattern Recognition, 42: 331-341, 2010.

6. K. Lee, J. Ho, and D. Kriegman. Acquiring linear subspaces for face recognition under variable lighting.

IEEE Trans. Pattern Analysis and Machine Intelligence, 27(5): 684–698, 2005.

7. J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation.

IEEE Trans. Pattern Analysis and Machine Intelligence, 31(2): 210–227, 2009.

8. B. Cheng, J. Yang, S. Yan, Y. Fu, and T. Huang. Learning with l1-graph for image analysis. IEEE Trans.

Image Processing, 19(4): 858-866, 2010.

9. S. Gao, I. Tsang, L. Chia, Kernel sparse representation for image classification and face recognition. In

ECCV 2010.

10. M. Yang and L. Zhang. Gabor feature based sparse representation for face recognition with Gabor occlu-

sion dictionary. In ECCV2010.

11. J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan. Sparse representation for computer vision and

pattern recognition. Proceedings of IEEE, 98(6): 1031-1044, 2010.

12. L. Zhang, M. Yang, X. Feng. Sparse representation or collaborative representation: Which helps face rec-

ognition? In ICCV, 2011.

13. M. Aharon, M. Elad, and A.M. Bruckstein. The K-SVD: an algorithm for designing of overcomplete dictio-

naries for sparse representation. IEEE Trans. Signal Processing, 54(11): 4311-4322, 2006.

14. A.M. Bruckstein, M. Elad, Dictionaries for Sparse Representation Modeling. Proceedings of the IEEE ,

98(6): 1045-1057,2010

15. H. R. Wang, C. F. Yuan, W. M. Hu, and C. Y. Sun. Supervised class-specific dictionary learning for sparse

modeling in action recognition. Pattern Recognition, 45(11): 3902-3911, 2012.

22

16. M. Yang, L. Zhang, X. Feng, D. Zhang, Fisher Discrimination Dictionary Learning for Sparse Representa-

tion. In ICCV 2011.

17. A.M. Martinez and R. Benavente. The AR face database. CVC Technical Report No. 24, 1998.

18. R. Gross, I. Matthews. J. Cohn, T. Kanade, and S. Baker. Multi-PIE. Image and Vision Computing, 28:807–

813, 2010.

19. T. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern

recognition. IEEE Trans. on Electronic Computers, 14(3): 326-334, 1965.

20. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research

of the National Bureau of Standards 49(6), 1952

21. S.J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. A interior-point method for large-scale l1-

regularized least squares. IEEE Journal on Selected Topics in Signal Processing, 1(4): 606–617, 2007.

22. D. Malioutove, M. Cetin, and A. Willsky. Homotopy continuation for sparse signal representation, in

ICASS 2005.

23. A. Yang, A. Ganesh, Z. Zhou, S. Sastry, and Y. Ma. Fast l1-minimization algorithms and application in

robust face recognition. UC Berkeley, Technique Report.

24. M. Yang, L. Zhang, J. Yang, D. Zhang. Robust sparse coding for face recognition. In CVPR, 2011.

25. Y. G. Peng, A. Ganesh, J. Wright, W. L. Xu, and Y. Ma. RASL: robust alignment by sparse and low-rank

decomposition for linearly correlated images. In CVPR, 2010.

26. W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition: A literature survey. ACM Com-

puting Survey, 35(4): 399-458, 2003.

27. S. Z. Li and A. K. Jain. Handbook of Face Recognition (Second Edition). Springer, 2011.

28. R. Jafri and H. R. Abrania. A survey of face recognition techniques. Journal of Information Processing

Systems, 5(2): 41-68, 2009.

29. M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1): 71-86, 1991.

30. P. Comon. Independent component analysis — a new concept? Signal Process, 36(3): 287-314.

31. J. Yang, D. Zhang, A. Frangi, and J. Yang. Two-dimensional PCA: A new approach to appearance-based

face representation and recognition. IEEE Trans. Pattern Analysis and Machine Intelligence, 26(1): 131-

137, 2004.

32. M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictiona-

ries. IEEE Trans. Image Processing, 15(12): 3736-3745, 2006.

23

33. J. Marial, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE Trans. Image

Processing, 17(1): 53-69, 2008.

34. J. Mairal, F. Bach, J. Ponce, G. Sapiro, andA. Zisserman, Supervised dictionary learning. In NIPS, 21:

1033–10402009.

35. W. H. Deng, J. N. Hu, J. Guo, W. D. Cai, and D. G. Feng. Robust, accurate and efficient face recognition

from a single training image: A uniform pursuit approach. Pattern Recognition, 43(5): 1748-1762, 2010.

36. P. Belhumeur, J. Hespanda, and D. Kriegman, Eigenfaces versus Fisherfaces: Recognition Using Class

Specific Linear Projection. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(7): 711-720, 1997.

37. A. Leonardis and H. Bischof, Robust recognition using eigenimages, Computer Vision and Image Under-

standing, 78(1): 99-118, 2000.

38. S. Chen, T. Shan, and B.C. Lovell, “Robust face recognition in rotated eigenspaces,” Proc. Int’l Conf. Im-

age and Vision Computing New Zealand, 2007.

39. A.M. Martinez, Recognizing Imprecisely localized, partially occluded, and expression variant faces from a

single sample per class, IEEE Trans. Pattern Analysis and Machine Intelligence, 24(6): 748-763, 2002.

40. T. Takahashi and T. Kurita, A robust classifier combined with an auto-associative network for completing

partly occluded images, Neural Networks, 18: 958-966, 2005.

41. K. Hotta, Adaptive weighting of local classifiers by particle filters for robust tracking, Patter Recognition,

42: 619-628, 2009.

42. T. Ahonen, A. Hadid, and M. Pietikainen, Face recognition with local binary pattern: application to face

recognition, IEEE Trans. Pattern Analysis and Machine Intelligence, 28(12):2037-2041, 2006.

43. W.P. Chen and Y.S. Gao, Recognizing partially occluded faces from a single sample per class using string-

based matching, In ECCV 2010.

44. Y.S. Gao and M.K.H. Leung, Face recognition using line edge map, IEEE Trans. Pattern Analysis and Ma-

chine Intelligence, 24(6):764-779, 2002.

45. F. Deboeverie, P. Veelaert, K. Teelen, and W. Philips, Face recognition using parabola edge map, In

ACIVS 2008.

46. N.-S. Vu and A. Caplier, Enhanced patterns of oriented edge magnitudes for face recognition and image

matching, IEEE Trans. Image Processing, 21(3):1352-1365, 2012.

47. K.N. Le, A mathematical approach to edge detection in hyperbolic-distributed and Gaussian-distributed

pixel-intensity images using hyperbolic and Gaussian masks, Digital Signal Processing, 21:162-181, 2011.

24

48. M. Yang, L. Zhang, Simon C.K. Shiu, and D. Zhang, Gabor Feature based Robust Representation and Clas-

sification for Face Recognition with Gabor Occlusion Dictionary, Pattern Recognition, 46(7):1865-1878,

2013.

49. Z.Z. Feng, M. Yang, L. Zhang, Y. Liu, and D. Zhang, Joint Discriminative Dimensionality Reduction and

Dictionary Learning for Face Recognition, Pattern Recognition, 46(8):2134-2143, 2013.

50. Z.X. Huang, Y.G. Liu, C.G. Li, M.L. Yang, and L.P. Chen, A robust face and ear based multimodal biome-

tric system using sparse representation, Pattern Recognition, 46(8): 2156-2168, 2013.

fast and robust face recognition via coding residual map ...cslzhang/paper/frfr_pr.pdf · meng...

Documents