[email protected] arxiv:1604.08865v2 [cs.cv] 8 jul 2016 · rama chellappa university of maryland,...

Convolutional Neural Networks for Attribute-based Active Authentication OnMobile Devices

Pouya SamangoueiUniversity of Maryland, College Park

MD, [email protected]

Rama ChellappaUniversity of Maryland, College Park

MD, [email protected]

Abstract

We present a Deep Convolutional Neural Network(DCNN) architecture for the task of continuous authenti-cation on mobile devices. To deal with the limited resourcesof these devices, we reduce the complexity of the networksby learning intermediate features such as gender and haircolor instead of identities. We present a multi-task, part-based DCNN architecture for attribute detection that per-forms better than the state-of-the-art methods in terms ofaccuracy. As a byproduct of the proposed architecture, weare able to explore the embedding space of the attributes ex-tracted from different facial parts, such as mouth and eyes,to discover new attributes. Furthermore, through exten-sive experimentation, we show that the attribute features ex-tracted by our method outperform the previously presentedattribute-based method and a baseline LBP method for thetask of active authentication. Lastly, we demonstrate theeffectiveness of the proposed architecture in terms of speedand power consumption by deploying it on an actual mobiledevice.

Index terms attributes, face, active authentication,smartphones, mobile, deep networks

1. IntroductionMobile devices, such as cellphones, tablets, and smart

watches have become inseparable parts of people’s lives.The users often store important information such as bankaccount details or credentials to access their sensitive ac-counts on their mobile phones. According to the surveyin [15], nearly half of the users do not use any form ofauthentication mechanism for their phones because of thefrustrations made by these methods. Even if they do, theinitial password-based authentication can be compromisedand thus it cannot continuously protect the personal infor-mation of the users.

1

-1

0.8

0.4

0.4

-0.2

Male

Eyeglasses

Blond

Mustache

Beard

Chubby

Male

Eyeglasses

Blond

Mustache

Beard

Chubby

CNNAA

Enrolled Attributes

==

Efficient deep part-based attribute detection

Yes

No

Lock the phone

Continue access

Figure 1: The overview of our method. To authenticate theusers, we extract facial attributes by extracting the face partsand feeding them to Convoulutional Neural Network for fa-cial Attribute-based Acitve authentication CNNAA, whichis an ensemble of efficient multi-task DCNNs.

To mitigate this issue and make mobile devices moresecure, different active authentication (AA) methods havebeen proposed over the past five years to continuously au-thenticate the user after he/she unlocks the device. [12],[11], [38], and [3] proposed to continuously authenticatethe users based on their touch gestures or swipes. Gait aswell as device movement patterns measured by the smart-phone accelerometer were used in [8], [41], [27] for contin-uous authentication. Stylometry, GPS location, web brows-ing behavior, and application usage patterns were used in[13] for active authentication. Face-based continuous userauthentication on mobile devices has also been proposed in[14], [10], [23], and [28, 29]. Different modalities such asspeech [23], Gait [7], touch [37] have been fused with faces.

State-of-the-art methods for face recognition employDeep Convolutional Neural Networks (DCNN) [34], [25],[31], [32]. The previously proposed architectures havemany parameters to account for the huge variabilities offacial identities in different conditions. Thus they are notefficient to be used on mobile devices. For instance, it has

arX

iv:1

604.

0886

5v2

[cs

.CV

] 8

Jul

201

6

been shown in [30] that DCNN with an architecture similarto AlexNet [18] can drain the battery very fast.

Facial attributes are also referred to as “soft biometrics”in the literature [16], and a few of them were used in [24]to boost the continuous authentication performance. Therecent method of Attribute-based Continuous Authentica-tion (ACA) [28, 29] shows that large number of attributescan give good authentication results on mobile phones ontheir own. They are semantic features which are easier tolearn than facial identities. Also, if they are learned on fa-cial parts, they become less complex. By leveraging thesetwo qualities, we design efficient CNN architectures suit-able for mobile devices. As for many other tasks, CNNs areproved to be very accurate for attribute detection [21], [40],[20]. The overview of our method is presented in Figure 1.

The main contribution of this paper is the design of feasi-ble deep architectures for continuous authentication on mo-bile devices. We also obtain state-of-the-art results from theproposed multi-task, part-based deep architecture for thetask of facial attribute extraction. We show that unsuper-vised subspace clustering of the shared embeddings gathersthe semantically related attributes in the same cluster. In ad-dition, we show improvements in attribute-based AA on twopublicly available mobile datasets using the attributes fromour approach and also the “discovered” attributes by cluster-ing. Finally, we demonstrate the feasibility of the proposedmethod for mobile devices by testing the speed and powerusage calculations on a commercial mobile device.

The rest of this paper is organized as follows. We presentthe attribute detection models in Section 2. In Section 3,we show that the learned attribute models are effective foractive authentication. In Section 4, we evaluate the perfor-mance of the proposed architecture using a generic Androiddevice.

2. Attributes

In the mobile setting, there is a trade-off between hard-ware constraints such as battery life, and accuracy of themodels. We design our models with the goal of balanc-ing this trade-off. To do so, we move from a computation-ally expensive but specialized models to a computationallycheaper but adequately accurate model.

We train and test four different sets of DCNNs, in total100 of them, for the task of attribute classification on a setof face regions. We crop the functional face regions usinglandmarks detected by [4]. These face regions can be seenin Table 3. For each part, first we find the maximum size ofthe window in the training dataset, then we crop the regionsby putting the center of the face part at the center of thecropped window to avoid any scaling.

Attribute Dee

pMul

ti-C

NN

AA

Wid

eMul

ti-C

NN

AA

Dee

pBin

ary-

CN

NA

A

Wid

eBin

ary-

CN

NA

A

Face

Trac

er

PAN

DA

LN

et+A

Net

5 o Clock Shadow 93 93 91 89 85 88 91Arched Eyebrows 81 82 82 83 76 78 79

Attractive 81 81 81 82 78 81 81Bags Under Eyes 83 84 83 82 76 79 79

Bald 99 99 96 98 89 96 98Bangs 95 95 94 94 88 92 95

Big Lips 67 70 69 67 64 67 68Big Nose 82 83 78 78 74 75 78

Black Hair 86 86 88 87 70 85 88Blond Hair 95 95 94 94 80 93 95

Blurry 95 95 92 80 81 86 84Brown Hair 86 86 86 84 60 77 80

Bushy Eyebrows 92 92 89 89 80 86 90Chubby 95 95 87 91 86 86 91

Double Chin 96 96 89 93 88 88 92Eyeglasses 99 99 99 99 98 98 99

Goatee 97 97 93 96 93 93 95Gray Hair 98 98 92 97 90 94 97

Heavy Makeup 90 90 90 91 85 90 90High Cheekbones 86 85 87 87 84 86 87

Male 98 97 97 98 91 97 98Mouth Slightly Open 93 93 94 94 87 78 92

Mustache 97 96 88 95 91 87 95Narrow Eyes 87 87 83 81 82 73 81

No Beard 95 95 95 96 90 75 95Oval Face 72 73 73 70 64 72 66Pale Skin 97 97 93 94 83 84 91

Pointy Nose 75 73 75 74 68 76 72Receding Hairline 92 92 88 90 76 84 89

Rosy Cheeks 94 94 87 91 84 73 90Sideburns 95 95 95 96 94 76 96Smiling 92 92 92 92 89 89 92

Straight Hair 79 79 78 79 63 73 73Wavy Hair 71 73 82 81 73 75 80

Wearing Earrings 83 84 86 79 73 92 82Wearing Hat 98 98 98 98 89 82 99

Wearing Lipstick 92 92 93 93 89 93 93Wearing Necklace 86 86 71 71 68 86 71Wearing Necktie 95 96 93 95 86 79 93

Young 87 87 87 88 80 82 87Average 89.4 89.5 87.7 87.9 81.1 83.6 87.3

Table 1: The accuracy comparison of attribute detectionmethods. Our multi-task part-based architectures performbetter than previously proposed methods and also single-task networks.

2.1. Network architecture

The two proposed architectures of the Deep Convolu-tional Neural Network for facial Attribute-based Activeauthentication (Deep-CNNAA) and Wide-CNNAA can befound in Table 2. The four sets of models compared are:BinaryDeep-CNNAA and BinaryWide-CNNAA, which aresingle task networks, MultiDeep-CNNAA and MultiWide-CNNAA, which are multi-task networks. First we describe

*Wide-CNNAA *Deep-CNNAAinput w × h× 3 input w × h× 3type patch size type patch size

conv relu 7× 7× 128 conv relu 7× 7× 32maxpool 3× 3/2(stride)

conv relu 5× 5× 128 conv relu 5× 5× 32conv relu 5× 5× 32conv relu 5× 5× 32

maxpool 3× 3/2conv relu 3× 3× 128 conv relu 3× 3× 32

conv relu 3× 3× 32conv relu 3× 3× 32conv relu 3× 3× 32

maxpool 3× 3/2FC relu dim× 128 FC relu dim× 64FC relu 128× 128 FC relu 64× 32

logits Num Attr × 2Softmax loss

Table 2: The architectures of our networks. The number ofparameters depends on the face region that they operate onand can be found in Table 6.

the shared configuration that is used to train these networksand then the ones that are specific to each type.

Shared configuration All of these 100 networks, 20Multi*-CNNAA, 2 networks per 10 parts, and 80 Binary*-CNNAA, 2 networks per attribute on full face, are trainedon the publicly available CelebA [21] dataset. It has 200thousands images of 10 thousands identities, each with 40attribute labels. It is divided into 160k training, 20k de-velopment, and 20k test images. The DCNNs are trainedusing the recently released Tensorflow [1] which also hasa mobile implementation. All of the networks are initial-ized with random weights and are trained with the samepolicy. The Adam optimizer is used to train all of thesenetworks since it incorporates the adaptive learning rate up-date step, and performs well without a careful fine tuningof the learning parameters [17]. Subsequent fine tuning cangive better results. Early stopping [26] using the accuracyon the development set is used to select the final model foreach network. The inputs are colored images that are ran-domly flipped and also their contrast and Gamma are ran-domly changed to augment the data to prevent over-fitting.

Due to the nature of the attributes, most of them have anunequal number of positive and negative labels. Extra carehas been taken to make sure the networks are not biasedtoward one class with the help of data augmentation andstochastic optimization.

Binary*-CNNAA The binary networks are for a singletask and are trained by the labels of one single attribute. Theinput face images are aligned to a canonical coordinate. Tobalance the training data, the class with the lower numberof training data is distorted and added to the input queue sothat the number of images for each class is equal. Then thedata is shuffled and fed in batches to the training algorithm.The softmax cross entropy loss lB in (1) is used to trainthese binary networks

lB(w) =1

N

N∑i=1

(1− yi) log p(yi = 0|w)

+ yi log p(yi = 1|w)

(1)

where N is the batch size, yj ∈ {0, 1} is the attribute pres-

ence label, p(y = j|w) =exp (fw

j (x))∑1i=0 exp (fw

i (x))where fwi (x)

is the logits of the ith output neuron of the network withweights w.

Multi*-CNNAA The Multi* networks have the samecomplexity as binary models but predict multiple attributesat once. The face parts and the number of attributes that areassigned to them can be found in Table 3. For each part,the corresponding network has an output layer that containsneurons for each attribute that is assigned to that face part.We use the softmax cross entropy loss for part q as specifiedbelow:

lq(w) =1

Nq

Nq∑a=1

1

nqa

nqa∑

i=1

(1− yai ) log p(yi = 0|w)

+ yai log p(yai = 1|w)

(2)

where Nq is the number of attributes assigned to part q. nqais the number of images with the ath attribute of part q inthe current batch. yai ∈ {0, 1} is 1 if the ith image has theath attribute and N is the batch size. p(yai = 1|w) is thesame as Eq 1.

To deal with the class ratio imbalance of the attributes,we shuffle the training data in a way that the network seesthe rare class for each attribute frequently. For example, forthe attribute “Mustache”, the positive class is the rare onesince most of the 202k images do not have this attribute. Tohandle this imbalance, a queue is created for each attributeand images that have the rare class are added to that queue.A queue is also created for images with all the attributesbelonging to the major class. Then all of the queues areshuffled. We treat each queue as a circular buffer so that thetraining batches are created by sampling with replacementfrom one of these queues at random. Also, each time theimages are distorted differently.

After training all of the networks separately, we traina single linear Support Vector Machines (SVMs) classifier

face partNo. of Attributes 10 10 10 7 16

Size 53× 39 115× 41 65× 38 40× 56 90× 62

face partNo. of Attributes 15 21 15 15 14

Size 55× 82 115× 107 128× 52 128× 45 62× 100

Table 3: The face regions that are extracted by cropping around the landmark points and their corresponding number ofattributes. A Multi*-CNNAA that operates on a face crop has “No. of Attributes” tasks.

per attribute over the embeddings of these networks. Foreach attribute, we only take the embeddings of the rele-vant parts. For instance, for the attribute ”Mustache” inthe MultiDeep-CNNAA, the 32 dimensional embedding ofthe parts: mouth, mouth-and-nose, and mouth-and-chin aretaken and concatenated together. The SVMs are trained us-ing the training set of CelebA [21] and fine tuned on itsdevelopment set.

2.2. Comparison of attribute detection methods

We compare our proposed networks with FaceTracer[19], PANDA [40], and CelebA [21] attribute networks.These models capture a broad spectrum of possible auto-matic attribute detection models.

FaceTracer [19] attribute classifiers are trained by ex-tracting traditional low-level features like HOG and thecolor histogram from aligned face parts by incrementallyfinding the best set of features and training the SVMs onthe selected features and parts for attribute detection. Theface crops are extracted from the ground truth landmarks.

PANDA employs multiple CNNs for the face parts andconcatenates the outputs of the last layer and trains SVMsfor each attribute. There are two differences betweenour network architecture and PANDA networks. First, inPANDA, all of the attributes are associated with all of theparts. Second, in our Multi*-CNNAA networks, the lastlayer is shared between all of the attributes softmax losses,but in PANDA there are two fully connected layers after theshared fully connected layer for each one. As a result, inour network, the different attributes that are associated withone network lie in the same Euclidean space of the last fullyconnected layer of the network. We exploit this feature inSection 2.3.

CelebA takes a different approach by pre-training a net-work with face identities of CelebFaces [33] for both faceverification and identification. Then features from multipleoverlapping patches of the face and train SVMs for eachcropped region and each attribute. To predict an attribute,the scores of SVMs are averaged.

We also follow [20] and train a single task network foreach attribute in Binary*-CNNAA on the full face to com-pare with the most specialized model for each attribute. Ta-ble 1 shows the accuracy of each of these methods.

As it can be seen, our Multi*-CNNAA networks giveequal or better results than the rest. The MultiWide-CNNAA architectures perform slightly better than theMultiDeep-CNNAA in attribute prediction. However, theyare slower and consume more energy as shown in Section4.

2.3. Attribute discovery

As mentioned in the previous section, our Multi*-CNNAA networks transform the input face regions to ashared Euclidean space for the attributes associated withthat part. To further explore this Euclidean space, weperform Sparse Subspace Clustering (SSC) [9] on 10000points that are selected from the training portion of CelebAdataset. The intuition behind this clustering step is that thedata points with the same set of attributes lie on the sameside of the learned planes defined by the weights of the lastlayer of each network. Thus they can be represented as asparse combination of the neighboring points. SSC usesthis fact to find the clusters. Therefore by formulating theclustering problem as

minimizeC∈Rn×n

|C|1 + ‖D −DC‖2F (3)

subject to diag (C) = 0 (4)

where D ∈ Rd×n is the data matrix containing n points ofdimension d and C ∈ Rn×n is the affinity matrix. To en-force the constraint, the authors of [9] find the sparse codeof each data point in a dictionary of all the points except thetest point. To get the clusters they perform spectral cluster-ing on C.

We find 10 clusters per face regions. The clusters cor-responding to the “Hair-Forehead” region of the face andthe “eyes” region can be seen in Figure 2. As illustrated,the “discovered” attributes overlap with the labels that we

Fore

head

and

hair

(a) (b) (c) (d) (e)

Mou

than

dC

hin

(a) (b) (c) (d) (e)

Figure 2: Sample images from subspace clustering of face part embedding in attribute space. Zoom in to see the clusters.

had in the training time mostly, however, some attributesare divided into finer categories. For example, the “Hair-Forehead” region cluster (c) contains male images withshort hair which was not seen in the labels.

3. Active Authentication

We evaluate the performance of CNNAA for the task ofactive authentication using two publicly available datasetsMOBIO [23] and AA01 [10]. These datasets contain videosof the users interacting with cell phones. We comparethe authentication performance of our DCNN attribute de-tectors and discovered attributes using the baseline LocalBinary Patterns [2] and ACA [28, 29] which is the onlyattribute-based approach for this task on mobile phones. Wefollow the same protocol as ACA to extract facial parts andvideo features. So, we average over the extracted attributeoutputs for the video frames to get the video descriptors.

We cast the problem of continuous authentication as aface verification problem in which a pair of videos is givento determine whether they contain the same identity or not.To compare the performance of the algorithms, the receiveroperating characteristic (ROC) curve is used. Many othermeasures of performance can be readily extracted from theROC curve. The ROC curve plots the relationship be-tween false acceptance rates (FARs) and true acceptancerates (TARs) and can be computed from a similarity ma-trix S between gallery and probe videos. We also report theEER value where False Rejection Rate (FRR) and FAR areequal. EER value gives a good idea of the ROC curve shape,since the value 1 − EER is where the line x + y + 1 = 0meets the ROC curve. Thus, the better the algorithm, thelower is its EER value.

We give each video frame to the CNNAA networks andpredict the attributes with linear SVMs. For the learned at-

Figure 3: Sample images of the three sessions of the AA01dataset.

tributes, we put the probabilistic output of the SVMs whichare trained by LIBSVM [5] as our final attribute feature.Since the attribute outputs of our models are probability val-ues we get the similarity value si,j = 〈ei, tj〉, where ei isthe feature vector for the enrollment video and ti is the testvideo features.

To use the discovered attributes (DiscAttrs) for authen-tication, we extract the attribute features by a similar ap-proach to Sparse Representation Classification [36]. Eachface crop from the video frame is embedded to the attributespace of MultiDeep-CNNAA. It is represented by the dic-tionary which we used in Section 2.3, so that we know thecluster assignment of its atoms. We normalize all of the dic-tionary atoms and the embedding. Then we get each featurevalue by a softmax over the representation contribution ofeach cluster in the attribute space. To do so, we first solve

minimizef∈Rn

|f |1 + ‖f −Df‖2F (5)

e→ t AC

A

LB

P

MD

-CN

NA

A

MW

-CN

NA

A

Dis

cAttr

s

1→ 1 0.14 0.13 0.11 0.14 0.162→ 2 0.19 0.31 0.18 0.22 0.173→ 3 0.16 0.20 0.10 0.10 0.131→ 2, 3 0.38 0.38 0.18 0.25 0.232→ 1, 3 0.31 0.33 0.26 0.30 0.313→ 1, 2 0.31 0.38 0.19 0.24 0.25

Altogether 0.30 0.34 0.20 0.25 0.25

Table 4: The EER values for the different experiments onAA01[10] dataset. The sessions numbers are: 1. Officelight 2. Low light 3. Natural light. DiscAttrs column con-tains the EER values using the discovered attributes.

to get the sparse representation f of the face crop of thatvideo frame. Then we set the ith feature for that face cropto p(c = i|D) which is calculated by

p(c = i|D) =exp(‖D:,ifi‖)∑10k=1 exp(‖D:,kfk‖)

(6)

where D:,i is the dictionary atoms of cluster i and fi are thecoefficients corresponding to those atoms. Thus, if f is inthe subspace spanned by the points inD that are in cluster i,it will have more energy in non-zero values for those atoms.To solve (5) we use the Orthogonal Matching Pursuit [35]algorithm with sparsity 20. The only reason for choosing20 is that it is less than 32 which is the embedding space di-mension. This parameter could be fine tuned further to getbetter results. Then we concatenate all of these probabilityvalues for different face parts to create the final representa-tion. The similarity matrix is then created in the same wayas using attributes as features.

3.1. Results

To plot the ROC curves and evaluate our method, in eachdataset, for each person, videos of one session are consid-ered as the enrollment videos and the other videos as thetest videos. The similarity matrix is then generated by com-puting the pairwise similiarity between the enrollment andthe test videos. The corresponding ROC curve is plotted foreach experiment.

AA01 AA01 is a mobile dataset with 750 videos of 50subjects. Each subject has three sets of videos with threedifferent lighting conditions. Each user is asked to per-form a set of actions on the phone while the front camera isrecording the video. The videos are captured by an iPhone 4camera. The three lighting conditions are: office light, low

Figure 5: Sample images of three sessions of the MOBIOdataset. First row images are from different sites, secondrow has the pairs with the same identities in two differentsessions.

light, and natural light. The sample images of this datasetin Figure 3 show the three different illuminations in eachsession. Figure 3 also presents some partial faces in thedataset. Each person has five videos of performing five dif-ferent tasks on the phone. There is a designated enrollmentvideo for each person. Three different experiments havebeen conducted on this dataset.

First, the enrollment and the test videos for all of the 50subjects are taken from the session with the same lightingcondition. The EER values of this experiment can be foundin the first three rows of Table 4. It can be seen that ourMultiDeep-CNNAA has the lowest EER in all cases. Thisexperiment reveals the discriminative power of the featureswhen the surrounding environment is the same. In the sec-ond experiment, all of the enrollment videos are taken fromone illumination session and the test videos from another.The EER values corresponding to this experiment are de-picted in the next three rows of Table 4. The performancedrop in our method is 0.08 while on average while ACA suf-fers 0.17 and LBP 0.15. The reason is that ACA attributeclassifiers use low level features that are sensitive to illumi-nation changes, but CNNAA is trained on a large-scale un-constrained dataset containing a lot of variations and thusgives more robust features.

In the last experiment, all of enrollment videos of thethree sessions are put in the gallery and all of the test videosin the probe to get the similarity matrix. The ROC curvecorresponding to the third general experiment is plotted inFigure 4a. It can be seen that MultiDeep-CNNAA per-forms the best and MultiWide-CNNAA and the discoveredattributes are tied as second best.

One explanation for the lower performance ofMultiWide-CNNAA compared to MultiDeep-CNNAAis that it has many more parameters than MultiDeep-CNNAA according to Table 6 and has overfitted to thecelebrity face images distribution.

MOBIO MOBIO [23] is a dataset of 152 subjects. Thevideos are taken in six different universities across Europe.For most subjects, twelve sessions of video are captured.

False Acceptance Rate

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tru

e A

cce

pta

nce

Ra

te

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ACA

LBP

MultiDeep-CNNAA

MultiWide-CNNAA

DiscAttrs

(a) AA01 Altogether


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tru

e A

cce

pta

nce

Ra

te

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ACA

LBP

MultiDeep-CNNAA

MultiWide-CNNAA

DiscAttrs

(b) MOBIO mobile


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tru

e A

cce

pta

nce

Ra

te

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ACA

LBP

MultiDeep-CNNAA

MultiWide-CNNAA

DiscAttrs

(c) MOBIO cross-sensor

Figure 4: ROC curve of different experiments on AA01 [10] and MOBIO [23] dataset. (a) is the ROC curve of AA01 withall of the sessions together in gallery and probe. (b) is the ROC curve of MOBIO with all of the mobile sessions togetherwith the last session videos as gallery and videos of the rest as probe. (c) is the ROC curve of the cross-device experiment.

AC

A

LB

P

MD

-CN

NA

A

MW

-CN

NA

A

Dis

cAttr

s

but 0.26 0.36 0.19 0.20 0.23idiap 0.25 0.35 0.27 0.25 0.24lia 0.24 0.34 0.17 0.15 0.16

uman 0.27 0.33 0.18 0.20 0.21unis 0.2 0.27 0.07 0.1 0.1

uoulu 0.18 0.23 0.14 0.14 0.19Altogether 0.22 0.28 0.17 0.18 0.19Mobile-PC 0.27 0.38 0.19 0.21 0.2

Table 5: The EER values corresponding to MOBIO datasetexperiments.

All of the mobile videos are captured with a Nokia N93i.The first session’s videos are also recorded with a 2008MacBook laptop. We perform two experiments on thisdataset. We take the 12th session videos as our trainingvideos since they are the mostly available videos across thedataset.

In the first experiment, we just consider videos that aretaken by the mobile device. We show the EER values forthe mobile videos of the subjects within each site, as wellas all of the videos together in Table 5. This experimentis similar to the “Altogether” experiment of AA01 datasetsince the environment conditions for enrollment videos andtest videos can be the same or different. The ROC curve forthis experiment is plotted in Figure 4b.

The second experiment is about the cross sensor authen-tication, in which you enroll yourself on one device and teston another device. To see how important sensor change canbe for low level features, one can look at the performance

drop of the LBP features in this experiment and the pre-vious one in Table 5. The decrease is 0.10 for the LBPfeatures and 0.05 for ACA which depends on low level fea-tures, while CNNAA methods just have a decrease of 0.01in EER value. The ROC curve for this experiment is pre-sented in Figure 4c. Again, this is due to the fact that CN-NAA has seen more variations in the large training set. Ourmethod can also handle partial face verification if a partialface detector [22] is available.

4. Mobile performanceThere is a trade-off among power consumption, authen-

tication speed, and accuracy of the model for the task ofactive authentication on mobile devices. The response timeis important since we do not want to freeze other runningprocesses and create an unpleasant user experience whileauthenticating. Power consumption is also important be-cause as frequent demands for charging the battery can beannoying.

To show the effectiveness of our approach, we measurethe attribute prediction speed of our networks and the bat-tery consumption on an LG Nexus 5 device. The resultsare shown in Table 6. This mobile device has a quad-coreQUALCOMM Snapdragon 800 clocked at 2.26 GHz and 2GB of RAM. This specification is considered average com-pared to the current smartphones. We use the Tensorflow[1] implementation of CNNs on Android devices.

We follow ACA [29] for the performance analysis on thephone. We take one shot with the smartphone camera andfeed it to the network 200 times and measure the predic-tion speed by looking at the average duration per frame. Tomeasure the power usage we use PowerTutor [39] whichregisters the energy usage per running application and alsoin total. We do not use the camera continuously because itwill bias the response time and power usage of the network.

Input size Network Parameters Prediction time Network Parameters Prediction time128× 52 D-UpperHead 275,360 0.15s W-UpperHead 1,825,664 0.26s115× 41 D-BothEyes 227,936 0.11s W-BothEyes 1,447,552 0.19s90× 62 D-EyesNose 244,704 0.13s W-EyesNose 1,580,160 0.22s40× 56 D-Nose 170,400 0.06s W-Nose 988,032 0.1s55× 82 D-NoseMouth 232,352 0.10s W-NoseMouth 1,481,600 0.18s65× 38 D-Mouth 164,448 0.06s W-Mouth 939,648 0.11s

115× 107 D-EyesNoseMouth 441,632 0.28s W-EyesNoseMouth 3,154,304 0.48s128× 45 D-MouthChin 244,640 0.13s W-MouthChin 1,579,904 0.23s62× 100 D-Ear 256,864 0.14s W-Ear 1,677,952 0.25s53× 39 D-Eye 162,400 0.06s W-Eye 923,264 0.08sOverall MultiDeep-CNNAA 2.4M 1.22s MultiWide-CNNAA 15.6M 2.10s

128× 128 BinaryDeep-Full 584,160 0.36s BinaryWide-Full 4,289,664 0.637s

Table 6: Network size and prediction speed of the networks. The D-* means it has MultiDeep-CNNAA architecture and W-*means it is MultiWide-CNNAA. The Binary*-CNNAA network prediction times are just for one attribute. For all of themtogether it will be 40 times this value.

We take the image and the application works in background.The default Android processes are the only other processesthat are running besides our application that runs the net-works.

According to Table 6 all the attributes are de-tected in 1.22s with MultiDeep-CNNAA running onCPU in the background without blocking other applica-tions. MultiWide-CNNAA takes 2.10s. The BinaryDeep-CNNAA takes 14.4s and BinaryWide-CNNAA 25.5s.

The MultiDeep-CNNAA architecture consumes780mW power on average and MultiWide-CNNAA drains1100mW of the battery power. The average battery usageof Android when it is not running the CNNAA networksis 600mW according to PowerTutor. To see how thisaffects the battery life, suppose the battery capacity is CWatt-hours (Wh). Then

d =C

Pn + βαPd(7)

where d is the mobile device’s battery life, Pn is the powerconsumption in normal use, Pd is the power usage of theattribute detection algorithm, β is the fraction of time thatthe mobile device is being used, α is the authentication ratioconstant. α shows how often we want to authenticate theuser considering the prediction time of the algorithm, i.e.we authenticate every Ta

α where Ta is the prediction speedof the model. For instance, if α = 0.5 we authenticateevery 2.44s using MultiDeep-CNNAA and every 4.2s usingMultiWide-CNNAA.

To make the feasibility of CNNAA clearer, suppose weauthenticate the user using the MultiDeep-CNNAA archi-tecture on the Nexus 5 device. We choose the MultiDeep-CNNAA since it performs well in the authentication taskas discussed in Section 3 and also it has a better runtimeand power usage. The Nexus 5 has a 2300mAh batterywith 3.8V voltage, so C = 8.74Wh. Pn = 0.6W for the“normal usage” state which is when just Android 5 and the

default applications are running. This gives 14.5 hours bat-tery life. Now if α = 1 which means we want to authenti-cate with the highest speed possible and if we are using thephone all the time with β = 1 then the battery life will bereduced to 6.3 hours in the worst case. In a realistic settingwith β = 0.2 and α = 0.5 it becomes 12.85 hours whichis reasonable. Also, if a GPU implementation of CNNs onAndroid [30] is used, attribute prediction can happen muchfaster with less energy consumption.

5. Discussion and future directionWe proposed a feasible multi-task DCNN architecture to

extract accurate and describable facial attributes on mobiledevices. Each network predicts multi facial attributes from agiven face component by mapping it to a shared embeddingspace. We showed that our attribute prediction performanceis comparable to state-of-the art. We explored the embed-ding space and illustrated that we can extract new attributesby looking at subspace clusters of this space. We alsoshowed that our networks perform attribute-based authenti-cation better than the previously proposed method [28, 29].Finally, we analyzed the feasibility of our method by per-forming battery usage and prediction speed experiments onan actual mobile device.

In the future, we plan to jointly train this ensemble ofnetworks for the task of face verification and attribute pre-diction to get a more discriminative embedding space togain better authentication performance.

AcknowledgementThis work was supported by cooperative agreement

FA8750-13-2-0279 from DARPA.

References[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,

C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghe-

mawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane,R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. War-den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor-Flow: Large-scale machine learning on heterogeneous sys-tems, 2015. Software available from tensorflow.org.

[2] T. Ahonen, A. Hadid, and M. Pietikainen. Face descriptionwith local binary patterns: Application to face recognition.Pattern Analysis and Machine Intelligence, IEEE Transac-tions on, 28(12):2037–2041, 2006.

[3] M. Antal and L. Z. Szabo. Biometric authentication based ontouchscreen swipe patterns. Procedia Technology, 22:862–869, 2016.

[4] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Ro-bust discriminative response map fitting with constrainedlocal models. In Computer Vision and Pattern Recogni-tion (CVPR), 2013 IEEE Conference on, pages 3444–3451.IEEE, 2013.

[5] C.-C. Chang and C.-J. Lin. Libsvm: a library for supportvector machines. ACM Transactions on Intelligent Systemsand Technology (TIST), 2(3):27, 2011.

[6] N. L. Clarke and S. M. Furnell. Authenticating mobile phoneusers using keystroke analysis. International Journal of In-formation Security, 6(1):1–14, 2007.

[7] D. Crouse, H. Han, D. Chandra, B. Barbello, and A. K. Jain.Continuous authentication of mobile user: Fusion of faceimage and inertial measurement unit data. In InternationalConference on Biometrics, 2015.

[8] M. Derawi, C. Nickel, P. Bours, and C. Busch. Unobtrusiveuser-authentication on mobile phones using biometric gaitrecognition. In International Conference on Intelligent In-formation Hiding and Multimedia Signal Processing, pages306–311, Oct 2010.

[9] E. Elhamifar and R. Vidal. Sparse subspace clustering.In Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on, pages 2790–2797. IEEE, 2009.

[10] M. E. Fathy, V. M. Patel, and R. Chellappa. Face-basedactive authentication on mobile devices. In IEEE Interna-tional Conference on Acoustics, Speech and Signal Process-ing, 2015.

[11] T. Feng, Z. Liu, K.-A. Kwon, W. Shi, B. Carbunar, Y. Jiang,and N. Nguyen. Continuous mobile authentication usingtouchscreen gestures. In IEEE Conference on Technologiesfor Homeland Security, pages 451–456, Nov 2012.

[12] M. Frank, R. Biedert, E. Ma, I. Martinovic, and D. Song.Touchalytics: On the applicability of touchscreen inputas a behavioral biometric for continuous authentication.IEEE Transactions on Information Forensics and Security,8(1):136–148, Jan 2013.

[13] L. Fridman, S. Weber, R. Greenstadt, and M. Kam. Activeauthentication on mobile devices via stylometry, gps loca-tion, web browsing behavior, and application usage patterns.IEEE Systems Journal, 2015.

[14] A. Hadid, J. Heikkila, O. Silven, and M. Pietikainen.Face and eye detection for person authentication in mobile

phones. In ACM/IEEE International Conference on Dis-tributed Smart Cameras, pages 101–108, Sept 2007.

[15] N. M. Inc. Nearly one in three consumers who have lost theirmobile devices still do not lock them, new survey shows.”http://www.prnewswire.com/news-releases/nearly-one-in-three-consumers-who-have-lost-their-mobile-devices-still-do-not-lock-them-new-survey-shows-200410151.html”,2013.

[16] A. K. Jain, S. C. Dass, and K. Nandakumar. Can soft biomet-ric traits assist user recognition? In Defense and Security,pages 561–572. International Society for Optics and Photon-ics, 2004.

[17] D. Kingma and J. Ba. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980, 2014.

[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in neural information processing systems, pages1097–1105, 2012.

[19] N. Kumar, P. N. Belhumeur, and S. K. Nayar. FaceTracer: ASearch Engine for Large Collections of Images with Faces.In European Conference on Computer Vision (ECCV), pages340–353, Oct 2008.

[20] G. Levi and T. Hassner. Age and gender classification us-ing convolutional neural networks. In IEEE Conf. on Com-puter Vision and Pattern Recognition (CVPR) workshops,June 2015.

[21] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning faceattributes in the wild. In Proceedings of International Con-ference on Computer Vision (ICCV), December 2015.

[22] U. Mahbub, V. M. Patel, D. Chandra, B. Barbello, andR. Chellappa. Partial face detection for continuous authenti-cation. arXiv preprint arXiv:1603.09364, 2016.

[23] C. McCool, S. Marcel, A. Hadid, M. Pietikainen, P. Matejka,J. Cernocky, N. Poh, J. Kittler, A. Larcher, C. Levy, D. Ma-trouf, J.-F. Bonastre, P. Tresadern, and T. Cootes. Bi-modalperson recognition on a mobile phone: using mobile phonedata. In IEEE ICME Workshop on Hot Topics in Mobile Mul-timedia, July 2012.

[24] K. Niinuma, U. Park, and A. K. Jain. Soft biometric traitsfor continuous user authentication. Information Forensicsand Security, IEEE Transactions on, 5(4):771–780, 2010.

[25] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep facerecognition. In British Machine Vision Conference, 2015.

[26] L. Prechelt. Automatic early stopping using cross validation:quantifying the criteria. Neural Networks, 11(4):761–767,1998.

[27] A. Primo, V. Phoha, R. Kumar, and A. Serwadda. Context-aware active authentication using smartphone accelerometermeasurements. In Computer Vision and Pattern RecognitionWorkshops (CVPRW), 2014 IEEE Conference on, pages 98–105, June 2014.

[28] P. Samangouei, V. M. Patel, and R. Chellappa. Attribute-based continuous user authentication on mobile devices. InBiometrics Theory, Applications and Systems (BTAS), 2015IEEE 7th International Conference on, pages 1–8, Sept2015.

[29] P. Samangouei, V. M. Patel, and R. Chellappa. Facial at-tributes for active authentication on mobile devices. Imageand Vision Computing, 2016.

[30] S. Sarkar, V. M. Patel, and R. Chellappa. Deep feature-based face detection on mobile devices. arXiv preprintarXiv:1602.04868, 2016.

[31] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-fied embedding for face recognition and clustering. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 815–823, 2015.

[32] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learningface representation by joint identification-verification. InAdvances in Neural Information Processing Systems, pages1988–1996, 2014.

[33] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learningface representation by joint identification-verification. InAdvances in Neural Information Processing Systems, pages1988–1996, 2014.

[34] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:Closing the gap to human-level performance in face verifica-tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 1701–1708, 2014.

[35] J. A. Tropp and A. C. Gilbert. Signal recovery from randommeasurements via orthogonal matching pursuit. InformationTheory, IEEE Transactions on, 53(12):4655–4666, 2007.

[36] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma.Robust face recognition via sparse representation. PatternAnalysis and Machine Intelligence, IEEE Transactions on,31(2):210–227, 2009.

[37] H. Zhang, V. M. Patel, and R. Chellappa. Robust multimodalrecognition via multitask multivariate low-rank representa-tions. In IEEE International Conference on Automatic Faceand Gesture Recognition. IEEE, 2015.

[38] H. Zhang, V. M. Patel, M. E. Fathy, and R. Chellappa. Touchgesture-based active user authentication using dictionaries.In IEEE Winter conference on Applications of Computer Vi-sion. IEEE, 2015.

[39] L. Zhang, B. Tiwana, Z. Qian, Z. Wang, R. P. Dick, Z. M.Mao, and L. Yang. Accurate online power estimation and au-tomatic battery behavior based power model generation forsmartphones. In Proceedings of the eighth IEEE/ACM/IFIPinternational conference on Hardware/software codesignand system synthesis, pages 105–114. ACM, 2010.

[40] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev.Panda: Pose aligned networks for deep attribute modeling.In Computer Vision and Pattern Recognition (CVPR), 2014IEEE Conference on, pages 1637–1644. IEEE, 2014.

[41] Y. Zhong and Y. Deng. Sensor orientation invariant mobilegait biometrics. In Biometrics (IJCB), 2014 IEEE Interna-tional Joint Conference on, pages 1–8. IEEE, 2014.

[email protected] arxiv:1604.08865v2 [cs.cv] 8 jul 2016 · rama chellappa university of maryland,...

Documents