Transcript
Page 1: A Joint Discriminative Generative Model for Deformable ......A Joint Discriminative Generative Model for Deformable Model Construction and Classification Ioannis Marras 1, Symeon

A Joint Discriminative Generative Model for Deformable ModelConstruction and Classification

Ioannis Marras1, Symeon Nikitidis2, Stefanos Zafeiriou1 and Maja Pantic1,31 Department of Computing, Imperial College London, UK2 Yoti Ltd, London, UK, e-mail: [email protected]

3 Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente, TheNetherlands

Abstract— Discriminative classification models have been suc-cessfully applied for various computer vision tasks such asobject and face detection and recognition. However, deforma-tions can change objects coordinate space and perturb robustsimilarity measurement, which is the essence of all classificationalgorithms. The common approach to deal with deformations iseither to seek for deformation invariant features or to developmodels that describe objects deformations. However, the formerapproach requires a huge amount of data and a good amountof engineering to be properly trained, while the latter requireconsiderable human effort in the form of carefully annotateddata. In this paper, we propose a method that jointly learns withminimal human intervention a generative deformable modelusing only a simple shape model of the object and imagesautomatically downloaded from the Internet, and also extractsfeatures appropriate for classification. The proposed algorithmis applied on various classification problems such as “in-the-wild” face recognition, gender classification and eye glassesdetection on data retrieved by querying into a web imagesearch engine. We demonstrate that not only it outperformsother automatic methods by large margins, but also performscomparably with supervised methods trained on thousands ofmanually annotated data.

I. INTRODUCTION

Arguably, one of the problems that distinguishes computervision discipline from machine learning is visual objectsdeformations modelling. Object deformations involve variousshape and texture variations within an object class (e.g.,the differences between parts of human faces), as well as,variations due to rigid or non-rigid movement of differentobject parts (e.g., head pose and facial expressions).

Objects deformations render the use of standard distancemeasures (such as the Euclidean distance), or kernels (suchas Gaussian Radial Basis Function (GRBF) and polynomialkernels) not robust for measuring the similarity betweenthe deformed data samples. Hence, on recognition problemsinvolving visual data captured in unconstrained (also referredto as “in-the-wild”) conditions, the direct application ofpopular distance based classifiers, such as Support VectorMachines (SVMs) [21], [19] and Relevance Vector Machines(RVMs) [1], results in a significantly degraded performance.Thus, an additional data pre-processing step is requiredto deal with objects deformations before the classificationalgorithm is applied.

Dealing with objects deformations has created a wealthof research which can be roughly divided into two cate-

Retrieved

JDGM MODEL

WEBsQUERIESTAGS=uuMansFaceuuandsuuWomansFaceuu

MeansFacesShapesModel

ImagesDeformationsParameterssLearningJointsDiscriminativesGenerative

ModelsTraining

CrudeDetector

Fig. 1. Given a set of unconstrained facial images retrieved by performingthe queries “Man Face” and ”Woman Face” into a web image searchengine, the proposed joint model using only the bounding boxes and amean face shape model, iteratively aligns the retrieved images by leaningthe deformation parameters, while extracts discriminant features and buildsa classifier.

gories: (1) methods that identify appropriate deformationinvariant features of the objects and use these in orderto train the classification algorithms and (2) methods thatdefine statistical or physical deformable models of the visualobjects at hand, in order to align the deformed objectsprior to feature extraction and classification. In the firstline of research fall methods that construct kernels invariantto certain rigid image transformations [14], as well as thefamily of deep learning and representation algorithms, suchas Deep Convolutional Neural Networks (DCNN) [16], [12],[25] and Invariant Scattering Convolution Networks (ISCN)[5]. However, invariant kernels and ISCNs can only modelsimple objects transformations such as those caused by viewangle rotation [14], while DCNNs in order to be robustlytrained require huge amounts of data, and a good portionof engineering to handle the high computational complexity[25]. In addition, as it has been recently shown in [25], inchallenging settings such as “in-the-wild” face recognition,DCNNs trained on millions of annotated data are unable toreach satisfactory performance and meticulously designedface alignment algorithms have to be applied to deal withface deformations. Finally, most of the above deformation978-1-5090-4023-0/17/$31.00 c©2017 IEEE

Page 2: A Joint Discriminative Generative Model for Deformable ......A Joint Discriminative Generative Model for Deformable Model Construction and Classification Ioannis Marras 1, Symeon

invariant feature extraction algorithms are commonly com-bined with an SVM algorithm for classification.

In the second line of research fall methods that explic-itly model objects deformations by defining a statisticaldeformable model. Notable examples include Active Appear-ance Models [7], Active Shape Models [24], Elastic GraphMatching [30] and Pictorial Structures [11], [32]. Thesemodels are applied in order to first align the data whichare subsequently fed to the classifier. However, learningvisual deformable models requires considerable human effortand still remains an expensive, tedious, labour intensiveand prone to human errors procedure. More precisely, itrequires both a careful selection of an image corpora thatcan efficiently exhibit the vast amount of objects variabilityand also the careful annotation of this corpora in terms ofobjects meaningful parts.

In this paper, we propose a methodology that is ableto jointly learn with minimal supervision and annotationeffort, classifiers and objects deformable models using dataautomatically collected from the Internet without requiringtheir detailed semantic annotation. Let us consider the fol-lowing Internet image based application, where we wantto train a facial gender classifier using the abundance ofimages available in the Internet which can be retrieved byapplying a textual query (e.g., “man face” and “womanface”) into a web image search engine such as Googleand Bing, or photo sharing websites such as Flickr. Ourapplication is further motivated by Figure 1. In this task wewant to automatically and simultaneously learn (a) a facialdeformable model associated with the image collection and(b) extract appropriate features and build a classifier for facialgender discrimination. If we merely use the obtained imagesto directly extract discriminant features and train a classifier,such as SVMs, the existence of facial deformations andmisalignment, is guaranteed to lead to inaccurate results. Toalleviate this problem we propose a joint methodology thathas an interplay between a generative statistical model anda discriminative one, for facial image alignment and featureextraction/classification, on the latent space of the first. Inparticular, we aim to recover the deformation parameters(e.g., via the use of object shape parameters) in order toalign the retrieved images using a generative model (i.e.Principal Component Analysis (PCA)) and at the same timeto pursue a discriminative decomposition of the aligned databy coupling the objective of the generative model with adiscriminative one (i.e. SVMs) that aims to separate data withmaximum margin. This can be also interpreted as developingan SVM classifier with generative regularization terms anddeformations parameters which enables the creation of thestatistical deformable models.

The line of research that is closely related to the proposedwork is that of image congealing [17] which refers to theproblem of simultaneous aligning a set of images. However,the majority of the rigid and non-rigid image congealingalgorithms use pure generative subspace models, such asPCA [9] or Robust PCA (RPCA) alternatives [4], [20], [23],[2]. On the other hand, we propose to the best of our knowl-

edge the first methodology which simultaneously learns asubspace model that can be jointly used for deformablemodels construction, discriminant features extraction andclassification 1.

Summarizing the novel contributions of the paper are thefollowing:• We propose a joint discriminative generative compo-

nent analysis method which learns a subspace thatnot only best reconstructs the data but also provideslow-dimensional features which can separate classesby maximum margin. Joint models are currently quitepopular in computer vision. For instance, [18] proposesa joint model that learns a maximum margin classifierand extracts discriminant features in a single framework,while [11] jointly learns the statistical deformable modelparameters and the parts locations in a discriminativemanner. In addition, deep multi-task learning networks[31], [22] are among the state-of-the-art algorithms onvarious face analysis tasks.

• We show that is feasible to incorporate a deformationfield in the proposed model, via a simple shape model,and, hence, to jointly align the set of images whilelearning the model parameters. The parameters of themodel can be used to fit a deformable model on testimages and at the same time perform classification. Tothe best of our knowledge such a model has not beenpresented before.

• We applied the proposed method on various recogni-tion problems using unconstrained data, such as facerecognition on LFW database [15] or facial genderclassification and eye glass detection using data directlycollected from the Internet using Google image searchengine. We demonstrate that the proposed joint modeloutperforms the traditional ones that require the separateapplication of various algorithms for facial landmarklocalization, image alignment, feature extraction andclassification.

II. PROBLEM FORMULATION

In this section we present the proposed joint discriminativegenerative component analysis method that combines optimaldata reconstruction with classes separation by maximummargin. We term this model Joint Discriminative Genera-tive Model (JDGM) and develop an alternative optimizationalgorithm with close form solutions for its optimization. Tofacilitate our deformable model construction framework inthe following we shall assume that our data are vectorizedimages. However, the presented component analysis algo-rithm can be applied on any kind of high dimensional data.

A. Joint Discriminative Generative Model

Given a set X = {(x1, y1), ..., (xN , yN )} of N trainingdata pairs, where xi ∈ RF , i = 1, ..., N are the F -dimensional input feature vectors each assigned a class label

1Our model can be also viewed as an application of the Occam’s razorprinciple, which in simple terms requires to learn a model tailored to thetask at hand and not a more general one, as those learnt by PCA or RPCA.

Page 3: A Joint Discriminative Generative Model for Deformable ......A Joint Discriminative Generative Model for Deformable Model Construction and Classification Ioannis Marras 1, Symeon

yi ∈ {1, . . . ,K} with K denoting the total number ofclasses, a multiclass SVM classifier [8] attempts to determinea set of K separating hyperplanes W = {w1,w2, . . . ,wK}where wp ∈ RF , p = 1, ...,K is the normal vector of thep-th hyperplane that separates the training vectors of the p-thclass from all the others with maximum margin by solvingthe following constraint optimization problem [8]:

minwp,ξi

1

2

K∑p=1

wTp wp + C

N∑i=1

ξi,

s.t.: wTyixi −wT

p xi ≥ bpi − ξi, i = 1, . . . , N. (1)

where ξ = [ξ1, . . . , ξN ]T are the slack variables, each oneassociated with a training sample, C is the term that penalizesthe training error and b is the bias vector defined as bpi =1− δpyi , where δpyi is the Kronecker delta function.

PCA is one of the most popular components analysistechniques that aims to find a set of orthonormal projectionbases, stacked in the columns of matrix U ∈ <F×M , suchthat the data reconstruction error is minimized or equivalentlythe variance of the projected samples is maximized:

Uo = arg minU

∑Ni=1 ||xi −UUTxi||22

= arg minU ||X−UUTX||22= arg maxU ||UTX||22, s.t. UTU = IM ,

(2)

where X ∈ RF×N is the data matrix created by stackingsamples xi column-wise.

JDGM model aims to learn a subspace that best recon-structs the data, while providing appropriate for classificationlow dimensional features that can discriminate classes bymaximum margin. Thus, we combine the above constrainedoptimization problems into a single objective function andderive the following minimax optimization problem:

minwp,ξi

maxU

1

2

K∑p=1

wTp wp + C

N∑i=1

ξi +

N∑i=1

||UTxi||22

s.t.: wTyiU

Txi −wTp UTxi ≥ bpi − ξi, i = 1, . . . , N

UTU = IM

where the separating hyperplane normal vector wp ∈ RM isM -dimensional, since it separates samples in the subspace ofU, determined through the linear data projection x́i = UTxiand IM is an M ×M identity matrix.

To solve the minimax optimization problem in (3) weconsider an alternative optimization framework where wefirst compute the optimal decision hyperplanes for an initial-ized projection matrix U and subsequently, solve (3) for Uwhile keeping the optimal normal vectors wp,o fixed. Thus,to identify the optimal wp we introduce positive Lagrangemultipliers αpi and Λ ∈ RM×M = [Λi,j ] each associatedwith one inequality or orthonormality constraint, respectively

and formulate the Lagrangian function L(wp, ξ,U,α,Λ):

L(wp, ξ,U,α,Λ) =1

2

K∑p=1

wTp wp + C

N∑i=1

ξi

+ Tr[XTUUTX]− Tr[Λ(UTU− IM ]

−N∑i=1

K∑p=1

αpi[(

wTyi −wT

p

)UTxi + ξi − bpi

], (3)

where Tr[.] is the matrix trace operator. To find the minimumover the primal variables wp and ξ we require the partialderivatives of L(wp, ξ,U,α,Λ) with respect to ξ and wp

to vanish, which yields the following equalities:

∂L∂ξi

= 0⇒k∑p=1

αpi = C, (4)

∂L∂wp

= 0⇒ wp,o =

N∑i=1

(αpi −

K∑p=1

αpi δpyi

)UTxi. (5)

Substituting terms from (4),(5) into (3), expressing theLagrange multipliers in a vector form and performing thesubstitution ni = C1yi−αi, (where 1yi is a K-dimensionalvector with all its components equal to zero except of theyi-th, which is equal to one) the minimax problem in (3) isequivalent to the following dual problem:

minn

maxU

1

2

N∑i,j

xTi UUTxjnTi nj +

N∑i=1

nTi bi

+ Tr[XTUUTX]− Tr[Λ(UTU− IM )],

s.t.:K∑p=1

npi = 0, npi ≤{

0 , if yi 6= pC , if yi = p

∀ i = 1, . . . , N , p = 1, . . . ,K. (6)

Solving the above quadratic programming problem with thelinear kernel functionK(xi,xj) = xTi UUTxj for n, we subsequently obtain theseparating hyperplanes from (5).

To optimize for U we remove from (6) term∑Ni=1 nTi bi,

since it is independent of the optimized variable and solvethe resulting trace optimization problem:

maxU

Tr[UTN∑i,j

nTi njxixTj U] + Tr[XTUUTX] (7)

− Tr[Λ(UTU− IM )].

Computing and setting the derivative with respect to Uequal to zero we derive the following generalized eigenvalueproblem: ( N∑

i,j

nTi njxixTj + XXT

)U = UΛ. (8)

Thus, the projection bases of U correspond to the (N − 1)eigenvectors of matrix

∑Ni,j nTi njxix

Tj + XXT associated

with the largest eigenvalues. Matrix∑Ni,j nTi njxix

Tj has a

form similar to the Linear Discriminant Analysis (LDA)

Page 4: A Joint Discriminative Generative Model for Deformable ......A Joint Discriminative Generative Model for Deformable Model Construction and Classification Ioannis Marras 1, Symeon

between class covariance matrix, since it can be writtenas∑Ni,j nTi njxix

Tj = XLLTXT = AAT , where A ∈

RF×K and L = [n1, . . . ,nK ]T ∈ RN×K is created bystacking the vectors of the optimal Lagrange multipliers foreach training sample row-wise. Since non-zero Lagrangemultipliers correspond to support vectors which are thesamples that reside closest to the decision boundary, matrix∑Ni,j nTi njxix

Tj is a robust estimator of the between class

scatter evaluated using only the support vectors weightedby their associated Lagrange multiplier value. On the otherhand, matrix St = XXT is the data covariance matrix, thustheir combination enables the interplay between extractingsimultaneously generative and discriminant basis vectors.

III. JDGM WITH GENERALIZED ORTHOGONALITYCONSTRAINTS

On deformable models construction it is usually advan-tageous to whiten or “sphere” the data. In our model thiscan be directly incorporated in the optimization problemby the additional constraint uTk Stuk = 1, where uk isthe k-th projection base of matrix U = [u1,u2, . . . ,uM ].Thus, to evaluate the data transformation matrix U, whileincorporating the additional orthogonality constraint, we de-velop an optimization algorithm motivated by [6] where wesequentially derive each projection base uk, k = 1, . . . ,Mso that it improves the objective function and at the sametime is orthogonal to the (k − 1) previously derived bases.More precisely, to evaluate the k-th basis vector we considerthe following minimax optimization problem:

minwp,ξi

maxuk

1

2

K∑p=1

wTp wp + C

N∑i=1

ξi,

s.t.: wTyiu

Tk xi −wT

p uTk xi ≥ bpi − ξi,uTk u1 = uTk u2 = · · · = uTk uk−1 = 0,

uTk Stuk = 1. (9)

Deriving the dual formulation of (9) with respect to theprimal variables wp and ξi as in (6) and setting Sb =∑Ni,j nTi njxix

Tj , we obtain:

minn

maxuk

1

2uTk Sbuk +

N∑i=1

nTi bi

s.t.:K∑p=1

npi = 0, npi ≤{

0 , if yi 6= pC , if yi = p

∀ i = 1, . . . , N , p = 1, . . . ,K.

uTk u1 = uTk u2 = · · · = uTk uk−1 = 0,

uTk Stuk = 1. (10)

The solution for the optimal decision hyperplanes is com-puted by fixing the transformation bases and solving theresulting QP problem for n as in (6). To compute the k-th orthogonal basis uk we fix the optimal normal vectors

wp,o and consider the maximization problem:

uk = arg maxu

1

2uTk Sbuk +

N∑i=1

nTi bi

s.t.: uTk u1 = uTk u2 = · · · = uTk uk−1 = 0,

uTk Stuk = 1. (11)

Incorporating the constraints we develop the following La-grangian function:

L(uk, λ,µ) =1

2uTk Sbuk − λ(uTk Stuk − 1)

− µ1uTk u1 − µ2u

Tk u2 − · · · − µk−1uTk uk−1, (12)

where computing and setting its derivative with respect touk equal to zero we derive the following equality:

Sbuk−2λStuk−µ1u1−µ2u2−· · ·−µk−1uk−1 = 0. (13)

Multiplying (13) from the left hand side by uTk and solvingfor λ we derive:

λ =uTk Sbuk2uTk Stuk

, (14)

while multiplying the left hand side of (13) by u1S−1t ,

u2S−1t , . . . ,uk−1S

−1t , summing up the resulting equa-

tions and setting U(k−1) = [u1, . . . ,uk−1], µ(k−1) =[µ1, . . . , µk−1] and Θ(k−1) = [U(k−1)]TS−1t U(k−1) wederive:

µ(k−1) = [Θ(k−1)]−1[U(k−1)]TS−1t Sbuk. (15)

Finally, multiplying the left hand side of (13) by S−1t andsubstituting terms according to (15) we derive the general-ized eigenvalue problem:

1

2

(IF − S−1t U(k−1)[Θ(k−1)]−1[U(k−1)]T

)S−1t Sbuk = λuk

(16)We compute u1 as the eigenvector of matrix S−1t Sbassociated with the largest eigenvalue and evaluate se-quentially the remaining (N − 1) orthogonal bases ukas the eigenvectors with the largest eigenvalues of12

(IF − S−1t U(k−1)[Θ(k−1)]−1[U(k−1)]T

)S−1t Sb.

IV. DEFORMATION PARAMETERS LEARNING AND IMAGEALIGNMENT

The proposed framework is a general method for auto-matic deformable model construction and thus can be appliedon any deformable object. Here we explicitly focus on facialimages and on building facial deformable models, sincevarious annotated in-the-wild facial databases, shape modelsand algorithms for facial landmark localization are availablewhich can facilitate our quantitative evaluation.

To automatically learn the shape deformation parametersfor image alignment we only use a shape prior and theinitialized bounding boxes. More formally, given a set of Ntraining images {Ii}, i = 1, . . . , N and a statistical shapemodel {s,B} where s = {(x1, y1), . . . , (xL, yL)} is themean shape model and (xi, yi) are the coordinates of the i-th landmark point (we consider an L = 68 landmark points

Page 5: A Joint Discriminative Generative Model for Deformable ......A Joint Discriminative Generative Model for Deformable Model Construction and Classification Ioannis Marras 1, Symeon

facial shape model) and matrix B ∈ <2L×(4+Ns) storesthe shape eigenvectors where the first four correspond tothe global similarity transform, while the rest are computedperforming PCA on an available set of facial shapes.

A new shape instance is generated as a perturbation of themean shape s by linearly combining the eigenvectors in Bweighted by the parameters p = [p1, . . . , p4+NS

]T as sp =s + Bp. Moreover, we define a motion model as the warpfunction W (x,p), (which for simplicity we denote as W (p))that maps each point within the mean shape (x ∈ s) to itscorresponding location in a shape instance. Thus, each imagewarping to the mean shape given a shape estimate of thedisplayed face ({si}, i = 1, . . . , N), returns N appearancevectors {Ii(W (pi))}, i = 1, . . . , N of size F×1, where F isthe total number of pixels that lie inside the mean shape. Wealso denote as P = [p1,p2, . . . ,pN ] the (4+Ns)×N matrixcontaining the shape parameters for each image. To estimatethe motion parameters pi, i = 1, . . . , N we consider thefollowing optimization problem where we aim to minimizethe `22 norm fitting error over all images using the generativediscriminative bases U derived by the JDGM model:

min{pi, i=1,...,N}

N∑i=1

||xi(W (pi))− (Uci + µ) ||22 (17)

where ci ∈ <M×1, i = 1, . . . , N are the linear combinationweights and µ ∈ <F×1 is the data mean vector. Theoptimization problem in (17) is solved in an alternative opti-mization manner as in [27], with ci = UT (xi(W (pi))−µ),while the update of pi is performed as in [26].

The proposed algorithm optimizes the proposed cost func-tion using a standard block coordinate optimization ap-proach. Even though it is not feasible to provide a mathemati-cal proof of convergence, we observed empirical convergencein all cases. In all experiments, JDGM converged after a fewtens of iterations (around 30) without facing any convergenceissues.

V. EXPERIMENTAL RESULTS

In the experimental comparisons we demonstrate that theproposed joint model is able to (a) correctly locate the faciallandmarks from a crude face detector so as to perform imagealignment (b) extract appropriate features for classificationby performing a discriminative decomposition of the aligneddata and (c) build a maximum margin classifier on theidentified latent space. We compare the automatic JDGMmodel against the current standard approach on recogni-tion applications involving in-the-wild facial images wherevarious algorithms for facial landmarks localization, imagealignment, feature extraction and finally classification areindependently treated and considered as different modulesin the pipeline of the general recognition application. Todo so, we consider three different recognition problemson unconstrained facial data namely, face recognition onLFW dataset and eye glasses detection and facial genderclassification on facial images collected from the Internet byperforming textual queries in Flickr and Google web image

search engine. The detailed descriptions of each dataset aregiven in subsections V-A, V-B and V-C.

In our experiments the facial bounding boxes of all imageswere derived by applying the Viola-Jones object detectionalgorithm [29], while we employed a shape model trainedon 50 shapes of Multi-Pie database [13]. This shape modelcontains Ns = 10 eigenvectors and the mean shape has aresolution of 86 × 87 pixels, thus the initial dimensionalityof all our data is F = 7, 482.

We evaluate JDGM ability to automatically build a faceshape model (i.e. facial landmark localization accuracy)and its feature extraction and classification performance,comparing against three different algorithms. More precisely,for the baseline algorithm in our comparison we directlyexploited the detected facial regions (without aligning theimages) which were anisotropically scaled to a fixed size of86× 87 pixels, performed PCA and LDA for dimensionalityreduction and the extracted features were subsequently fedinto a linear SVM algorithm [10] for classification. A purelyautomatic algorithm that we also considered in our com-parison is the state-of-the-art image congealing “TIP-CA”algorithm in [9] that combines PCA with AAM fitting whichwe appropriately modified so as to learn the shape deforma-tion parameters instead of the image affine transform. Onthe aligned facial images derived by TIP-CA we used onlythe pixels inside the resulting reference shapes, performedPCA and LDA for dimensionality reduction and used thesame linear SVM algorithm for classification. For the thirdalgorithm in the comparison which we term “DRMF+SVM”,we aligned the facial images exploiting the facial landmarkslocalizations derived by applying the state-of-the-art RobustDiscriminative Response Map Fitting (DRMF) algorithm [3],trained on thousands of manually annotated facial images andsimilarly used only the pixels inside the shapes to performPCA and LDA and fed the low dimensional discriminantrepresentations to a linear SVM algorithm for classification.Finally, except of the purely automatic JDGM model wheredeformable shape fitting is initialized using the mean shape,we also initialized our model using the landmark local-izations derived by DRMF developing an algorithm whichwe term “JDGM+DRMF”. Moreover, in all experimentswe considered the image intensity values and the ImageGradient Orientations (IGOs) for robust subspace analysisas demonstrated in [28].

A. Face Recognition on LFW Dataset

LFW image dataset realistically simulates the variabilityevident to in-the-wild face recognition problems and isthe standard benchmark on face verification. In total LFWconsists of 13, 233 facial images depicting 5, 749 differentindividuals of which 4, 069 have a single image in thedatabase. Thus, in order to perform face recognition, weemployed a subset of the available data considering onlythe face images of those individuals that have more than 30samples, sufficient to contribute both to the training and testsets. The dataset we considered consists in total of 2, 310face images from 32 individuals where for each subject a

Page 6: A Joint Discriminative Generative Model for Deformable ......A Joint Discriminative Generative Model for Deformable Model Construction and Classification Ioannis Marras 1, Symeon

different number of samples were available varying from 30to 530 images. To form our testing set we randomly selectedhalf of the images available for each individual for trainingand the remaining for testing.

To measure the landmark localization accuracy we com-pute the point-to-point RMSE normalized with respect tothe face size. Thus, the RMSE between the fitted shape sf

and the groundtruth shape sg is evaluated as: RMSE =∑Ni=1

√(xf

i−xgi )

2+(yfi −ygi )

2

Ns∗d , where d = (maxx sg−minx sg+maxy sg −miny sg). Figure 2 shows the normalized RMSEfitting curves for both appearance representation featureswhere the proposed automatic joint model is comparedagainst the TIP-CA and the state-of-the-art DRMF algorithmwhich however, is trained on manually annotated data. As canbe seen, the proposed automatic JDGM method outperformsTIP-CA method, while it achieves comparable performancewith DRMF algorithm. It is also important to highlightthat JDGM+DRMF algorithm which exploits the landmarklocalizations derived by DRMF in order to initialize our de-formable shape fitting process, achieved the best performanceas it was able to further improve landmarks localization inimages where the provided fittings were not accurate.

Table I summarizes the highest face recognition ratesachieved by each method in the comparison. The proposedJDGM automatic model outperforms all other automaticalgorithms by large margins, while the highest performancewas achieved by JDGM+DRMF algorithm which also ob-tained the best fitting accuracy in the database.

TABLE IFACE RECOGNITION ACCURACY RATES IN LFW IMAGE DATASET.

Automatic Methods RequiringMethods Landmarks Annotation

Methods Baseline TIP-CA JDGM DRMF+SVM JDGM+DRMFIntensities 71.8% 74.6% 86.9% 88.5% 90.4%

IGOs 79.9% 84.1% 91.1% 93.3% 95.8%

B. Eye Glasses DetectionNext we consider a recognition problem using uncon-

strained images automatically collected from the Internet,where we aim to recognize whether the depicted subjectswear glasses or not. To do so, we created an image datasetdepicting 20 different celebrities who usually appear wearingglasses (we considered both eye and sun glasses) and aim toautomatically build a deformable face shape model for imagealignment and also build a classifier and extract discriminantfeatures appropriate for the eye glasses classification prob-lem. To create our database we performed textual queriessuch as “Samuel Jackson wearing glasses” or “SamuelJackson without glasses” in Google images and collected foreach celerity 50 retrieved images containing approximatelyequal number of samples for each class. Thus, our celebritiesimage collection consists in total of 1, 000 images whichwere randomly split into half to form our training and testsets. Figure 3 shows some of the collected images wherethe facial landmarks have been localized using the proposedautomatic JDGM model.

Fig. 3. Sample images of the collected celebrities dataset for the eye glassesdetection problem. Facial landmarks have been localized by the proposedautomatic JDGM model.

To further investigate the effect of the joint generativediscriminative component analysis model in AAM fitting wecomputed the normalized RMSE considering only the faciallandmarks around the eyes and the eyebrows which are thefacial regions discriminant for the eye glasses classificationproblem. Figure 4 plots the fitting curves for both appear-ance features where as can be seen the proposed methodachieved smaller fittings error. It is important to highlightthat by directly comparing the fitting errors derived by TIP-CA algorithm that combines PCA with AAM fitting andthe proposed JDGM model we verify that the discriminantinformation incorporated in the learnt subspace by JDGMsignificantly assists Gauss-Newton optimization algorithmto reach a better local minimum. Table II also summarizesthe obtained classification results. JDGM+DRMF algorithmachieved the best recognition performance, while the purelyautomatic JDGM model also attained competitive perfor-mance outperforming TIP-CA algorithm by large margins.

TABLE IIEYE GLASSES DETECTION ACCURACY RATES IN THE COLLECTED

CELEBRITIES DATABASE.

Automatic Methods RequiringMethods Landmarks Annotation

Methods Baseline TIP-CA JDGM DRMF+SVM JDGM+DRMFIntensities 65.2% 69.3% 82.6% 84.8% 85.2%

IGOs 73.3% 77.0% 89.6% 92.2% 93.7%

C. Facial Gender Classification in-the-wild

For our final set of experiments we considered the problemof facial gender classification also in a set of in-the-wildfacial images collected from the Internet. In order to form ourimage collection we used Flickr image sharing website wherewe applied the textual queries “Man face” and “Womanface” and collected 500 of the retrieved for each queryimages. Thus, in total our dataset comprises of 1, 000 facialimages which were randomly split into half to form thetraining and test sets.

Page 7: A Joint Discriminative Generative Model for Deformable ......A Joint Discriminative Generative Model for Deformable Model Construction and Classification Ioannis Marras 1, Symeon

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.090

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized RMSE

Perc

en

tag

e o

f Im

ag

es

JDGM (IGOs)

JDGM (Intensities)

TIP-CA (IGOs)

TIP-CA (Intensities)

Initial Error

JDGM+DRMF

DRMF

Fig. 2. Comparison of the normalized RMSE fitting curves in LFW image dataset.

0 0.01 0.02 0.03 0.04 0.05 0.060

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized RMSE

Perc

en

tag

e o

f Im

ag

es

JDGM (IGOs)

JDGM (Intensities)

TIP-CA (IGOs)

TIP-CA (Intensities)

Initial Error

JDGM+DRMF

DRMF

Fig. 4. Comparison of the normalized RMSE fitting curves in the collected celebrities dataset considering only the facial landmarks that correspond inthe discriminant facial areas around the eyes and the eyebrows.

Table III summarizes the obtained classification resultswhich are consistent with those obtained on the otherdatasets. JDGM model not only outperformed the otherautomatic methods but also attained competitive performanceagainst models trained on manual annotations which we thinkis remarkable.

TABLE IIIFACIAL GENDER CLASSIFICATION ACCURACY IN THE COLLECTED

DATASET.

Automatic Methods RequiringMethods Landmarks Annotation

Methods Baseline TIP-CA JDGM DRMF+SVM JDGM+DRMFIntensities 68.5% 71.1% 88.4% 89.6% 90.7%

IGOs 72.6% 77.4% 89.1% 91.5% 93.9%

VI. CONCLUSION

The current standard approach on recognition applicationsinvolving in-the-wild images of deformable objects requiresthe combination of various algorithms for landmarks local-ization, image alignment, feature extraction and classificationwhich however, are independently treated and considered asdifferent modules in the pipeline of the general recognitionapplication. Contrary to that, we propose a model that jointlylearns a generative deformable model using only a simpleshape model of the object and a crude detector, and alsoextracts features appropriate for classification. Experimentson unconstrained facial data, some of which were directlycollected from the Internet by performing textual queries inweb image search engines, demonstrated that the proposedautomatic model not only outperforms other automatic mod-els by large margins but also was able to achieve comparableperformance with models trained on thousands of manuallyannotated data.

Page 8: A Joint Discriminative Generative Model for Deformable ......A Joint Discriminative Generative Model for Deformable Model Construction and Classification Ioannis Marras 1, Symeon

VII. ACKNOWLEDGEMENTS

The work of S. Zafeiriou and M. Pantic was partiallyfunded by EPSRC project FACER2VM (EP/N007743/1).The work of S. Nikitidis and I. Marras was funded byEPSRC 4D Facial Behaviour Analysis for Security (4DFAB,EP/J017787/1).

REFERENCES

[1] A. Agarwal and B. Triggs. 3d human pose from silhouettes byrelevance vector regression. In CVPR, volume 2, pages 882–888.IEEE, 2004.

[2] E. Antonakos and S. Zafeiriou. Automatic construction of deformablemodels in-the-wild. In CVPR, pages 1813–1820. IEEE, 2014.

[3] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robust discrimi-native response map fitting with constrained local models. In CVPR,pages 3444–3451. IEEE, 2013.

[4] S. Baker, I. Matthews, and J. Schneider. Automatic construction ofactive appearance models as an image coding problem. IEEE TPAMI,26(10):1380–1384, 2004.

[5] J. Bruna and S. Mallat. Invariant scattering convolution networks.IEEE TPAMI, 35(8):1872–1886, 2013.

[6] D. Cai, X. He, J. Han, and H.-J. Zhang. Orthogonal laplacianfaces forface recognition. IEEE TIP, 15(11):3608–3614, 2006.

[7] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearancemodels. IEEE TPAMI, 23(6):681–685, 2001.

[8] K. Crammer and Y. Singer. On the algorithmic implementation ofmulticlass kernel-based vector machines. JMLR, 2:265–292, 2001.

[9] W. Deng, J. Hu, J. Lu, and J. Guo. Transform-invariant pca: Aunified approach to fully automatic face alignment, representation, andrecognition. TPAMI, 36(6):1275–1284, 2013.

[10] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. LIBLINEAR: Alibrary for large linear classification. JMLR, 9:1871–1874, 2008.

[11] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan.Object detection with discriminatively trained part-based models.IEEE TPAMI, 32(9):1627–1645, 2010.

[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich featurehierarchies for accurate object detection and semantic segmentation.In CVPR, pages 580–587. IEEE, 2014.

[13] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie.Image and Vision Computing (JIVC), 28(5):807–813, 2010.

[14] O. C. Hamsici and A. M. Martinez. Rotation invariant kernels andtheir application to shape analysis. IEEE TPAMI, 31(11):1985–1999,2009.

[15] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeledfaces in the wild: A database for studying face recognition in un-constrained environments. Technical report, Technical Report 07-49,University of Massachusetts, Amherst, 2007.

[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classificationwith deep convolutional neural networks. In NIPS, pages 1097–1105,2012.

[17] E. G. Learned-Miller. Data driven image models through continuousjoint alignment. IEEE TPAMI, 28(2):236–250, 2006.

[18] S. Nikitidis, S. Zafeiriou, and M. Pantic. Merging svms with lineardiscriminant analysis: a combined model. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 1067–1074, 2014.

[19] E. Osuna, R. Freund, and F. Girosi. Training support vector machines:an application to face detection. In CVPR, pages 130–136. IEEE, 1997.

[20] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma. Rasl: Robustalignment by sparse and low-rank decomposition for linearly corre-lated images. IEEE TPAMI, 34(11):2233–2246, 2012.

[21] M. Pontil and A. Verri. Support vector machines for 3d objectrecognition. IEEE TPAMI, 20(6):637–646, 1998.

[22] R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, poseestimation, and gender recognition. arXiv preprint arXiv:1603.01249,2016.

[23] C. Sagonas, Y. Panagakis, S. Zafeiriou, and M. Pantic. Raps: Robustand efficient automatic construction of person-specific deformablemodels. In CVPR, pages 1789–1796. IEEE, 2014.

[24] J. M. Saragih, S. Lucey, and J. F. Cohn. Deformable model fitting byregularized landmark mean-shift. IJCV, 91(2):200–215, 2011.

[25] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closingthe gap to human-level performance in face verification. In CVPR,pages 1701–1708. IEEE, 2014.

[26] G. Tzimiropoulos, J. Alabort-i Medina, S. Zafeiriou, and M. Pantic.Generic active appearance models revisited. In ACCV, pages 650–663.Springer, 2013.

[27] G. Tzimiropoulos and M. Pantic. Optimization problems for fast aamfitting in-the-wild. In ICCV, pages 593–600. IEEE, 2013.

[28] G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. Subspace learningfrom image gradient orientations. IEEE TPAMI, 34(12):2454–2466,2012.

[29] P. Viola and M. Jones. Rapid object detection using a boosted cascadeof simple features. In CVPR, volume 1, pages 511–518. IEEE, 2001.

[30] L. Wiskott, J.-M. Fellous, N. Kuiger, and C. Von Der Malsburg. Facerecognition by elastic bunch graph matching. IEEE TPAMI, 19(7):775–779, 1997.

[31] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detectionby deep multi-task learning. In European Conference on ComputerVision, pages 94–108. Springer, 2014.

[32] X. Zhu and D. Ramanan. Face detection, pose estimation, andlandmark localization in the wild. In CVPR, pages 2879–2886. IEEE,2012.


Top Related