24 ieee transactions on image processing, vol. 25, no. 1 ...€¦ · 24 ieee transactions on image...

15
24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning for Image Classification Soheil Bahrampour, Member, IEEE, Nasser M. Nasrabadi, Fellow, IEEE , Asok Ray, Fellow, IEEE , and William Kenneth Jenkins, Life Fellow, IEEE Abstract— Dictionary learning algorithms have been success- fully used for both reconstructive and discriminative tasks, where an input signal is represented with a sparse linear combination of dictionary atoms. While these methods are mostly developed for single-modality scenarios, recent studies have demonstrated the advantages of feature-level fusion based on the joint sparse representation of the multimodal inputs. In this paper, we propose a multimodal task-driven dictionary learning algorithm under the joint sparsity constraint (prior) to enforce collaborations among multiple homogeneous/heterogeneous sources of information. In this task-driven formulation, the multimodal dictionaries are learned simultaneously with their corresponding classifiers. The resulting multimodal dictionaries can generate discriminative latent features (sparse codes) from the data that are optimized for a given task such as binary or multiclass classification. Moreover, we present an extension of the proposed formulation using a mixed joint and independent sparsity prior, which facilitates more flexible fusion of the modalities at feature level. The efficacy of the proposed algorithms for multimodal classification is illustrated on four different applications—multimodal face recognition, multi-view face recognition, multi-view action recognition, and multimodal biometric recognition. It is also shown that, compared with the counterpart reconstructive-based dictionary learning algorithms, the task-driven formulations are more computation- ally efficient in the sense that they can be equipped with more compact dictionaries and still achieve superior performance. Index Terms— Dictionary learning, multimodal classification, sparse representation, feature fusion. I. I NTRODUCTION I T IS well established that information fusion using multiple sensors can generally result in an improved recognition performance [1]. It provides a framework to combine local information from different perspectives which is more tolerant to the errors of individual sources [2], [3]. Fusion methods for Manuscript received July 18, 2014; revised December 12, 2014, April 26, 2015, and October 1, 2015; accepted October 16, 2015. Date of publication October 30, 2015; date of current version November 18, 2015. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. David Clausi. S. Bahrampour was with the Department of Electrical Engineering, Penn- sylvania State University, University Park, PA 16802 USA. Heis now with the Bosch Research and Technology Center, Palo Alto, CA 94304 USA (e-mail: [email protected]). N. M. Nasrabadi was with the Army Research Laboratory, Adelphi, MD 20783 USA. He is now with the Computer Science and Electrical Engineering Department, West Virginia University, Morgantown, WV 26506 USA (e-mail: [email protected]). A. Ray and W. K. Jenkins are with the Department of Electrical Engineering, Pennsylvania State University, University Park, PA 16802 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2015.2496275 classification are generally categorized into feature fusion [4] and classifier fusion [5] algorithms. Feature fusion methods aggregate extracted features from different sources into a single feature set which is then used for classification. On the other hand, classifier fusions algorithms combine decisions from individual classifiers, each of which is trained using separate sources. While classifier fusion is a well-studied topic, fewer studies have been done for feature fusion, mainly due to the incompatibility of the feature sets [6]. A naive way of feature fusion is to stack the features into a longer one [7]. However this approach usually suffers from the curse of dimensionality due to the limited number of training samples [4]. Even in scenarios with abundant training samples, concatenation of feature vectors does not take into account the relationship among the different sources and it may contain noisy or redundant data, which degrade the performance of the classifier [6]. However, if these limitations are mitigated, feature fusion can potentially result in improved classification performance [8], [9]. Sparse representation classification has recently attracted the interest of many researchers in which the input signal is approximated with a linear combination of a few dic- tionary atoms [10] and has been successfully applied to several problems such as robust face recognition [10], visual tracking [11], and transient acoustic signal classification [12]. In this approach, a structured dictionary is usually con- structed by stacking all the training samples from the different classes. The method has also been expanded for efficient feature-level fusion which is usually referred to as multi-task learning [13]–[16]. Among different proposed sparsity con- straints (priors), joint sparse representation has shown signif- icant performance improvement in several multi-task learning applications such as target classification, biometric recogni- tions, and multiview face recognition [12], [14], [17], [18]. The underlying assumption is that the multimodal test input can be simultaneously represented by a few dictionary atoms, or training samples, from a multimodal dictionary, that rep- resents all the modalities and, therefore, the resulting sparse coefficients should have the same sparsity pattern. However, the dictionary constructed by the collection of the training samples suffer from two limitations. First, as the number of training samples increases, the resulting optimization problem becomes more computationally demanding. Second, the dictionary that is constructed this way is not optimal neither for the reconstructive tasks [19] nor the discriminative tasks [20]. 1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 14-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016

Multimodal Task-Driven Dictionary Learningfor Image Classification

Soheil Bahrampour, Member, IEEE, Nasser M. Nasrabadi, Fellow, IEEE,Asok Ray, Fellow, IEEE, and William Kenneth Jenkins, Life Fellow, IEEE

Abstract— Dictionary learning algorithms have been success-fully used for both reconstructive and discriminative tasks, wherean input signal is represented with a sparse linear combinationof dictionary atoms. While these methods are mostly developedfor single-modality scenarios, recent studies have demonstratedthe advantages of feature-level fusion based on the joint sparserepresentation of the multimodal inputs. In this paper, we proposea multimodal task-driven dictionary learning algorithm under thejoint sparsity constraint (prior) to enforce collaborations amongmultiple homogeneous/heterogeneous sources of information.In this task-driven formulation, the multimodal dictionaries arelearned simultaneously with their corresponding classifiers. Theresulting multimodal dictionaries can generate discriminativelatent features (sparse codes) from the data that are optimized fora given task such as binary or multiclass classification. Moreover,we present an extension of the proposed formulation using amixed joint and independent sparsity prior, which facilitates moreflexible fusion of the modalities at feature level. The efficacy of theproposed algorithms for multimodal classification is illustratedon four different applications—multimodal face recognition,multi-view face recognition, multi-view action recognition, andmultimodal biometric recognition. It is also shown that, comparedwith the counterpart reconstructive-based dictionary learningalgorithms, the task-driven formulations are more computation-ally efficient in the sense that they can be equipped with morecompact dictionaries and still achieve superior performance.

Index Terms— Dictionary learning, multimodal classification,sparse representation, feature fusion.

I. INTRODUCTION

IT IS well established that information fusion using multiplesensors can generally result in an improved recognition

performance [1]. It provides a framework to combine localinformation from different perspectives which is more tolerantto the errors of individual sources [2], [3]. Fusion methods for

Manuscript received July 18, 2014; revised December 12, 2014,April 26, 2015, and October 1, 2015; accepted October 16, 2015. Date ofpublication October 30, 2015; date of current version November 18, 2015.The associate editor coordinating the review of this manuscript and approvingit for publication was Prof. David Clausi.

S. Bahrampour was with the Department of Electrical Engineering, Penn-sylvania State University, University Park, PA 16802 USA. He is now with theBosch Research and Technology Center, Palo Alto, CA 94304 USA (e-mail:[email protected]).

N. M. Nasrabadi was with the Army Research Laboratory, Adelphi,MD 20783 USA. He is now with the Computer Science and ElectricalEngineering Department, West Virginia University, Morgantown, WV 26506USA (e-mail: [email protected]).

A. Ray and W. K. Jenkins are with the Department of Electrical Engineering,Pennsylvania State University, University Park, PA 16802 USA (e-mail:[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2015.2496275

classification are generally categorized into feature fusion [4]and classifier fusion [5] algorithms. Feature fusion methodsaggregate extracted features from different sources into asingle feature set which is then used for classification. On theother hand, classifier fusions algorithms combine decisionsfrom individual classifiers, each of which is trained usingseparate sources. While classifier fusion is a well-studiedtopic, fewer studies have been done for feature fusion, mainlydue to the incompatibility of the feature sets [6]. A naiveway of feature fusion is to stack the features into a longerone [7]. However this approach usually suffers from thecurse of dimensionality due to the limited number of trainingsamples [4]. Even in scenarios with abundant training samples,concatenation of feature vectors does not take into account therelationship among the different sources and it may containnoisy or redundant data, which degrade the performance ofthe classifier [6]. However, if these limitations are mitigated,feature fusion can potentially result in improved classificationperformance [8], [9].

Sparse representation classification has recently attractedthe interest of many researchers in which the input signalis approximated with a linear combination of a few dic-tionary atoms [10] and has been successfully applied toseveral problems such as robust face recognition [10], visualtracking [11], and transient acoustic signal classification [12].In this approach, a structured dictionary is usually con-structed by stacking all the training samples from the differentclasses. The method has also been expanded for efficientfeature-level fusion which is usually referred to as multi-tasklearning [13]–[16]. Among different proposed sparsity con-straints (priors), joint sparse representation has shown signif-icant performance improvement in several multi-task learningapplications such as target classification, biometric recogni-tions, and multiview face recognition [12], [14], [17], [18].The underlying assumption is that the multimodal test inputcan be simultaneously represented by a few dictionary atoms,or training samples, from a multimodal dictionary, that rep-resents all the modalities and, therefore, the resulting sparsecoefficients should have the same sparsity pattern. However,the dictionary constructed by the collection of the trainingsamples suffer from two limitations. First, as the numberof training samples increases, the resulting optimizationproblem becomes more computationally demanding. Second,the dictionary that is constructed this way is not optimalneither for the reconstructive tasks [19] nor the discriminativetasks [20].

1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

BAHRAMPOUR et al.: MULTIMODAL TASK-DRIVEN DICTIONARY LEARNING FOR IMAGE CLASSIFICATION 25

Recently it has been shown that learning the dictio-nary can overcome the above limitations and significantlyimprove the performance in several applications includingimage restoration [21], face recognition [22] and objectrecognition [23], [24]. The learned dictionaries are usuallymore compact and have fewer dictionary atoms than thenumber of training samples [25], [26]. Dictionary learningalgorithms can generally be categorized into two groups:unsupervised and supervised. Unsupervised dictionary learn-ing algorithms such as the method of optimal direction [27]and K-SVD [25] are aimed at finding a dictionary that yieldsminimum errors when adapted to reconstruction tasks such assignal denoising [28] and image inpainting [19]. Although,the unsupervised dictionary learning has also been used forclassification [22], it has been shown that better performancecan be achieved by learning the dictionaries that are adaptedto an specific task rather than just the data set [29], [30].These methods are called supervised, or task-driven, dictionarylearning algorithms. For the classification task, for example, itis more meaningful to utilize the labeled data to minimize themisclassification error rather than the reconstruction error [31].Adding a discriminative term to the reconstruction error andminimizing a trade-off between them has been proposed inseveral formulations [20], [24], [32], [33]. The incoherentdictionary learning algorithm proposed in [34] is anothersupervised formulation which trains class-specific dictionariesto minimize atom sharing between different classes and usessparse representation for classification. In [35], a Fisher crite-rion is proposed to learn structured dictionaries such that thesparse coefficients have small within-class and large between-class scatters. While unsupervised dictionary learning can bereformulated as a large scale matrix factorization problemand solved efficiently [19], supervised dictionary learning isusually more difficult to optimize. More recently, it has beenshown that better optimization tool can be used to tackle thesupervised dictionary learning [30], [36]. This is achieved byformulating it as a bilevel optimization problem [37], [38].In particular, a stochastic gradient descent algorithm has beenproposed in [29] which efficiently solves the dictionary learn-ing problem in a unified framework for different tasks, suchas classification, nonlinear image mapping, and compressivesensing.

The majority of the existing dictionary learning algorithms,including the task-driven dictionary learning [29], are onlyapplicable to single source of data. In [39], a set of view-specific dictionaries and a common dictionary are learnedfor the application of multi-view action recognition. Theview-specific dictionaries are trained to exploit view-levelcorrespondence while the common dictionary is trained tocapture common patterns shared among the different views.The proposed formulation belongs to the class of dictionarylearning algorithms that leverages the labeled samples tolearn class-specific atoms while minimizing the reconstruc-tion error. Moreover, it cannot be used for fusion of theheterogeneous modalities. In [40], a generative multimodaldictionary learning algorithm is proposed to extract typicaltemplates of multimodal features. The templates representsynchronous transient structures between modalities which

Fig. 1. Multimodal task-driven dictionary learning scheme.

can be used for localization applications. More recently, amultimodal dictionary learning algorithm with joint sparsityprior is proposed in [41] for multimodal retrieval where thetask is to find relevant samples from other modalities fora given unimodal query. However, the proposed formulationcannot be readily applied for information fusion in which thetask is to find label of a given multimodal query. Moreover, thejoint sparsity prior is used in [41] to couple similarly labeledsamples within each modality and is not utilized to extractcross-modality information which is essential for informationfusion [12]. Furthermore, the dictionaries in [41] are learnedto be generative by minimizing the reconstruction error of dataacross modalities and, therefore, are not necessary optimal fordiscriminative tasks [31].

This paper focuses on learning discriminative multimodaldictionaries. The major contributions of the paper are asfollows:• Formulation of the Multimodal Dictionary Learning

Algorithms: A multimodal task-driven dictionary learningalgorithm is proposed for classification using homoge-neous or heterogeneous sources of information. Infor-mation from different modalities are fused both at thefeature level, by using the joint sparse representation,and at the decision level, by combining the scores ofthe modal-based classifiers. The proposed formulationsimultaneously trains the multimodal dictionaries andclassifiers under the joint sparsity prior in order toenforce collaborations among the modalities and obtainthe latent sparse codes as the optimized features for dif-ferent tasks such as binary and multiclass classification.Fig. 1 presents an overview of the proposed framework.An unsupervised multimodal dictionary learning algo-rithm is also presented as a by-product of the supervisedversion.

• Differentiability of the Bi-Level Optimization Problem:The main difficulty in proposing such a formulation isthat the solution of the corresponding joint sparse codingproblem is not differentiable with respect to the dictionar-ies. While the joint sparse coding has a non-smooth costfunction, it is shown here that it is locally differentiableand the resulting bi-level optimization for task-driven

Page 3: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

26 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016

multimodal dictionary learning is smooth and can besolved using a stochastic gradient descent algorithm.1

• Flexible Feature-Level Fusion: An extension of the pro-posed framework is presented which facilitates moreflexible fusion of the modalities at the feature level byallowing the modalities to have different sparsity patterns.This extension provides a framework to tune the trade-off between independent sparse representation and jointsparse representation among the modalities. ImprovedPerformance for Multimodal Classification: The pro-posed methods achieve the state-of-the-art performancein a range of different multi-modal classification tasks.In particular, we have provided extensive performancecomparison between the proposed algorithms and some ofthe competing methods from literature for four differenttasks of multimodal face recognition, multi-view facerecognition, multimodal biometric recognition, and multi-view action recognition. The experimental results on thesedatasets have demonstrated the usefulness of the proposedformulation, showing that the proposed algorithm can bereadily applied to several different application domains.

• Improved Efficiency for Sparse-Representation BasedClassification: It is shown here that, compared tothe counterpart sparse representation classification algo-rithms, the proposed algorithms are more computationallyefficient in the sense that they can be equipped with morecompact dictionaries and still achieve superior perfor-mance.

A. Paper Organization

The rest of the paper is organized as follows. In Section II,unsupervised and supervised dictionary learning algorithmsfor single source of information are reviewed. Joint sparserepresentation for multimodal classification is also reviewed inthis section. Section III proposes the task-driven multimodaldictionary learning algorithms. Comparative studies on severalbenchmarks and concluding results are presented in Section IVand Section V, respectively.

B. Notation

Vectors are denoted by bold lower case letters and matricesby bold upper case letters. For a given vector x, xi is its i th

element. For a given finite set of indices γ , xγ is the vectorformed with those elements of x indexed in γ . Symbol→ isused to distinguish the row vectors from column vectors,i.e. for a given matrix X , the i th row and j th column of matrixare represented as xi→ and x j , respectively. For a given finiteset of indices γ , Xγ is the matrix formed with those columnsof X indexed in γ and Xγ→ is the matrix formed with thoserows of X indexed in γ . Similarly, for given finite sets ofindices γ and ψ , Xγ→,ψ is the matrix formed with those rowsand columns of X indexed in γ and ψ , respectively. xi j is theelement of X at row i and column j . The lq norm, q ≥ 1, ofa vector x ∈ R

m is defined as ‖x‖�q = (∑m

j=1 |x j |q)1/q . TheFrobenius norm and �1q norm, q ≥ 1, of matrix X ∈ R

m×n

1The source code of the proposed algorithm is released here:https://github.com/soheilb/multimodal_dictionary_learning

is defined as ‖X‖F =(∑m

i=1∑n

j=1 x2i j

)1/2and ‖X‖�1q =

∑mi=1 ‖xi→‖�q , respectively. The collection {x i |i ∈ γ } is

shortly denoted as {x i}.

II. BACKGROUND

A. Dictionary Learning

Dictionary learning has been widely used in various taskssuch as reconstruction, classification, and compressive sens-ing [29], [33], [42], [43]. In contrast to principal componentanalysis (PCA) and its variants, dictionary learning algorithmsgenerally do not impose orthogonality condition and are moreflexible allowing to be well-tuned to the training data. LetX = [x1, x2, . . . , xN ] ∈ R

n×N be the collection of N (nor-malized) training samples that are assumed to be statisticallyindependent. Dictionary D ∈ R

n×d can then be obtained asthe minimizer of the following empirical cost [22]:

gN (D) � 1

N

N∑

i=1

lu (xi , D) (1)

over the regularizing convex set D � {D ∈ Rn×d |‖dk‖�2 ≤1,

∀k = 1, . . . , d}, where dk is the kth column, or atom, in thedictionary and the unsupervised loss lu is defined as

lu (x, D) � minα∈Rd‖x − Dα‖2�2

+ λ1‖α‖�1 + λ2‖α‖2�2, (2)

which is the optimal value of the sparse coding problem withλ1 and λ2 being the regularizing parameters. While λ2 isusually set to zero to exploit sparsity, using λ2 > 0 makesthe optimization problem in Eq. (2) strongly convex resultingin a differentiable cost function [29]. The index u of lu is usedto emphasize that the above dictionary learning formulation isan unsupervised method. It is well-known that one is ofteninterested in minimizing an expected risk, rather than theperfect minimization of the empirical cost [44]. An efficientonline algorithm is proposed in [19] to find the dictionary D asthe minimizer of the following stochastic cost over the convexset D:

g (D) � Ex [lu (x, D)] , (3)

where it is assumed that the data x is drawn from a finiteprobability distribution p(x) which is usually unknown andEx [.] is the expectation operator with respect to the distribu-tion p(x).

The trained dictionary can then be used to (sparsely) recon-struct the input. The reconstruction error has been shown to bea robust measure for classification tasks [10], [45]. Another useof a given trained dictionary is for feature extraction where thesparse code α�(x, D), obtained as a solution of (2), is used asa feature vector representing the input signal x in the classicalexpected risk optimization for training a classifier [29]:

minw∈W

Ey,x[l(y,w,α�(x, D)

)]+ ν2‖w‖2�2

, (4)

where y is the ground truth class label associated with theinput x, w is model (classifier) parameters, ν is a regularizingparameter, and l is a convex loss function that measures howwell one can predict y given the feature vector α� and classifier

Page 4: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

BAHRAMPOUR et al.: MULTIMODAL TASK-DRIVEN DICTIONARY LEARNING FOR IMAGE CLASSIFICATION 27

parameters w. The expectation Ey,x is taken with respect tothe probability distribution p(y, x) of the labeled data. Notethat in Eq. 4, the dictionary D is fixed and independent of thegiven task and class label y. In task-driven dictionary learning,on the other hand, a supervised formulation is used whichfinds the optimal dictionary and classifier parameters jointlyby solving the following optimization problem [29]:

minD∈D,w∈W

Ey,x[lsu

(y,w,α�(x, D)

)]+ ν2‖w‖2�2

. (5)

The index su of convex loss function lsu is used to emphasizethat the above dictionary learning formulation is supervised.The learned task-driven dictionary has been shown to resultin a superior performance compared to the unsupervisedsetting [29]. In this setting, the sparse codes are indeed theoptimized latent features for the classifier.

B. Multimodal Joint Sparse Representation

Joint sparse representation provides an efficient tool forfeature-level fusion of sources of information [12], [14], [46].Let S � {1, . . . , S} be a finite set of available modalitiesand let xs ∈ R

ns, s ∈ S, be the feature vector for the

sth modality. Also let Ds ∈ Rns×d be the corresponding

dictionary for the sth modality. For now, it is assumed thatthe multimodal dictionaries are constructed by collections ofthe training samples from different modalities, i.e. j th atom of

dictionary Ds is the j th training sample from the sth modality.Given a multimodal input {xs|s ∈ S}, shortly denoted as {xs},an optimal sparse matrix A� ∈ R

d×S is obtained by solvingthe following �12-regularized reconstruction problem:

argminA=[α1...αS]

1

2

S∑

s=1

‖xs − Dsαs‖2�2+ λ‖A‖�12, (6)

where λ is a regularization parameter. Here αs is thesth-column of A which corresponds to the sparse represen-tation for the sth modality. Different algorithms have beenproposed to solve the above optimization problem [47], [48].We use the efficient alternating direction method of multi-pliers (ADMM) [49] to find A�. The �12 prior encouragesrow sparsity in A�, i.e. it encourages collaboration among allthe modalities by enforcing the same dictionary atoms fromdifferent modalities that present the same event, to be usedfor reconstructing the inputs {xs}. An �11 term can also beadded to the above cost function to extend it to a more generalframework where sparsity can also be sought within the rows,as will be discussed in Section III-D. It has been shown thatjoint sparse representation can result in a superior performancein fusing multimodal sources of information compared toother information fusion techniques [45]. We are interestedin learning multimodal dictionaries under the joint sparsityprior. This has several advantages over a fixed dictionaryconsisting of training data. Most importantly, it can potentiallyremove the redundant and noisy information by representingthe training data in a more compact form. Also using thesupervised formulation, one expects to find dictionaries thatare well-adapted to the discriminative tasks.

III. MULTIMODAL DICTIONARY LEARNING

In this section, online algorithms for unsupervised andsupervised multimodal dictionary learning are proposed.

A. Multimodal Unsupervised Dictionary Learning

Unsupervised multimodal dictionary learning is derived byextending the optimization problem characterized in Eq. (3)and using the joint sparse representation of (6) to enforcecollaborations among modalities. Let the minimum costl ′u ({xs, Ds}) of the joint sparse coding be defined as

minA

1

2

S∑

s=1

‖xs − Dsαs‖2�2+ λ1‖A‖�12 +

λ2

2‖A‖2F , (7)

where λ1 and λ2 are the regularizing parameters. The addi-tional Frobenius norm ‖.‖F compared to Eq. (6) guaranteesa unique solution for the joint sparse optimization problem.In the special case when S = 1, optimization (7) reducesto the well-studied elastic-net optimization [50]. By naturalextension of the optimization problem (3), the unsupervisedmultimodal dictionaries are obtained by:

Ds� = argminDs∈Ds

Exs[l ′u

({xs, Ds})] , ∀s ∈ S, (8)

where the convex set Ds is defined as

Ds � {D ∈ Rns×d |‖dk‖�2 ≤ 1,∀k = 1, . . . , d}. (9)

It is assumed that data xs is drawn from a finite (unknown)probability distribution p(xs). The above optimization prob-lem can be solved using the classical projected stochasticgradient algorithm [51] which consists of a sequence ofupdates as follows:

Ds ← �Ds[

Ds − ρt∇Ds l ′u({xs

t , Ds})] , (10)

where ρt is the gradient step at time t and �D is theorthogonal projector onto set D. The algorithm converges to astationary point for a decreasing sequence of ρt [51], [52].A typical choice of ρt is shown in the next section. Thisproblem can also be solved using online matrix factorizationalgorithm [26]. It should be noted that the while the stochas-tic gradient descent does converge, it is not guaranteed toconverge to a global minimum due to the non-convexity ofthe optimization problem [26], [44]. However, such stationarypoint is empirically found to be sufficiently good for practicalapplications [21], [28].

B. Multimodal Task-Driven Dictionary Learning

As discussed in Section II, the unsupervised setting does nottake into account the label of the training data, and the dic-tionaries are obtained by minimizing the reconstruction error.However, for classification tasks, the minimum reconstructionerror does not necessarily result in discriminative dictionaries.In this section, a multimodal task-driven dictionary learningalgorithm is proposed that enforces collaboration among themodalities both at the feature level using joint sparse repre-sentation and the decision level using a sum of the decisionscores. We propose to learn the dictionaries Ds�,∀s ∈ S,

Page 5: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

28 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016

and the classifier parameters ws�,∀s ∈ S, shortly denoted asthe set {Ds�,ws�}, jointly as the solution of the followingoptimization problem:

min{Ds∈Ds,ws∈W s}

f({Ds,ws})+ ν

2

S∑

s=1

‖ws‖2�2, (11)

where f is defined as the expected cumulative cost:

f({Ds,ws}) = E

S∑

s=1

lsu(y,ws ,αs�), (12)

where αs � is the sth column of the minimizer A�({xs, Ds}) ofthe optimization problem (7) and lsu(y,w,α) is a convex lossfunction that measures how well the classifier parametrizedby w can predict y by observing α. The expectation is takenwith respect to the joint probability distribution of the multi-modal inputs {xs} and label y. Note that αs � acts as a hidden/latent feature vector, corresponding to the input xs , whichis generated by the learned discriminative dictionary Ds�.In general, lsu can be chosen as any convex function such thatlsu(y, ., .) is twice continuously differentiable for all possiblevalues of y. A few examples are given below for binary andmulticlass classification tasks.

1) Binary Classification: In a binary classification taskwhere the label y belongs to the set {−1, 1}, lsu can benaturally chosen as the logistic regression loss

lsu(y,w,α�) = log(1+ e−ywT α� ), (13)

where w ∈ Rd is the classifier parameters. Once the optimal

{Ds,ws} are obtained, a new multimodal sample {xs} isclassified according to sign of

∑Ss=1 ws T α� due to the uniform

monotonicity of∑S

s=1 lsu . For simplicity, the intercept termfor the linear model is omitted here, but it can be easilyadded. One can also use a bilinear model where, instead ofa set of vectors {ws}, a set of matrices {W s} are learned anda new multimodal sample is classified according to the signof

∑Ss=1 xsT W sα�. Accordingly, the �2-norm regularization

of Eq. (11) needs to be replaced with the matrix Frobeniusnorm. The bilinear model is richer than the linear model andcan sometimes result in better classification performance butneeds more careful training to avoid over-fitting.

2) Multiclass Classification: Multiclass classification canbe formulated using a collections of (independently learned)binary classifiers in a one-vs-one or one-vs-all setting.Multiclass classification can also be handled in an all-vs-allsetting using the softmax regression loss function. In thisscheme, the label y belongs to the set {1, . . . , K } and thesoftmax regression loss is defined as

lsu(y,W,α�) = −K∑

k=1

1{y=k} log

(ewT

k α�

∑Kl=1 ewT

l α�

)

, (14)

where W = [w1 . . .wK ] ∈ Rd×K , and 1{.} is the indicator

function. Once the optimal {Ds,W s} are obtained, a newmultimodal sample {xs} is classified as

argmaxk∈{1,...,K }S∑

s=1

(ews

kT αs�

∑Kl=1 ews

lT αs�

)

. (15)

In yet another all-vs-all setting, the multiclass classificationtask can be turned into a regression task in which the scalerlabel y is changed to a binary vector y ∈ R

K , where the kth

coordinate corresponding to the label of {xs} is set to oneand the rest of the coordinates are set to zero. In this setting,lsu is defined as

lsu(y,W,α�) = 1

2‖y −Wα�‖2�2

, (16)

where W ∈ RK×d . Having obtained the optimal {Ds,W s},

the test sample {xs} is then classified as

argmink∈{1,...,K }S∑

s=1

‖qk − W sαs�‖2�2, (17)

where qk is a binary vector in which its kth coordinate is oneand its remaining coordinates are zero.

In choosing between the one-vs-all setting, in which inde-pendent multimodal dictionaries are trained for each class, andthe multiclass formulation, in which multimodal dictionariesare shared between classes, a few points should be considered.In the one-vs-all setting, the total number of dictionary atomsis equal to d SK in the K -class classification while in themulticlass setting the number is equal to d S. It should be notedthat in the multiclass setting a larger dictionary is generallyrequired to achieve the same level of performance to capturethe variations among all classes. However, it is generallyobserved that the size of the dictionaries in multiclass settingis not required to grow linearly as the number of classesincreases due to atom sharing among the different classes.Another point to consider is that the class-specific dictionariesof the one-vs-all approach are independent and can be obtainedin parallel. In this paper, the multiclass formulation is used toallow feature sharing among the classes.

C. Optimization

The main challenge in optimizing (11) is the non-differentiability of A�({xs, Ds}). However, it can be shownthat although the sparse coefficients A� are obtained bysolving a non-differentiable optimization problem, the func-tion f ({Ds,ws}), defined in Eq. (12), is differentiable onD1 × · · ·DS ×W1 × · · ·W S , and therefore its gradients arecomputable. To find the gradient of f with respect to Ds ,one can find the optimality condition of the optimization (7)or use the fixed point differentiation [36], [38] and showthat A� is differentiable over its non-zero rows. Without lossof generality, we assume that label y admits a finite set ofvalues such as those defined in Eqs. (13) and (14). The samealgorithm can be derived for the scenario when y belongs to acompact subset of a finite-dimensional real vector space as inEq. (16). A couple of mild assumptions are required to provethe differentiability of f which are direct generalizations ofthose required for the single modal scenario [29] and are listedbelow:

Assumption (A): The multimodal data (y, {xs}) admit aprobability density p with compact support.

Assumption (B): For all possible values of y, p(y, .) iscontinuous and lsu(y, .) is twice continuously differentiable.

Page 6: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

BAHRAMPOUR et al.: MULTIMODAL TASK-DRIVEN DICTIONARY LEARNING FOR IMAGE CLASSIFICATION 29

The first assumption is reasonable when dealing with thesignal/image processing applications where the acquired val-ues obtained by the sensors are bounded. Also all the givenexamples for lsu in the previous section satisfy the secondassumption. Before stating the main proposition of this paperbelow, the term active set is defined.

Definition 1 (Active Set): The active set of thesolution A� of the joint sparse coding problem (7) isdefined to be

= { j ∈ {1, . . . , d} : ‖a�j→‖�2 �= 0}, (18)

where a�j→ is the j th row of A�.Proposition 2 (Differentiability and Gradients of f ): Let

λ2 > 0 and the assumptions (A) and (B) hold. Letϒ = ∪ j∈ϒ j where ϒ j = { j, j + d, . . . , j + (S − 1)d}.Let the matrix D̂ ∈ R

n×|ϒ | be defined as

D̂ =[

D̂1 . . . D̂||], (19)

where D̂ j = blkdiag(d1j , . . . , d S

j ) ∈ Rn×S ,∀ j ∈ , is

the collection of the j th active atoms of the multimodaldictionaries, ds

j is the j th active atom of Ds , blkdiag is theblock diagonalization operator, and n = ∑

s∈S ns . Also letmatrix � ∈ R

|ϒ |×|ϒ | be defined as

� = blkdiag(�1, . . . ,�||), (20)

where � j = 1‖a�j→‖�2 I − 1

‖a�j→‖�2 3 a�j→T a�j→ ∈ R

S×S,∀ j ∈ ,

and I is the identity matrix. Then, the function f defined inEq. (12) is differentiable and ∀s ∈ S,

∇ws f = E[∇ws lsu

(y,ws ,αs�)] ,

∇Ds f = E[(

xs − Dsαs�)βTs̃ − Dsβs̃α

s�T], (21)

where s̃ = {s, s + S, . . . , s + (d − 1)S} and β ∈ Rd S is

defined as

βϒc = 0,βϒ = ( D̂T D̂ + λ1�+ λ2 I)−1 g, (22)

in which g = vec(∇A�→T

∑Ss=1 lsu(y,ws,αs �)), ϒc =

{1, . . . , d S} \ϒ , βϒ ∈ R|ϒ | is formed of those rows of β

indexed by ϒ , and vec(.) is the vectorization operator.The proof of this proposition is given in the Appendix.

A stochastic gradient descent algorithm to find the opti-mal dictionaries {Ds�} and classifiers {ws�} is described inAlgorithm 1. The stochastic gradient descent algorithm isguaranteed to converge under a few assumptions that aremildly stricter than those in this paper (requires three-timesdifferentiability) [53]. To further improve the convergence ofthe proposed stochastic gradient descent algorithm, a classicmini-batch strategy is used in which a small batch of thetraining data are sampled in each batch, instead of 1 sample,and the parameters are updated using the averaged updatesof the batch. This has additional advantage in which D̂T D̂and the corresponding factorization of the ADMM for solv-ing the sparse coding problem can be computed once forthe whole batch. For the special case when S = 1, theproposed algorithm reduces to the single-modal task-drivendictionary learning algorithm in [29]. Selecting λ2 in Eq. (7)

Algorithm 1 Stochastic Gradient Descent Algorithm forMultimodal Task-Driven Dictionary Learning

to be strictly positive guarantees the linear equations of (22)to have a unique solution. In other words, it is easy toshow that the matrix ( D̂T D̂ + λ1� + λ2 I) is positive def-inite given λ1 ≥ 0, λ2 > 0. However, in practice it isobserved that the solution of the joint sparse representationproblem is numerically stable since D̂ becomes full-columnrank when sparsity is sought with a sufficiently large λ1,and λ2 can be set to zero. It should be noted that theassumption of D̂ being a full column rank matrix is acommon assumption in sparse linear regression [26]. As inany non-convex optimization algorithm, if the algorithm is notinitialized properly, it may yield poor performance. Similarto [29], the dictionaries {Ds} are initialized by the solutionof the unsupervised multimodal dictionary learning algorithm.Upon assignment of the initial dictionaries, parameters {ws}of the classifiers are set by solving (11) only with respectto {ws} which is a convex optimization problem.

D. Extension

We now present an extension of the proposed algorithmwith a more flexible structure on the sparse codes. Jointsparse representation relies on the fact that all the modalitiesshare the same sparsity pattern in which, if a multimodaltraining sample is selected to reconstruct the input, then allthe modalities within that training sample are active. However,this group sparsity constraint, imposed by the �12 norm, maybe too stringent for some applications [45], [54], for examplein the scenarios where the modalities have different noiselevels or when the heterogeneity of the modalities imposesdifferent sparsity levels for the reconstruction task. A naturalrelaxation to the joint sparsity prior is to let the multimodal

Page 7: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

30 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016

inputs not share the full active set which can be achieved byreplacing the �12 norm with a combination of the �12 and �11norms (�12−�11 norm). Following the same formulation as inSection III-B, let A�({xs, Ds}) in Eq. (11) be the minimizerof the following optimization problem:

minA

1

2

S∑

s=1

‖xs − Dsαs‖2�2

+ λ1‖A‖�12 + λ′1‖A‖�11 +λ2

2‖A‖2F , (23)

where λ′1 is the regularization parameter for the added �11norm and other terms are the same as those in Eq. (7). Theselection of λ1 and λ′1 influences the sparsity pattern of A�.Intuitively, as λ1/λ

′1 increases, the group constraint becomes

dominant and more collaboration is enforced among themodalities. On the other hand, small values of λ1/λ

′1 encour-

age independent reconstructions across modalities. In theextreme case of λ1 being set to zero, the above optimizationproblem is separable across the modalities. The above formu-lation brings added flexibility with the cost of one additionaldesign parameter which is obtained in this paper using cross-validation.

Here we present how the Algorithm 1 should be modified tosolve the supervised multimodal dictionary learning problemunder the mixed �12 − �11 constraint. The proof for obtainingthe algorithm is similar to the one for the �12 norm and isbriefly discussed in the appendix. In Algorithm 1, let A� bethe solution of the optimization problem (23) and let be theset of its active rows. Let � ⊆ {1, . . . , S||} be the set ofindices with non-zero entries in vec(A�→

T ); i.e. it consistsof non-zero entries of the active rows of A�. Let D̂, �, and gbe the same as those defined in algorithm 1. Then, β ∈ R

d S

is updated as

βϒc = 0, βϒ = ( D̂T� D̂� + λ1��→,� + λ2 I)−1 g�,

where ϒ is the set of indices with non-zero entries in vec(A�T)

and ϒc = {1, . . . , d S} \ϒ . Note that ϒ is defined over theentire matrix A� while � is defined over its active rows. Therest of the algorithm remains unchanged.

IV. RESULTS AND DISCUSSION

The performance of the proposed multimodal dictionarylearning algorithms are evaluated on the AR face data-base [55], the CMU Multi-PIE dataset [56], the IXMAS actionrecognition dataset [57] and the WVU multimodal dataset [58].For these algorithms, ls is chosen to be the quadratic lossof Eq. (16) to handle the multiclass classification. In ourexperiments, it is observed that using the multiclass formu-lation achieves similar classification performance compared tousing the logistic loss formulation of Eq. (13) in the one-vs-all setting. Regularization parameters λ1 and ν are selectedusing cross-validation in the sets {0.01+ 0.005k|k ∈ {−3, 3}}and {10−2, ..., 10−9}, respectively. It is observed that whenthe number of dictionary atoms is kept small compared tothe number of training samples, ν can be arbitrarily set toa small value, e.g. ν = 10−8, for the normalized inputs.

Fig. 2. Extracted modalities from a sample in AR dataset.

TABLE I

CORRECT CLASSIFICATION RATES OBTAINED USING THE

WHOLE FACE MODALITY FOR THE AR DATABASE

When the mixed �12 − �11 norm is used, the regularizationparameters λ1 and λ′1 are selected by cross-validation in theset {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05}. The parameterλ2 is set to zero in most of the experiments except when usingthe �11 prior in Section IV-B1 where a small positive valuefor λ2 was required for convergence. The learning parameterρt is selected according to the heuristic proposed in [29],i.e. ρt = min

(ρ, ρ t0

t

)where ρ and t0 are constants. This

results in a constant learning rate during the first t0 iterationsand an annealing strategy of 1/t for the rest of the iterations.It is observed that choosing t0 = T/10, where T is the totalnumber of iterations over the whole training set, works well forall of our experiments. Different values of ρ are tried duringthe first few iterations and the one that results in minimumerror on a small validation set is retained. T is set equalto be 20 in all the experiments. We observed empiricallythat the selection of these parameters is quite robust andsmall variations in their values do not affect considerably theobtained results. We also used a mini-batch size of 100 in allour experiments. It should also be noted that design parametersfor the competitive algorithms are also selected using cross-validation for a fair comparison.

A. AR Face Recognition

The AR dataset consists of faces under different poses, illu-mination and expression conditions, captured in two sessions.A set of 100 users are used, each consisting of seven imagesfrom the first session as training samples and seven imagesfrom the second session as test samples. A small randomlyselected portion of the training set, 50 out of 700, is used asvalidation set for optimizing the design parameters. Fusion istaken on five modalities which are the left and right periocular,nose, mouth, and the whole face modalities, similar to thesetup in [14] and [45]. A test sample from the AR datasetand the extracted modalities are shown in Fig. 2. Raw pixelsare first PCA-transformed and then normalized to have zeromean and unit l2 norm. The dictionary size for the dictionarylearning algorithms is chosen to be four per class, resulting indictionaries of overall 400 atoms.

1) Classification Using the Whole Face Modality: Theclassification results using the whole face modality are shownin Table I. The results are obtained using linear support vectormachine (SVM) [60], multiple kernel learning (MKL) [59],logistic regression (LR) [60], sparse representation classifica-tion (SRC) [10], and unsupervised and supervised dictionary

Page 8: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

BAHRAMPOUR et al.: MULTIMODAL TASK-DRIVEN DICTIONARY LEARNING FOR IMAGE CLASSIFICATION 31

TABLE II

COMPARISON OF THE �11 AND �12 PRIORS FOR MULTIMODALCLASSIFICATION. MODALITIES INCLUDE 1. LEFT PERIOCULAR,

2. RIGHT PERIOCULAR, 3. NOSE, 4. MOUTH, AND 5. FACE

learning algorithms (UDL and SDL) [29]. For the MKLalgorithm, linear, polynomial, and RBF kernels are used. TheUDL and SDL are equipped with the quadratic classifier (16).The SDL results in the best performance.

2) �11 vs �12 Sparse Priors for Multimodal Classification:A straightforward way of utilizing the single-modal dictionarylearning algorithms, namely UDL and SDL, for multimodalclassification is to train independent dictionaries and classifiersfor each modality and then combine the individual scores fora fused decision. This way of fusion is equivalent to using the�11 norm on A, instead of �12 norm, in Eq. (7) (or setting λ1to zero in Eq. (23)) which does not enforce row sparsity in thesparse coefficients. We denote the corresponding unsupervisedand supervised multimodal dictionary learning algorithmsusing only the �11 norm as UMDL�11 and SMDL�11 , respec-tively. Similarly, the proposed unsupervised and supervisedmultimodal dictionary learning algorithms using the �12 normare denoted as UMDL�12 and SMDL�12 . Table II compares theperformance of the multimodal dictionary learning algorithmsunder the two priors. As shown, the proposed algorithms with�12 prior, which enforces collaborations among the modalities,have better fusion performances than those with �11 prior.In particular, SMDL�12 has significantly better performancethan the SMDL�11 for fusion of the first and second (left andright periocular) modalities. This agrees with the intuition thatthese modalities are highly correlated and learning the mul-timodal dictionaries jointly indeed improves the recognitionperformance.

3) Comparison With Other Fusion Methods: The perfor-mances of the proposed fusion algorithms under differentsparsity priors are compared with those of the several state-of-the-art decision-level and feature-level fusion algorithms.In addition to �11 and �12 priors, we evaluate the proposedsupervised multimodal dictionary learning algorithm with themixed �12 − �11 norm which is denoted as SMDL�12−�11.One way to achieve decision-level fusion is to train indepen-dent classifiers for each modality and aggregate the outputsby either adding the corresponding scores of each modalityto come up with the fused decision, or using the major-ity voting among the independent decisions obtained fromdifferent modalities. These approaches are abbreviated withSum and Maj, respectively, and are used with SVM and LRclassifiers for decision-level fusion. The proposed methodsare also compared with feature-level fusion methods includ-ing the joint sparse representation classifier (JSRC) [14],joint dynamic sparse representation classifier (JDSRC) [7],and MKL. For the JSRC and JDSRC, the dictionary consists ofall the training samples. Table III compares the performance of

our proposed algorithms with the other fusion algorithms forthe AR dataset. As expected, the multimodal fusion results insignificant performance improvement compared to using onlythe whole face modality. Moreover, the proposed SMDL�12

and SMDL�12−�11 achieve the superior performances.4) Reconstructive vs Discriminative Formulation With Joint

Sparsity Prior: Comparison of the algorithms with joint spar-sity priors in Table III indicates that the proposed SMDL�12

algorithm equipped with dictionaries of size 400 achievesrelatively better results than the JSRC that uses dictionariesof size 700. The results confirm the idea that by using thesupervised formulation, compared to using the reconstructionerror, one can achieve better classification performance evenwith more compact dictionaries. For further comparison, anexperiment is performed in which the correct classificationrates of the reconsturtive and discriminative formulations arecompared when the their dictionary sizes are kept equal. Fora given number of dictionary atoms per class d , dictionariesof JSRC are thus constructed by random selection of d trainsamples from different classes. This is different from thestandard JSRC, utilized for the results in Table III, in which allthe training samples are used to construct the dictionaries [14].Moreover, to utilize all the available training samples for thereconstructive approach and make a more meaningful compar-ison, we use the unsupervised multimodal dictionary learningalgorithm of Eq. (8) to train class-specific sub-dictionarieswhich minimizes the reconstruction error in approximatingthe training samples for a given class. These sub-dictionariesare then stacked to construct the final dictionaries, similar tothe approach in [22]. We call this algorithm as JSRC-UDLto indicate that the dictionaries are indeed learned by thereconstructive formulation. Table IV summarizes the recog-nition performance of JSRC and JSRC-UDL in comparisonto the proposed SMDL�12 , which enjoys a discriminativeformulation, for different number of dictionary atoms per class.As seen, SMDL�12 outperforms the reconstructive approaches,especially when the number of dictionary is chosen to berelatively small. This is the main advantage of SMDL�12

compared to the reconstructive approaches in which morecompact dictionaries can be used for the recognition task thatis important for the real-time applications. It is clear thatreconstructive model can only result in comparable perfor-mance when the dictionary size is chosen to be relativelylarge. On the other hand, the SMDL�12 algorithm may getover-fitted with the large number of dictionary atoms. In termsof computational expense at test time, as discussed in [14], thetime required to solve the optimization problem (7) is expectedto be linear in the dictionary size using the efficient ADMM ifthe required matrix factorization is cashed beforehand. Typicalcomputational time to solve (7) for a given multimodal testsample is shown in Fig. 3 for different dictionary sizes.As expected, it increases linearly as the size of the dictionaryincreases. This illustrates the advantage of the SMDL�12

algorithm that results in the state-of-the-art performance withmore compact dictionaries.

5) Classification in Presence of Disguise: The AR datasetalso contains 600 occluded samples per session, overall1200 images, where the faces are disguised using sun glasses

Page 9: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

32 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016

TABLE III

MULTIMODAL CLASSIFICATION RESULTS OBTAINED FOR THE AR DATASETS

TABLE IV

COMPARISON OF THE RECONSTRUCTIVE-BASED (JSRC AND JSRC-UDL)AND THE PROPOSED DISCRIMINATIVE-BASED (SMDL�12 )

CLASSIFICATION ALGORITHMS OBTAINED USING THE

JOINT SPARSITY PRIOR FOR DIFFERENTNUMBERS OF DICTIONARY ATOMS PER

CLASS ON THE AR DATASET

Fig. 3. Computational time required to solve the optimization problem (7)for a given test sample.

TABLE V

COMPARISON OF THE SUPERVISED MULTIMODAL DICTIONARY LEARNING

ALGORITHMS WITH DIFFERENT SPARSITY PRIORS FOR FACERECOGNITION UNDER OCCLUSION ON THE AR DATASET

or scarf. Here we use these additional images to evaluate therobustness of the proposed algorithms. Similar to previousexperiments, images from session 1 are used as trainingsamples and images from session 2 are used as test data.Classification performance under different sparsity priors areshown in Table V and as expected, the SMDL�12−�11 achievesthe best performance. In presence of occlusion, some of themodalities are less coupled and the joint sparsity prior amongall the modalities may be too stringent as is also reflected inthe results.

B. Multi-View Recognition

1) Multi-View Face Recognition: In this section, theperformance of the proposed algorithm is evaluated formulti-view face recognition using the CMU Multi-PIE

Fig. 4. Configurations of the cameras and sample multi-view images fromCMU Multi-Pie dataset.

dataset [56]. The dataset consists of a large numberof face images under different illuminations, viewpoints,and expressions which are recorded in four sessions overthe span of several months. Subjects were imaged using13 cameras at different view-angles of {0°,±15°,±30°,±45°,±60°,±75°,±90°} at head height. Illustrations for themultiple camera configurations, as well as sample multi-viewimages are shown in Fig. 4. We use the multi-view face imagesfor 129 subjects that are present in all sessions. The faceregions for all the poses are extracted manually and resizedto 10 × 8. Similar to the protocol used in [61], images fromsession 1 at views {0°,±30°,±60°,±90°} are used as trainingsamples. Test images are obtained from all available viewangles from session 2 to have a more realistic scenario inwhich not all the testing poses are available in the trainingset. To handle multi-view recognition using the multi-modalformulation, we divide the available views into three setsof {−90°,−75°,−60°,−45°}, {−30°,−15°, 0°, 15°, 30°, },{45°, 60°, 75°, 90°}, each of which forms a modality. A testsample is then constructed by randomly selecting an imagefrom each modality. Two thousand test samples are generatedin this way. The dictionary size for the dictionary learningalgorithms is chosen to have two atoms per class.

The classification results obtained using individual modali-ties are shown in Table VI. As expected, better classificationperformance is obtained using the frontal view. Results ofthe multi-view face recognition is shown in Table VII. Theproposed supervised dictionary learning algorithms outper-form the corresponding unsupervised methods and other fusionalgorithms. The SMDL�12−�11 results in the state-of-the-artperformance. It is consistently observed in all the studiedapplications that the multimodal dictionary learning algorithmwith the mixed prior results in better performance than thosewith individual �12 or �12 prior. However, it requires oneadditional regularizing parameter to be tuned. For the rest ofthe paper, the performance of the proposed dictionary learningalgorithms are only reported under the individual priors.

Page 10: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

BAHRAMPOUR et al.: MULTIMODAL TASK-DRIVEN DICTIONARY LEARNING FOR IMAGE CLASSIFICATION 33

TABLE VI

CORRECT CLASSIFICATION RATES OBTAINED USING INDIVIDUALMODALITIES IN THE CMU MULTI-PIE DATABASE

TABLE VII

CORRECT CLASSIFICATION RATES (CCR) OBTAINED USING MULTI-VIEW

IMAGES ON THE CMU MULTI-PIE DATABASE

Fig. 5. Sample frames of the IXMAS dataset from 5 different views.

2) Multi-View Action Recognition: This section presentsthe results of the proposed algorithm for the purpose ofmulti-view action recognition using the IXMAS dataset [57].Each action is recorded simultaneously by cameras from fivedifferent viewpoints, which are considered as modalities inthis experiment. A multimodal sample of the IXMAS datasetis shown in Fig. 5. The dataset contains 11 action classeswhere each action is repeated three times by each of the tenactors, resulting in 330 sequences per view. The dataset includeactions such as check watch, cross arms, and scratch head.Similar to the work in [57], [62], and [63], leave-one-actor-out cross-validation is performed and samples from all fiveviews are used for training and testing.

We use dense trajectories as features which are generatedusing the publicly available code [63] in which a 2000 wordcodebook is generated by a random subset of these tra-jectories and the k-means clustering as in [64]. Note thatWang et al. [63] used HOG, HOF, and MBH descriptors inaddition to the dense trajectories. However, here only densetrajectory descriptors are used. The number of dictionaryatoms for the proposed dictionary learning algorithms arechosen to be 4 atoms per class, resulting in a dictionaryof 44 atoms per view. The five dictionaries for JSRC areconstructed using all the training samples, thus each dictionary,corresponding to a different view, has 297 atoms.

Table VIII shows average accuracies over all classesobtained using the existing algorithms and the state of theart algorithms. The Wang et al. 1 [63] algorithm uses onlythe dense trajectories as feature, similar to our setup. TheWang et al. 2 [63] algorithm, however, uses HOG, HOF, MBHdescriptors and the spatio-temporal pyramids in addition tothe trajectory descriptor. The results show that the proposed

TABLE VIII

CORRECT CLASSIFICATION RATES (CCR) OBTAINED FOR MULTI-VIEWACTION RECOGNITION ON THE IXMAS DATABASE

Fig. 6. The confusion matrix obtained by the SMDL�12 algorithm on theIXMAS dataset. The actions are 1: check watch, 2: cross arms, 3: scratchhead, 4: sit down, 5: get up, 6: turn around, 7: walk, 8: wave, 9: punch,10: kick and 11: pick up.

SMDL�12 algorithm achieves the superior performance whilethe SMDL�11 algorithm achieves the second best perfor-mance. This indicates that sparse coefficients generated bythe trained dictionaries are indeed more discriminative thanthe engineered features. The resulting confusion matrix of theSMDL�12 algorithm is shown in Fig. 6.

C. Multimodal Biometric Recognition

The WVU dataset consists of different biometric modalitiessuch as fingerprint, iris, palmprint, hand geometry, and voicefrom subjects of different age, gender, and ethnicity. It is achallenging data set, as many of the samples are corruptedwith blur, occlusion, and sensor noise. In this paper, two irises(left and right) and four fingerprint modalities are used. Theevaluation is done on a subset of 202 subjects which have morethan four samples in all modalities. Samples from differentmodalities are shown in Fig. 1. The training set is formed byrandomly selecting four samples from each subject, overall808 samples. The remaining 509 samples are used for testing.The features used here are those described in [14] which arefurther PCA-transformed. The dimension of the input dataafter preprocessing are 178 and 550 for the fingerprint and irismodalities, respectively. All inputs are normalized to have zeromean and unit l2 norm. The number of dictionary atoms forthe dictionary learning algorithms are chosen to be 2 per class,resulting in dictionaries of overall 404 atoms. The dictionariesfor JSRC and JDSRC are constructed using all the trainingsamples.

Page 11: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

34 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016

TABLE IX

CORRECT CLASSIFICATION RATES OBTAINED USING INDIVIDUAL MODALITIES IN THE WVU DATABASE

TABLE X

MULTIMODAL CLASSIFICATION RESULTS

OBTAINED FOR THE WVU DATASET

The classification results obtained using individual modal-ities on 5 different splits of the data into training and testsamples are shown in Table IX. As shown, finger 2 is thestrongest modality for the recognition task. The SRC andSDL algorithms achieve the best results. It should be notedthat dictionary size of SRC is twice of that in SDL.

For multimodal classification, we consider fusion of finger-prints, fusion of Irises, and fusion of all the modalities. Table Xsummarizes the correct classification rates of several fusionalgorithms using 4 fingerprints, 2 Irises, and all the modalities,obtained on 5 different training and test splits. Fig. 7 shows thecorresponding cumulative matched score curves (CMC) for thecompetitive methods. CMC is a performance measure, similarto ROC, which is originally proposed for biometric recognitionsystems [67]. As seen, the SMDL�12 algorithm outperformsthe competitive algorithms and achieves the state-of-the-artperformance using the Irises and all modalities with the rankone recognition rate of 83.77% and 99.10%, respectively.Using the fingerprints, the performance of the SMDL�12 isclose to the best performing algorithm, which is JSRC. Theresults suggest that using joint sparsity prior indeed improvesthe multimodal classification performance by extracting thecoupled information among the modalities.

Comparison of the algorithms with the joint sparsity priorsindicates that the proposed SMDL�12 algorithm equipped withdictionaries of size 404 achieves comparable, and mostlybetter, results than the JSRC that uses dictionary of size 808.Similar to the experiment in Section IV-A, we compared thereconstructive and discriminating algorithms that are basedon the joint sparsity prior when the number of dictionaryatoms per class is kept equal. Fig. 8 summarizes the results ofthe different fusion scenarios. As seen, SMDL�12 significantly

Fig. 7. CMC plots obtained by fusing the Irises (top), fingerprints (middle),and all modalities (below) on the WVU dataset.

outperforms JSRC and JSRC-UDL when the number of dic-tionary atoms per class is chosen to be 1 or 2. The results areconsistent with that of Table IV for the AR dataset indicatingthat the proposed supervised formulation equipped with morecompact dictionaries achieves superior performance than thatof the reconstructive formulation for the studied biometricrecognition applications.

Page 12: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

BAHRAMPOUR et al.: MULTIMODAL TASK-DRIVEN DICTIONARY LEARNING FOR IMAGE CLASSIFICATION 35

Fig. 8. Comparison of the reconstructive-based (JSRC and JSRC-UDL)and the proposed discriminative-based (SMDL�12 ) classification algorithmsobtained using the joint sparsity prior for different numbers of dictionaryatoms per class on the WVU dataset.

V. CONCLUSIONS AND FUTURE WORKS

The problem of multimodal classification using sparsitymodels was studied and a task-driven formulation was pro-posed to jointly find the optimal dictionaries and classifiersunder the joint sparsity prior. It was shown that the resultingbi-level optimization problem is smooth and an stochasticgradient descent algorithm was proposed to solve the cor-responding optimization problem. The algorithm was thenextended for a more general scenario where the sparsity priorwas the combination of the joint and independent sparsityconstraints. The simulation results on the studied image clas-sification applications suggest that while the unsuperviseddictionaries can be used for feature learning, the sparsecoefficients generated by the proposed multimodal task-drivendictionary learning algorithms are usually more discriminativeand therefore can result in improved multimodal classificationperformance. It was also shown that, compared to the sparse-representation classification algorithms (JSRC, JDSRC, andJSRC-UDL), the proposed algorithms can achieve significantlybetter performance when compact dictionaries are utilized.

In the proposed dictionary learning framework which uti-lizes the stochastic gradient algorithm, the learning rate shouldbe carefully chosen for convergence of the algorithm. In outexperiments, a heuristic was used to control the learningrate. Topics of future research include developing of betteroptimization tools for fast convergence guarantee in this non-convex setting. Moreover, developing task-driven dictionarylearning algorithms under other proposed structured sparsitypriors for multimodal fusion such as the tree-structured spar-sity prior [45], [68] is another future research topic. Futureresearch will also include adapting of the proposed algo-rithms for other multimodal tasks such as multimodal retrieval,multimodal action recognition using Kinect data, and imagesuper-resolution.

APPENDIX

The proof of Proposition 2 is presented using the followingtwo results.

Lemma 3 (Optimality Condition): The matrix A� =[α1� . . .αS�

]=

[a�1→

T . . . a�d→T]T ∈ R

d×S is a minimizer

of (7) if and only if , ∀ j ∈ {1, . . . , d},⎧⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎩

[d1

jT

(x1 − D1α1�

). . . d S

jT

(xS − DSαS�

)]

− λ2a�j→ = λ1a�j→‖a�j→‖�2

, if ‖a�j→‖�2 �= 0,

‖[

d1jT

(x1 − D1α1�

). . . d S

jT

(xS − DSαS�

)]

− λ2a�j→‖�2 ≤ λ1, otherwise.

(24)

Proof: The proof follows directly from the subgradientoptimality condition of (7), i.e.

0 ∈ {[

D1T(

D1α1� − x1). . . DST

(DSαS� − xS

)]

+ λ2 A� + λ1 P : P ∈ ∂‖A�‖�12},where ∂‖A�‖�12 denotes the subgradient of the �12 norm evalu-ated at A�. As shown in [69], the subgradient is characterized,for all j ∈ {1, . . . , d}, as p j→ = a j→

‖a j→‖�2 if ‖a j→‖�2 > 0, and‖ p j→‖�2 ≤ 1 otherwise.

Before proceeding to the next proposition, we need to definethe term transition point. For a given {xs}, let λ be the activeset of the solution A� of (7) when λ1 = λ. Then λ is definedto be a transition point of {xs} if λ+ε �= λ−ε ,∀ε > 0.

Proposition 4 (Regularity of A�): Let λ2 > 0 and assump-tion (A) be hold. Then,

Part 1: A�({xs, Ds}) is a continuous function of {xs}and {Ds}.

Part 2: If λ1 is not a transition point of {xs}, then theactive set of A�({xs, Ds}) is locally constant with respectto both {xs} and {Ds}. Moreover, A�({xs, Ds}) is locallydifferentiable with respect to {Ds}.

Part 3: ∀λ1 > 0, ∃ a set Nλ1 of measure zero in which∀{xs} ∈ {Rns }\Nλ1 , λ1 is not any of the transition pointsof {xs}.

Proof: Part 1. In the special case of S = 1, whichis equivalent to an elastic net problem, this has alreadybeen shown [19], [70]. Our proof follows similar steps.Assumption (A) guarantees that A� is bounded. Therefore, wecan restrict the optimization problem (7) to a compact subsetof R

d×S . Since A� is unique (imposed by λ2 > 0) and thecost function of (7) is continuous in A and each element ofthe set {xs, Ds} is defined over a compact set, A�({xs, Ds})is a continuous function of {xs} and {Ds}.

Part 2 and Part 3. These statements are proved here byconverting the optimization problem (7) into an equivalentgroup lasso problem [71] and using some recent results onit. Let the matrix D′j = blkdiag(d1

j , . . . , d Sj ) ∈ R

n×S ,∀ j ∈{1, . . . , d}, be the block-diagnoal collection of the j th atomsof the dictionaries. Also let D′ = [

D′1 . . . D′d] ∈ R

n×Sd ,

x ′ =[

x1T. . . xST

]T ∈ Rn , and a′ = [a1→ . . . ad→]T ∈ R

Sd .Then (7) can be rewritten as

minA

1

2‖x ′ − D′a′‖2�2

+ λ1

d∑

j=1

‖a j→‖�2 +λ2

2‖A‖2F . (25)

This can be further converted into the standard group lasso:

minA

1

2‖x ′′ − D′′a′‖2�2

+ λ1

d∑

j=1

‖a j→‖�2, (26)

Page 13: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

36 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016

where x ′′ =[

x ′T 0T]T ∈ R

n+Sd and D′′ =[

D′T√λ2 I

]T ∈R(n+Sd)×Sd. It is clear that the matrix D′′ is full column

rank. The rest of the proof follows directly from the resultsin [72].

Proof of Proposition 2: The above proposition implies thatA� is differentiable almost everywhere. We know prove theproposition 2. It is easy to show that f is differentiable withrespect to ws due to the assumption (A) and the fact thatlsu is twice differentiable. f is also differentiable with respectto Ds given assumption (A), twice differentiability of lsu , andthe fact that A� is differentiable everywhere except on a setof measure zero (Prop 4). We obtain the derivative of f withrespect to Ds using the chain rule. The steps are similar tothose taken for �1-related optimization in [36], though a bitmore involved. Since the active set is locally constant, usingthe optimality condition (24), we can implicitly differentiateA�({xs, Ds}) with respect to Ds . For the non-active rowsof A�, the differential is zero. On the active set , (24) canbe rewritten as

[D1

T(

x1 − D1α1�). . . DS

T

(xS − DSαS�

)]

− λ2 A�→ = λ1

[a�1→

T

‖a�1→‖�2

. . .a�N→

T

‖a�N→‖�2

]T

, (27)

where N is the cardinality of and D and A�→ are thematrices consisting of active columns of D and active rowsof A�, respectively. For the rest of the proof, we only workon the active set and the symbols and � are dropped forthe ease of notation. Taking the partial derivative from bothsides of (27) with respect to ds

i j , the element in the i th-rowand j th-column of Ds , and taking its transpose we have:

λ2∂ AT

∂dsi j+

⎣0

(Dsαs − xs)T Esi j + αs T Es

i jT Ds

0

+

⎢⎢⎢⎢⎢⎢⎢⎣

∂α1T

∂dsi j

D1TD1

...

∂αS T

∂dsi j

DSTDS

⎥⎥⎥⎥⎥⎥⎥⎦

= −λ1

[

�1∂aT

1→∂ds

i j. . .�N

∂aTN→∂ds

i j

]

,

where Esi j ∈ R

ns×N is a matrix with zero elements except theelement in the i th row and j th column which is one and

�k = 1

‖ak→‖�2

(

I − 1

‖ak→‖2�2

aTk→ak→

)

∈ RS×S,

∀k ∈ {1, . . . , N}. It is easy to check that �k ≥ 0. Vectorizingthe both sides and factorizing results in

vec

(∂ AT

∂dsi j

)

= P

⎢⎢⎢⎢⎢⎢⎣

0(xs − Dsαs)T es

i j 1− αs T Es

i jT ds

1...

(xs − Dsαs)T esi j N− αs T Es

i jT ds

N

0

⎥⎥⎥⎥⎥⎥⎦

,

(28)

where esi j k

is the kth column of Esi j , P =

(D̂T D̂ + λ1�+ λ2 I

)−1, and D̂ and � are defined in

Eqs. (19) and (20), respectively. Further simplifying Eq. (28)yields

vec

(∂ AT

∂dsi j

)

= Ps̃ Esi j

T (xs − Dsαs)− Ps̃ ds

i→Tαs

j ,

where s̃ is defined in Eq. (21). Using the chain rule, we have

∂ f

∂dsi j= E

[

gT vec

(∂ AT

∂dsi j

)]

,

where g = vec

(∂

∑Ss=1 lsu

∂AT

)

. Therefore, derivative with respec-

tive to the active columns of dictionary Ds is

∂ f

∂Ds = E

⎢⎢⎣

gT Ps̃

(Es

11T (xs − Dsαs)− ds

1→Tαs

1

)

...gT Ps̃

(Es

ns 1T (xs − Dsαs)− ds

ns→Tαs

1

)

. . . gT Ps̃

(Es

1NT (xs − Dsαs )− ds

1→Tαs

N

)

.... . . gT Ps̃

(Es

ns NT (xs − Dsαs )− ds

ns→T αs

N

)

⎥⎥⎦

= E[(

xs − Dsαs) gT Ps̃ − Ds PTs̃ gαs T

].

Setting β = PT g ∈ RN S and noting that βs̃ = PT

s̃ g completethe proof. �

Derivation of the algorithm with the mixed �12 − �11 priorcan be obtained similarly. For each active row j ∈ of A�,the solution of the optimization problem (23) with the mixedprior, let � j ⊆ S be the set of active modalities which havenon-zeros entries. Then the optimality condition for the activerow j is

[d1

jT

(x1 − D1α1�

). . . d S

jT

(xS − DSαS�

)]

� j

− λ2a�j→ = λ1

a�j→,� j

‖a�j→‖�2

+ λ′1 sign(

a�j→,� j

).

Then, the algorithm for the mixed prior can be obtained bydifferentiating the optimality condition, following similar stepsas was shown for the �12 prior.

REFERENCES

[1] D. L. Hall and J. Llinas, “An introduction to multisensor data fusion,”Proc. IEEE, vol. 85, no. 1, pp. 6–23, Jan. 1997.

[2] P. K. Varshney, “Multisensor data fusion,” in Intelligent Problem Solv-ing. Methodologies and Approaches. Berlin, Germany: Springer-Verlag,2000.

[3] H. Wu, M. Siegel, R. Stiefelhagen, and J. Yang, “Sensor fusion usingDempster–Shafer theory [for context-aware HCI],” in Proc. 19th IEEEInstrum. Meas. Technol. Conf. (IMTC), 2002, pp. 7–12.

[4] A. A. Ross and R. Govindarajan, “Feature level fusion of hand and facebiometrics,” Proc. SPIE, vol. 5779, pp. 196–204, Apr. 2005.

[5] D. Ruta and B. Gabrys, “An overview of classifier fusion methods,”Comput. Inf. Syst., vol. 7, no. 1, pp. 1–10, 2000.

[6] A. Rattani, D. R. Kisku, M. Bicego, and M. Tistarelli, “Feature levelfusion of face and fingerprint biometrics,” in Proc. 1st IEEE Int. Conf.Biometrics, Theory, Appl., Syst., Sep. 2007, pp. 1–6.

[7] H. Zhang, N. M. Nasrabadi, Y. Zhang, and T. S. Huang, “Multi-observation visual recognition via joint dynamic sparse representation,”in Proc. IEEE Conf. Comput. Vis. (ICCV), Nov. 2011, pp. 595–602.

Page 14: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

BAHRAMPOUR et al.: MULTIMODAL TASK-DRIVEN DICTIONARY LEARNING FOR IMAGE CLASSIFICATION 37

[8] A. Klausner, A. Tengg, and B. Rinner, “Vehicle classification on multi-sensor smart cameras using feature- and decision-fusion,” in Proc. 1stACM/IEEE Int. Conf. Distrib. Smart Cameras, Sep. 2007, pp. 67–74.

[9] A. Rattani and M. Tistarelli, “Robust multi-modal and multi-unit featurelevel fusion of face and iris biometrics,” in Advances in Biometrics.Berlin, Germany: Springer-Verlag, 2009, pp. 960–969.

[10] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust facerecognition via sparse representation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.

[11] X. Mei and H. Ling, “Robust visual tracking and vehicle classificationvia sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 33, no. 11, pp. 2259–2272, Nov. 2011.

[12] H. Zhang, Y. Zhang, N. M. Nasrabadi, and T. S. Huang, “Joint-structured-sparsity-based classification for multiple-measurement tran-sient acoustic signals,” IEEE Trans. Syst., Man, Cybern. B, Cybern.,vol. 42, no. 6, pp. 1586–1598, Dec. 2012.

[13] R. Caruana, Multitask Learning. New York, NY, USA: Springer-Verlag,1998.

[14] S. Shekhar, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Jointsparse representation for robust multimodal biometrics recognition,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 1, pp. 113–126,Jan. 2013.

[15] U. Srinivas, H. S. Mousavi, V. Monga, A. Hattel, and B. Jayarao,“Simultaneous sparsity model for histopathological image represen-tation and classification,” IEEE Trans. Med. Imag., vol. 33, no. 5,pp. 1163–1179, May 2014.

[16] H. S. Mousavi, V. Srinivas, U. Monga, Y. Suo, M. Dao, and T. D. Tran,“Multi-task image classification via collaborative, hierarchical spike-and-slab priors,” in Proc. IEEE Int. Conf. Image Process. (ICIP),Oct. 2014, pp. 4236–4240.

[17] N. H. Nguyen, N. M. Nasrabadi, and T. D. Tran, “Robust multi-sensorclassification via joint sparse representation,” in Proc. 14th Int. Conf.Inf. Fusion (FUSION), Jul. 2011, pp. 1–8.

[18] M. Yang, L. Zhang, D. Zhang, and S. Wang, “Relaxed collaborativerepresentation for pattern classification,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 2224–2231.

[19] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learningfor sparse coding,” Proc. 26th Annu. Int. Conf. Mach. Learn. (ICML),2009, pp. 689–696.

[20] J. Mairal, F. Bach, A. Zisserman, and G. Sapiro, “Supervised dictio-nary learning,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2008,pp. 1033–1040.

[21] J. Mairal, M. Elad, and G. Sapiro, “Sparse representation for color imagerestoration,” IEEE Trans. Image Process., vol. 17, no. 1, pp. 53–69,Jan. 2008.

[22] M. Yang, L. Zhang, J. Yang, and D. Zhang, “Metaface learning forsparse representation based face recognition,” in Proc. IEEE Conf. ImageProcess. (ICIP), Sep. 2010, pp. 1601–1604.

[23] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning mid-levelfeatures for recognition,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2010, pp. 2559–2566.

[24] Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent K-SVD: Learninga discriminative dictionary for recognition,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 35, no. 11, pp. 2651–2664, Nov. 2013.

[25] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm fordesigning overcomplete dictionaries for sparse representation,” IEEETrans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006.

[26] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning formatrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11,pp. 19–60, Jan. 2010.

[27] K. Engan, S. O. Aase, and J. Hakon Husoy, “Method of optimaldirections for frame design,” in Proc. IEEE Int. Conf. Acoust., Speech,Signal Process., vol. 5. Mar. 1999, pp. 2443–2446.

[28] M. Elad and M. Aharon, “Image denoising via sparse and redundantrepresentations over learned dictionaries,” IEEE Trans. Image Process.,vol. 15, no. 12, pp. 3736–3745, Dec. 2006.

[29] J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 791–804,Apr. 2012.

[30] J. Yang, Z. Wang, Z. Lin, X. Shu, and T. Huang, “Bilevel sparse codingfor coupled feature spaces,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2012, pp. 2360–2367.

[31] K. Shu and W. Donghui. (2012). “A brief summary of dictio-nary learning based approach for classification.” [Online]. Available:http://arxiv.org/abs/1205.6391

[32] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discrimina-tive learned dictionaries for local image analysis,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. (CVPR), Jun. 2008, pp. 1–8.

[33] Q. Zhang and B. Li, “Discriminative K-SVD for dictionary learn-ing in face recognition,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2010, pp. 2691–2698.

[34] I. Ramirez, P. Sprechmann, and G. Sapiro, “Classification and clusteringvia dictionary learning with structured incoherence and shared features,”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010,pp. 3501–3508.

[35] M. Yang, L. Zhang, X. Feng, and D. Zhang, “Fisher discriminationdictionary learning for sparse representation,” in Proc. IEEE Int. Conf.Comput. Vis. (ICCV), Nov. 2011, pp. 543–550.

[36] J. Yang, K. Yu, and T. Huang, “Supervised translation-invariant sparsecoding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),Jun. 2010, pp. 3517–3524.

[37] B. Colson, P. Marcotte, and G. Savard, “An overview of bilevel opti-mization,” Ann. Oper. Res., vol. 153, no. 1, pp. 235–256, Sep. 2007.

[38] D. M. Bradley and J. A. Bagnell, “Differentiable sparse coding,” in Adv.Neural Inf. Process. Syst. (NIPS), 2009, pp. 113–120.

[39] J. Zheng and Z. Jiang, “Learning view-invariant sparse representationsfor cross-view action recognition,” in Proc. IEEE Int. Conf. Comput.Vis. (ICCV), Dec. 2013, pp. 3176–3183.

[40] G. Monaci, P. Jost, P. Vandergheynst, B. Mailhé, S. Lesage, andR. Gribonval, “Learning multimodal dictionaries,” IEEE Trans. ImageProcess., vol. 16, no. 9, pp. 2272–2283, Sep. 2007.

[41] Y. T. Zhuang, Y. F. Wang, F. Wu, Y. Zhang, and W. Lu, “Supervised cou-pled dictionary learning with group structures for multi-modal retrieval,”in Proc. 27th Conf. Artif. Intell., 2013, pp. 1070–1076.

[42] H. Zhang, Y. Zhang, and T. S. Huang, “Simultaneous discriminativeprojection and dictionary learning for sparse representation based clas-sification,” Pattern Recognit., vol. 46, no. 1, pp. 346–354, Jan. 2013.

[43] B. A. Olshausen and D. J. Field, “Sparse coding with an overcompletebasis set: A strategy employed by V1?” Vis. Res., vol. 37, no. 23,pp. 3311–3325, Dec. 1997.

[44] L. Bottou and O. Bousquet, “The tradeoffs of large scale learning,” inProc. Adv. Neural Inf. Process. Syst. (NIPS), 2007, pp. 161–168.

[45] S. Bahrampour, A. Ray, N. M. Nasrabadi, and K. W. Jenkins,“Quality-based multimodal classification using tree-structured sparsity,”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014,pp. 4114–4121.

[46] S. F. Cotter, B. D. Rao, K. Engan, and K. Kreutz-Delgado, “Sparsesolutions to linear inverse problems with multiple measurement vectors,”IEEE Trans. Signal Process., vol. 53, no. 7, pp. 2477–2488, Jul. 2005.

[47] J. A. Tropp, “Algorithms for simultaneous sparse approximation. Part II:Convex relaxation,” Signal Process., vol. 86, no. 3, pp. 589–602,Mar. 2006.

[48] A. Rakotomamonjy, “Surveying and comparing simultaneous sparseapproximation (or group-lasso) algorithms,” Signal Process., vol. 91,no. 7, pp. 1505–1526, Jul. 2011.

[49] N. Parikh and S. Boyd, “Proximal algorithms,” Found. Trends Optim.,vol. 1, no. 3, pp. 123–231, 2013.

[50] H. Zou and T. Hastie, “Regularization and variable selection via theelastic net,” J. Roy. Statist. Soc. Ser. B, vol. 67, no. 2, pp. 301–320,Apr. 2005.

[51] M. Aharon and M. Elad, “Sparse and redundant modeling of imagecontent using an image-signature-dictionary,” SIAM J. Imag. Sci., vol. 1,no. 3, pp. 228–247, 2008.

[52] L. Bottou, “Large-scale machine learning with stochastic gradientdescent,” in Proc. 19th Int. Conf. COMPSTAT, 2010, pp. 177–186.

[53] L. Bottou, “Online learning and stochastic approximations,” OnlineLearn. Neural Netw., vol. 17, p. 9, Oct. 2012.

[54] P. Sprechmann, I. Ramirez, G. Sapiro, and Y. C. Eldar, “C-HiLasso:A collaborative hierarchical sparse modeling framework,” IEEE Trans.Signal Process., vol. 59, no. 9, pp. 4183–4198, Sep. 2011.

[55] A. Martínez and R. Benavente, “The AR face database,”CVC Tech. Rep. 24, 1998.

[56] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-PIE,”Image Vis. Comput., vol. 28, no. 5, pp. 807–813, May 2010.

[57] D. Weinland, E. Boyer, and R. Ronfard, “Action recognition fromarbitrary views using 3D exemplars,” in Proc. IEEE 11th Int. Conf.Comput. Vis. (ICCV), Oct. 2007, pp. 1–7.

[58] S. Crihalmeanu, A. Ross, and L. Hornak, “A protocol for multibiometricdata acquisition, storage and dissemination,” Dept. Comput. Sci. Elect.Eng., West Virginia Univ., Morgantown, WV, USA, Tech. Rep., 2007.

[59] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,”J. Mach. Learn. Res., vol. 9, no. 11, pp. 2491–2521, 2008.

Page 15: 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1 ...€¦ · 24 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016 Multimodal Task-Driven Dictionary Learning

38 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 1, JANUARY 2016

[60] C. M. Bishop, Pattern Recognition and Machine Learning. New York,NY, USA: Springer-Verlag, 2006.

[61] H. Zhang, N. M. Nasrabadi, Y. Zhang, and T. S. Huang, “Joint dynamicsparse representation for multi-view face recognition,” Pattern Recognit.,vol. 45, no. 4, pp. 1290–1298, Apr. 2012.

[62] D. Tran and A. Sorokin, “Human activity recognition with met-ric learning,” in Computer Vision—ECCV. New York, NY, USA:Springer-Verlag, 2008, pp. 548–561.

[63] A. Wang, H. Kläser, C. Schmid, and C.-L. Liu, “Dense trajectories andmotion boundary descriptors for action recognition,” Int. J. Comput. Vis.,vol. 103, no. 1, pp. 60–79, May 2013.

[64] A. Gupta, A. Shafaei, J. J. Little, and R. J. Woodham, “Unlabelled 3Dmotion examples improve cross-view action recognition,” in Proc. Brit.Mach. Vis. Conf., 2014, pp. 2458–2466.

[65] I. N. Junejo, E. Dexter, P. Laptev, and I. Perez, “View-independent actionrecognition from temporal self-similarities,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 33, no. 1, pp. 172–185, Jan. 2011.

[66] X. Wu, D. Xu, L. Duan, and J. Luo, “Action recognition using contextand appearance distribution features,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit. (CVPR), Jun. 2011, pp. 489–496.

[67] R. M. Bolle, J. H. Connell, S. Pankanti, N. K. Ratha, and A. W. Senior,“The relation between the ROC curve and the CMC,” in Proc. 4th IEEEWorkshop Automat. Identificat. Adv. Technol., Oct. 2005, pp. 15–20.

[68] J. Mairal, R. Jenatton, G. Obozinski, and F. Bach, “Learning hierarchicaland topographic dictionaries with structured sparsity,” Proc. SPIE,vol. 8138, p. 81381P, Sep. 2011.

[69] E. van den Berg and M. P. Friedlander, “Theoretical and empirical resultsfor recovery from multiple measurements,” IEEE Trans. Inf. Theory,vol. 56, no. 5, pp. 2516–2527, May 2010.

[70] H. Zou, T. Hastie, and R. Tibshirani, “On the ‘degrees of freedom’ ofthe lasso,” Ann. Statist., vol. 35, no. 5, pp. 2173–2192, 2007.

[71] M. Yuan and Y. Lin, “Model selection and estimation in regression withgrouped variables,” J. Roy. Statist. Soc. Ser. B (Statistical Methodology),vol. 68, no. 1, pp. 49–67, Feb. 2006.

[72] S. Vaiter, C. Deledalle, G. Peyré, J. Fadili, and C. Dossal. (2012).“The degrees of freedom of the group lasso.” [Online]. Available:http://arxiv.org/abs/1205.1481

Soheil Bahrampour received the M.Sc. degree inelectrical engineering from the University of Tehran,Iran, in 2009, and the M.Sc. degree in mechani-cal engineering and the Ph.D. degree in electricalengineering under the supervision of A. Ray andW. K. Jenkins from The Pennsylvania State Univer-sity, University Park, PA, in 2013 and 2015, respec-tively. He is currently a Research Scientist with theBosch Research and Technology Center, Palo Alto,CA. His research interests include machine learning,data mining, signal processing, and computer vision.

Nasser M. Nasrabadi (S’80–M’84–SM’92–F’01)received the B.Sc. (Eng.) and Ph.D. degrees inelectrical engineering from the Imperial College ofScience and Technology (University of London),London, U.K., in 1980 and 1984, respectively.In 1984, he weas with IBM (U.K.) as a SeniorProgrammer. From 1985 to 1986, he was with thePhilips Research Laboratory, NY, as a member ofthe Technical Staff. From 1986 to 1991, he was anAssistant Professor with the Department of Elec-trical Engineering, Worcester Polytechnic Institute,

Worcester, MA. From 1991 to 1996, he was an Associate Professor withthe Department of Electrical and Computer Engineering, State University ofNew York at Buffalo, Buffalo, NY. From 1996 to 2015, he was a SeniorResearch Scientist with the U.S. Army Research Laboratory (ARL). He iscurrently a Professor with the Lane Department of Computer Science andElectrical Engineering. His current research interests are in image processing,computer vision, biometrics, statistical machine learning theory, sparsity,robotics, and neural networks applications to image processing. He is alsoa fellow of ARL and SPIE. He has served as an Associate Editor of theIEEE TRANSACTIONS ON IMAGE PROCESSING, the IEEE TRANSACTIONS

ON CIRCUITS, SYSTEMS AND VIDEO TECHNOLOGY, and the IEEETRANSACTIONS ON NEURAL NETWORKS.

Asok Ray (SM’83–F’02) received the Ph.D. degreein mechanical engineering from Northeastern Uni-versity, Boston, MA, and the master’s degrees in thedisciplines of electrical engineering, mathematics,and computer science. He joined the PennsylvaniaState University, University Park, PA, in 1985, andis currently a Distinguished Professor of MechanicalEngineering and Mathematics, a Graduate Facultyof Electrical Engineering, and a Graduate Facultyof Nuclear Engineering. He held research and aca-demic positions with the Massachusetts Institute of

Technology, Cambridge, MA, and Carnegie-Mellon University, Pittsburgh, PA,as well as management and research positions with the GTE Strategic SystemsDivision, Westborough, MA, the Charles Stark Draper Laboratory, Cambridge,and MITRE Corporation, Bedford, MA. He has authored or co-authored over550 research publications, including over 285 scholarly articles in refereedjournals and research monographs. He is also a fellow of the American Societyof Mechanical Engineers and the World Innovative Foundation. He had beena Senior Research Fellow with the NASA Glenn Research Center under aNational Academy of Sciences Award.

William Kenneth Jenkins (S’72–M’74–SM’80–F’85–LF’13) received the B.S.E.E. degree fromLehigh University, and the M.S.E.E. and Ph.D.degrees from Purdue University. From 1974 to 1977,he was a Research Scientist Associate with the Com-munication Sciences Laboratory, Lockheed ResearchLaboratory, Palo Alto, CA. In 1977, he joined theUniversity of Illinois at Urbana–Champaign, wherehe was a Faculty Member in Electrical and ComputerEngineering from 1977 to 1999. From 1986 to 1999,he was the Director of the Coordinated Science

Laboratory. From 1999 to 2011, he served as a Professor and the Headof Electrical Engineering with The Pennsylvania State University. In 2011,he returned to the rank of Professor of Electrical Engineering. His currentresearch interests include fault tolerant DSP for highly scaled VLSI sys-tems, adaptive signal processing, multidimensional array processing, computerimaging, bio-inspired optimization algorithms for intelligent signal processing,and fault tolerant digital signal processing. He has co-authored the bookentitled Advanced Concepts in Adaptive Signal Processing (Kluwer, 1996).He was an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND

SYSTEMS, and the President of the CAS Society (1985). He served as theGeneral Chairman of the 1988 Midwest Symposium on Circuits and Systemsand the Thirty Second Annual Asilomar Conference on Signals and Systems.From 2002 to 2007, he served on the Board of Directors of the Electricaland Computer Engineering Department Heads Association (ECEDHA) andthe President of ECEDHA in 2005. Since 2011, he has served as a memberof the IEEE-HKN Board of Governors. He was a recipient of the 1990Distinguished Service Award of the IEEE Circuits and Systems Society. In2000, he received a Golden Jubilee Medal from the IEEE Circuits and SystemsSociety and a 2000 Millennium Award from the IEEE. In 2000, he wasnamed a co-winner of the 2000 International Award of the George MontefioreFoundation (Belgium) for outstanding career contributions in the field ofelectrical engineering and electrical science. In 2002, he also received theShaler Area High School Distinguished Alumnus Award, and the ECEDHARobert M. Janowiak Outstanding Leadership and Service Award in 2013. In2007, he was honored with an IEEE Midwest Symposium on Circuits andSystems 50th Anniversary Award.