truncated cauchy non-negative matrix factorization · index terms—non-negative matrix...

14
Truncated Cauchy Non-Negative Matrix Factorization Naiyang Guan , Tongliang Liu , Yangmuzi Zhang , Dacheng Tao , Fellow, IEEE, and Larry S. Davis, Fellow, IEEE Abstract—Non-negative matrix factorization (NMF) minimizes the euclidean distance between the data matrix and its low rank approximation, and it fails when applied to corrupted data because the loss function is sensitive to outliers. In this paper, we propose a Truncated CauchyNMF loss that handle outliers by truncating large errors, and develop a Truncated CauchyNMF to robustly learn the subspace on noisy datasets contaminated by outliers. We theoretically analyze the robustness of Truncated CauchyNMF comparing with the competing models and theoretically prove that Truncated CauchyNMF has a generalization bound which converges at a rate of order Oð ffiffiffiffiffiffiffiffiffiffiffiffiffi ln n=n p Þ, where n is the sample size. We evaluate Truncated CauchyNMF by image clustering on both simulated and real datasets. The experimental results on the datasets containing gross corruptions validate the effectiveness and robustness of Truncated CauchyNMF for learning robust subspaces. Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1 INTRODUCTION N ON-NEGATIVE matrix factorization (NMF, [16]) explores the non-negativity property of data and has received considerable attention in many fields, such as text mining [25], hyper-spectral imaging [26], and gene expres- sion clustering [38]. It decomposes a data matrix into the product of two lower dimensional non-negative factor matrices by minimizing the Eudlidean distance between their product and the original data matrix. Since NMF only allows additive, non-subtractive combinations, it obtains a natural parts-based representation of the data. NMF is opti- mal when the dataset contains additive Gaussian noise, and so it fails on grossly corrupted datasets, e.g., the AR data- base [22] where face images are partially occluded by sun- glasses or scarves. This is because the corruptions or outliers seriously violate the noise assumption. Many models have been proposed to improve the robust- ness of NMF. Hamza and Brady [12] proposed a hypersur- face cost based NMF (HCNMF) which minimizes the hypersurface cost function 1 between the data matrix and its approximation. HCNMF is a significant contribution for improving the robustness of NMF, but its optimization algo- rithm is time-consuming because the Armijo’s rule based line search that it employs is complex. Lam [15] proposed L 1 -NMF 2 to model the noise in a data matrix by a Laplace distribution. Although L 1 -NMF is less sensitive to outliers than NMF, its optimization is expensive because the L 1 -norm based loss function is non-smooth. This problem is largely reduced by Manhattan NMF (MahNMF, [11]), which solves L 1 -NMF by approximating the non-smooth loss func- tion with a smooth one and minimizing the approximated loss function with Nesterov’s method [36]. Zhang et al. [29] proposed an L 1 -norm regularized Robust NMF (RNMF-L 1 ) to recover the uncorrupted data matrix by subtracting a sparse error matrix from the corrupted data matrix. Kong et al. [14] proposed L 2;1 -NMF to minimize the L 2;1 -norm of an error matrix to prevent noise of large magnitude from dominating the objective function. Gao et al. [47] further proposed robust capped norm NMF (RCNMF) to filter out the effect of outlier samples by limiting their proportions in the objective function. However, the iterative algorithms utilized in L 2;1 -NMF and RCNMF converge slowly because they involve a successive use of the power method [1]. Recently, Bhattacharyya et al. [48] proposed an important robust variant of convex NMF which only requires the aver- age L 1 -norm of noise over large subsets of columns to be small; Pan et al. [49] proposed an L 1 -norm based robust dic- tionary learning model; and Gillis and Luce [50] proposed a N. Guan is with the College of Computer, National University of Defense Technology, Changsha 410073 and the National Institute of Defense Tech- nology Innovation, Beijing 100010, China. E-mail: [email protected]. T. Liu and D. Tao are with the UBTECH Sydney Artificial Intelligence Centre and the School of Information Technologies, Faculty of Engineering and Information Technologies, University of Sydney, 6 Cleveland St, Darlington, NSW 2008, Australia. E-mail: {tongliang.liu, dacheng.tao}@sydney.edu.au. Y. Zhang is with Google, Mountain View, CA 94043. E-mail: [email protected]. L.S. Davis is with the University of Maryland Institute for Advanced Computer Studies (UMIACS), University of Maryland at College Park, MD 20742. E-mail: [email protected]. Manuscript received 30 Apr. 2017; revised 12 Oct. 2017; accepted 16 Nov. 2017. Date of publication 28 Nov. 2017; date of current version 12 Dec. 2018. Recommended for acceptance by H. Xu. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPAMI.2017.2777841 1. The hypersurface cost function is defined as hðxÞ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1 þ x 2 Þ p 1 which is quadratic when its argument is small and linear when its argu- ment is large. 2. When the noise is modeled by Laplace distribution, the maximum likelihood estimation yields an L 1 -norm based objective function. We therefore term the method in [15] L 1 -NMF. 246 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 1, JANUARY 2019 0162-8828 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 21-Jul-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

Truncated Cauchy Non-NegativeMatrix Factorization

Naiyang Guan , Tongliang Liu , Yangmuzi Zhang , Dacheng Tao , Fellow, IEEE,

and Larry S. Davis, Fellow, IEEE

Abstract—Non-negative matrix factorization (NMF) minimizes the euclidean distance between the data matrix and its low rank

approximation, and it fails when applied to corrupted data because the loss function is sensitive to outliers. In this paper, we propose a

Truncated CauchyNMF loss that handle outliers by truncating large errors, and develop a Truncated CauchyNMF to robustly learn the

subspace on noisy datasets contaminated by outliers. We theoretically analyze the robustness of Truncated CauchyNMF comparing

with the competing models and theoretically prove that Truncated CauchyNMF has a generalization bound which converges at a rate of

order Oð ffiffiffiffiffiffiffiffiffiffiffiffiffilnn=n

p Þ, where n is the sample size. We evaluate Truncated CauchyNMF by image clustering on both simulated and real

datasets. The experimental results on the datasets containing gross corruptions validate the effectiveness and robustness of Truncated

CauchyNMF for learning robust subspaces.

Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming

Ç

1 INTRODUCTION

NON-NEGATIVE matrix factorization (NMF, [16])explores the non-negativity property of data and has

received considerable attention in many fields, such as textmining [25], hyper-spectral imaging [26], and gene expres-sion clustering [38]. It decomposes a data matrix into theproduct of two lower dimensional non-negative factormatrices by minimizing the Eudlidean distance betweentheir product and the original data matrix. Since NMF onlyallows additive, non-subtractive combinations, it obtains anatural parts-based representation of the data. NMF is opti-mal when the dataset contains additive Gaussian noise, andso it fails on grossly corrupted datasets, e.g., the AR data-base [22] where face images are partially occluded by sun-glasses or scarves. This is because the corruptions oroutliers seriously violate the noise assumption.

Many models have been proposed to improve the robust-ness of NMF. Hamza and Brady [12] proposed a hypersur-face cost based NMF (HCNMF) which minimizes the

hypersurface cost function1 between the data matrix and itsapproximation. HCNMF is a significant contribution forimproving the robustness of NMF, but its optimization algo-rithm is time-consuming because the Armijo’s rule basedline search that it employs is complex. Lam [15] proposedL1-NMF2 to model the noise in a data matrix by a Laplacedistribution. Although L1-NMF is less sensitive to outliersthan NMF, its optimization is expensive because theL1-norm based loss function is non-smooth. This problem islargely reduced by Manhattan NMF (MahNMF, [11]), whichsolves L1-NMF by approximating the non-smooth loss func-tion with a smooth one and minimizing the approximatedloss function with Nesterov’s method [36]. Zhang et al. [29]proposed an L1-norm regularized Robust NMF (RNMF-L1)to recover the uncorrupted data matrix by subtracting asparse error matrix from the corrupted data matrix. Konget al. [14] proposed L2;1-NMF to minimize the L2;1-norm ofan error matrix to prevent noise of large magnitude fromdominating the objective function. Gao et al. [47] furtherproposed robust capped norm NMF (RCNMF) to filter outthe effect of outlier samples by limiting their proportions inthe objective function. However, the iterative algorithmsutilized in L2;1-NMF and RCNMF converge slowly becausethey involve a successive use of the power method [1].Recently, Bhattacharyya et al. [48] proposed an importantrobust variant of convex NMF which only requires the aver-age L1-norm of noise over large subsets of columns to besmall; Pan et al. [49] proposed an L1-norm based robust dic-tionary learning model; and Gillis and Luce [50] proposed a

� N. Guan is with the College of Computer, National University of DefenseTechnology, Changsha 410073 and the National Institute of Defense Tech-nology Innovation, Beijing 100010, China. E-mail: [email protected].

� T. Liu and D. Tao are with the UBTECH Sydney Artificial IntelligenceCentre and the School of Information Technologies, Faculty of Engineeringand Information Technologies, University of Sydney, 6 Cleveland St,Darlington, NSW 2008, Australia.E-mail: {tongliang.liu, dacheng.tao}@sydney.edu.au.

� Y. Zhang is with Google, Mountain View, CA 94043.E-mail: [email protected].

� L.S. Davis is with the University of Maryland Institute for AdvancedComputer Studies (UMIACS), University of Maryland at College Park,MD 20742. E-mail: [email protected].

Manuscript received 30 Apr. 2017; revised 12 Oct. 2017; accepted 16 Nov.2017. Date of publication 28 Nov. 2017; date of current version 12 Dec. 2018.Recommended for acceptance by H. Xu.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPAMI.2017.2777841

1. The hypersurface cost function is defined as hðxÞ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið1þ x2Þp � 1which is quadratic when its argument is small and linear when its argu-ment is large.

2. When the noise is modeled by Laplace distribution, the maximumlikelihood estimation yields an L1-norm based objective function. Wetherefore term the method in [15] L1-NMF.

246 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 1, JANUARY 2019

0162-8828� 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

robust near-separable NMF which can determine the low-rank, avoid normalizing data, and filter out outliers.HCNMF, L1-NMF, RNMF-L1, L2;1-NMF, RCNMF, [48], [49]and [50] share a common drawback, i.e., they all fail whenthe dataset is contaminated by serious corruptions becausethe breakdown point of the L1-norm based models is deter-mined by the dimensionality of the data [7].

In this paper, we propose a Truncated Cauchy non-negative matrix factorization (Truncated CauchyNMF)model to learn a subspace on a dataset contaminated by largemagnitude noise or corruption. In particular, we proposed aTruncated Cauchy loss that simultaneously and appropri-ately models moderate outliers (because the loss correspondsto a fat tailed distribution in-between the truncation points)and extreme outliers (because the truncation directly cut offlarge errors). Based on the proposed loss function,we developa novel Truncated CauchyNMF model. We theoretically ana-lyze the robustness of Truncated CauchyNMF and show thatTruncated CauchyNMF is more robust than a family of NMFmodels, and derive a theoretical guarantee for its generaliza-tion ability and show that Truncated CauchyNMF convergesat a rate of order Oð ffiffiffiffiffiffiffiffiffiffiffiffiffi

lnn=np Þ, where n is the sample size.

Truncated CauchyNMF is difficult to optimize because theloss function includes a nonlinear logarithmic function. Toaddress this, we optimize Truncated CauchyNMF by half-quadratic (HQ) programming based on the theory of convexconjugation. HQ introduces a weight for each entry of thedata matrix and alternately and analytically updates theweight and updates both factor matrices by easily solving aweighted non-negative least squares problemwithNesterov’smethod [23]. Intuitively, the introduced weight reflects themagnitude of the error. The heavier the corruption, thesmaller the weight, and the less an entry contributes to learn-ing the subspace. By performing truncation onmagnitudes oferrors, we prove that HQ introduces zero weights for entrieswith extreme outliers, and thus HQ is able to learn the intrin-sic subspace on the inlier entries.

In summary, the contributions of this paper are three-fold: (1) we propose a robust subspace learning frameworkcalled Truncated CauchyNMF, and develop a Nesterov-based HQ algorithm to solve it; (2) we theoretically analyzethe robustness of Truncated CauchyNMF comparing with afamily of NMF models, and provide insight as to why Trun-cated CauchyNMF is the most robust method; and (3) wetheoretically analyze the generalization ability of TruncatedCauchyNMF, and provide performance guarantees for theproposed model. We evaluate Truncated CauchyNMF byimage clustering on both simulated and real datasets. Theexperimental results on the datasets containing gross cor-ruptions validate the effectiveness and robustness of Trun-cated CauchyNMF for learning the subspace.

The rest of this paper is organized as follows: Section 2describes the proposed Truncated CauchyNMF, Section 3develops the Nesterov-based half-quadratic programmingalgorithm for solving Truncated CauchyNMF. Section 4 sur-veys the related works and Section 5 verifies TruncatedCauchyNMF on simulated and real datasets. Section 6 con-cludes this paper. All the proofs are given in the supple-mentary material, which can be found on the ComputerSociety Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2777841.

2 TRUNCATED CAUCHY NON-NEGATIVE MATRIX

FACTORIZATION

Classical NMF [16] is not robust because its loss functione2ðxÞ ¼ x2 is sensitive to outliers considering the errors oflarge magnitude dominate the loss function. Althoughsome robust loss functions, such as e1ðxÞ ¼ jxj for L1-NMF

[15], Hypersurface cost ehðxÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffi1þ x2p � 1 [12], and

Cauchy loss ecðx; gÞ ¼ ln 1þ x=gð Þ2� �

, are less sensitive to

outliers, they introduces infinite energy for infinitely largenoise in the extreme case. To remedy this problem, we pro-pose a Truncated Cauchy loss by truncating the magnitudesof large errors to limit the effects of extreme outliers, i.e.,

etðx; g; "Þ ¼ln 1þ x=gð Þ2� �

; jxj � "

ln 1þ "=gð Þ2� �

; jxj > ";

8<: (1)

where g is the scale parameter of the Cauchy distributionand " is a constant.

To study the behavior of the Truncated Cauchy loss, wecompare the loss functions e2ðxÞ, e1ðxÞ, ecðx; 1Þ, etðx; 1; 5Þ, andthe loss function of the L0-norm, i.e., e0ðxÞ ¼ 1; x 6¼ 0

0; x ¼ 0

�in

Fig. 1, because the L0-norm induces robust models. Fig. 1ashows that when the error is moderately large, e.g., jxj � 5,etðx; 1; 5Þ shifts from e2ðxÞ to e1ðxÞ and corresponds to a fat-tailed distribution, and implies that the Truncated Cauchyloss can model moderate outliers well, while e2ðxÞ cannotbecause it makes the outliers dominate the objective function.When the error gets larger and larger, etðx; 1; 5Þ gets awayfrom e1ðxÞ and behaves like e0ðxÞ, and etðx; 1; 5Þ keeps con-stant once the error exceeds a threshold, e.g., jxj > 5, andimplies that the Truncated Cauchy loss can model extremeoutliers, whereas neither e1ðxÞ nor ecðx; 1Þ cannot becausethey encourage infinite energy to infinitely large error. Intui-tively, the Truncated Cauchy loss can model both moderateand extreme outliers well. Fig. 1b plots the curves of bothetðx; g; "Þ and e0ðxÞwith varying g from 0.001 to 1 and accord-ingly varying " from 25 to 200. It shows that etðx; g; "Þ behavesmore and more close to e0ðxÞ when g approaches zero. Bycomparing the behaviors of loss functions, we believe that theTruncated Cauchy loss can induce robust NMFmodel.

Given n high-dimensional samples arranged in a non-negative matrix V ¼ ½v1; . . . ; vn� 2 Rm�n

þ , Truncated Cauchynon-negative matrix factorization (Truncated CauchyNMF)approximately decomposes V into the product of two lower

Fig. 1. The comparison of loss functions: (a) e2ðxÞ, e1ðxÞ, ecðx; 1Þ,etðx; 1; 5Þ, and e0ðxÞ; and (b) etðx; g; ") when ðg; "Þ ¼ ð1; 200Þ; ð0:1; 100Þ;ð0:01; 50Þ; ð0:001; 25Þ and e0ðxÞ.

GUAN ET AL.: TRUNCATED CAUCHY NON-NEGATIVE MATRIX FACTORIZATION 247

Page 3: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

dimensional non-negative matrices, i.e., V ¼WH þ E, whereW 2 Rm�r

þ signifies the basis, H ¼ ½h1; . . . ; hn� 2 Rr�nþ signi-

fies the coefficients, and E 2 Rm�n signifies the error matrixwhich is measured by using the proposed Truncated Cauchyloss. The objective function of Truncated CauchyNMF can bewritten as

minW�0;H�0

1

2

Xij

gV �WH

g

� �2

ij

!; (2)

where gðxÞ ¼ lnð1þ xÞ; 0 � x � s

lnð1þ sÞ; x > s

�is utilized for the con-

venience of derivation and s is a truncation parameter, and

g is the scale parameter. We will next show that the trunca-

tion parameter s can be implicitly determined by robust sta-

tistics and the scale parameter g can be estimated by the

Nagy algorithm [32]. It is not hard to see that Truncated

CauchyNMF includes CauchyNMF as a special case whens ¼ þ1. Since (2) assigns fixed energy to any large error

whose magnitude exceeds gffiffiffisp

, Truncated CauchyNMF

can filter out any extreme outliers.To illustrate the ability of Truncated CauchyNMF tomodel

outliers, Fig. 2 gives an illustrative example that demonstratesits application to corrupted face images. In this example, weselect 26 frontal face images of an individual in two sessionsfrom the Purdue AR database [22] (see all face images inFig. 2a). In each session, there are 13 frontal face images withdifferent facial expressions, captured under different illumi-nation conditions, with sunglasses, and with a scarf. Eachimage is cropped into a 165� 120 -dimensional pixel arrayand reshaped into a 19800-dimensional vector. The total num-ber of face images compose a 19800� 26-dimensional non-negative matrix because the pixel values are non-negative. Inthis experiment, we aim at learning the intrinsically clean faceimages from the contaminated images. This task is quite chal-lenging because more than half the images are contaminated.Since these images were taken in two sessions, we set thedimensionality low (r ¼ 2) to learn two basis images. Fig. 2bshows that Truncated CauchyNMF robustly recovers all faceimages evenwhen they are contaminated by a variety of facialexpressions, illumination, and occlusion. Fig. 2c presents thereconstruction errors and Fig. 2d shows the basis images,which confirms that Truncated CauchyNMF is able to learnclean basis imageswith the outliers filtered out.

In the following sections, we will analyze the generaliza-tion ability and robustness of Truncated CauchyNMF. Beforethat, we introduce Lemma 1 which states that the new repre-sentations generated by Truncated CauchyNMF are boundedif the input observations are bounded. This lemmawill be uti-lized in the following analysis with the only assumption that

each base is a unit vector. Such an assumption is typical inNMF because the basesW are usually normalized to limit thevariance of its local minimizers. We use k � kp to represent theLp-norm and k � k to represent the euclidean norm.

Lemma 1. Assuming kWik ¼ 1, i ¼ 1; . . . ; r, and that the inputobservations are bounded, i.e., kvk � a for some a > 0. Thenthe new representations are also bounded, i.e., khk � 2aþðsaÞ=ð ffiffiffi2p gÞ.Although Truncated CauchyNMF (2) has a differentiable

objective function, solving it is difficult because the naturallogarithmical function is nonlinear. Section 3 will present ahalf-quadratic programming algorithm for solving Trun-cated CauchyNMF.

2.1 Generalization Ability

To analyze the generalization ability of Truncated Cau-chyNMF, we further assume that samples ½v1; . . . ; vn� areindependent and identically distributed and drawn from aspace V with a Borel measure r. We use A�j and Aij todenote the jth column and the ði; jÞth entry of a matrix,respectively, and ai is the ith entry of a vector a.

For any W 2 Rm�rþ , we define the reconstruction error of

a sample v as follows:

fW ðvÞ ¼ minh2Rrþ

Xj

gv�Wh

g

� �2

j

!: (3)

Therefore, the objective function of Truancated CauchyNMF(2) can be written as

minW�0;H�0

1

2

Xij

gV �WH

g

� �2

ij

!¼ min

W�01

2

Xi

fW ðviÞ: (4)

Let us define the empirical reconstruction error of Trun-cated CauchyNMF as RnðfW Þ ¼ 1

n

Pni¼1 fW ðviÞ, and the

expected reconstruction error of Truncated CauchyNMF asRðfW Þ ¼ Ev

1n

Pni¼1 fW ðviÞ. Intuitively, we want to learn

W ¼ argminW�0RðfW Þ: (5)

However, since the distribution of v is unknown, we cannotminimize RðfW Þ directly. Instead, we use the empirical riskminimization (ERM, [2]) algorithm to learn Wn to approxi-mateW, as follows:

Wn ¼ argminW�0RnðfW Þ: (6)

We are interested in the difference between Wn and W.If the distance is small, we can say thatWn is a good approx-imation of W. Here, we measure the distance of theirreduced expected reconstruction error as follows:

RðfWnÞ �RðfW Þ � 2 supfW2FW

jRðfW Þ �RnðfW Þj;

where FW ¼ ffW jW 2 W ¼ Rm�rþ g. The right hand side is

known as the generalization error. Note that since NMF isconvex with respect to either W or H but not both, the mini-mizer fWn is hard to obtain. In practice, a local minimizer isused as an approximation. Measuring the distance betweenthe local minimizer and the global minimizer is also aninteresting and challenging problem.

Fig. 2. The illustrative example: (a) Frontal face images from the ARdatabase, (b) face images reconstructed by Truncated CauchyNMF, (c)error images, and (d) the learned basis images.

248 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 1, JANUARY 2019

Page 4: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

By analyzing the covering number [30] of the functionclass FW and Lemma 1, we derive a generalization errorbound for Truncated CauchyNMF as follows:

Theorem 1. Let kW�ik ¼ 1, i ¼ 1; . . . ; r, and FW ¼ ffW jW 2W ¼ Rm�r

þ g. Assume that kvk � a. For any d > 0, with prob-ability at least 1� d, the Equation (7) holds, where Gð12Þ ¼

ffiffiffipp

;Gð1Þ ¼ 1; and Gðxþ 1Þ ¼ xGðxÞ.

Remark 1. Theorem 1 shows that under the setting of ourproposed Truncated CauchyNMF, the expected recon-struction error RðfWnÞ will converge to RðfW Þ with a fast

rate of order Oð ffiffiffiffiffiffiffiffiffiffiffiffiffilnn=n

p Þ, which means that when the

sample size n is large, the distance between RðfWnÞ andRðfW Þ will be small. Moreover, if n is large and a local

minimizer W (obtained by optimizing the non-convexobjective of Truncated CauchyNMF) is close to the global

minimizer Wn, the local minimizer will also be close to

the optimalW.

Remark 2. Theorem 1 also implies that for any W learnedfrom (2), the corresponding empirical reconstructionerror RnðfW Þ will converge to its expectation with a spe-cific rate guarantee, which means our proposed Trun-cated CauchyNMF can generalize well to unseen data.

Note that the noise sampled from the Cauchy distribu-tion should not be bounded because Cauchy distribution isheavy-tailed. And bounded observations always implybounded noise. However, Theorem 1 keeps the bounded-ness assumption on the observations for two reasons: (1) thetruncated loss function indicates that the observations corre-sponding to unbounded noise are discarded, and (2) in realapplications, the energy of observations should be bounded,which means their L2-norms are bounded.

2.2 Robustness Analysis

We next compare the robustness of Truncated CauchyNMFwith those of otherNMFmodels by using a sample-weightedprocedure interpretation [20]. The sample-weighted proce-dure compares the robustness of different algorithms fromthe optimization viewpoint.

Let F ðWHÞ denote the objective function of any NMFproblem and fðtÞ ¼ F ðtWHÞwhere t 2 R. We can verify thatthe NMF problem is equivalent to finding a pair ofWH suchthat f

0 ð1Þ ¼ 0,3 where f0 ðtÞ denotes the derivative of fðtÞ. Let

cðVij;WHÞ ¼ ðV �WHÞijð�WHÞij be the contribution of the

jth entry of the ith training example to the optimization pro-cedure and eðVij;WHÞ ¼ jV �WHjij be an error function.

Note that we choose cðVij;WHÞ as the basis of contributionbecause we choose NMF, which aims to find a pair of WHsuch that

Pij cðVij;WHÞ ¼ 0 and is sensitive to noise, as the

baseline for comparing the robustness. Also note thateðVij;WHÞ represents the noise added to the ði; jÞth entry ofV . The interpretation of the sample-weighted procedureexplains the optimization procedure as being contribution-weightedwith respect to the noise.

We compare f 0ð1Þ of a family of NMF models in Table 1.Note that sincemultiplying f 0ð1Þ by a constantwill not changeits zero points, we can normalize the weights of differentNMFmodels to unity when the noise is equal to zero. Duringthe optimization procedures, robust algorithms should assigna small weight to an entry of the training set with large noise.Therefore, by comparing the derivative f 0ð1Þ, we can easilymake the following statements: (1) L1-NMF4 ismore robust tonoise and outliers thanNMF;Huber-NMF combines the ideasof NMF and L1-NMF; (2) HCNMF, L2;1-NMF, RCNMF, andRNMF-L1 work similarly to L1-NMF because their weightsare of order Oð1=eðVij;WHÞÞ with respect to the noise. It alsobecomes clear that HCNMF, L2;1-NMF, and RCNMF exploitsome data structure information because the weights includethe neighborhood information of eðVij;WHÞ and that RNMF-L1 is less sensitive to noise because it employs a sparse matrixS to adjust the weights; (3) The interpretation of the sample-weighted procedure also illustrates why CIM-NMF workswell for heavy noise. This is because its weights decreaseexponentially when the noise is large; And (4) for the pro-posed Truncated CauchyNMF, when the noise is larger thana threshold, its weights will drop directly to zero, whichdecrease far faster than that of CIM-NMF and thus TruncatedCauchyNMF is very robust to extreme outliers. Finally, weconclude that Truncated CauchyNMF is more robust thanany other NMF models with respect to extreme outliersbecause it has the power to provide smaller weights toexamples.

3 HALF-QUADRATIC PROGRAMMING ALGORITHM

FOR TRUNCATED CAUCHYNMF

Note that Truncated CauchyNMF (2) cannot be solveddirectly because the energy function gðxÞ is non-quadratic.We present a half-quadratic programming algorithm basedon conjugate function theory [9]. To adopt the HQ algo-rithm, we transform (2) to the following maximization form:

RðfWnÞ � RðfWÞ � supfW2FW

Ev1

2n

Xij

gV �WH

g

� �2

ij

!� 1

2n

Xij

gV �WH

g

� �2

ij

!����������

� min�

2�þ a2

g2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffimr ln ð41

mp12

8ra2 þ 2a2

r2þ sa2ffiffiffi

2p

r3þ 2rs2a2

r4

� �mrÞ=G m

2

� � 1m2�

� �þ ln

2

d

� �� �=2n

s( )

� 2

nþ a2

g2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffimr ln 4

1mp

12

8ra2 þ 2a2

r2þ sa2ffiffiffi

2p

r3þ 2rs2a2

r4

� �mrn

� �=G

m

2

� � 1m2

� �þ ln

2

d

� �� �=2n

s:

(7)

3. When minimizing F ðWHÞ, the low rank matricesW andH will beupdated alternately. Fixing one of them and optimizing the otherimplies that f

0 ð1Þ ¼ 0. In other words, if f0 ð1Þ 6¼ 0, neither W nor H can

be a minimizer.4. For the soundness of defining the subgradient of L1-norm, we

state that 00 can be any value in ½�1; 1�.

GUAN ET AL.: TRUNCATED CAUCHY NON-NEGATIVE MATRIX FACTORIZATION 249

Page 5: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

maxW�0;H�0

1

2

Xij

fV �WH

g

� �2

ij

!; (8)

where fðxÞ ¼ �gðxÞ is the core function utilized in HQ.Since the negative logarithmic function is convex, fðxÞ isalso convex.

3.1 HQ-Based Alternating Optimization

Generally speaking, the half-quadratic programming algo-rithm [9] reformulates the non-quadratic loss function as anaugmented loss function in an enlarged parameter space byintroducing an additional auxiliary variable based on theconvex conjugation theory [3]. HQ is equivalent to thequasi-Newton method [24] and has been widely applied innon-quadratic optimization.

Note that the function fðxÞ : Rþ ! R is continuous, andaccording to [3], its conjugate fðyÞ : R! R [ fþ1g isdefined as

fðyÞ ¼ maxx2Rþfxy� fðxÞg:

Since fðxÞ is convex and closed (although the domain Rþ isopen, fðxÞ is closed, see Section A.3.3 in [3]), the conjugateof its conjugate function is itself [3], i.e., f ¼ f , then wehave:

Theorem 2. The core function fðxÞ and its conjugate fðyÞsatisfy

fðxÞ ¼ maxyfyx� fðyÞg; x 2 Rþ; (9)

and the maximizer is y ¼ �1=ð1þ xÞ; 0 � x � s

0; x > s

�.

By substituting x ¼ ðV�WHgÞ2ij into (9), we have the aug-

mented loss function

fV �WH

g

� �2

ij

!¼ max

YijYij

V �WH

g

� �2

ij

�fðYijÞ( )

; (10)

where Yij is the auxiliary variable introduced by HQ forðV�WH

gÞ2ij. By substituting (10) into (8), we have the objective

function in an enlarged parameter space

maxW�0;H�0

�1

2

Xij

maxYij

�Yij

�V �WH

g

�2

ij

� fðYijÞ

¼ maxW�0;H�0;Y

�1

2

Xij

�Yij

�V �WH

g

�2

ij

� fðYijÞ

;

(11)

where the equality comes from the separability of the opti-mization problems with respect to Yij.

Although the objective function in (8) is non-quadratic,its equivalent problem (11) is essentially a quadraticoptimization. In this paper, HQ solves (11) based on theblock coordinate descent framework. In particular, HQrecursively optimizes the following three problems. Attth iteration

Y tþ1 : maxY1

2

Xij

YijV �WtHt

g

� �2

ij

�fðYijÞ !

; (12)

Htþ1 : maxH�0

1

2

Xij

Y tþ1ij

V �WtH

g

� �2

ij

!; (13)

Wtþ1 : maxW�0

1

2

Xij

Y tþ1ij

V �WHtþ1

g

� �2

ij

!: (14)

Using Theorem 2, we know that the solution of (12) can beexpressed analytically as

TABLE 1Comparison of the Robustness of Truncated CauchyNMF with Those of Other NMF Models

NMF methods Objective function F ðWHÞ Derivative f 0ð1ÞNMF kV �WHk2F

Pij 2cðVij;WHÞ

HCNMFP

ijðffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þ ðV �WHÞ2ij

q� 1Þ P

ij1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1þðV�WHÞ2ijp cðVij;WHÞ

L2;1-NMF kV �WHk2;1P

ij1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

lðV�WHÞ2lj

p cðVij;WHÞ

RCNMFPn

j¼1 minfkV�j �WH�jk; ugPn

j¼1

Pi

1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPlðV�WHÞ2lj

p cðVij;WHÞ; kV�j �WH�jk � u

0; kV�j �WH�jk � u

(

RNMF-L1 kV �WH � Sk2F þ �kSk1P

ij 2ð1� SijðV�WHÞijÞcðVij;WHÞ

L1-NMF kV �WHk1P

ij1

jV�WHjij cðVij;WHÞ

HuberNMF

Pmi¼1Pn

j¼1 lððV �WHÞij; sÞ;where lðx; sÞ ¼ x2; jxj � s

2sjxj � s2; jxj � s

� Pij

2cðVij;WHÞ; jV �WHjij � s2s

jV�WHjij cðVij;WHÞ; jV �WHjij � s

(

CIM-NMFPm

i¼1Pn

j¼1 1� 1ffiffiffiffi2pp

se�ðV�WHÞ2ij=2s2 P

ij1ffiffiffiffi

2pp

s3e�ðV�WHÞ2ij=2s2

cðVij;WHÞ

CauchyNMFP

ij lnð1þ ðV�WHgÞijÞ

Pij

2g2þðV�WHÞ2ij

cðVij;WHÞ

Truncated CauchyNMF

Pij gððV�WH

gÞ2ijÞ;

where gðxÞ ¼ lnð1þ xÞ; 0 � x � slnð1þ sÞ; x > s

� Pij

2g2þðV�WHÞ2ij

cðVij;WHÞ; jV �WHjij � gffiffiffisp

0 � cðVij;WHÞ; jV �WHjij > gffiffiffisp

(

250 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 1, JANUARY 2019

Page 6: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

Y tþ1ij ¼

� 1

1þðV�WtHt

g Þ2ij; if jðV �WtHtÞijj � g

ffiffiffisp

0; if jðV �WtHtÞijj > gffiffiffisp

:

8<:

Since (13) and (14) are symmetric and intrinsicallyweighted non-negative least squares (WNLS) problems,they can be optimized in the same way using the Nesterovmethod [10]. Taking (13) as an example, the procedure of itsNesterov based optimization is summarized in Algorithm 1,and its derivative is derived in the supplementary material,available online. Considering that (13) is a constrained opti-mization problem, similar to [18], we use the followingprojected gradient-based criterion to check the stationarityof the search point , i.e., rP

j ðhkÞ ¼ 0, where rPj ðhkÞl ¼

rPj ðhkÞl; ðhkÞl � 0

minf0;rPj ðhkÞlg; ðhkÞl ¼ 0

(. Since the above stopping cri-

terion will make OGM run unnecessarily long, similar to[18], we use a relaxed version

krPj ðhkÞkF � maxf�1; 10�3g � krP

j ðh0ÞkF ; (15)

where �1 is a tolerance that controls how far the search pointis from a stationary point.

Algorithm 1. Optimal Gradient Method (OGM) forWNLS

Input: V�j 2 Rmþ ,W

t 2 Rm�rþ ,Ht

�j 2 Rrþ,D

tþ1j .

Output:Htþ1�j .

1: Initialize z0 ¼ Ht�j, h

0 ¼ Ht�j, a0 ¼ 1, k ¼ 0.

2: Calculate Lj ¼ kWtTDtþ1j Wtk2.

repeat3:rjðzkÞ ¼WtTDtþ1

j Wtzk �WtTDtþ1j V�j.

4: hkþ1 ¼ Pþðzk � rjðzkÞLjÞ.

5: akþ1 ¼ 1þffiffiffiffiffiffiffiffiffiffi4a2

kþ1

p2 :

6: zkþ1 ¼ hkþ1 þ ak�1akþ1 ðh

kþ1 � hkÞ.7: k kþ 1.

until {The stopping criterion (15) is satisfied.}8:Htþ1

�j ¼ zk.

The complete procedure of the HQ algorithm is summa-rized in Algorithm 2. The weights of entries and factormatrices are updated recursively until the objective functiondoes not change. We use the following stopping criterion tocheck the convergence in Algorithm 2

jF ðWt;HtÞ � F ðW ; HÞjjF ðW 0; H0Þ � F ðWt;HtÞj � �2; (16)

where �2 signifies the tolerance, F ðW;HÞ signifies theobjective function of (8) and ðW ; HÞ signifies a localminimizer.5 The stopping criterion (16) implies that HQstops when the search point is sufficiently close to theminimizer and sufficiently far from the initial point.Line 3 updates the scale parameter by the Nagy algo-rithm and will be further presented in Section 3.2. Line 4

detects outliers by robust statistics and will be presentedin Section 3.3.

Algorithm 2. Half-Quadratic (HQ) Programming Algo-rithm for Truncated CauchyNMF

Input:W 2 Rm�nþ , r minfm;ng.

Output:W;H.1: InitializeW 0 2 Rm�r

þ ,H0 2 Rr�nþ , t ¼ 0.

repeat2: Calculate Et ¼ V �WtHt and Qtþ1 ¼ 1

ð1þðEt

g Þ2Þ.

3: Update the scale parameter g based on Et.4: Detect the indices VðtÞ of outliers and set Qtþ1

VðtÞ ¼ 0.for j ¼ 1; . . . ; n do5: CalculateDtþ1

j ¼ diagðQtþ1�j Þ.

6: Update Htþ1�j by Algorithm 1.

end for7: Calculate Et ¼ V �WtHtþ1 and Qtþ1 ¼ 1

ð1þðEt

g Þ2Þ.

for i ¼ 1; . . . ; m do8: CalculateDtþ1

i ¼ diagðQtþ1i� Þ.

9: UpdateWtþ1i� by Algorithm 1.

end for10: t tþ 1.

until {Stopping criterion (16) is satisfied.}11:W ¼Wt;H ¼ Ht.

The main time cost of Algorithm 2 is incurred on lines 2,4, 5, 6, 7, 8, and 9. The time complexities of lines 2 and 7 areboth OðmnrÞ. According to Algorithm 1, the time complexi-ties of lines 6 and 9 are Oðmr2Þ and Oðnr2Þ, respectively.Since line 4 introduces a median operator, its time complex-ity is Oðmn lnðmnÞÞ. In summary, the total complexity ofAlgorithm 2 is Oððmn lnðmnÞ þmnr2ÞÞ.

3.2 Scale Estimation

The parameter estimation problem for Cauchy distributionhas been studied for several decades [32], [33], [34]. Nagy[32] proposed an I-divergence based method, termed theNagy algorithm for short, to simultaneously estimate loca-tion and scale parameters. The Nagy algorithm minimizesthe discrimination information6 between the empirical dis-tribution of the data points and the prior Cauchy distribu-tion with respect to the parameters. In our TruncatedCauchyNMF model (2), the location parameter of theCauchy distribution is assumed to be zero, and thus weonly need to estimate the scale-parameter g.

Here we employ the Nagy algorithm to estimate thescale-parameter based on all the residual errors of the data.According to [32], supposing there exist a large number ofresidual errors, the scale-parameter estimation problem canbe formulated as

ming

Dðhnjf0;gÞ ¼ ming

Z þ11

ln1

f0;gðxÞ dFnðxÞ

¼ ming

XNn¼1

1

Nln

1

f0;gðxkÞ ;(17)

5. Since any local minimal is unknown beforehand, we instead uti-lize ðWt�1; Ht�1Þ in our experiments.

6. The discrimination information of random variable �1 given ran-

dom variable �2 is defined as Dð�1j�2Þ ¼Rþ1�1 ln f1ðxÞ

f2ðxÞ dF1ðxÞ, where f1

and f2 are the PDFs of �1 and �2, and F1 is the distribution function of �1.

GUAN ET AL.: TRUNCATED CAUCHY NON-NEGATIVE MATRIX FACTORIZATION 251

Page 7: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

where Dð�j�Þ denotes the discrimination information, andthe first equality is due to the independence of hn and g,and the second equality is due to the Law of large numbers.By substituting the probability density function f0;g ofCauchy distribution7 into (17) and replacing fxng with

fEijg, we can rewrite (17) as follows: mingPm

i¼1Pn

j¼11mn ln

fpgð1þ ðEij

gÞ2Þg. To solve this problem, Nagy [32] proposed

an efficient iterative algorithm, i.e.,

gkþ1 ¼ gk

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1=e0k � 1

q; k ¼ 0; 1; 2; . . . ; (18)

where g0 > 0, and e0k ¼ 1mn

Pmi¼1Pn

j¼11

ð1þðEijgkÞ2Þ. In [32], Nagy

proved that the algorithm (18) converges to a fixed pointassuming the number of data points is large enough, andthis assumption is reasonable in Truncated CauchyNMF.

3.3 Outlier Rejection

Looking more carefully at (12), (13) and (14), HQ intrinsi-cally assigns a weight for each entry of V withboth factor matrices Htþ1 and Wtþ1 fixed, i.e., Qtþ1

ij ¼1

1þðEt

g Þ2ij; if jEt

ijj � gffiffiffisp

0; if jEtijj > g

ffiffiffisp

(, where Et denotes the error

matrix at the tth iteration. The larger the magnitude of errorfor a particular entry, the lighter the weight is assigned to itby HQ. Intuitively, the corrupted entry contributes less inlearning the intrinsic subspace. If the magnitude of errorexceeds a threshold g

ffiffiffisp

, Truncated CauchyNMF assignszero weights to the corrupted entries to inhibit their contri-bution to the learned subspace. That is how Truncated Cau-chyNMF filters out extreme outliers.

However, it is non-trivial to estimate the threshold gffiffiffisp

.Here, we introduce a robust statistics-based method toexplicitly detect the support of the outliers instead of esti-mating the threshold to detect outliers. Since the energy

function of Truncated CauchyNMF gets close to that ofNMF as the error tends towards zero, i.e., limx!0ðlnð1þ x2Þ�x2Þ ¼ 0. Truncated CauchyNMF encourages the smallmagnitude errors to have a Gaussian distribution. Let Qt

denote the set of magnitudes of error at the tth iteration ofHQ, i.e., Qt ¼ fjEt

ijj : 1 � i � m; 1 � j � ng where Et ¼ V�WtHt. It is reasonable to believe that a subset of Qt, i.e.,Gt ¼ fu 2 Qt : u � medfQtgg, obeys a Gaussian distribution,wheremedfQtg signifies the median of Qt. Since jGtj ¼ bmn

2 c,it suffices to estimate both the mean mt and standard devia-tion dt from Gt. According to the three-sigma-rule, we detectthe outliers as Ot ¼ ft 2 Gt : jt � mtj > 3dtg and outputtheir indices VðtÞ.

To illustrate the effect of outlier rejection, Fig. 3 presentsa sequence of weighting matrices generated by HQ for themotivating example described in Fig. 2. It shows that HQcorrectly assigns zero weights for the corrupted entries inonly a few iterations and finally detects almost all outliersincluding illumination, sunglasses, and scarves (see the lastcolumn in Fig. 3) in the end.

4 RELATED WORK

Before evaluating the effectiveness and robustness of Trun-cated CauchyNMF, we briefly review the state-of-the-art ofnon-negative matrix factorization (NMF) and its robustifiedvariants. We have thoroughly compared the robustnessbetween the proposed Truncated CauchyNMF and all thelisted related works.

4.1 NMF

Traditional NMF [17] assumes that noise obeys a Gaussiandistribution and derives the following squared L2-normbased objective function: minW�0;H�0kV �WHk2F , where

kXkF ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

ij X2ij

qsignifies the matrix Frobenius norm. It is

commonly known that NMF can be solved by using themultiplicative update rule (MUR, [17]). Because of the nicemathematical property of squared L2-norm and the effi-ciency of MUR, NMF has been extended for various appli-cations [4], [6], [28]. However, NMF and its extensions arenon-robust because the L2-norm is sensitive to outliers.

4.2 Hypersurface Cost Based NMF

Hamza and Brady [12] proposed a hypersurface cost basedNMF (HCNMF) by minimizing the summation of hypersur-face costs of errors, i.e., minW�0;H�0f

Pij dððV �WHÞijÞg,

where dðxÞ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffi1þ x2p � 1 is the hypersurface cost function.

According to [12], the hypersurface cost function has differ-entiable and bounded influence function. Since the hyper-surface cost function is differentiable, HCNMF can bedirectly solved by using the projected gradient method.However, the optimization of HCNMF is difficult becauseArmijo’s rule based line search is time consuming [12].

4.3 L1-Norm Based NMF

To improve the robustness of NMF, Lam [15] assumed thatnoise is independent and identically distributed fromLaplace distribution and proposed L1-NMF as follows:minW�0;H�0kV �WHk1, where kXk1 ¼

Pij jXijj and j � j sig-

nifies the absolute value function. Since the L1-norm based

Fig. 3. An illustrative example of the sequence of weights generated bythe HQ algorithm.

7. The probability density function (PDF) of Cauchy distribution is

fx0 ;gðxÞ ¼ 1=ðpgð1þ ðx�x0gÞ2ÞÞ, where x0 is the location parameter, speci-

fying the location of the peak of the distribution, and g is the scale

parameter, specifying the half-width at half-maximum.

252 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 1, JANUARY 2019

Page 8: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

loss function is non-smooth, the optimization algorithm in[15] is not scalable on large-scale datasets. Manhattan NMF(MahNMF, [11]) remidies this problem by approximatingthe loss function ofL1-NMFwith a smooth function andmin-imizing the approximated loss function using Nesterov’smethod. Although L1-NMF is less sensitive to outliers thanNMF, it is not sufficiently robust because its breakdownpoint is related to the dimensionality of data [7].

4.4 L1-Norm Regularized Robust NMF

Zhang et al. [29] assumed that the dataset contains both Lap-lace distributed noise and Gaussian distributed noise andproposed an L1-norm regularized Robust NMF (RNMF-L1)as follows: minW�0;H�0;SfkV �WH � Sk2F þ �kSk1g, where� is a positive constant that trades off the sparsity of S. Simi-lar to L1-NMF, RNMF-L1 is also less sensitive to outliersthan NMF, but they are both non-robust to large numbers ofoutliers because the L1-minimization model has a low break-down point. Moreover, it is non-trivial to determine thetradeoff parameter �.

4.5 L2;1-Norm Based NMF

Since NMF is substantially a summation of the squaredL2-norm of the errors, the large magnitude errors dominatethe objective function and cause NMF to be non-robust. Tosolve this problem, Kong et al. [14] proposed the L2;1-normbased NMF (L2;1-NMF) which minimizes the L2;1-norm ofthe error matrix, i.e., minW�0;H�0kV �WHk2;1, where theL2;1-norm is defined as kEk2;1 ¼

Pnj¼1 kE�jk2. In contrast to

NMF, L2;1-NMF is more robust because the influences ofnoisy examples are inhibited in learning the subspace.

4.6 Robust Capped Norm NMF

Gao et al. [47] proposed a robust capped norm NMF(RCNMF) to completely filter out the effect of outliersby instead minimizing the following objective function:P

W�0;H�0Pn

j¼1 minfkV�j �WH�jk; ug, where u is a thresholdthat chooses the outlier samples. RCNMF cannot be appliedin practical applications because it is non-trivial to deter-mine the pre-defined threshold, and the utilized iterativealgorithms in both [14] and [47] converge slowly with thesuccessive use of the power method [1].

4.7 Correntropy Induced Metric Based NMF

The most closely-related work is the half-quadratic algo-rithm for optimizing robust NMF, which includes theCorrentropy-Induced Metric (CIM)-based NMF (CIM-NMF) and Huber-NMF by Du et al. [8]. CIM-NMFmeasuresthe approximation errors by using CIM [19], i.e.,minW�0;H�0Pm

i¼1Pn

j¼1 rððV �WHÞij; dÞ, where rðx; dÞ ¼ 1� 1ffiffiffiffi2pp

de� x2

2d2 .

Since the energy function rðx; dÞ increases slowly as the

error increases, CIM-NMF is insensitive to outliers. In a sim-

ilar way, Huber-NMF [8] measures the approximation

errors by using the Huber function, i.e., minW�0;H�0Pm

i¼1Pnj¼1 lððV �WHÞij; cÞ, where lðx; cÞ ¼ x2; jxj � c

2cjxj � c2; jxj � c

�and the cutoff c is automatically determined byc ¼ medfjðV �WHÞijjg.

Truncated CauchyNMF is different from both CIM-NMFand Huber-NMF in four aspects: (1) Truncated CauchyNMFis derived from the proposed Truncated Cauchy loss whichcan model both modearte and extreme outliers, whereas nei-ther CIM-NMF or Huber-NMF can do that; (2) TruncatedCauchyNMF demonstrates strong evidence of both robust-ness and generalization ability, whereas neither CIM-NMFnor Huber-NMF demonstrates evidence of neither; (3) Trun-cated CauchyNMF iteratively detects outliers by the robuststatistics on the magnitude of errors, and thus performs morerobustly than CIM-NMF andHuber-NMF in practice; And (4)Truncated CauchyNMF obtains the optima for each factor ineach iteration round by solving the weighted non-negativeleast squares (WNLS) problems, whereas the multiplicativeupdate rules for CIM-NMF andHuber-NMFdo not.

5 EXPERIMENTAL VERIFICATION

We explore both the robustness and the effectiveness ofTruncated CauchyNMF on two popular face image datasets,ORL [27] and AR [22], and one object image dataset, i.e., Cal-tech 101 [44], by comparing with six typical NMFmodels: (1)L2-NMF [16] optimized by NeNMF [10]; (2) L1-NMF [15]optimized byMahNMF [11]; (3) RNMF-L1 [29]; (4) L2;1-NMF[14]; (5) CIM-NMF [8]; and (6) Huber-NMF [8]. We first pres-ent a toy example to intuitively show the robustness of Trun-cated CauchyNMF and several clustering experiments onthe contaminated ORL dataset to confirm its robustness. Wethen analyze the effectiveness of Truncated CauchyNMF byclustering and recognizing face images in the AR dataset,and clustering object images in the Caltech 101 dataset.

5.1 An Illustrative Study

To illustrate Truncated CauchyNMF’s ability to learn a sub-space, we apply Truncated CauchyNMF on a synthetic data-set composed of 180 two-dimensional data points (seeFig. 4a). All data points are distributed in a one-dimensionalsubspace, i.e., a straight line (y ¼ 0:2x). Both L2-NMF andL1-NMF are applied on this synthetic dataset for comparison.

Fig. 4a shows that all methods learn the intrinsic sub-space correctly on the clean dataset. Figs. 4b, 4c, and 4ddemonstrate the robustness of Truncated CauchyNMF on anoisy dataset. First, we randomly select 20 data points andcontaminate their x-coordinates, with their y-coordinatesretained to simulate outliers. Fig. 4b shows that L2-NMFfails to recover the subspace in the presence of 1

9 outliers,while both Truncated CauchyNMF and L1-NMF performrobustly in this case. However, the robustness of L1-NMFdecreases as the outliers increase. To study this point, werandomly select another 20 data points and contaminatetheir x-coordinates. Fig. 4c shows that both L2-NMF andL1-NMF fail to recover the subspace, but Truncated Cau-chyNMF succeeds. To study the robustness of TruncatedCauchyNMF on seriously corrupted datasets, we randomlyselect an additional 40 data points as outliers. We contami-nate their y-coordinates while keeping their x-coordinatesconsistent. Fig. 4d shows that Truncated CauchyNMF stillrecovers the intrinsic subspace in the presence of 4

9 outlierswhile both L2-NMF and L1-NMF fail in this case. In otherwords, the breakdown point of Truncated CauchyNMF isgreater than 44.4 percent, which is quite close to the highestbreakdown point of 50 percent.

GUAN ET AL.: TRUNCATED CAUCHY NON-NEGATIVE MATRIX FACTORIZATION 253

Page 9: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

5.2 Simulated Corruption

We first evaluate Truncated CauchyNMF’ robustness tosimulated corruptions. To this end, we add three typicalcorruptions, i.e., Laplace noise, and Salt & Pepper noise,randomly positioned blocks, to frontal face images from theCambridge ORL database and compare the clustering per-formance of our methods with the performance of othermethods on these contaminated images. Fig. 5 shows exam-ple face images contaminated by these corruptions.

The Cambridge ORL database [27] contains 400 frontal facephotos of 40 individuals. There are 10 photos of each individ-ual with a variety of lighting, facial expressions and facialdetails (with-glasses or without-glasses). All photos weretaken against the same dark background and each photo wascropped to a 32� 32 pixel array and normalized to a long vec-tor. The clustering performance is evaluated by two metrics,namely accuracy and normalized mutual information [22].The number of clusters is set equal to the number of

individuals, i.e., 40. Intuitively, the better amodel clusters con-taminated images, the more robust it is for learning the sub-space. In this experiment, we utilize K-means [21] as abaseline. To qualify the robustness of all NMF models, wecompare their relative reconstruction errors, i.e., kV̂ �WHkF=kV̂ kF , where V̂ denotes the clean dataset, and W and Hsignify the factorization results on the contaminated dataset.

5.2.1 Laplace Noise

Laplace noise exists in many types of observation, e.g., gradi-ent-based image features such as SIFT [31], but the classicalNMF cannot deal with such data because the distributionsviolate the assumption of classical NMF. In this experiment,we study Truncated CauchyNMF’s capacity to deal with Lap-lace noisy data.We simulate Laplace noise by adding randomnoise to each pixel of each face image from ORL where thenoise obeys a Laplace distribution Laplaceð0; dÞ. For the pur-pose of verifying the robustness of Truncated CauchyNMF,we vary the deviation d from 40 to 280 because the maximumpixel value is 255. Fig. 5a gives an example face image and itsseven noisy versions by adding Laplace noise. Figs. 6a and 6bpresent the mean and standard deviations of accuracy andnormalized mutual information of Truncated CauchyNMFand the representativemodels.

Fig. 6 confirms that NMF models outperform K-means interms of accuracy and normalized mutual information.L1-NMF outperforms L2-NMF and L2;1-NMF becauseL1-NMF models Laplace noise better. L1-NMF outperformsRNMF-L1 because L1-NMF assigns smaller weight for largenoise than RNMF-L1. CIM-NMF and Huber-NMF performcomparably with L1-NMF when the deviation of Laplacenoise is moderate. However, as the deviation increases, theirperformance is dramatically reduced because large-magni-tude outliers seriously influence the factorization results. Incontrast, Truncated CauchyNMF outperforms all the repre-sentativeNMFmodels and remains stable as deviation varies.

The clustering performance in Fig. 6 confirms TruncatedCauchyNMF’s effectiveness in learning the subspace on theORL dataset contaminated by Laplace noise. Table 2 com-pares the relative reconstruction errors of Truncated Cau-chyNMF and the representative algorithms. It shows thatCauchyNMF performs the most robustly in all situations.That is because Truncated CauchyNMF can not only modelthe simulated Laplace noise but also models the underlyingoutliers, e.g., glasses, in the ORL dataset.

Fig. 5. Example face images of ORL database: (a) An example faceimage and its noised versions by Laplace noise with deviationd ¼ 40; 80; 120; 160; 200; 240; 280, (b) an example face image and itsnoisy versions, where p% pixels are contaminated by Salt & Peppernoise and p ¼ 5; 10; 20; 30; 40; 50; 60, (c) an example face image and itsoccluded versions by b� b-blocks with b ¼ 10; 12; 14; 16; 18; 20; 22.

Fig. 4. Learning the one-dimensional subspace (i.e., a straight line) from180 synthetic two-dimensional data points by L2-NMF, L1-NMF, andTruncated CauchyNMF in four cases: (a) Clean dataset, (b) 20 pointscontaminated in x-direction, (c) 40 points contaminated in x-direction,and (d) 80 points contaminated in both directions.

Fig. 6. Evaluation on frontal face images of ORL database contaminatedby Laplace noise: (a) Average accuracy and standard deviation of K-means, L2-NMF, L2;1-NMF, RNMF-L1, L1-NMF, Huber-NMF, CIM-NMFand Truncated CauchyNMF, (b) comparison of average normalizedmutual information and standard deviation.

254 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 1, JANUARY 2019

Page 10: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

5.2.2 Salt & Pepper Noise

Salt & Pepper noise is a common type of corruption inimages. The removal of Salt & Pepper noise is a challengingtask in computer vision since this type of noise contami-nates each pixel by zero or the maximum pixel value, andthe noise distribution violates the noise assumption of tradi-tional learning models. In this experiment, we verify Trun-cated CauchyNMF’s capacity to handle Salt & Peppernoises. We add Salt & Pepper noise to each frontal faceimage of the ORL dataset (see Fig. 5b for the contaminatedface images of a certain individual) and compare the cluster-ing performance of Truncated CauchyNMF on the contami-nated dataset with that of the representative algorithms. Todemonstrate the robustness of Truncated CauchyNMF, wevary the percentage of corrupted pixels from 5 to 60 percent.For each case of additive Salt & Pepper noise, we repeat theclustering test 10 times and report the average accuracy andaverage normalized mutual information to eliminate theeffect of initial points.

Fig. 7 shows that all models perform satisfactorily when 5percent of the pixels of each image are corrupted. As the num-ber of corrupted pixels increases, the classical L2-NMF is seri-ously influenced by the Salt & Pepper noise and itsperformance is dramatically reduced. Although L1-NMF,Huber-NMF and CIM-NMF perform more robustly thanL2-NMF, their performance is also degradedwhenmore than40 percent of pixels are corrupted. Truncated CauchyNMFperforms quite stably even when 40 percent of pixels are cor-rupted and outperforms all the representativemodels inmostcases. All the models fail when 60 percent of pixels are cor-rupted, because it is difficult to distinguish inliers from out-liers in this case.

Table 3 gives a comparison of Truncated CauchyNMFand the representative algorithms in terms of relative recon-struction error. It shows that L1-NMF, Hubel-NMF, CIM-NMF and Truncated CauchyNMF perform comparablywhen less than 20 percent of the pixels are corrupted, butthe robustness of L1-NMF, Hubel-NMF, CIM-NMF areunstable as the percentage of corrupted pixels increases.Truncated CauchyNMF performs stably when 30% � 50%of the pixels are corrupted by Salt & Pepper noise. This con-firms the robustness of Truncated CauchyNMF.

5.2.3 Contiguous Occlusion

The removal of contiguous segments of an object due toocclusion is a challenging problem in computer vision.Many techniques such as L1-norm minimization andnuclear norm minimization are unable to handle this prob-lem. In this experiment, we utilize contiguous occlusion tosimulate extreme outliers. Specifically, we randomly posi-tion a b� b-sized block on each face image of the ORL data-set and fill each block with a pixel array whose pixel valuesequal 550. To verify the effectiveness of subspace learning,we apply both K-means and all NMF models to the contam-inated dataset and compare the clustering performance interms of both accuracy and normalized mutual information.This task is quite challenging because large numbers of out-liers with large magnitudes must be ignored to learn a cleansubspace. To study the influence of outliers, we vary theblock size b from 10 to 22, where the minimum block sizeand maximum block size imply 10 and 50 percent outliers,respectively. Fig. 5c shows the occluded face images of acertain individual.

Table 4 shows that K-means, L2-NMF, L2;1-NMF, RNMF-L1, L1-NMF, Huber-NMF, and CauchyNMF8 are seriouslydeteriorated by the added continuous occlusions. AlthoughCIM-NMF performs robustly when the percentage of outliersis moderate, i.e., 10 percent (corresponds to b ¼ 10Þ and14 percent (corresponds to b ¼ 12Þ, its performance is unsta-ble when the percentage of outliers reaches 20 percent (corre-sponds to b ¼ 14). This is because CIM-NMF keeps energiesfor extreme outliers and makes a large number of extremeoutliers dominate the objective function. By contrast, Trun-cated CauchyNMF reduces energies of extreme outliers tozeros, and thus performs robustlywhen the percentage of out-liers is less than 40 percent (corresponds to b ¼ 20Þ.

TABLE 2Relative Reconstruction Error (%) of L2-NMF, L2;1-NMF, RNMF-L1, L1-NMF, Huber-NMF, CIM-NMF,

and CauchyNMF on ORL Dataset Contaminated by Laplace Noise with Deviation Varying from 40 to 280

d L2-NMF L2;1-NMF RNMF-L1 L1-NMF Huber-NMF CIM-NMF Truncated CauchyNMF

40 14.78(0.01) 16.68(0.04) 17.18(0.09) 13.56(0.04) 14.05(0.07) 15.93(0.08) 13.41(0.04)80 24.91(0.02) 25.39(0.03) 21.30(0.11) 17.18(0.05) 17.53(0.04) 16.27(0.06) 14.70(0.06)120 36.30(0.02) 35.65(0.07) 24.66(0.08) 21.33(0.06) 21.87(0.07) 18.95(0.05) 15.94(0.07)160 47.48(0.03) 46.08(0.04) 27.49(0.06) 25.38(0.06) 26.47(0.08) 22.11(0.05) 16.88(0.13)200 59.18(0.04) 57.35(0.04) 30.27(0.11) 29.73(0.10) 31.72(0.16) 25.84(0.07) 18.10(0.13)240 70.70(0.03) 68.52(0.07) 32.96(0.15) 33.98(0.17) 37.12(0.55) 29.68(0.11) 19.88(0.55)280 82.06(0.04) 79.78(0.09) 35.81(0.17) 38.13(0.37) 43.07(0.62) 33.67(0.12) 27.23(4.06)

Fig. 7. Evaluation on frontal face images of ORL database contaminatedby Salt & Pepper noise: (a) Average accuracy and standard deviation ofK-means, L2-NMF, L2;1-NMF, RNMF-L1, L1-NMF, Huber-NMF, CIM-NMF and Truncated CauchyNMF, (b) comparison of average normal-ized mutual information and standard deviation.

8. In this experiment, we compare with CauchyNMF to show theeffect ot truncation. For CauchyNMF, we set s ¼ þ1 and adopt theproposed HQ algorithm to solve it.

GUAN ET AL.: TRUNCATED CAUCHY NON-NEGATIVE MATRIX FACTORIZATION 255

Page 11: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

5.3 Real-Life Corruption

The previous section has evaluated the robustness of Trun-cated CauchyNMF under several types of synthetic outliersincluding Laplace noise, Salt & Pepper noise, and contigu-ous occlusion. The experimental results show that our meth-ods consistently learns the subspace even when half thepixels in each image are corrupted, while other NMF mod-els fail under this extreme condition. In this section, weevaluate Truncated CauchyNMF’s ability to learn the sub-space under natural sources of corruption, e.g., contiguousdisguise in the AR dataset and object variations in theCaltech-101 dataset.

5.3.1 Contiguous Disguise

The PurdueAR dataset [22] contains 2600 frontal face imagestaken from 100 individuals comprising 50 males and 50females in two sessions. There is a total of 13 images in eachsession, including one normal image, three images depictingdifferent facial expressions, three images under varying illu-mination conditions, three imageswith sunglasses, and threeimages with a scarf for each individual. Each image iscropped into a 55� 40-dimensional pixel array and reshapedinto a 2200-dimensional long vector. Fig. 8 gives 20 exampleimages of two individuals and shows that the images withdisguises, i.e., sunglasses and scarf, are seriously contami-nated by outliers. Therefore, it is quite challenging to cor-rectly group these contaminated images, e.g., the 4th, 5th,9th and 10th columns in Fig. 8, with the clean images, e.g.,the 1st and 6th columns in Fig. 8. According to the resultsin Section 5.2.3, Truncated CauchyNMF can handle contig-uous occlusions with extreme outliers well, we will there-fore show the effectiveness of Truncated CauchyNMF todo this job.

To evaluate the effectiveness of Truncated CauchyNMFin clustering, we randomly select between two and ten

images of each individual to comprise the dataset. Byconcatenating all the long vectors, we obtain an image inten-sity matrix denoted as V . We then apply NMF to V to learnthe subspace, i.e., V �WH, where the rank of W and Hequals the number of clusters. Lastly, we output the clusterlabels by performing K-means on H. To eliminate the influ-ence of randomness, we repeat this trial 50 times and reportthe averaged accuracy and averaged normalized mutualinformation for comparison.

Fig. 9 gives both average accuracy and average normal-ized mutual information in relation to the number of clus-ters of Truncated CauchyNMF and other NMF models. Itshows that Truncated CauchyNMF consistently achievesthe highest clustering performance on the AR dataset. Thisresult confirms that Truncated CauchyNMF learns the sub-space more effectively than other NMF models, even whenthe images are contaminated by contiguous disguises suchas sunglasses and a scarf.

We further conduct the face recognition experiment onthe AR dataset to evaluate the effectiveness of TruncatedCauchyNMF. In this experiment, we treat the images takenin the first session as the training set and the images takenin the second session as the test set. This task is challengingbecause (1) the distribution of the training set is differentfrom that of the test set, and (2) both training and test sets

TABLE 3Relative Reconstruction Error (%) of L2-NMF, L2;1-NMF, RNMF-L1, L1-NMF, Huber-NMF, CIM-NMF, and Truncated CauchyNMF

on ORL Dataset Contaminated by Salt & Pepper Noise with the Percentage of Corrupted Pixels Varying from 5 to 60 percent

p L2-NMF L2;1-NMF RNMF-L1 L1-NMF Huber-NMF CIM-NMF Truncated CauchyNMF

5 12.51(0.03) 14.50(0.05) 14.36(0.10) 11.33(0.04) 12.00(0.06) 13.05(0.09) 12.37(0.05)10 15.36(0.02) 16.44(0.04) 14.93(0.12) 11.50(0.05) 12.03(0.07) 12.25(0.11) 12.27(0.06)20 20.30(0.03) 19.99(0.07) 15.97(0.08) 11.98(0.04) 12.29(0.04) 12.18(0.08) 12.00(0.06)30 24.44(0.03) 23.33(0.08) 17.47(0.11) 13.25(0.09) 13.24(0.06) 13.86(0.09) 11.80(0.04)40 28.30(0.03) 26.59(0.06) 19.11(0.06) 16.75(0.08) 17.23(0.11) 18.94(0.13) 12.35(0.06)50 31.51(0.04) 29.46(0.06) 21.71(0.11) 22.49(0.67) 26.82(0.33) 28.54(0.22) 22.97(0.28)60 24.28(0.03) 31.92(0.04) 26.33(0.13) 29.62(0.17) 34.30(0.22) 39.10(0.63) 35.26(0.13)

TABLE 4Average Accuracy (%) and Average Normalized Mutual Information (%) of K-Means, L2-NMF, L2;1-NMF,RNMF-L1, L1-NMF, Huber-NMF, CauchyNMF, CIM-NMF, and Truncated CauchyNMF on Occluded ORL

Dataset with Block Size b Varying from 10 to 22 with Step Size 2

b K-means L2-NMF L2;1-NMF RNMF-L1 L1-NMF Huber-NMF CauchyNMF CIM-NMF Truncated CauchyNMF

10 17.20(39.77) 17.10(39.80) 17.20(39.71) 17.50(39.47) 17.65(38.87) 17.68(38.95) 19.27(39.27) 58.48(75.41) 57.80(73.94)12 17.65(39.95) 17.38(39.43) 17.10(39.90) 17.33(39.46) 17.33(38.79) 17.73(38.96) 18.57(38.96) 56.05(73.36) 58.23(74.31)14 17.63(40.10) 17.25(39.43) 17.45(39.92) 17.35(39.11) 17.70(39.18) 17.65(38.77) 19.03(39.18) 26.88(46.63) 55.38(71.94)16 17.55(39.95) 17.25(39.72) 17.20(39.82) 17.25(39.37) 17.43(38.90) 17.35(38.80) 18.36(19.01) 21.18(42.40) 47.30(65.39)18 16.78(39.09) 17.00(39.06) 16.90(39.73) 16.95(38.67) 16.88(38.20) 17.00(38.41) 17.90(38.48) 23.93(45.21) 42.93(61.84)20 17.35(39.40) 17.15(39.25) 17.20(39.56) 17.15(38.59) 17.08(38.52) 17.00(38.45) 17.40(38.08) 22.33(43.64) 37.48(57.57)22 17.15(39.38) 16.75(38.82) 16.88(39.14) 16.90(38.45) 16.95(38.59) 17.10(38.67) 17.73(38.69) 25.38(46.39) 30.05(50.98)

Fig. 8. Face image examples of two individuals in the AR dataset, with10 images per individual.

256 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 1, JANUARY 2019

Page 12: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

are seriously contaminated by outliers. We first learn a sub-space by conducting Truncated CauchyNMF on the wholedataset and then classify each test image by the sparserepresentation classification method (SRC) [45] on the coef-ficients of both training images and test images in thelearned subspace. Since there are totally 100 individualsand the images of each individual were taken in two ses-sions, we set the reduced dimensionality of Truncated Cau-chyNMF to 200. We also conduct other NMF variants withthe same setting for comparison. To filter out the influenceof continuous occlusions in face recognition, Zhou et al. [46]proposed a sparse error correction method (SEC) whichlabels each pixel of test image as occluded pixel and non-occluded one by using Markov random field (MRF) andlearns a representation of each test image on non-occludedpixels. Although SEC succeeds to filter out the continuousocclusions in the test set, it cannot handle outliers in thetraining set. By contrast, Truncated CauchyNMF can takethe occlusions off on both training and test images, and thusboost the performance of the subsequent classification.

Table 5 shows the face recognition accuracies of NMF var-iants and SEC. In the AR dataset, each individual containsone normal image and twelve contaminated images underdifferent conditions including varying facial expressions, illu-minations, wearing sunglasses, and wearing scarves. In thisexperiment, we not only show the results on total test set butalso show the results on the test images taken under differentconditions separately. Table 5 shows that Truncated Cau-chyNMF performs the best in most cases, especially, it per-forms almost perfectly on normal images. It validates thatTruncated CauchyNMF can learn an effective subspace from

the contaminated data. In most situations, SEC performsexcellently, but the last two columns indicate that the contam-inated training images seriouslyweaken SEC. TruncatedCau-chyNMF performs well in such situations because iteffectively removes the influence of outliers in the subspacelearning stage.

5.3.2 Object Variation

The Caltech 101 dataset [44] contains pictures of objects cap-tured from 101 categories. The number of pictures for eachcategory varies from 40 to 800. Fig. 10 shows example imagesfrom 6 different categories including dolphin, butterfly, sun-flower, watch, pizza and cougar_body. We extract convolu-tional neural network (CNN) feature for each image using theCaffe framework [39] and pre-trained model of Imagenetwith AlexNet [42]. As objects from the same categories mayvary in shape, color and size, and the pictures are taken fromdifferent viewpoints, clustering objects of the same categorytogether is a very challenging task. We will show the goodperformance of Truncated CauchyNMF compared to othermethods such as CIM-NMF, Huber-NMF, L2;1-NMF, RNMF-L1,L2-NMF,L1-NMF, andK-means.

Following the similar protocol as in Section 5.3.1, wedemonstrate the effectiveness of Truncated CauchyNMF inclustering objects. We test with 2 to 10 randomly selectedcategories. The image feature matrix is denoted as V . NMFsare applied to V to compute the subspace, i.e., V �WH,where the rank of W and H equals the number of clusters.Cluster labels are obtained by performing K-means on H.We repeated such trial 50 times and computed averagedaccuracies and normalized mutual information among alltrials for comparison.

Fig. 11 presents the accuracy and normalized mutualinformation versus cluster numbers of different NMF mod-els. Truncated CauchyNMF significantly outperforms otherapproaches. As the number of categories increases, theaccuracy achieved by other NMF models decreases quickly,while Truncated CauchyNMF maintains a strong subspace

Fig. 9. Clustering performance in terms of average accuracy and aver-age normalized mutual information of Truncated CauchyNMF, CIM-NMF, Huber-NMF, L2;1-NMF, RNMF-L1, L2-NMF, and L1-NMF on theAR dataset, with the number of clusters varying between 2 and 10: (a)Average accuracy versus number of clusters, and (b) average normal-ized mutual information versus number of clusters.

TABLE 5Face Recognition Accuracies (%) of SEC and NMF Models on the AR Dataset, with the Reduced Dimensionalities of NMF Models

Set to 200 and the Test Images Classified by SRC in the Subspaces Learned by NMF Methods

Methods Total Normal Expressions Illuminations Scarves Sunglasses

Truncated CauchyNMF+SRC 90 99 96:67 92.67 90:67 77CIM-NMF+SRC 82.54 96 92.33 90.67 82.33 60.33L1-NMF+SRC 87.15 94 90 95:33 84.67 76.33L2-NMF+SRC 80.38 90 85 95.33 79.33 58.67Huber-NMF+SRC 48.85 70 59.33 51.33 48.33 29.33L2;1-NMF+SRC 75.23 87 83.33 88 72 53.67RNMF-L1+SRC 49.92 67 55.33 55.33 52 31.33SEC 78.92 95 93.33 90.67 74.67 51.67

Fig. 10. Example images in Caltech101 dataset from 6 categories, andwe have four images per category.

GUAN ET AL.: TRUNCATED CAUCHY NON-NEGATIVE MATRIX FACTORIZATION 257

Page 13: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

learning ability. We can see from the figure that TruncatedCauchyNMF is more robust to the object variations com-pared to other models.

Note that, in all above experiments, we optimized theTruncated CauchyNMF and the other NMF models withdifferent types of algorithms. However, the high perfor-mance is not due to the optimization algorithm. To studythis point, we applied the Nesterov based HQ algorithm tooptimize the representative NMF models and comparedtheir clustering performance on the AR dataset. The resultsshow that Truncated CauchyNMF consistently outperformsthe other NMF models. See the supplementary materials,available online, for detailed discussions.

6 CONCLUSION

This paper proposes a Truncated CauchyNMF frameworkfor learning subspaces from corrupted data. We propose aTruncated Cauchy loss which can simultaneously andappropriately model both moderate and extreme outliers,and develop a novel Truncated CauchyNMF model. Wetheoretically analyze the robustness of Truncated Cau-chyNMF by comparing with a family of NMF models, andprovide the performance guarantees of Truncated Cau-chyNMF. Considering that the objective function is neitherconvex nor quadratic, we optimize Truncated CauchyNMFby using half-quadratic programming and alternatelyupdating both factor matrices. We experimentally verify therobustness and effectiveness of our methods on both syn-thetic and natural datasets and confirm that Truncated Cau-chyNMF are robust for learning subspace even when halfthe data points are contaminated.

ACKNOWLEDGMENTS

The work was supported in part by the Australian ResearchCouncil Projects FL-170100117, DP-180103424, DP-140102164,LP-150100671, and the National Natural Science FoundationofChina underGrant 61502515. NaiyangGuan andTongliangLiu contribute equally to thiswork.

REFERENCES

[1] A. Baccini, P. Besse, and A. Falguerolles, “A L1-norm PCA and aheuristic approach,” Ordinal and Symbolic Data Analysis. Berlin,Germany: Springer, 1996, pp. 359–368.

[2] P. L. Bartlett and S. Mendelson, “Rademacher and Gaussian com-plexities: Risk bounds and structural results,” J. Mach. Learn. Res.,vol. 3, pp. 463–482, 2003.

[3] S. P. Boyd and L. Vandenberghe, Convex Optimization. Cambridge,U.K.: Cambridge Univ. Press, 2004.

[4] D. Cai, X. He, and J. Han, “Graph regularized non-negative matrixfactorization for data representation,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 33, no. 8, pp. 1548–1560, Aug. 2011.

[5] A. L. Cauchy, “On the pressure or tension in a solid body,” Exerci-ces de Mathmatiques, vol. 2, 1827, Art. no. 42.

[6] C. Ding, T. Li, and M. I. Jordan, “Convex and semi-non-negativematrix factorizations,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 32, no. 1, pp. 45–55, Jan. 2010.

[7] D. Donoho, “Breakdown Properties of Multivariate LocationEstimators,” Qualifying paper, Dept. of Statist., Harvard University,USA, 1982.

[8] L. Du, X. Li, and Y. D. Shen, “Robust non-negative matrix factori-zation via half-quadratic minimization,” in Proc. IEEE 12th Int.Conf. Data Mining, 2012, pp. 201–210.

[9] D. Geman and C. Yang, “Nonlinear image recovery with half-quadratic regularization,” IEEE Trans. Image Process., vol. 4, no. 7,pp. 932–946, Jul. 1995.

[10] N. Guan, D. Tao, Z. Luo, and B. Yuan, “NeNMF: An optimal gra-dient method for non-negative matrix factorization,” IEEE Trans.Signal Process., vol. 60, no. 6, pp. 2882–2898, Jun. 2012.

[11] N. Guan, D. Tao, Z. Luo, and J. Shawe-Taylor, “MahNMF:Manhat-tan non-negativematrix factorization,” arXiv: 1207.3438v1, 2012.

[12] A. B. Hamza and D. J. Brady, “Reconstruction of reflectance spec-tra using robust non-negative matrix factorization,” IEEE Trans.Signal Process., vol. 54, no. 9, pp. 3637–3642, Sep. 2006.

[13] H. Hotelling, “Analysis of a complex of statistical variables intoprincipal components,” J. Educational Psychology, vol. 24, pp. 417–441, 1933.

[14] D. Kong, C. Ding, and H. Huang, “Robust non-negative matrixfactorization using L2;1-norm,” in Proc. 20th ACM Int. Conf. Inf.Knowl. Manage., 2011, pp. 673–682.

[15] E. Y. Lam, “Non-negative matrix factorization for images withLaplacian noise,” in Proc. IEEE Asia Pacific Conf. Circuits Syst.,2008, pp. 798–801.

[16] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, Oct. 1999.

[17] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization,” in Proc. Int. Conf. Neural Inf. Process Syst., 2001,pp. 556–562.

[18] C. J. Lin, “Projected gradient methods for non-negative matrixfactorization,” Neural Comput., vol. 19, no. 10, pp. 2756–2779,Oct. 2007.

[19] W. Liu, P. P. Pokharel, and J. C. Principe, “Correntropy: Propertiesand applications in non-Gaussian signal processing,” IEEE Trans.Signal Process., vol. 55, no. 11, pp. 5286–5298, Nov. 2007.

[20] T. Liu and D. Tao, “On the robustness and generalization ofcauchy regression,” in Proc. IEEE Int. Conf. Inf. Sci. Technol., 26–28Apr., 2014, pp. 100–105.

[21] J. MacQueen, “Some methods for classification and analysis ofmultivariate observations,” in Proc. 5th Berkeley Symp. Math. Stat-ist. Probability, 1967, Art. no. 14.

[22] A. Martinez and R. Benavente, “The AR face database,” Universi-tat Aut�onoma de Barcelona, Barcelona, Spain, CVC Tech. Rep. 24,1998.

[23] Y. E. Nesterov, “A method of solving a convex programmingproblem with convergence rate Oð1=k2Þ,” Soviet Math. Doklady,vol. 27, no. 2, pp. 372–376, 1983.

[24] M. Nikolova and R. H. Chan, “The equivalence of half-quadraticminimization and the gradient linearization iteration,” IEEETrans. Image Process., vol. 16, no. 6, pp. 1623–1627, Jun. 2007.

[25] V. P. Pauca, F. Shahnaz, M. W. Berry, and R. J. Plemmons, “Textmining using non-negative matrix factorization,” in Proc. 4thSIAM Int. Conf. Data Mining, 2004, pp. 452–456.

[26] V. Pauca, J. Piper, and R. Plemmons, “Non-negative matrix factor-ization for spectral data analysis,” Linear Algebra Appl., vol. 416,no. 1, pp. 29–47, Jul. 2006.

[27] F. Samaria and A. Harter, “Parameterisation of a stochastic modelfor human face identification,” in Proc. IEEE Workshop Appl. Com-put. Vis., 1994, pp. 138–142.

[28] R. Sandler andM. Lindenbaum, “Non-negative matrix factorizationwith earth mover’s distance metric for image analysis,” IEEE Trans.Pattern Anal.Mach. Intell., vol. 33, no. 8, pp. 1590–1602, Jan. 2011.

[29] L. Zhang, Z. Chen, M. Zheng, and X. He, “Robust non-negativematrix factorization,” Frontiers Elect. Electron. Eng. China, vol. 6,no. 2, pp. 192–200, Feb. 2011.

Fig. 11. The clustering performance, in terms of both accuracy and nor-malized mutual information, of Truncated CauchyNMF, CIM-NMF,Huber-NMF, L2;1-NMF, RNMF-L1, L2-NMF, L1-NMF, and K-means onthe Caltech101 dataset with the number of clusters varying from 2 to 10.

258 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 41, NO. 1, JANUARY 2019

Page 14: Truncated Cauchy Non-Negative Matrix Factorization · Index Terms—Non-negative matrix factorization, truncated cauchy loss, robust statistics, half-quadratic programming Ç 1INTRODUCTION

[30] T. Zhang, “Covering number bounds of certain regularized linearfunction classes,” J. Mach. Learn. Res., vol. 2, pp. 527–550, 2002.

[31] Y. Jia and T. Darrell, “Heavy-tailed distances for gradient basedimage descriptors,” in Proc. Int. Conf. Neural Inf. Syst., 2011,pp. 397–405.

[32] F. Nagy, “Parameter estimation of the Cauchy distribution ininformation theory approach,” J. Universal Comput. Sci., vol. 12,no. 9, pp. 1332–1344, 2006.

[33] L. K. Chan, “Linear estimation of the location and scale parame-ters of the Cauchy distribution based on sample quantiles,” J.Amer. Statistical Assoc., vol. 65, no. 330, pp. 851–859, 1970.

[34] G. Cane, “Linear estimation of parameters of the Cauchy distribu-tion based on sample quantiles,” J. Amer. Statistical Assoc., vol. 69,no. 345, pp. 243–245, 1974.

[35] P. Chen, N. Wang, N. L. Zhang, and D. Y. Yeung, “Bayesian adap-tive matrix factorization with automatic model selection,” in Proc.IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2015,pp. 7–12.

[36] Y. Nesterov, “Smooth minimization of non-smooth functions,”Math. Program., vol. 103, no. 1, pp. 127–152, Dec. 2005.

[37] G. H. Golub and C. Reinsch, “Singular value decomposition andleast squares solutions,” Numerische Mathematik, vol. 14, no. 5,pp. 403–420, 1970.

[38] J. P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov, “Metagenesand molecular pattern discovery using matrix factorization,” Proc.Nat. Academy Sci. United States America, vol. 101, no. 12, pp. 4164–4169, 2004.

[39] Y. Jia, et al., “Caffe: Convolutional architecture for fast featureembedding,” in Proc. ACM Int. Conf. Multimedia, 2014, pp. 675–678.

[40] N. K. Logothetis and D. L. Sheinberg, “Visual object recognition,”Annu. Rev. Neurosci., vol. 19, pp. 577–621, 1996.

[41] E. Wachsmuth, M. W. Oram, and D. I. Perrett, “Recognition ofobjects and their component parts: Responses of single units inthe temporal cortex of the macaque,” Cerebral Cortex, vol. 4,pp. 509–522, 1994.

[42] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifi-cation with deep convolutional neural networks,” in Proc. Int.Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105.

[43] P. J. Huber, Robust Statistics. Berlin, Germany: Springer, 2011.[44] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual

models from few training examples: An incremental Bayesianapproach tested on 101 object categories,” Comput. Vis. ImageUnderstanding, vol. 106, no. 1, pp. 59–70, 2007.

[45] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robustface recognition via sparse representation,” IEEE Trans. PatternAnal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.

[46] Z. Zhou, A. Wagner, H. Mobahi, J. Wright, and Y. Ma, “Face rec-ognition with contiguous occlusion using Markov random fields,”in Proc. IEEE Int. Conf. Comput. Vis., 2009, pp. 1050–1057.

[47] H. Gao, F. Nie, W. Cai, and H. Huang, “Robust capped norm non-negative matrix factorization,” in Proc. 24th ACM Int. Conf. Inf.Knowl. Manage., Oct. 19–23, 2015, pp. 871–880.

[48] C. Bhattacharyya, N. Goyal, R. Kannan, and J. Pani, “Non-negative matrix factorization under heavy noise,” in Proc. 33rd Int.Conf. Mach. Learn., 2016, pp. 1426–1434.

[49] Q. Pan, D. Kong, C. Ding, and B. Luo, “Robust non-negativedictionary learning,” in Proc. 28th AAAI Conf. Artif. Intell., 2014,pp. 2027–2033.

[50] N. Gillis and R. Luce, “Robust near-separable nonnegative matrixfactorization using linear optimization,” J. Mach. Learn. Res.,vol. 15, pp. 1249–1280, 2014.

Naiyang Guan received the BS, MS, and PhDdegrees from the National University of DefenseTechnology. He is currently an associate profes-sor with the College of Computer, National Uni-versity of Defense Technology and the NationalInstitute of Defense Technology Innovation ofChina. His research interests include machinelearning, computer vision, and data mining. Hehas authored and co-authored more than 20research papers including the IEEE Transactionson Neural Networks and Learning Systems, theIEEE Transactions on Image Processing, theIEEE Transactions on Signal Processing, ICDM,and IJCAI.

Tongliang Liu received the BEng degree in elec-tronic engineering and information science fromthe University of Science and Technology ofChina and the PhD degree from the University ofTechnology Sydney. He is currently a lecturer inthe School of Information Technologies and theFaculty of Engineering and Information Technolo-gies, and a core member in the UBTECH SydneyAI Centre, University of Sydney. His researchinterests include statistical learning theory, com-puter vision, and optimisation. He has authored

and co-authoredmore than 30 research papers including the IEEE Trans-actions on Pattern Analysis and Machine Intelligence, the IEEE Transac-tions on Neural Networks and Learning Systems, the IEEE Transactionson Image Processing, ICML, CVPR, and KDD.

Yangmuzi Zhang received the BS degree inelectronics and information engineering fromHuazhong University of Science and Technology,in 2011 and the PhD degree in computer visionfrom the University of Maryland at College Park,in 2016. She is currently working with Google.Her research interests include computer visionand machine learning.

Dacheng Tao (F’15) is professor of computer sci-ence and ARC laureate fellow in the School ofInformation Technologies and the Faculty of Engi-neering and Information Technologies, and theInaugural director of the UBTECH Sydney ArtificialIntelligence Centre, University of Sydney. Hemainly applies statistics and mathematics to artifi-cial intelligence and data science. His researchinterests spread across computer vision, data sci-ence, image processing, machine learning, andvideo surveillance. His research results have

expounded in one monograph and more than 500 publications at presti-gious journals and prominent conferences, such as the IEEE Transactionson Pattern Analysis and Machine Intelligence, the IEEE Transactions onNeural Networks and Learning Systems, the IEEE Transactions on ImageProcessing, the Journal of Machine Learning Research, the InternationalJournal of Computer Vision, NIPS, ICML, CVPR, ICCV, ECCV, ICDM, andACM SIGKDD, with several best paper awards, such as the best theory/algorithm paper runner up award in IEEE ICDM’07, the best student paperaward in IEEE ICDM’13, the distinguished student paper award in the 2017IJCAI, the 2014 ICDM 10-year highest-impact paper award, and the 2017IEEE Signal Processing Society Best Paper Award. He received the 2015Australian Scopus-Eureka Prize, the 2015 ACS Gold Disruptor Award andthe 2015 UTS Vice-Chancellor’s Medal for Exceptional Research. He is afellow of the IEEE,OSA, IAPR, andSPIE.

Larry S. Davis received the BA degree from Col-gate University, in 1970 and the MS and PhDdegrees in computer science from the Universityof Maryland, in 1974 and 1976, respectively.From 1977-1981, he was an assistant professorin the Department of Computer Science, Univer-sity of Texas, Austin. He returned to the Univer-sity of Maryland as an associate professor in1981. From 1985-1994, he was the director of theUniversity of Maryland Institute for AdvancedComputer Studies. He was chair of the Depart-

ment of Computer Science from 1999-2012. He is currently a professorin the Institute and the Computer Science Department, as well as direc-tor of the Center for Automation Research. He was named a fellow ofthe IEEE in 1997 and of the ACM in 2013.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

GUAN ET AL.: TRUNCATED CAUCHY NON-NEGATIVE MATRIX FACTORIZATION 259