1 tangent space based alternating projections for

1

Tangent Space Based Alternating Projections forNonnegative Low Rank Matrix Approximation

Guangjing Song, Michael K. Ng, and Tai-Xiang Jiang,

Abstract—In this paper, we develop a new alternating projection method to compute nonnegative low rank matrix approximation fornonnegative matrices. In the nonnegative low rank matrix approximation method, the projection onto the manifold of fixed rank matricescan be expensive as the singular value decomposition is required. We propose to use the tangent space of the point in the manifold toapproximate the projection onto the manifold in order to reduce the computational cost. We show that the sequence generated by thealternating projections onto the tangent spaces of the fixed rank matrices manifold and the nonnegative matrix manifold, convergelinearly to a point in the intersection of the two manifolds where the convergent point is sufficiently close to optimal solutions. Thisconvergence result based inexact projection onto the manifold is new and is not studied in the literature. Numerical examples in dataclustering, pattern recognition and hyperspectral data analysis are given to demonstrate that the performance of the proposed methodis better than that of nonnegative matrix factorization methods in terms of computational time and accuracy.

Index Terms—Alternating projection method, manifolds, tangent spaces, nonnegative matrices, low rank, nonnegativity.

F

1 INTRODUCTION

NONNEGATIVE data matrices appear in many data anal-ysis applications. For instance, in image analysis, im-

age pixel values are nonnegative and the associated non-negative image data matrices can be formed for clusteringand recognition [1], [2], [3], [4], [5], [6], [7], [8], [9], [10],[11], [12]. In text mining, the frequencies of terms in docu-ments are nonnegative and the resulted nonnegative term-to-document data matrices can be constructed for clustering[13], [14], [15], [16]. In bioinformatics, nonnegative geneexpression values are studied and nonnegative gene expres-sion data matrices are generated for diseases and genesclassification [17], [18], [19], [20], [21]. Low rank matrixapproximation for nonnegative matrices plays a key role inall these applications. Its main purpose is to identify a latentfeature space for objects representation. The classification,clustering or recognition analysis can be done by using theselatent features.

Nonnegative Matrix Factorization (NMF) has emergedin 1994 by Paatero and Tapper [22] for performing environ-mental data analysis. The purpose of NMF is to decomposean input m-by-n nonnegative matrix A ∈ Rm×n+ into m-by-r nonnegative matrix B ∈ Rm×r+ and r-by-n nonnegativematrix C ∈ Rr×n+ : A ≈ BC, and more precisely

minB,C≥0

‖A−BC‖2F , (1)

• Guangjing Song is with School of Mathematics and InformationSciences, Weifang University, Weifang 261061, P.R. China. (email:[email protected]).

• M. K. Ng is with Department of Mathematics, The University of HongKong, Pokfulam, Hong Kong (e-mail: [email protected]). M. Ng’sresearch supported in part by the HKRGC GRF 12306616, 12200317,12300218, 12300519 and 17201020.

• T.-X. Jiang is with FinTech Innovation Center, School of EconomicInformation Engineering, Southwestern University of Finance and Eco-nomics, Chengdu, Sichuan, P.R.China (e-mail: [email protected],[email protected]). T.-X. Jiang’s research is supported in part by theFundamental Research Funds for the Central Universities (JBK2001011,JBK2001035). The Corresponding Author.

where B,C ≥ 0 means that each entry of B and C isnonnegative, ‖ · ‖F is the Frobenius norm of a matrix, andr (the low rank value) is smaller than m and n. Severalresearchers have proposed and developed algorithms fordetermining such nonnegative matrix factorization in theliterature [4], [8], [23], [24], [25], [26], [27], [28], [29], [30],[31], [32]. Lee and Seung [8] proposed and developed NMFalgorithms, and demonstrated that NMF has part-basedrepresentation which can be used for intuitive perceptioninterpretation. For the development of NMF, we refer to therecently edited book [33].

In [34], Song and Ng proposed a new algorithm forcomputing Nonnegative Low Rank Matrix (NLRM) approx-imation for nonnegative matrices. The approach is com-pletely different from NMF which has been studied for morethan twenty five years. The new approach aims to find anonnegative low rank matrix X such that their difference isas small as possible. Mathematically, it can be formulated asthe following optimization problem

minrank(X)=r,X≥0

‖A−X‖2F. (2)

The convergence of the proposed algorithm is also provedand experimental results are shown that the minimizeddistance by the NLRM method can be smaller than that bythe NMF method. Moreover, according to the ordering ofsingular values, the proposed method can identify impor-tant singular basis vectors, while this information may notbe obtained in the classical NMF.

1.1 The ContributionThe algorithm proposed in [34] for computing the non-negative low rank matrix approximation is based on us-ing the alternating projections on the fixed-rank matricesmanifold and the nonnegative matrices manifold. Note thatthe computational cost of the above alternating projectionmethod is dominant by the calculation of the singular

arX

iv:2

009.

0399

8v1

[cs

.LG

] 2

Sep

202

0

2

value decomposition at each iteration. The computationalworkload of the singular value decomposition can be largeespecially when the size of the matrix is large. In this paper,we propose to use the tangent space of the point in themanifold to approximate the projection onto the manifoldin order to reduce the computational cost. We show that thesequence generated by the alternating projections onto thetangent spaces of the fixed rank matrices manifold and thenonnegative matrix manifold, converge linearly to a point inthe intersection of the two manifolds where the convergentpoint is sufficiently close to optimal solutions. Numericalexamples will be presented to show that the computationaltime of the proposed tangent space based method is lessthan that of the original alternating projection method.Moreover, experimental results in data clustering, patternrecognition and hyperspectral data analysis, are given todemonstrate that the performance of the proposed methodis better than that of other nonnegative matrix factorizationmethods in terms of computational time and accuracy.

The rest of this paper is organized as follows. In Section2, we propose tangent space based alternating projectionmethod. In Section 3, we show the convergence of the pro-posed method. In Section 4, numerical examples are givento show the advantages of the proposed method. Finally,some concluding remarks are given in Section 5.

2 NONNEGATIVE LOW RANK MATRIX APPROXIMA-TION

In this paper, we are interested in the m × n fixed-rankmatrices manifold

Mr :=X ∈ Rm×n, rank(X) = r

, (3)

the m× n non-negativity matrices manifold

Mn :=X ∈ Rm×n,Xij ≥ 0, i = 1, · · · ,m, j = 1, · · · , n

,

(4)

and the m× n nonnegative fixed rank matrices manifold

Mrn =Mr ∩Mn =X ∈ Rm×n, rank(X) = r, Xij ≥ 0,

i = 1, ...,m, j = 1, ..., n . (5)

The proof of Mrn is a manifold can be found in [34]. LetX ∈ Rm×n be an arbitrary matrix in the manifoldMr . Weset the singular value decomposition of X as follows: X =UΣVT where U ∈ Rm×r , Σ ∈ Rr×r, and V ∈ Rn×r . Itfollows from Proposition 2.1 in [35] that the tangent spaceofMr at X can be expressed as

TMr (X) = UWT + ZVT , (6)

where W ∈ Rn×r,Z ∈ Rm×r are arbitrary. Here ·T denotesthe transpose of a matrix. For a given m-by-n matrix Y, theorthogonal projection of Y onto the subspace TMr (X) canbe written as

PTMr (X)(Y) = UUTY + YVVT −UUTYVVT . (7)

The alternating projection method studied in [34] is basedon projecting the given nonnegative matrix onto the m ×n fixed-rank matrices manifoldMr and the non-negativitymatrices manifold Mn iteratively. The projection onto the

fixed rank matrix set Mr is derived by the Eckart-Young-Mirsky theorem [36] which can be expressed as follows:

π1(X) =r∑i=1

σi(X)ui(X)vTi (X), (8)

where σi(X) are first r singular values of X, and ui(X),vi(X) are their corresponding singular vectors. The projec-tion onto the nonnegative matrix setMn is expressed as

π2(X) =

Xij , if Xij ≥ 0,0, if Xij < 0.

(9)

Moreover, Mrn refers to the nonnegative fixed rank ma-trices manifold given as (5), and π(X) refers to the closestmatrix to the given nonnegative matrix X, i.e.,

π(X) = argmaxY∈Mrn

‖X−Y‖2F . (10)

2.1 Projections Based on Tangent Spaces

Note that it can be expensive to project a matrix onto thefixed rank manifold by using the singular value decompo-sition. In this paper, we make use of tangent spaces andconstruct Tangent space based Alternating Projection (TAP)method to find nonnnegative low rank matrix approxi-mation such that the computational cost can be reducedcompared with the original alternating projection method in[34]. In Figure 1 and Figure 2, we demonstrate the proposedTAP method. In the method, the given nonnegative matrixX0 = A was first projected onto the manifold Mr to geta point X1 by π1, and then X2 is derived by projecting X1

onto the manifoldMn by π2. The first two steps are same asthe original alternating projection method [34]. Accordingto the third step, the point X2 is first projected onto thetangent space at X1 of the manifoldMr by the orthogonalprojection PTMr (X1), and then the derived point is projectedfrom the tangent space to the manifoldMr to get X3. Thusthe sequence can be derived as follows:

X0 = A, X1 = π1(X0), X2 = π2(X1),

X3 = π1(PTMr (X1)(X2)), X4 = π2(X3), · · · ,X2k+1 = π1(PTMr (X2k−1))(X2k)), X2k+2 = π2(X2k+1), · · ·

where PTMr (X2k−1))(X2k) denotes the orthogonal projec-tions of X2k onto the tangent space of Mr at X2k−1. Thealgorithm is summarized in Algorithm 1.

Algorithm 1 Tangent spaces based Alternating Projection(TAP) MethodInput: Given a nonnegative matrix A ∈ Rm×n this algo-rithm computes nearest rank-r nonnegative matrix.1: Initialize X0 = A;2: X1 = π1(X0) and Y1 = π2(X1)3: for k=1,2,...,4: Xk+1 = π1(PTMr (Xk)(Yk));5: Yk+1 = π2(Xk+1);6: endOutput: Xk when the stopping criterion is satisfied.

Let us analyze the computational cost of TAP methodin each iteration. Let Xk = UkΣkV

Tk be the skinny SVD

3

(a)

(b)

Fig. 1. The comparison between (a) the original alternating projectionmethod and (b) the proposed TAP method.

Fig. 2. The zoomed region in Figure 1(b).

decomposition of Xk. By (6), the tangent space ofMr at Xk

can be expressed as

TMr(Xk) = UkW

T + ZVTk ,

where W ∈ Rn×r,Z ∈ Rm×rare arbitrary. According to(10), the orthogonal projection of Yk onto the subspaceTMr(Xk) can be written as follows:

PTMr(Xk)(Yk) = UkU

TkYk +YkVkV

Tk −UkU

TkYkVkV

Tk .

Now it is required to compute the SVD of the projected ma-trix PTMr(Xk)

(Yk) with smaller size. Suppose that the QRdecompositions of (I−UkU

Tk )YkVk and (I−VkV

Tk )YkUk

are given as follows:

(I−UkUTk )YkVk = QkRk

and

(I−VkVTk )YT

k Uk = QkRk,

respectively. Recall that UTkQk = VT

k Qk = 0 and then by a

direct computation, we have

PTMr(Xk)(Yk)

= UkUTkYk(I−VkV

Tk ) + (I−UkU

Tk )YkVkV

Tk

+ UkUTkYkVkV

Tk

= UkRTk QT

k + QkRkVTk + UkU

TkYkVkV

Tk

=(

Uk Qk

)( UTkYkVk RT

k

Rk 0

)(VTk

QTk

):=(

Uk Qk

)Mk

(VTk

QTk

).

Let Mk = ΨkΓkΦTk be the SVD of Mk which can be

computed using O(r3) flops since Mk is a 2r × 2r matrix.Note that (Uk,Qk) and

(Vk, Qk

)are orthogonal, then the

SVD of PTMr(Xk)(Yk) = ΩkΘkΥ

Tk can be computed by

Ωk = (Uk,Qk) Ψk,Θk = Γk and Υk =(Vk, Qk

)Φk.

It follows that the overall computational cost ofπ1(PTMr(Xk)

(Yk)) can be expressed as two matrix-matrixmultiplications. In addition, the calculation procedure in-volves the QR decomposition of two matrices of sizes m× rand n× r matrices, and the SVD of a matrix of size 2r× 2r.The total cost per iteration is of 4mnr+O(r2m+r2n+r3). Incontrast, the computation of the best rank-r approximationof a non-structured m× n matrix costs O(mnr) +mn flopswhere the constant in front of mnr can be very large. Inpractice, the cost per iteration of the proposed TAP methodis less than that of original alternating projection method. InSection 4, numerical examples will be given to demonstratethe total computational time of the proposed TAP method isless than that of the original alternating projection method.

3 THE CONVERGENCE ANALYSIS

In this section, we study the convergence of the proposedTAP method in Algorithm 1. We note that the convergenceresult of the original alternating projection method has beenestablished in [37]. The key concept is the angle of a point inthe intersection of two manifolds. In our setting, the angleα(B) of B ∈Mrn where

α(B) = cos−1(σ(B)) (11)

and

σ(B) = limξ→0

supB1∈F ξ1 (B),B2∈F ξ2 (B)

〈B1 −B,B2 −B〉‖B1−B‖F ‖B2−B‖F

,

with

F ξ1 (B) = B1 | B1 ∈Mr\B,‖B1 −B‖F ≤ ξ,B1 −B⊥TMr∩Mn

(B),

F ξ2 (B) = B2 | B2 ∈Mn\B,‖B2 −B‖F ≤ ξ,B2 −B⊥TMr∩Mn(B),

and TMr∩Mn(B) is the tangent space ofMr ∩Mn at pointB. The angle is calculated based on the two points belongingMr andMn respectively. A point B inMrn is nontangen-tial if α(B) has a positive angle, i.e., 0 ≤ σ(A) < 1.

4

In the following, we list the main convergence results ofAlgorithm 1 that has been studied in the literature.

Theorem 3.1. Let Mr, Mn and Mrn be given as (3), (4)and (5), the projections onto Mr and Mn be given as (8)-(9), respectively. Suppose that P ∈ Mrn is a non-tangentialintersection point, then for any given ε > 0 and 1 > c > σ(P),there exist an ξ > 0 such that for any A ∈ Ball(P, ξ) (the ballneighborhood of P with radius ξ contains the given nonnegativematrix A), the sequence Xk generated by Algorithm 1 convergesto a point X∞ ∈Mrn, and satisfy(1) ‖X∞ − π(A)‖F ≤ ε‖A− π(A)‖F ,(2) ‖X∞ −Xk‖F ≤ const · ck‖A− π(A)‖F ,where π(A) is defined in (10).

Here we first present and establish some preliminaryresults and then give the proof of Theorem 3.1. These resultsare used to estimate the approximation when tangent spacesare used in the projections.

Lemma 3.2 (Proposition 4.3 and Theorem 4.1 in [37]). LetP ∈ Mr be given and π1, π be defined as (8) and (10). For each0 < ε < 3

5 , there exist an s(ε) > 0 and an ε(ε) > 0, such thatfor any given Z ∈ Ball(P, s(ε)),

‖π1(Z)− PTMr (π(Z))(Z)‖F < 4√ε‖Z− π(Z)‖F , (12)

and

‖π(π1(Z))− π(Z)‖F < ε(ε)‖Z− π(Z)‖F . (13)

Lemma 3.3 (Proposition 2.4 in [37]). Let P ∈ Mr begiven. For each ε > 0, there exists s > 0 such that for allC ∈ Ball(P, s) ∩Mr, we have:(i) minD′∈TMr(C)

‖D − D′‖F ≤ ε‖D − C‖F , ∀ D ∈Ball(P, s) ∩Mr.(ii) ‖D − π1(D)‖F ≤ ε‖D − C‖F , ∀ D ∈ Ball(P, s) ∩TMr

(C).

Next we can estimate the distance with respect to theother point in Mr instead of using π(Z). The followingresults which are proved in Appendix A.1 are needed.

Lemma 3.4. Let P ∈ Mr be given, PTMrand π1 be given

as (7) and (8). For each 0 < ε < 35 , there exist an s(ε) > 0

and a point Q ∈ Ball(P, s(ε)) ∩Mr such that for any givenZ ∈ Ball(P, s(ε)), we have

‖π1(Z)− PTMr (Q)(Z)‖F < 4√ε‖Z−Q‖F . (14)

Lemma 3.5 (Theorem 4.5 in [37]). Suppose P is a nontangen-tial point with σ(P) < c. Then there exists an s > 0 such thatfor all Z ∈Mn ∩Ball(P, s), we have

‖π1(Z)− π(Z)‖F < c‖Z− π(Z)‖F . (15)

Next we would like to estimate the distance betweenπ(π1(PTMr (Q)(Z))) and π(Z) where they are on the mani-foldMrn.

Lemma 3.6. Let P ∈ Mrn be given. For each 0 < ε < 35 ,

there exist ε1(ε) > 0, ε2(ε) > 0 and s1(ε) > 0 such that for allZ ∈ Ball(P, s1(ε)),

‖π(π1(PTMr (Q)(Z)))− π(Z)‖F ≤ε1(ε)‖Z− π(Z)‖F+ ε2(ε)‖Q− π(Z)‖F ,

where Q ∈Mr ∩Ball(P, s1(ε)).

The proof of Lemma 3.6 can be found in Appendix A.2.According to Lemma 3.6, if Z = π2(Q), then

‖Z−Q‖F = ‖Q− π2(Q)‖F ≤ ‖Q− π(Z)‖Fby noting that π(Z) ∈Mn, and

‖π(π1(PTMr (Q)(Z)))− π(Z)‖F≤ 8α

√ε‖Z−Q‖F + ε(ε)‖Z− π(Z)‖F

≤ ε1(ε)‖Z− π(Z)‖F + ε2(ε)‖Q− π(Z)‖F ,

where ε1(ε) = ε(ε) and ε2(ε) = 8α√ε.

In order to prove the convergence of Algorithm 1, weneed to estimate the distance between π1(PTMr (Q)(Z)) andπ(Z). The proof can be found in Appendix A.3.

Lemma 3.7. Suppose P is a nontangential point in Mrn withσ(P) < c, and Q ∈ Mr. Then there exists an s > 0 such thatwhen Z = π2(Q) ∈ Mn ∩ Ball(P, s) and PTMr (Q)(Z) ∈TMr

(Q) ∩Ball(P, s), we have

‖π1(PTMr (Q)(Z))− π(Z)‖F < c‖Z− π(Z)‖F . (16)

With the above preliminaries, we give the proof of The-orem 3.1.

Proof of Theorem 3.1 Suppose that ε < 1, and set σ(P) < c <1 and

ε =1− c

2(3− c)ε, ε2(ε) =

1− c2 + 2α

ε,

where α is a constant given as in (30). It follows Lemma3.5-3.7 that there exist some possibly distinct radii thatguarantee (15)-(16) are satisfied. Let s denote the mini-mum of these possibly radii and pick r < s(1−ε)

4(2+ε) , so thatπ(Ball(P, r)) ⊆ Ball(P, s4 ). Then ‖π(A) − P‖F < s

4follows from the latter condition. Denote l = ‖A− π(A)‖Fand note that

l = ‖A−P+P−π(A)‖F ≤ ‖A−P‖F+‖P−π(A)‖F ≤ r+s

4.

As π(A) ∈Mrn and note that X1 = π1(A), we have

‖X1 −A‖F = ‖π1(A)−A‖F ≤ ‖π(A)−A‖F = l

and

‖X1 − π(X1)‖F ≤ ‖X1 − π(A)‖F≤ ‖X1 −A‖F + ‖A− π(A)‖F ≤ 2l.

In order to prove Xk derived by Algorithm 1 is conver-gent, we need to prove Xk is a Cauchy sequence. ByLemma 3.7, there exist an c1 such that

‖X2k+1 − π(X2k+1)‖F ≤ ‖X2k+1 − π(X2k)‖F≤ c1‖X2k − π(X2k)‖F . (17)

In addition, by Lemma 3.5, there exist an c2 such that

‖X2k − π(X2k)‖F ≤ ‖X2k − π(X2k−1)‖F≤ c2‖X2k−1 − π(X2k−1)‖F . (18)

Set c = maxc1, c2, combine (17) and (18) together gives

‖Xk − π(Xk)‖F ≤ c‖Xk−1 − π(Xk−1)‖F . (19)

Then Xk is a Cauchy sequence if and only if

Xk∞k=1 ⊆ Ball(P, s) (20)

5

is satisfied. The remaining task is to show (20) is satisfied byinduction. For k = 1,

‖X1 −P‖F ≤ ‖X1 −A‖F + ‖A−P‖F ≤ l +r

2

≤ 2r +s

4≤ s(1− ε)

2(2 + ε)+s

4< s.

Assume that (20) is satisfied when n = k, then it followsfrom (19) that

‖Xk − π(Xk)‖F ≤ ck‖X1 − π(X1)‖F ≤ 2lck. (21)

For an arbitrary k and i = 1 or 2, we have

‖Xk−2 − π(Xk−1)‖F= ‖Xk−2 − π(πi(Xk−2))‖F= ‖Xk−2 − π(Xk−2) + π(Xk−2)− π(πi(Xk−2))‖F≤ ‖Xk−2 − π(Xk−2)‖F + ‖π(Xk−2)− π(πi(Xk−2))‖F≤ ‖Xk−2 − π(Xk−2)‖F + α‖Xk−2 − πi(Xk−2)‖F≤ (1 + α)‖Xk−2 − π(Xk−2)‖F .

The second part of the second inequality follows by thecontinuous of π, the third inequality follows by

‖Xk−2 − πi(Xk−2)‖F ≤ ‖Xk−2 − π(Xk−2)‖F , i = 1, 2.

In addition, when k is even, by lemma 3.2, we have

‖π(Xk)− π(Xk−1)‖F < ε(ε)‖Xk−1 − π(Xk−1)‖F . (22)

When k is odd, applying Lemma 3.6 gives

‖π(Xk)− π(Xk−1)‖F< ε1(ε)‖Xk−1 − π(Xk−1)‖F + ε2(ε)‖Xk−2 − π(Xk−1)‖F< ε1(ε)‖Xk−1 − π(Xk−1)‖F

+ ε2(ε)(1 + α)‖Xk−2 − π(Xk−2)‖F≤ 2ε1(ε)ck−1l + 2ε2(ε)(1 + α)ck−2l

= (ε1(ε)c+ ε2(ε)(1 + α))2ck−2l.

Set ε = maxε(ε), ε1(ε), then for an arbitrary k, we have

‖π(Xk)− π(Xk−1)‖F ≤ (εc+ ε2(ε)(1 + α))2ck−2l. (23)

By combining (23) and using Lemma 3.2, we obtain

‖π(Xk)− π(A)‖F≤ ‖π(A)− π(X1)‖F + ‖π(X2)− π(X1)‖F

+ ‖k∑j=3

π(Xj)− π(Xj−1)‖F

≤ εl + 2εl +k∑j=3

(ε1(ε)c+ ε2(ε)(1 + α))2cj−2l

≤ 3εl +2(ε1(ε)c+ ε2(ε)(1 + α))

1− cl

=3ε(1− c) + 2εc+ (1 + α)ε2(ε)

1− cl ≤ εl. (24)

Thus,

‖P−Xk‖F ≤ ‖P− π(A)‖F + ‖π(A)− π(Xk)‖F+ ‖π(Xk)−Xk‖F ≤ s/4 + εl + 2l < s,

which shows that (20) is satisfied.

It follows from (23) that the sequence (π(Xk))∞k=1 is aCauchy sequence which converges to a point Z∞. Note that(21) is satisfied, the sequence (Xk)∞k=1 also converges. Inaddition, Z∞ = π(Z∞) can be derive by noting that theprojection is local continuous. Moreover, by taking the limitof (24) we can get (i). For (ii). Note that

‖π(Xk)−X∞‖F ≤∞∑

j=k+1

‖π(Xj)− π(Xj−1)‖F

≤2lεck

1− c+

2(1 + α)lε2(ε)ck−1

1− c,

and combine with (21), we can get

‖Xk −X∞‖F ≤ ‖Xk − π(Xk)‖F + ‖π(Xk)−X∞‖F

≤(

2l +2lε

1− c+

2(1 + α)lε2(ε)

1− c

)ck

= βckl,

with a constant β as desired.

4 EXPERIMENTAL RESULTS

The main aim of this section is to demonstrate that (i) thecomputational time requried by the proposed TAP methodis faster than that by the original alternating projection(AP) method with about the same approximation; (ii) theperformance of the proposed TAP method is better thanthat of nonnegative matrix factorization methods in termsof computational time and accuracy for examples in dataclustering, pattern recognition and hyperspectral data anal-ysis.

The experiments in Subsection 4.1 are performed underWindows 7 and MATLAB R2018a running on a desktop(Intel Core i7, @ 3.40GHz, 8.00G RAM) and experiments inSubsections 4.2-4.5 are performed under Windows 10 andMATLAB R2020a running on a desktop (AMD Ryzen 9 3950,@ 3.49GHz, 64.00G RAM).

4.1 The First ExperimentIn the first experiment, we randomly generated n-by-nnonnegative matrices A where their matrix entries followa uniform distribution in between 0 and 1. We employedthe proposed TAP method and the original alternating pro-jection (AP) method [34] to test the relative approximationerror ‖A−Xc‖F /‖A‖F , where Xc are the computed rank rsolutions by different methods. For comparison, we also listthe results by nonnegative matrix factorization algorithms:A-MU [25], A-HALS [25] and A-PG [29].

Tables 1 shows the relative approximation error of thecomputed solutions from the proposed TAP method andthe other testing methods for synthetic data sets of sizes200-by-200, 400-by-400 and 800-by-800. Note that there is noguarantee that other testing NMF algorithms can determinethe underlying nonnegative low rank factorization. In thetables, it is clear that the testing NMF algorithms cannotobtain the underlying low rank factorization. One of thereason may be that NMF algorithms can be sensitive toinitial guesses. In the tables, we illustrate this phenomenaby displaying the mean relative approximation error andthe range containing both the minimum and the maximum

6

TABLE 1The relative approximation error and computation time on the synthetic data sets. The best values are respectively highlighted by bolder fonts.

200-by-200 matrix

Relative approximation error Computation time

Method r = 10 r = 20 r = 40 r = 10 r = 20 r = 40

TAP 0.4576 0.4161 0.3247 0.42 0.48 0.38

AP 0.4576 0.4161 0.3247 0.66 0.66 0.42

A-MU:mean 0.4592 0.4249 0.3733 8.32 9.54 15.34A-MU:range [0.4591, 0.4593] [0.4246, 0.4251] [0.3729, 0.3737] [8.00, 8.81] [9.41, 9.61] [14.72, 15.75]

A-HALS:mean 0.4591 0.4246 0.3717 1.09 1.95 4.01A-HALS:range [0.4590, 0.4593] [0.4244, 0.4247] [0.3714, 0.3719] [0.98, 1.22] [1.86, 2.05] [3.86, 4.13]

A-PG:mean 0.4591 0.4244 0.3717 14.77 16.24 21.52A-PG:range [0.4590, 0.4592] [0.4243, 0.4246] [0.3715, 0.3719] [14.50,15.03] [15.81, 16.55] [21.02, 21.77]

Method400-by-400 matrix


r = 20 r = 40 r = 80 r = 20 r = 40 r = 80

TAP 0.4573 0.4161 0.3421 1.55 1.32 1.10

AP 0.4573 0.4161 0.3421 2.95 2.47 1.68


A-HALS:mean 0.4604 0.4295 0.3836 3.10 7.40 19.67A-HALS:range [0.4603, 0.4605] [0.4294, 0.4296] [0.3833, 0.3838] [3.03, 3.25] [7.12, 7.60] [19.04, 20.61]A-PG:mean 0.4604 0.4297 0.3850 51.68 60.80 61.95A-PG:range [0.4604, 0.4605] [0.4296, 0.4298] [0.3847, 0.3853] [51.04, 52.26] [60.62, 61.01] [61.34, 62.64]

Method800-by-800 matrix


r = 40 r = 80 r = 160 r = 40 r = 80 r = 160

TAP 0.4550 0.4144 0.3412 7.14 4.84 4.80

AP 0.4550 0.4144 0.3412 15.84 9.55 7.11


A-HALS:mean 0.4605 0.4336 0.3984 18.54 47.70 61.30A-HALS:range [0.4604,0.4605] [0.4335, 0.4336] [0.3982, 0.3986] [17.76, 19.48] [43.33, 52.75] [60.71, 61.91]

A-PG:mean 0.4606 0.4343 0.4007 60.38 61.26 61.76A-PG:range [0.4606, 0.4607] [0.4342, 0.4344] [0.4005, 0.4012] [60.12, 60.79] [60.78, 61.81] [61.29 62.48]

relative approximation errors by using ten initial guessesrandomly generated. We find in the table that the relativeapproximation errors computed by the TAP method isthe same as those by the AP method. It implies that theproposed TAP method can achieve the same accuracy ofclassical alternating projection. According to the tables, therelative approximation errors by both TAP and AP methodsare always smaller than the minimum relative approxima-tion errors by the testing NMF algorithms. In addition,we report the computational time (seconds calculated byMATLAB) in the tables. We see that the computational timerequired by the proposed TAP method is less than thatrequired by AP method.

4.2 The Second Experiment

4.2.1 Face DataIn this subsection, we consider two frequently-used facedata sets, i.e., the ORL face date set1 and the Yale B face

1. http://www.uk.research.att.com/facedatabase.html

data set 2. The ORL face data set contains images from40 individuals, each providing 10 different images withthe size 112 × 92 . In the Yale B face data set, we takea subset which consists of 38 people and 64 facial imageswith different illuminations for each individual. Each testingimage is reshaped to a vector, and all the image vectorsare combined together to form a nonnegative matrix. Herewe perform NMF algorithms and TAP algorithm to obtainlow rank approximations with a predefined rank r. Thereare several NMF algorithms to be compared, namely mul-tiplicative updates (MU) [8], [30], accelerated MU (A-MU)[25], hierarchical alternating least squares (HALS) algorithm[23], accelerated HALS (A-HALS) [25], projected gradient(PG) Method [29], and accelerated PG (A-PG) [29].

Firstly, we compare the low rank approximation resultsby different methods with respect to different predefinedranks r. We report the relative approximation errors in Table2. For ORL data set, we set r to be 10 and 40 because facedata contains 40 individuals and each individual has 10

2. http://vision.ucsd.edu/∼leekc/ExtYaleDatabase/ExtYaleB.html

7

TABLE 2The relative approximation error on the Yale-B data set and the ORL

data set. The best values and the second best values are respectivelyhighlighted by bolder fonts and underlines.

Dataset Rank MU A-MU HALS A-HALS PG A-PG AP TAP

Yale B r = 38 0.186 0.182 0.181 0.181 0.187 0.184 0.166 0.166r = 64 0.160 0.157 0.152 0.152 0.159 0.159 0.133 0.133

ORL r = 10 0.206 0.206 0.205 0.205 0.206 0.206 0.204 0.204r = 40 0.159 0.156 0.155 0.155 0.160 0.158 0.147 0.147

T

T

Fig. 3. Relative approximation errors on the ORL data set (Top) and theYale B data set (Bottom), with respect to the different ranks r.

different images. Similarly, r is set to be 38 and 64 for YaleB data set. In the numerical results, we compare the relativeapproximation error: ‖Xc − A‖F /‖A‖F . For the TAP andAP methods the nonnegative low rank approximation isdirectly computed, while for the NMF methods, we multiplythe factor matrices. We can see from the table that therelative approximation errors by TAP and AP methods arelower than those by NMF methods.

The relative approximation errors on these two facedata sets with respect to different ranks r are plotted inFigure 3. We can see that as r increases, the gap of relativeapproximation errors between TAP (or AP) method andNMF methods becomes larger. The total computational timerequired by the proposed TAP method (2.84 seconds) is lessthan that (17.44 seconds) required by the AP method. Theproposed TAP method is more efficient than the AP method.

Next, we test the face recognition performance withrespect to TAP approximations and NMF approximations.We use the k-fold cross-validation strategy. For each dataset, the data is split into k (k = 10 for the ORL data setand k = 64 for the Yale B data set) groups and each groupcontains one facial image of each individual. For instance,the ORL data set is split into k = 10 groups and each groupcontains 40 facial images. Then, we circularly take one group

T

Fig. 4. The recognition accuracy (%) on Yale-B dataset with respect torank r.

MU HALS PG

A-MU A-HALS A-PG

AP TAP

Fig. 5. The first 20 singular vectors of the results by TAP (or AP) methodand the columns of left factor matrices resulted by NMF methods whenthe rank r = 20. These vectors are reshaped to the size of facial imagesand their values are adaptively normalized.

as a test data set and the remaining groups as a training dataset until all the groups have been selected as the test data.Given the original training data Atrain with the size m × n,where n indicates the pixels of each face image and m isthe amount of training samples, we first perform NMF andTAP (or AP) algorithms to obtain non-negative low rankapproximations Atrain ≈ BNMFtranCNMFtrain and Atrain ≈UTAPtrainΣTAPtrainVTAPtrain respectively with rank r. The newrepresentations of Atrain are given by UT

NMFtrainAtrain andUT

TAPtrainAtrain respectively by the NMF methods and theTAP (or AP) method. The nearest neighbor (NN) classifieris adopted by recognized the testing data based on thedistance between their representations and the projectedtraining data.

The face recognition results are exhibited in Table 3.From this table, we can see that the accuracies based on TAPapproximations are higher than those based on NMF ap-proximations. To further investigate how the rank r affectsthe recognition results, we plot the recognition accuracy onYale B data set with respect to r in Figure 4. It can be foundthat the recognition accuracy based on TAP and AP approx-

8

TABLE 3The recognition accuracy on the Yale-B dataset and ORL dataset. The best values and the second best values are respectively highlighted by

bolder fonts and underlines.

Dataset Parameter MU A-MU HALS A-HALS PG A-PG AP TAP

Yale B r = 38 61.061% 61.143% 61.637% 62.253% 58.306% 60.074% 66.776% 67.681%r = 64 69.942% 70.477% 72.821% 72.821% 65.502% 68.586% 76.563% 76.809%

ORL r = 10 95.750% 96.250% 96.250% 96.250% 96.500% 96.500% 96.750% 96.750%r = 40 98.250% 98.000% 98.250% 98.500% 79.250% 98.250% 98.500% 98.500%

TABLE 4The accuracy and NMI values of the document clustering results on the TDT2 data set.

Metric MU A-MU HALS A-HALS PG A-PG AP TAP

Accuracy 52.800% 50.724% 54.322% 53.108% 54.205% 51.661% 61.294% 61.326%NMI 0.674 0.651 0.663 0.643 0.681 0.661 0.728 0.728

imations is always better than those based on NMF approxi-mations. Meanwhile, to see the features learned by differentmethods, we exhibit the column vectors of BNMFtrain andsingular vectors of UTAPtrain in Figure 5. These vectors arereshaped to the same size as facial images and their valuesare normalized to [0,255] for the display purpose. We seethat the nonnegative low rank matrix approximation meth-ods do not give the part-based representations, but providesdifferent important facial representations in the recognition.

4.2.2 Document DataIn this subsection, we use the NIST Topic Detection andTracking (TDT2) corpus as the document data. The TDT2corpus consists of data collected during the first half of1998 and taken from 6 sources, including 2 newswires(APW, NYT), 2 radio programs (VOA, PRI) and 2 televisionprograms (CNN, ABC). It consists of 11201 on-topic docu-ments which are classified into 96 semantic categories. Inthis experiment, the documents appearing in two or morecategories were removed, and only the largest 30 categorieswere kept, thus leaving us with 9394 documents in total.Then, each document is represented by the weighted term-frequency vector [16], and all the documents are gatheredas a matrix Adoc of size 9394 × 36771. By using the pro-cedure given in [16], we compute the projected resultsUT

TAPATAP = ΣTAPVTTAP, and then use k-means cluster-

ing method and Kuhn-Munkres algorithm to find the bestmapping which maps each cluster label to the equivalentlabel from the document corpus. For NMF methods, wescale each column of BNMF such that their `2 norms areequal to 1, and the corresponding scaled CNMF is used forclustering and label assignment. To quantitatively evaluatethe clustering performance of each method, we selectedtwo metrics, i.e., the accuracy and the normalized mutualinformation (NMI) (we refer to [38] for detailed discussion).According to Table (4), it is clear that nonnegative lowrank matrix approximation can provide more effective latentfeatures (UT

TAPATAP = ΣTAPVTTAP) for document clustering

task. Note that the computational time required by theproposed TAP method (309.22 seconds) is less than that(3417.33 seconds) required by the AP method. Again theresults demonstrate that the proposed TAP method is moreefficient than the AP method.

4.3 Separable Nonnegative MatricesIn this subsection, we compare the performance of thenonnegative low rank matrix approximation method andseparable NMF algorithms. Here we generate two kinds ofsynthetic separable nonnegative matrices.

• (Separable) The first case A = BC + N is generatedthe same as [26], in which B ∈ R200×20 is uniformdistributed and C = [I20,H

′] ∈ R20×210 with H′

containing all possible combinations of two non-zero entries equal to 0.5 at different positions. Thecolumns of BH′ are all the middle points of thecolumns of B. Meanwhile, the i-th column of N, de-noted as ni, obeys ni = σ(mi − w) for 21 ≤ i ≤ 210,where σ > 0 is the noise level, mi is the i-th columnof B, and w denotes the average of columns of B.This means that we move the columns of A towardthe outside of the convex hull of the columns of B.

• (Generalized separable) The second case is generatedalmost the same as the first case but simultaneouslyconsidering the separability of rows, known as gen-eralized separable NMF [31]. For this case, the size ofA is set as 78×55 with column-rank 10 and row-rank12, being the same as [31].

Firstly, we test the approximation ability of TAP andAP methods, NMF methods, and the successive projectionalgorithm (SPA) [26], [39] for separable NMF for syntheticseparable data. For the generalized separable case, we com-pare the TAP (or AP) method with SPA, the generalized SPA(GSPA) [31], and the generalized separable NMF with a fastgradient method (GS-FGM) [31]. Note that when we applySPA on the generalized separable matrix, we run it firstlyto identify the important columns and with the transposeof the input to identify the important rows. This variant isreferred to SPA*. The noise level σ is logarithmic spaced inthe interval [10−3, 1]. For each noise level, we independentlygenerate 25 matrices for both separable and generalizedseparable cases, respectively. We report the averaged ap-proximation error in Figures 6 and 7. It can be found thatTAP and AP methods can achieve the lowest average errorsin the testing examples.

The approximation errors of TAP and AP methods aremuch lower than separable and generalized separable NMF

9

T

Fig. 6. Average relative approximation error on separable matrices(Case 1), with respect to the different values of σ.

T

Fig. 7. Average relative approximation error on generalized separablematrices (Case 2), with respect to the different values of σ.

T

T

TT

T

T

Fig. 8. Average accuracy (left) and distance to ground truth (right) for thedifferent algorithms on generalized separable matrices (Case 2), withrespect to the different σs.

methods when the noise level is high. Note that the averagecomputational time required by the proposed TAP method(0.0064 seconds) is less than that (0.0165 seconds) requiredby the AP method. It is interesting whether a better non-negative low rank matrix approximation could contributeto a better separable (or generalized separable) NMF re-sult. To further investigate whether nonnegative low rankmatrix approximation could help separable and generalizedseparable NMF methods, we conduct the experiments withinputting the nonnegative low rank approximation to sep-arable and generalized separable NMF methods. We adoptthe accuracy and the distance to ground truth defined inEqs. (16) and (17) of [31] as the quantitative metrics. The ac-curacy reports the proportion of correctly identified row andcolumn indices while the distance to ground truth reportsthe relative errors between the identified important rows(columns) to the ground truth important rows (columns).We present the computational results in Figure 8. Whenthe noise level is between 0.1 and 1, the nonnegative lowrank matrix approximation by our TAP method obviouslyenhances the accuracy and decrease the distance betweenthe identified rows (columns) to the ground truth.

4.4 Symmetric Nonnegative Matrices for Graph Clus-teringIn this subsection, we test TAP and AP methods on thesymmetric matrices. It can readily be found that the outputof TAP and AP algorithms would be symmetric if theinput matrix is symmetric since that the projection ontothe nonnegative matrix manifold or the low rank matrixmanifold would never affect the symmetry. Here symmetricNMF methods are the coordinate descent algorithm (de-noted as “CD-symNMF ”) [40], the Newton-like algorithm(denoted as “Newton-symNMF”) [27], and the alternatingleast squares algorithm (denoted as “ALS-symNMF”) [27].

We perform experiments by using symmetric NMFmethods, TAP and AP methods on the synthetic graph data,which is reproduced from [41] with six different cases. Thedata points in 2-dimensional space are displayed in the firstrow of Figure 9. Each case contains clear cluster structures.By following the procedures in [27], [41], a similarity matrixA ∈ Rn×n, where n represents the number of data points,is constructed to characterize the similarity between eachpair of data points. Each data point is assumed to be onlyconnected to its nearest nine neighbors. Given a specific pairof the i-th and j-th data points xi and xj , we firstly constructthe distance matrix D ∈ Rn×n withDij = Dji = ‖xi−xj‖22.Then, the similarity matrix is given as

Aij =

0, if i = j,

e(−Dijσiσj

), if i 6= j,

(25)

where σi is the Euclidean distance between the i-th datapoint xi and its 9-th neighbor. Then, we perform NMF, TAPand AP methods for A.

The clustering results of the symmetric NMF methodsand nonnegative low rank matrix approximation are ob-tained by using k-means method on B and U respectively.The clustering results are shown in Figure 9. CD-symNMFmethod fails in most examples except the example in thesecond column. Both Newton-symNMF and ALS-symNMFmethods fail in the example in the fifth column. However,TAP and AP methods perform well for all the examples. Theaverage computational time required by the proposed TAPmethod (0.0321 seconds) is less than that (0.1035 seconds)required by the AP method. The proposed TAP is faster thanthe AP method.

4.5 Orthogonal Decomposable Non-negative MatricesIn this subsection, we test TAP and AP methods and orthog-onal NMF (ONMF) methods [4], [24] on the approximationof the synthetic data and the unmixing of hyperspectralimages. The orthogonal NMF method is a multiplicativeupdating algorithm proposed by Ding et al. [4]. We refer toDing-Ti-Peng-Park (DTPP)-ONMF. A multiplicative updat-ing algorithm utilizing the true gradient in Stiefel manifoldis proposed in [24]. We refer to SM-ONMF.

We construct an orthogonal nonnegative matrix B ∈R100×10, whose transpose is shown in Figure 10. Then amatrix C ∈ R10×30 is generated with entries uniformlydistributed in [0, 1]. Then, we obtain an orthogonal de-composable matrix A = BC ∈ R100×30. Next, a noisematrix based on MATLAB command σ × rand(100, 30) is

10

3 clusters 3 clusters 3 clusters 5 clusters 4 clusters 3 clusters

Ori

gina

ldat

aC

D-s

ymN

MF

New

ton-

sym

NM

FA

LS-s

ymN

MF

AP

TAP

Fig. 9. The graph clustering results by the TAP (or AP) method and symmetric NMF methods on 6 cases of synthetic graph data. Different colorrepresents different clusters.

added to A. We set σ = 0, 0.02, 0.04, · · · , 0.1. The relativeapproximation errors of the results by different methods areshown in Table 5. We can see that the approximation errorsof TAP and AP methods are the lowest among the testingexamples.

As a real-world application of ONMF, hyperspectral im-age unmixing aims at factoring the observed hyperspectralimage in matrix format into an endmember matrix andan abundance matrix. The abundance matrix is indeed theclassification of the pixels to different clusters, with eachcorresponding to a material (endmember). In this part, weuse a sub-image of the Samson data set [42], consisting of95 × 95 = 9025 spatial pixels and 156 spectral bands. Weform a matrix A of size 9025 × 156 to represent this sub-image. Three different materials, i.e., “Tree”, “Rock”, and“Water”, are in this sub-image, and we set the rank r as3. The factor matrices B ∈ R9025×3 and C ∈ R3×156 canbe obtained by the orthogonal NMF methods. We use k-means and do hard clustering on B ∈ R9025×3 to obtain theabundance matrix, and we can obtain the i-th feature imageby reshaping its i-th column to a 95×95 matrix. Each row ofC represents the spectral reflectance of on material (“Tree”,“Rock”, or “Water”). As for TAP and AP methods, weapply singular value decomposition on approximated non-

Fig. 10. An illustration of the generated BT .

TABLE 5The relative approximation errors (×100) on the orthogonal symmetric

matrix data. The best values and the second best values arerespectively highlighted by bolder fonts and underlines.

σ 0 0.02 0.04 0.06 0.08 0.1

DTPP-ONMF 0.022 2.730 5.231 7.567 9.465 11.232SM-ONMF 0.016 2.741 5.169 7.533 9.424 14.180

AP 0.000 2.364 4.471 6.529 8.215 9.700TAP 0.000 2.364 4.471 6.529 8.215 9.700

negative low rank matrices to obtain the left singular valuematrices which contain the first 3 left singular vectors. Then,we use k-means and do hard clustering on the left singularmatrices to cluster three materials and obtain abundancematrices and endmember matrices.

To quantitatively evaluate the umixing results, we em-ploy two metrics. The first one is the spectral angle distance

11

TABLE 6The quantitative metrics of the unmixing results on the hyperspetralimage Samson. The best values and the second best values are

respectively highlighted by bolder fonts and underlines.

Metric DTPP-ONMF SM-ONMF AP TAP

SAD 0.3490 0.4389 0.0765 0.0765Similartity 0.5887 0.5640 0.9383 0.9383

Fig. 11. Left: Rock, Tree, Water; Right: Reflectance of Rock, Tree, Water.From the top to bottom: groundtruth, DTPP-ONMF, SM-ONMF, AP, TAP.

(SAD) as follows:

SAD =1

r

r∑i=1

arccos

(sTi si

‖si‖2‖si‖2

),

where siri=1 are the estimated spectral reflectance (rowsof the endmember matrix) and siri=1 are the groundtruthspectral reflectance. The second one is the similarity of theabundance feature image [43] as follows:

Similarity =1

r

r∑i=1

aTi ai‖ai‖2‖ai‖2

,

where airi=1 are the estimated abundance feature(columns of the abundance matrix) and airi=1 are thegroundtruth ones. We note that a larger Similarity and asmaller SAD indicate a better unmixing result. We exhibitthe quantitative metrics in Table 6. We can evidently seethat the proposed TAP and AP methods obtain the bestmetrics. Meanwhile, we illustrate the estimated spectralreflectance and abundance feature images in Figure 6. It canbe found from the second row that DTPP-ONMF and SM-ONMF perform well for the materials “Rock” and “Tree”but poor on “Water”. TAP and AP methods unmix thesethree materials well, but the proposed TAP method (the

computational time = 0.1492 seconds) is faster than the APmethod (the computational time = 0.3738 seconds).

5 CONCLUSION

In this paper, we have proposed a new alternating pro-jection method to compute nonnegative low rank matrixapproximation for nonnegative matrices. Our main idea isto use the tangent space of the point in the fixed-rank matrixmanifold to approximate the projection onto the manifold inorder to reduce the computational cost. Numerical examplesin data clustering, pattern recognition and hyperspectraldata analysis have shown that the proposed alternatingprojection method is better than that of nonnegative matrixfactorization methods in terms of accuracy, and the compu-tational time required by the proposed alternating projectionmethod is less than that required by the original alternatingprojection method.

Moreover, we have shown that the sequence generatedby the alternating projections onto the tangent spaces of thefixed rank matrices manifold and the nonnegative matrixmanifold, converge linearly to a point in the intersection ofthe two manifolds where the convergent point is sufficientlyclose to optimal solutions. Our theoretical convergence re-sults are new and are not studied in the literauture. Weremark that Andersson and Carlsson [37] assumed thatthe exact projection onto each manifold and then obtainedthe convergence result of the alternating projection method.Because of our proposed inexact projection onto each mani-fold, our proof can be extended to show the sequence gener-ated by alternating projections on one or two nontangentialmanifolds based on tangent spaces, converges linearly to apoint in the intersection of the two manifolds.

As a future research work, it is interesting to study (i)the convergence results when inexact projections on sev-eral manifolds are employed, and (ii) applications wherethe other norms (such as l1 norm) in data fitting insteadof Frobenius norm. It is necessary to develop the relatedalgorithms for such manifold optimization problems.

APPENDIX APROOF OF LEMMA 3.4, 3.6 AND 3.7A.1 Proof of Lemma 3.4Proof. For a given ε > 0, there exist an s1(ε) such thatLemma 3.3 applies to Mr ∩ Ball(P, s1(ε)). Since π1(·)and PTMr (·)(·) are continuous, there exist an s(ε) < s1(ε)such that the image of Ball(P, s(ε)) under π1 and PTMr

is included in Ball(P, s1(ε)). Now we can choose a pointQ ∈ Ball(P, s(ε)) ∩Mr . For any given Z ∈ Ball(P, s(ε)),there are two cases: Z ∈ Mr and Z /∈ Mr . When Z ∈ Mr ,by using Lemma 3.3 (i) with C = Q and D = Z = π1(Z),(14) follows.

Next we consider Z /∈ Mr . In this case, we setC = Q and D = PTMr (Q)(Z). As Z ∈ Ball(P, s(ε)) andPTMr (Q)(Z) ∈ Ball(P, s1(ε)). By using Lemma 3.3 (ii), wehave

‖PTMr (Q)(Z)− π1(PTMr (Q)(Z))‖F ≤ ε‖PTMr (Q)(Z)−Q‖F .

It implies that the set

Mr ∩Ball(PTMr (Q)(Z), ε‖PTMr (Q)(Z)−Q‖F )

12

is not void and is included in Ball(Z, ‖Z− Z1‖F + ε‖Z1 −Q‖F ) with Z1 = PTMr (Q)(Z). Note that π1(Z) is on themanifold Mr and is included in Ball(P, s1(ε)). By usingLemma 3.3 (i) (set C = Q and D = π1(Z)), we have

‖π1(Z)− PTMr (Q)(π1(Z))‖F ≤ ε‖π1(Z)−Q‖F . (26)

Here the three points π1(Z), PTMr (Q)(π1(Z)) and Q form aright triangle, we have

‖π1(Z)− PTMr (Q)(π1(Z))‖2F + ‖PTMr (Q)(π1(Z))−Q‖2F= ‖π1(Z)−Q‖2F .

By using this equality in the calculation of (26), we obtain

‖π1(Z)− PTMr (Q)(π1(Z))‖F

=√‖π1(Z)−Q‖2F − ‖PTMr (Q)(π1(Z))−Q‖2F

≤ ϕ(ε)‖PTMr (Q)(π1(Z))−Q‖F , (27)

with ϕ(ε) = ε√1−ε2 . As Z1 = PTMr (Q)(Z), we know that

(Z−Z1) ⊥ TMr(Q). By using PTMr (Q)(π1(Z)) ∈ TMr

(Q),we find that PTMr (Q)(π1(Z)), Z1 and Z form a right-angledtriangle which is included in Ball(Z, ‖Z − Z1‖F + ε‖Z1 −Q‖F ). It implies that

‖PTMr (Q)(π1(Z))− Z1‖F≤ ‖Z− PTMr (Q)(π1(Z))‖F ≤ ‖Z− Z1‖F + ε‖Z1 −Q‖F ,

and

‖PTMr (Q)(π1(Z))−Q‖F= ‖PTMr (Q)(π1(Z))− Z1 + Z1 −Q‖F≤ ‖PTMr (Q)(π1(Z))− Z1‖F + ‖Z1 −Q‖F< ‖Z− Z1‖F + ε‖Z1 −Q‖F + ‖Z1 −Q‖F . (28)

By combing (27) and (28) with 0 < ε < 35 , we derive

‖π1(Z)− PTMr (Q)(π1(Z))‖F< ϕ(ε)((1 + ε)‖Z1 −Q‖F + ‖Z− Z1‖F )

< 2ε(‖Z1 −Q‖F + ‖Z− Z1‖F ).

Now we set Z2 as the reflection point of of π1(Z) withrepsect to the tangent space TMr

(Q). It is clear that

‖π1(Z)− PTMr (Q)(π1(Z))‖F = ‖PTMr (Q)(π1(Z))− Z2‖F .

Along the direction of Z − Z1, we can find a point Z3 suchthat

‖Z1 − Z3‖F = ‖π1(Z)− PTMr (Q)(π1(Z))‖F .

Thus we can estimate the distance between PTMr (Q)(Z) andPTMr (Q)(π1(Z)) as follows:

‖PTMr (Q)(Z)− PTMr (Q)(π1(Z))‖2F= ‖Z− Z2‖2F − ‖Z− Z3‖2F≤ (‖Z− Z1‖F + ε‖Z1 −Q‖F )2

− (‖Z− Z1‖F − 2ε(‖Z1 −Q‖F + ‖Z− Z1‖F ))2

= (‖Z− Z1‖F + ε‖Z1 −Q‖F )2

− ((1− 2ε)‖Z− Z1‖F − 2ε‖Z1 −Q‖F )2.

With the above inequalities, for any given Z ∈Ball(P, s(ε)), we have

‖π1(Z)− PTMr (Q)(Z)‖2F= ‖PTMr (Q)(Z)− PTMr (Q)(π1(Z))‖2F

+ ‖π1(Z)− PTMr (Q)(π1(Z))‖2F≤ (‖Z− Z1‖F + ε‖Z1 −Q‖F )2 − ((1− 2ε)‖Z− Z1‖F− 2ε‖Z1 −Q‖F )2 + 4ε2(‖Z1‖F + ‖Z2‖F )2

= 2ε‖Z− Z1‖F ‖Z1 −Q‖F + ε2‖Z1 −Q‖2F+ 4ε(‖Z− Z1‖2F + ‖Z1 −Q‖F ‖Z− Z1‖F )

< 16ε‖Z−Q‖F .

The second inequality is derived by using the facts that‖Z1 − Q‖F < ‖Z − Q‖F , ‖Z2 − Q‖F < ‖Z − Q‖F andε2 < ε. Hence the result follows.

A.2 Proof of Lemma 3.6Proof. Note that π1(PTMr (Q)(Z)) and π1(Z) are on themanifold Mr , and π1(PTMr (Q)(Z)) is the closest point toPTMr (Q)(Z) on the manifoldMr. Therefore, we have

‖PTMr (Q)(Z)− π1(PTMr (Q)(Z))‖F ≤‖PTMr (Q)(Z)− π1(Z)‖F . (29)

We remark that Mrn is a smooth manifold [34] and P ∈Mrn. Thus there exists an s′ such that π is continuous onBall(P, s′). In other words, we can find a constant α > 0such that

‖π(X)− π(X′)‖F ≤ α‖X−X′‖F ,∀ X,X′ ∈ Ball(P, s′).(30)

Now we choose s1(ε) to be the minimum of s′ and s(ε) inLemmas 3.2 and 3.4. For all Z ∈ Ball(P, s1(ε)), we have

‖π(π1(PTMr (Q)(Z)))− π(Z)‖F= ‖π(π1(PTMr (Q)(Z)))− π(π1(Z)) + π(π1(Z))− π(Z)‖F≤ ‖π(π1(PTMr (Q)(Z)))−π(π1(Z))‖F +‖π(π1(Z))−π(Z)‖F≤ α‖π1(PTMr (Q)(Z))− π1(Z)‖F + ε1(ε)‖Z− π(Z)‖F≤ α‖π1(PTMr (Q)(Z))− PTMr (Q)(Z)‖F + ε(ε)‖Z− π(Z)‖F

+ α‖PTMr (Q)(Z)− π1(Z)‖F≤ 2α‖PTMr (Q)(Z)− π1(Z)‖F + ε(ε)‖Z− π(Z)‖F≤ 8α

√ε‖Z−Q‖F + ε(ε)‖Z− π(Z)‖F

≤ (ε(ε) + 8α√ε)‖Z− π(Z)‖F + 8α

√ε‖Q− π(Z))‖F .

The second inequality is derived by (30) and (13), thefourth inequality is derived by (29) and the fifth inequalityis derived by (14). We choose ε1(ε) = ε(ε) + 8α

√ε and

ε2(ε) = 8α√ε. The result follows.

A.3 Proof of Lemma 3.7Proof. Without loss of generality, we can assume π(Z) = 0,thus it is sufficient to prove

‖π1(PTMr (Q)(Z))‖F < c‖Z‖F orπ1(PTMr (Q)(Z))‖F

‖Z‖F< c

is satisfied. By the definition of σ(P), we can find a con-stant c1 such that σ(P) < c1 < c. Note that σ(·) is

13

a local continuous function, it implies that there exist aconstant s1 such that σ(S) < c1 is satisifed for every pointS ∈ Mrn ∩ Ball(P, s1). Let s < s1. Since π(·) is a localcontinuous function, π(Z) ∈ Mrn ∩ Ball(P, s), thus wehave σ(π(Z)) = σ(0) < c1. Now we set D = π1(Z),D′ = PTMr (π(Z))(Z) and E = π1(PTMr (Q)(Z)), then wehave‖π1(PTMr (Q)(Z))‖F

‖Z‖F=‖E‖F‖Z‖F

=

( ‖E‖F‖D′‖F

)(‖D′‖F‖Z‖F

).

The remaining task is to estimate the values of ‖E‖F‖D′‖F and‖D′‖F‖Z‖F . Recall thatMn is a linear affine manifold, and Z is onMn, it implies that Z = π2(Z) = PTMn (π(Z))(Z). Therefore,

‖D′‖F =‖PTMr (π(Z))(Z)‖F = ‖PTMr (π(Z))(Z)− π(Z)‖F=σ(π(Z))‖PTMn (π(Z))(Z)‖F ,

and thus ‖D′‖F

‖Z‖F = σ(π(Z)) = σ(0) < c1.

In order to estimate ‖E‖F‖D′‖F , we consider ‖E−D′‖F whichcan be bounded by the following inequality:

‖E−D′‖F= ‖E−D + D−D′‖F ≤ ‖E−D‖F + ‖D−D′‖F= ‖E− PTMr (Q)(Z) + PTMr (Q)(Z)−D‖F + ‖D−D′‖F≤ ‖E− PTMr (Q)(Z)‖F + ‖PTMr (Q)(Z)−D‖F

+ ‖D−D′‖F= ‖π1(PTMr (Q)(Z))− PTMr (Q)(Z)‖F

+ ‖PTMr (Q)(Z)− π1(Z)‖F + ‖D−D′‖F . (31)

By using Lemma 3.4 and Z = π2(Q) is the closest point toQ with respect to π2(·),

‖PTMr (Q)(Z)− π1(Z)‖F <4√ε‖Z−Q‖F

≤4√ε‖Q− π(Z)‖F . (32)

By using the fact that π1(PTMr (Q)(Z)) is the closest point toPTMr (Q)(Z) with respect to π1(·) and (32), we have

‖π1(PTMr (Q)(Z))− PTMr (Q)(Z)‖F≤ ‖PTMr (Q)(Z)− π1(Z)‖F ≤ 4

√ε‖Q− π(Z)‖F . (33)

By applying Lemma 3.2 on ‖D−D′‖F , we get

‖D−D′‖F = ‖PTMr (π(Z))(Z)− π1(Z)‖F ≤ 4√ε‖Z‖F .

(34)

By putting (32), (33) and (34) into (31), we obtain thefollowing estimate

‖E−D′‖F< 4√ε‖Q− π(Z)‖F + 4

√ε‖Q− π(Z)‖F

+ 4√ε‖Z− π(Z)‖F

= 8√ε‖Q− π(Z)‖F + 4

√ε‖Z−Q + Q− π(Z)‖F

≤ 8√ε‖Q− π(Z)‖F + 4

√ε(‖Z−Q‖F + ‖Q− π(Z)‖F )

< 8√ε‖Q− π(Z)‖F + 8

√ε‖Q− π(Z)‖F

= 16√ε‖Q− π(Z)‖F . (35)

Now we choose c2 > 1 such that c2c1 < c, and alsosufficiently small ε ∈ (0, 35 ) such that

16√ε

(c2

c2 − 1

)‖Q‖F < c‖Z‖F (36)

is satisfied. For the value of ‖E‖F‖D′‖F , there are two cases to be

considered: ‖E‖F‖D′‖F ≤ c2 or ‖E‖F‖D′‖F > c2. For the first case, itis easy to check

‖E‖F‖Z‖F

=‖E‖F‖D′‖F

‖D′‖F‖Z‖F

< c2c1 < c.

For the second case, ‖E‖F‖D′‖F > c2. By using (35), we derive

‖E‖F − ‖D′‖F ≤ ‖E−D′‖F < 16√ε‖Q‖F .

or ‖D′‖F > ‖E‖F − 16√ε‖Q‖F . It implies that

c2 <‖E‖F‖D′‖F

<‖E‖F

‖E‖F − 16√ε‖Q‖F

.

By using (36), we have

‖E‖F‖Z‖F

<

(c2c2−1

)16√ε‖Q‖F

‖Z‖F< c.

The results follow.

REFERENCES

[1] K. Chen, Matrix preconditioning techniques and applications. Cam-bridge University Press, 2005, vol. 19.

[2] M. Chen, W.-S. Chen, B. Chen, and B. Pan, “Non-negative sparserepresentation based on block nmf for face recognition,” in ChineseConference on Biometric Recognition. Springer, 2013, pp. 26–33.

[3] C. Ding, X. He, and H. D. Simon, “On the equivalence of nonneg-ative matrix factorization and spectral clustering,” in Proceedingsof the 2005 SIAM International Conference on Data Mining. SIAM,2005, pp. 606–610.

[4] C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegativematrix t-factorizations for clustering,” in Proceedings of the 12thACM SIGKDD International Conference on Knowledge Discovery andData Mining, 2006, pp. 126–135.

[5] D. Guillamet and J. Vitria, “Non-negative matrix factorization forface recognition,” in Catalonian Conference on Artificial Intelligence.Springer, 2002, pp. 336–344.

[6] D. Guillamet, J. Vitria, and B. Schiele, “Introducing a weightednon-negative matrix factorization for image classification,” PatternRecognition Letters, vol. 24, no. 14, pp. 2447–2454, 2003.

[7] L. Jing, J. Yu, T. Zeng, and Y. Zhu, “Semi-supervised clusteringvia constrained symmetric non-negative matrix factorization,” inInternational Conference on Brain Informatics. Springer, 2012, pp.309–319.

[8] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.

[9] J. Liu, Z. Wu, Z. Wei, L. Xiao, and L. Sun, “A novel sparsityconstrained nonnegative matrix factorization for hyperspectralunmixing,” in 2012 IEEE International Geoscience and Remote SensingSymposium. IEEE, 2012, pp. 1389–1392.

[10] Y. Liu, X.-Z. Pan, R.-J. Shi, Y.-L. Li, C.-K. Wang, and Z.-T. Li,“Predicting soil salt content over partially vegetated surfaces usingnon-negative matrix factorization,” IEEE Journal of Selected Topicsin Applied Earth Observations and Remote Sensing, vol. 8, no. 11, pp.5305–5316, 2015.

[11] Y. Wang, Y. Jia, C. Hu, and M. Turk, “Non-negative matrix fac-torization framework for face recognition,” International Journal ofPattern Recognition and Artificial Intelligence, vol. 19, no. 04, pp. 495–511, 2005.

[12] D. Zhang, S. Chen, and Z.-H. Zhou, “Two-dimensional non-negative matrix factorization for face representation and recog-nition,” in International Workshop on Analysis and Modeling of Facesand Gestures. Springer, 2005, pp. 350–363.

[13] M. W. Berry and J. Kogan, Text mining: applications and theory. JohnWiley & Sons, 2010.

[14] T. Li and C. Ding, “The relationships among various nonnegativematrix factorization methods for clustering,” in Sixth InternationalConference on Data Mining (ICDM’06). IEEE, 2006, pp. 362–371.

14

[15] V. P. Pauca, F. Shahnaz, M. W. Berry, and R. J. Plemmons, “Textmining using non-negative matrix factorizations,” in Proceedingsof the 2004 SIAM International Conference on Data Mining. SIAM,2004, pp. 452–456.

[16] W. Xu, X. Liu, and Y. Gong, “Document clustering based on non-negative matrix factorization,” in Proceedings of the 26th AnnualInternational ACM SIGIR Conference on Research and Development inInformaion Retrieval, 2003, pp. 267–273.

[17] A. Cichocki, R. Zdunek, A. H. Phan, and S.-i. Amari, Nonnegativematrix and tensor factorizations: applications to exploratory multi-waydata analysis and blind source separation. John Wiley & Sons, 2009.

[18] P. M. Kim and B. Tidor, “Subsystem identification through dimen-sionality reduction of large-scale gene expression data,” GenomeResearch, vol. 13, no. 7, pp. 1706–1718, 2003.

[19] H. Kim and H. Park, “Sparse non-negative matrix factorizationsvia alternating non-negativity-constrained least squares for mi-croarray data analysis,” Bioinformatics, vol. 23, no. 12, pp. 1495–1502, 2007.

[20] A. Pascual-Montano, J. M. Carazo, K. Kochi, D. Lehmann, andR. D. Pascual-Marqui, “Nonsmooth nonnegative matrix factoriza-tion (nsNMF),” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 28, no. 3, pp. 403–415, 2006.

[21] G. Wang, A. V. Kossenkov, and M. F. Ochs, “LS-NMF: a modifiednon-negative matrix factorization algorithm utilizing uncertaintyestimates,” BMC Bioinformatics, vol. 7, no. 1, p. 175, 2006.

[22] P. Paatero and U. Tapper, “Positive matrix factorization: A non-negative factor model with optimal utilization of error estimatesof data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.

[23] A. Cichocki, R. Zdunek, and S.-i. Amari, “Hierarchical ALS al-gorithms for nonnegative matrix and 3D tensor factorization,”in International Conference on Independent Component Analysis andSignal Separation. Springer, 2007, pp. 169–176.

[24] S. Choi, “Algorithms for orthogonal nonnegative matrix factoriza-tion,” in 2008 IEEE International Joint Vonference on Neural Networks(IEEE World Congress on Computational Intelligence). IEEE, 2008,pp. 1828–1832.

[25] N. Gillis and F. Glineur, “Accelerated multiplicative updates andhierarchical ALS algorithms for nonnegative matrix factorization,”Neural Computation, vol. 24, no. 4, pp. 1085–1105, 2012.

[26] N. Gillis and S. A. Vavasis, “Fast and robust recursive algorithms-for separable nonnegative matrix factorization,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 698–714, 2013.

[27] D. Kuang, C. Ding, and H. Park, “Symmetric nonnegative matrixfactorization for graph clustering,” in Proceedings of the 2012 SIAMInternational Conference on Data Mining. SIAM, 2012, pp. 106–117.

[28] D. Kuang, S. Yun, and H. Park, “SymNMF: nonnegative low-rankapproximation of a similarity matrix for graph clustering,” Journalof Global Optimization, vol. 62, no. 3, pp. 545–574, 2015.

[29] C.-J. Lin, “Projected gradient methods for nonnegative matrixfactorization,” Neural Computation, vol. 19, no. 10, pp. 2756–2779,2007.

[30] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization,” in Advances in Neural Information Processing Systems,2001, pp. 556–562.

[31] J. Pan and N. Gillis, “Generalized separable nonnegative matrixfactorization,” IEEE Transactions on Pattern Analysis and MachineIntelligence, 2019.

[32] Z. Yuan and E. Oja, “Projective nonnegative matrix factorizationfor image compression and feature extraction,” in ScandinavianConference on Image Analysis. Springer, 2005, pp. 333–342.

[33] G. R. Naik, Non-negative matrix factorization techniques. Springer,2016.

[34] G. Song and M. K. Ng, “Nonnegative low rank matrix approxi-mation for nonnegative matrices,” Applied Mathematics Letters, vol.105, p. 106300, 2020.

[35] B. Vandereycken, “Low-rank matrix completion by riemannianoptimization,” SIAM Journal on Optimization, vol. 23, no. 2, pp.1214–1236, 2013.

[36] G. H. Golub and C. F. Van Loan, Matrix computations. JHU press,2012, vol. 3.

[37] F. Andersson and M. Carlsson, “Alternating projections on non-tangential manifolds,” Constructive Approximation, vol. 38, no. 3,pp. 489–525, 2013.

[38] D. Cai, X. He, and J. Han, “Document clustering using localitypreserving indexing,” IEEE Transactions on Knowledge and DataEngineering, vol. 17, no. 12, pp. 1624–1637, 2005.

[39] M. C. U. Araujo, T. C. B. Saldanha, R. K. H. Galvao, T. Yoneyama,H. C. Chame, and V. Visani, “The successive projections algorithmfor variable selection in spectroscopic multicomponent analysis,”Chemometrics and Intelligent Laboratory Systems, vol. 57, no. 2, pp.65–73, 2001.

[40] A. Vandaele, N. Gillis, Q. Lei, K. Zhong, and I. Dhillon, “Coordi-nate descent methods for symmetric nonnegative matrix factoriza-tion,” arXiv preprint arXiv:1509.01404, 2015.

[41] L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,”in Advances in neural information processing systems, 2005, pp. 1601–1608.

[42] F. Zhu, Y. Wang, B. Fan, S. Xiang, G. Meng, and C. Pan, “Spectralunmixing via data-guided sparsity,” IEEE Transactions on ImageProcessing, vol. 23, no. 12, pp. 5412–5427, 2014.

[43] J. Pan, M. K. Ng, Y. Liu, X. Zhang, and H. Yan, “Orthogonal non-negative tucker decomposition,” arXiv preprint arXiv:1912.06836,2019.

Guang-Jing Song received the Ph.D. degree inmathematics from Shanghai University, Shang-hai, China, in 2010. He is currently a profes-sor of School of Mathematics and InformationSciences, Weifang University. His research in-terests include numerical linear algebra, sparseand low-rank modeling, tensor decompositionand multi-dimensional image processing.

Michael K. Ng is the Director of Research Di-vision for Mathematical and Statistical Science,and Chair Professor of Department of Mathe-matics, the University of Hong Kong, and Chair-man of HKU-TCL Joint Research Center for AI.His research areas are data science, scientificcomputing, and numerical linear algebra.

Tai-Xiang Jiang received the B.S., Ph.D. de-grees in mathematics and applied mathemat-ics from the University of Electronic Scienceand Technology of China (UESTC), Chengdu,China, in 2013. He is currently a lecturer with theSchool of Economic Information Engineering,Southwestern University of Finance and Eco-nomics. His research interests include sparseand low-rank modeling, tensor decomposi-tion and multi-dimensional image processing.https://sites.google.com/view/taixiangjiang/

1 tangent space based alternating projections for

Documents