lazy householder decomposition of sparse matrices

Lazy Householder Decomposition of Sparse

Matrices ∗

G.W. HowellNorth Carolina State University

Raleigh, North Carolina

August 26, 2010

Abstract

This paper describes Householder reduction of a rectangular sparsematrix to small band upper triangular form Bk+1. Bk+1 is upper tri-angular with nonzero entries only on the diagonal and on the nearest ksuperdiagonals.

The algorithm is similar to the Householder reduction used as part ofthe standard dense SVD computation. For the sparse “lazy” algorithm,matrix updates are deferred until a row or column block is eliminated.The original sparse matrix is accessed only for sparse matrix dense ma-trix (SMDM) multiplications and to extract row and column blocks.For a triangular bandwidth of k + 1, the SMDM operations are of thesparse matrix by dense matrices consisting of the k rows or columnsof a block Householder transformation. Block Householder transforma-tions are reliably orthogonal, computationally efficient, and have goodpotential for parallelization.

Numeric results presented here indicate that using an initial randomblock Householder transformation allows computation of a collection oflargest singular values. Some potential applications are in finding lowrank matrix approximations and in solving least squares problems.

1 Introduction

In 1965, Golub and Kahan proposed Householder bidiagonalization A = UB2Vas a first step in determining the singular values of dense matrices. For sparse

∗Supported by NIH Molecular Libraries Roadmap for Medical Research, Grant 1 P20Hg003900-01.

matrices they proposed Lanczos bidiagonalization as a means of determininga few singular values [8].

Rearranging the order of computation to avoid filling a sparse matrix al-lows a natural extension of the use of Householder decomposition to the sparsecase. Householder transformations are scalably stable and if blocked, the re-duction algorithm is almost entirely BLAS-3, efficient on a variety of computerarchitectures. Given a sparse matrix A, applications include solving Ax = band finding x to minimize ‖Ax − b ‖2. In this paper, we apply the algorithmto finding singular values of A.

Key points are

• Algorithm stability is desirable for reliable solution of large problems.Householder and block Householder transformations are very nearly or-thogonal when implemented with rounding arithmetic, enabling simplerun-time convergence tests (see Section 4).

• Householder reductions can be applied to sparse matrices by deferringupdates of blocks of the original matrix. Updates are not performeduntil the step on which a column or row block is to be eliminated. Mul-tiplications are accomplished by expressing the updated matrix as a sumof the original sparse matrix and a low rank update1. The work here isessentially an extension of the dense Grosser and Lang algorithm [10],[16] to apply in the sparse case.

• Block Householder transformations are BLAS-3. On current computerarchitectures, whether cache-based multicore, GPU or other hardwareaccelerators connected to general processors, or distributed parallel com-puting, dense BLAS-3 matrix matrix multiplies are significantly fasterthan BLAS-2 matrix vector multiplications. In all these cases, BLAS-3Section 3 discusses NUMA (shared memory Non-Uniform Memory Ac-cess) performance for “wide or tall” BLAS-3 (See Section 3).

• Similarly, multiplying a sparse matrix by a dense matrix (SMDM) isfaster than multiplying a sparse matrix by a dense vector. If the reduc-tion is to an upper triangular matrix Bk+1 with nonzero entries on thediagonal and nearest k superdiagonals, then the SMDM operations AXand AT Y entail dense matrices X and Y with k columns.

1For reduction to Hessenberg form or upper triangular form, the idea of deferring updateshas been repeatedly used. Kaufman implemented deferral of updates in sparse HouseholderQR factorization [15]. For other applications of this idea, see for example ARPACK [19],Sosonkina, Allison, and Watson [25], and Dubrulle [6]. For Householder reduction to bidi-agonal form, the idea of deferring updates is implicit in the LAPACK reduction GEBRD(for the dense case) [3], with the extension to the sparse case explicitly outlined in Howell,Demmel, Fulton, Hammarling, and Marmol [11]. The “lazy” functional language Haskelldefers updates.

2

• With 64 GBytes of RAM with 48 GBytes allocated to basis storage, thenfor matrices of size up to about a million square (as illustrated by testingagainst matrices from Tim Davis’s UF collection of matrices [4]), theUBk+1V algorithm usually determined many singular values (See Section6). For comparision, if one dense 20K by 20K matrix can be stored on agiven processor (3.2 GBytes of storage), then 2500 processors would beneeded for a dense SCALAPACK computation of singular values for aone million square matrix.

Section 2 compares basis orthogonality and storage requirements for severalmethods of finding singular values of sparse matrices. Section 3 gives somenumeric results justifying the assertion that AX, AT Y (Sparse matrix densematrix or SPMD) operations are likely to be faster that Ax, AT y and discussesshared memory parallelization of “wide or tall” BLAS-3 operations. Section4 summarizes some theory, justifying some run-time error estimates. Section5 is an explicit presentation of the algorithm, provides comparisons of thesparse and dense algorithms, and shows how implicit fill can be useful. Section6 describes numeric experiments with the Davis UF collection [4] of sparsematrices.

2 Comparison to Other Sparse Methods

The sparse and dense UBk+1V Householder based decompositions are BLAS-3algorithms with U and V scalably orthogonal.

For comparison, consider Lanczos bidiagonalization for finding a few sin-gular values of a sparse matrix, proposed by Golub and Kahan as a sparse al-ternative to Householder bidiagonalization [8]. Lanczos bidiagonalization canproceed without storing multipliers, relying on a three term recursion, so thatonly the last few left and right multpliers are needed. Storage requirementsare minimal. In exact arithmetic the Lanczos bases would be orthogonal. Inrounding arithmetic, there is a rapid loss of orthogonality, and even of linearindependence, as illustrated in Figure 1. A matrix of size a few hundred wasrandomly generated in the program octave, the left and right multipliers weresaved, and the numeric rank of the right multiplier basis was calculated as itsnumber of nonzero singular values.

As memory per available node grows, using the memory to get better sta-bility becomes feasible. In order to preserve orthogonality, multiplier vectorsare frequently stored, and in a Lanczos algorithm, reorthogonalized. Table 1compares L2 condition numbers of various methods for constructing bases forthe columns of Hilbert matrices, computed as the ratio of largest to smallestsingular values of the basis.

3

.. 0

2

4

6

8

10

12

0 10 20 30 40 50 60 70 80

Ran

k D

efic

ienc

y

Number of Columns

Loss of Linear Independence of "Orthogonal" Lanczos Basis

Figure 1: Lanczos bases suffer loss of numeric rank.

Sizes 10 20 30 40 50Householder QR 1.0 1.0 1.0 1.0 1.0

L in LU 9.70 18.7 46.5 53.0 72.2MGS QR 1.001 98.5 170. 122.5 3631

Table 1: L2 Condition Numbers of Bases. Here we factored Hilbert matricesof various sizes and compared condition numbers of Q for QR factorization orL for LU decomposition.

If modified Gram-Schmidt as opposed to Householder orthogonalization isused, the number of flops can be halved. Since modified Gram-Schmidt is pri-marily a BLAS-1 algorithm, use of BLAS-3 block Householder transformationsis typically faster, as well as more nearly orthogonal. Alternatively, Jalby andPhilippe [13], Vanderstraten [28], Stewart [26], Giraud and Langou [7] andothers have have designed block Gram-Schmidt algorithms to be of compa-rable stability to modified Gram-Schmidt, which gives a well-conditioned butnot orthogonal basis for an ill-conditioned set of vectors such as those obtainedby a Lanczos method. Using block Gram-Schmidt in combination with blockLanczos methods, such as that proposed by Golub, Lusk, and Overton [9]would be another possible means of obtaining a stable BLAS-3 algorithm.

More usually, Lanczos bidiagonalization with reorthogonalization is used inSVDPACK [2] and PROPACK [17], with theoretical development in Larsen’sthesis [18] and work by Simon and Zha [24]. Each of these methods is appropri-ate for finding some singular values. Sparse MATLAB instead uses ARPACK

4

GFlop rates for “tall or wide” BLAS. Peak speed with 16 cores is 140

GFlops ..0 5 10 15 20 25 30 35 40 45 50

0

10

20

30

40

50

60

70

80

90

Smallest Dimension

Gflo

p R

ate

Smallest Dimension vs. Gflop Rate, Skinny BLAS−3 with 16 cores

tall*smallsmall*widewide*tall

Figure 2: For the above plot, the long matrix dimension is fixed at 480000.The x-axis is the smallest dimension. If the OpenMP loop using during initial-ization imitates the loop used during computation, parallelization is relativelygood, presumably because of data locality. Best performance is about 62% ofpeak. For matrices of “practical” width of 8, performance is only about 10%of peak.

[19], an Arnoldi method based on BLAS-2 non-blocked Householder transfor-mations. Table 2 compares some sparse decompositions in terms of requiredstorage and level of BLAS.

3 Sparse Matrix Dense Matrix (SMDM) and

“Wide or Tall” BLAS-3

Sparse matrix algorithms often use multiplications of the sparse matrix timesa dense vector and BLAS-1 or BLAS-2 operations. On cache based computerarchitectures these execute orders of magnitude more slower than the peakmachine speed, slowed by repeated fetches of of the sparse matrix from RAM.Block algorithms replace sparse matrix dense vector multiplications by sparsematrix dense matrix multiplications, and replace BLAS-1 inner products anddaxpys by “Wide or Tall” BLAS-3 operations. Compared to sparse matrixdense vector and BLAS-1 operations, SMDM multiplications and “Wide or

5

Basis Lanczos PROPACK ARPACKUB2V UB2V GMRES UHUT UBk+1V

Vecs O(1) 2N 2N 4NLoss of Uses Keeps KeepsRank Re-orthog Orthog Orthog

BLAS BLAS-1 BLAS-1 BLAS-2 BLAS-3flops 4Nnz 4Nnz 4Nnz 4Nnz

+ O(N) +4 N2n + 4N2n + 4(n + m)N2

Table 2: Summary Chart Comparing Sparse Decompositions.

Tall” BLAS-3 allow more floating point operations for each fetch of a floatingpoint number from RAM.

3.1 Shared Memory Parallelization for “Wide or Tall”BLAS-3

When dense matrices A and B are two large to fit in fast memory and havesmallest dimension smaller than about 100, our tests on multi-core processorsindicate that the matrix multiplication AB has computational rate roughlyproportion to the smallest dimension.

For a few shared memory (or multi-core processors), we get good parallelspeedups with a few OpenMP calls – or merely by using a multi-threadedBLAS library. For more than about four cores, a more careful parallelizationis needed for the “wide or tall” BLAS-3 operations which are the predominantcalculation in UBk+1V decomposition. Three special cases were parallelizedusing the OpenMP library. These were Wide, Tall, and WideTall, which re-spectively parallelize the cases of small × wide, tall × small, and wide ×tall.

The 4 socket, 16 core architecture is NUMA (Non Uniform Memory Ac-cess). Each socket has faster access for its own RAM than for the RAMassociated with the other sockets. The computational rates illustrated in Fig-ure 2 were obtained by using the same OpenMP loops for matrix initializationas for computation, thereby improving data locality. This numeric experimentwas on a four motherboard Opteron running in Linux. The same code alsoproduces good data locality and performance for Intel chips. 2

2Using the same OpenMP loops for matrix intialization and computation may fail toproduce data locatility on other architectures and operating systems. Lack of explicit controlover data locality may limit the portability of OpenMP NUMA parallelism.

6

3.2 Sparse Matrix Dense Matrix Products

Many classic iterative schemes for solving systems of sparse linear equationsrely on multiplications Ax, yT A, and BLAS-1 (vector vector) operations.

Accessing A to perform multiplications AX gives significantly better per-formance. Figure 3 indicates the relative effects of block size vs. matrix stor-age in speeding sparse matrix multiplications. Column blocking is effective in

(Sparse A)*(Dense X)–20X Seedup

..

0 5 10 15 20 25 30 35 40 45 500

200

400

600

800

1000

1200

1400

1600

Spe

ed in

Meg

aflo

ps p

er S

econ

d

Number of Column Blocks

Blocking Speeds Sparse Matrix Dense Matrix Multiplies

16 Vectors in X, A*X

16 Vectors in X tr(A)*X

1 Vector in x, A*x

1 Vector in x, tr(A)*x

Figure 3: Speeding multiplication of a randomly generated sparse matrix.Blocking the matrix and multiplying by multiple vectors reduce cache misses.The matrix here is 100K by 100K with 500 randomly distributed nonzerosentries per row. The computation was with a 64 bit 2.4 GHz two Pentiumprocessor with 512 MByte L2 cache compiled with an Intel Fortran compiler.Parallelization is provided with one OpenMP parallel loop for AX, Ax andone also for XT A, xT A.

improving performance when nonzero entries are uniformly distributed. Forother sparse matrices, different matrix storages can improve performance ofthe AX kernel. For example, Toledo [27], Angeli, et. al [1] and Im’s Ph.D.dissertation [12] offer some guidance in arranging storage of A to speed thecomputation Ax. The OSKI package (Vuduc, Demmel, and Yelick [29]) au-tomates the process of choosing storage of A. Nishtala, Vuduc, Demmel, andYelick [22] offer some guidance as to when OSKI is likely to be effective.

For the UBk+1V decomposition, access to A is only for the multiplications

7

AX, Y T A and extractions of blocks of A. Almost all other operations areBLAS-3 with minimal dimension k.

4 Some Theory

Using block Householder transformations, we expect the overall conditionnumber of transformations to be very near one, and expect that if we computeBk+1 to satisfy A = UBk+1V , then the singular values of Bk+1 and A will beclosely matched.

For a practical sparse algorithm only a partial decomposition is made, i.e.,for A of m rows and n columns we compute only the first N rows and columnsBk+1, N < n. The following subsections discuss interlacing of singular valuesand approximation of of A by UNBk+1NVN in the Frobenius norm.

4.1 Interlacing of Singular Values

In the dense case, singular values are typically found by reducing an m × nA matrix to a condensed m × n matrix B (upper triangular, or with bandedstructure such as bidiagonal) with the same singular values, then finding thesingular values of the condensed form by some iterative procedure.

For reduction to small band (or upper triangular) form to the sparse case,obtaining an m × n reduced matrix is impractical if the transformations arestored, unstable if they are not stored. Transformations must be stored tomaintain orthogonality and linear independence.

It’s natural to try to use the singular values of an N × N reduced matrixBk+1N obtained after “eliminating” N columns as approximate singular valuesof the original matrix A. The singular values of Bk+1N are sometimes calledRitz values of A. Cauchy’s interlacing property relates the Ritz values to thesingular values of A.

Cauchy’s Interlace Theorem [20]: Let C be a Hermitian matrix parti-tioned as

C =

[

H B∗

B U

]

,C is n × n

H is N × N

where C has eigenvalues:α1 ≤ α2 ≤ . . . αn

and H has eigenvaluesθ1 ≤ θ2 ≤ . . . θN

Then for j = 1, . . . , N ,

αj ≤ θj ≤ αj+n−N (4.1)

8

and for l = 1, 2, . . . , n,θl−n+N ≤ αl ≤ θ. (4.2)

Supppose A has been transformed to AN

AN =

[

RN TN

0 Bk+1N

]

Then The Hermitian matrix ATNAN

ATNAN =

[

RTNRN RT

NTN

T TNRN T T

NTN + (Bk+1N)T Bk+1N

]

has the same eigenvalues as AT A. For the symmetric matrix ATNAN , Cauchy’s

interlacing theorem implies that the eigenvalues of the symmetric matrix RTNRN

interlace with those of AT A. Since the singular values σi of A have the sameordering in size as the eigenvalues λi = σ2

i of AT A, Cauchy’s interlacing valuetheorem interlaces the singular values of A and RN .

As an example of interlacing consider singular values of of the 4 by 4 uppertriangular matrix

T =

4 3 2 00 3 2 10 0 2 10 0 0 1

.

Let T1, T2, T3 be the upper left 1 × 1, 2 × 2, 3 × 3 matrices respectively.The singular values of T1, T2, T3, T are respectively

4 5.3890 6.0959 6.133362.2267 2.5356 2.8331

1.5527 1.62131.85183

Actually, we can do a bit more. When reducing a banded upper triangularform, we get AN of the form

AN =

RN LN 00 B C0 D E

(4.3)

and we naturally wonder whether singular values of

R =[

RN LN 0]

are related to those of A. The following result of Kahan from P. 196 [20] isapplicable.

9

The Residual Interlace Theorem. Let F be a Hermitian matrix of theform

F =

H C∗ 0∗

C V Z∗

0 Z W

where H is N × N , V is j × j, F is n × n.Define

M(X) =

[

H C∗

C X

]

(4.4)

where V −X is assumed to be invertible. Denote the eigenvalues of M(X) as

µ1 ≤ µ2 ≤ . . . ≤ µj+N

Then each interval [µi, µi+N ], i = 1, . . . , j contains a different eigenvalue αI ofF . Also, outside each open interval (µl, µl+j), l = 1, . . . , N , there is a differenteigenvalue αN of F .

The residual interlace theorm applies to ATNAN as it is of the form (sup-

pressing the N subscripts)

ATNAN =

RT R RT L 0LT R LT L + BT B + DT D BT C + DT E

0 CT B + ET D CT C + ET E

Taking X = LT L gives X − V = BT B + DT D. The theorem will apply ifX − V is nonsingular, which will be the case when either the columns of B orthe columns of D are linearly independent.

We conclude that the j + N singular values αi of R taken as the squareroots the eigenvalues of M(LT L) are lower bounds for the top j + N singularvalues of A. In particular if αi, i ≤ N is the ith largest singular value of R,then αi < σi, where σi is the ith largest singular value of A. Applying theinterlacing theorem to R and Ri, the ith of N largest singular values of R islarger than the ith singular value ηi of R.

Since ηi ≤ αi ≤ σi, the singular values αi of R are better estimates ofsingular values of A than are the singular values of ηi of R.

For example, take

A =

4 3 2 00 3 2 10 0 2 10 0 −1 1

.

Let R1, R2, R3 be the upper left 2 × 2, 2 × 3, 2 × 4 matrices respectively.The first two singular values of R1, R2, R3, A are respectively

10

5.3890 6.0220 6.0422 6.14522.2267 2.3949 2.5479 2.7981

4.2 Approximation of A by Jkl = UklBk+1VTkl

Suppose that Akl is related to A by the orthogonal transformations Ukl andVkl as

Akl = UTklAVkl.

Due to the orthogonality of Ukl and Vkl, we have

‖Akl‖F = ‖A‖F .

For the algorithm described in the next section Akl has the form

Akl =

[

Bk+1 Ck 0

0 Akl

]

(4.5)

where Bk+1 is kl×kl and Ck is kl×k. In our instance, Ck has nonzero entriesonly in its lower triangular k × k block.

We’re interested in the case that Akl is not computed as it would be denseand large and likely to overflow the RAM. Since

‖A‖2

F = ‖Akl‖2

F = ‖Bk+1‖2

F + ‖Ck‖2

F + ‖Akl‖2

F ,

we have‖Akl‖2

F = ‖A‖2

F − ‖Bk+1‖2

F − ‖Ck‖2

F . (4.6)

Take Bk+1 = [Bk|Ck] as a kl × k(l + 1) matrix and Jkl = UkBklVTk as a

rank kl approximation to A. The approximation is good if ‖Akl‖F is small,with the quantities on the right hand side of (4.6) easily computable during acomputation.

5 The lazy UBk+1V partial decomposition

We adapt the BLAS-3 algorithm for reduction to bandwidth k+1 using House-holder reductions of block size k. Dense implementations were by Grosser andLang, [10], [16]. Using deferred updates to convert a dense to a sparse algo-rithm for k = 1 (the bidiagonal case) is discussed in Howell, Demmel, Fulton,and Marmol [11].

11

5.1 Notes on Lazy 2-Sided Block Householder Reduc-tion

The pseudo-code below3 should allow the reader to verify the following points.

• Entries of Aold are not changed. The only accesses to Aold are for SMDMmultiplications and extractions of matrix sub-blocks. If l pairs of rowand column blocks of size k are eliminated, then Aold is accessed for2l SMDM operations consisting of multiplication of Aold by blocks of kvectors.

• Block Householder transformations can be used for the reduction. Theimplementations of qlt and qlr used but not specifically detailed use thealgorithms due to Schreiber and Van Loan [23]. Alternately, the methodproposed by Joffrain, Low, Quintana-Orti, Van de Geijn, and Van Zee[14]or Puglisi[21] could be used. Householder transformations are reliablyorthogonal. Blocking the transformations enables use of BLAS-3.

• Operations updating column and row blocks and in forming the updatematrices are are BLAS-3. BLAS-2 operations are only in initializationsand copies, and in the qlt and qlr formation of block Householder trans-formations. If the qlt and qlr operations are BLAS-2, the total numberof BLAS-2 flops is O(mnk) for elimination of all columns, O(m + n)k2lfor elimination of l k-sized blocks. In comparison, (see (5.14)), there are6(lk)2(m + n) − 8(lk)3 BLAS-3 flops for eliminating l blocks of k rowsand columns.

As presented, the block Householder reduction “runs to completion”, usefulin that the returned matrix Bk can be observed to have very nearly the samesingular values as the input matrix, enabling a test for correct implementation.

More usually, for a large sparse matrix, the returned matrix Bk has dimen-sion kl× kl, kl << n, where l blocks have been eliminated. l is chosen (eithera priori as an input, or from a convergence criterion and the algorithm isended at the “! End of loop on blocks”. The returned matrix Bk+1 is then hasnonzero entries confined to the diagonal and k superdiagonals, Bk+1 satisfying

3As presented, and given appropriate qlr and qlt functions, the algorithm closely followsan octave script implementation

12

Bk+1

0 0Ck 0

0 Aupdated

=

[

l∏

i=1

(I − Ui)Li(I − UTi )

]

(Aold + E)

[

l∏

i=1

(I − V Ti )Ti(I − Vi)

]

(5.1)

In 5.1, Ck is a lower triangular k × k matrix. Denote ǫ as the largest numbersatisfying 1 = fl(1+ ǫ). Due to the use of block Householder transformations,E satisfies

‖E‖/‖A‖ = O(ǫ). (5.2)

i.e., the UBk+1V decomposition is backward stable.When the algorithm is not run to completion so the not actually computed

Aupdated in (5.1) has size (m − kl) × (n − kl) , then we typically assume thethe ‖E‖ term in (5.1) to be negligible compared to ‖Aupdated‖. As indicatedby (4.6), runtime estimates of ‖Aupdated‖F enable estimates of ‖E‖F .

5.2 Pseudo-code for lazy UBk+1V

As given here, the code runs to “completion”, returning an upper triangularbanded matrix, with bandwidth k + 1 ( see below (5.5)). In exact arithmetic,the returned matrix has the same singular values as the original matrix and isrelated to the original matrix by

Areturn =l

∏

i=1

(I − UiLiUTi )Aorig

l−1∏

i=1

(I − V Ti TiVi) (5.3)

or by

Areturn = (I − Ul)Ll(I − UTl )(Aorig − UZ − WV) (5.4)

where (I −UiLiUTi ) and (I − V T

i TiVi) are block Householder transformations.On return the blocks Ui are stored in the ith block of k columns in U (lower

triangular) and the blocks Vi are stored in the ith block of k rows in V (uppertriangular). Similarly, W is lower triangular and Z upper triangular.4

The algorithm proceeds by alternately eliminating blocks of k columnsand k rows. When no more blocks of size k can be made, then the rest of the

4The block letters U, V, W, Z refer to matrices used in the algorithm. U , V refer togeneric orthogonal matrices. In exact arithmetic, A - UZ - WV is an orthogonal transfor-mation of A, but none of the matrices U, Z, W, V are orthogonal.

13

columns are eliminated as one block. Hence the last block of L can be up totwice as large as the others.

For A = UB4V,m = 10, n = 8, k = 3, the returned B4 has the followingform

B4 =

x x x x 0 0 0 00 x x x x 0 0 00 0 x x x x 0 00 0 0 x x x x x0 0 0 0 x x x x0 0 0 0 0 x x x0 0 0 0 0 0 x x0 0 0 0 0 0 0 x0 0 0 0 0 0 0 00 0 0 0 0 0 0 0

(5.5)

Capital letters are used below to indicate that variables are matrices (asopposed to vectors). Accesses to the original sparse matrix are commented aseither extractions of blocks or as SMDM operations.

Pseudo-Code for UBk+1V

Function [B, U, W, V, Z, L, Ltemp, T] ← band(m, n, k, Aold)

! Assume m > n! Input Variables! m – number of rows! n – number of columns! k – number of superdiagonals in returned matrix! (also the block size for multiplications by Aold)! Aold – input matrix in sparse storage

! Output Variables! B is an m by n matrix with upper bandwidth k+1! U (m x n), W (m x n ), V (n x n), Z (n x n )! L (m x k), T (k x n ), Ltemp (2k x 2k)! where (compare to 5.4,5.3), the extra! term here is from eliminating all remaining columns! as a final block; for a large sparse problem, this final! block will not be eliminated).! B = (I−UlastLtempU

Tlast)

∏

i(I−UiLiUTi )Aold

∏

i(I−VTi TiVi)

! = (I−UlastLtempUTlast)(Aold−UZ−WV)

! where W = [W1|W2| . . . |Wl], U = [U1|U2| . . . |Ul],! VT = [VT

1 |VT2 | . . . |VT

l ], ZT = [ZT1 |ZT

2 | . . . |ZTl ],

! where each of the blocks Ui, Wi has k columns,! and where each block Vi, Zi has k rows

14

! and Ulast may have up to 2k columns.

! Initializations

B← 0m,n ; W ← 0m,n ; U← Om,n ;V ← 0n,n ; Z← 0n,n; L← 0m,k ; T← 0n,k ;

blks ← floor((n − k)/k) ;for i =1:blks,

il(i) ← (i−1)k + 1 ; ih(i) ← ik ;end

if ( kblks != n )il(blks+1) ← k blks + 1 ; ih(blks+1) ← n ;

end

mnow ← m ; nnow ← k ;Atemp ← Aold( : , 1:k) ;

! Extract first column block of Aold

ilow ← 1 ; ihi ← k ; ihp1 ← k+1 ;C ← Aold( : , 1:k) ;

! Extract first row block of Aold

! – qrl returns the QR factorization of the! the first column block of Aold where! R = (I - Utemp Ltemp UT

temp) C

[ Utemp, R, Ltemp ] ← qrl (m, k, C ) ;U( ilow:m , ilow:ihi ) ← Utemp ; L(ilow:ihi, : ) ← Ltemp ;B( ilow:m , ilow:ihi ) ← R ;

! – C will be the update of the first row block of Aold

Lua ← Ltemp (UTtemp)Aold( ilow:m , ihp1:n ) ;

! SPMD multiplication with Aold

C ← A( ilow:ihi , ihp1:n ) ;! Extract a row block of Aold

C ← C - U( 1:k,1:k ) Lua ;

! – qlt performs the QL factorization of C so that! Ttemp = C ( I - Vtemp Lr VT

temp )

[ Vtemp, Lr, Ttemp ] ← qlt (k, n-k, C ) ;B( 1:k , k+1:n ) ← Lr; T ( 1:k , : ) ← Ttemp ;

! – Get the first blocks for U, V, Z, W

Temp ← Ttemp Vtemp ;

15

Z( 1:k , k+1:n ) ← Lua− (Lua VTtemp) Temp ;

Temp2 ← VTtemp Ttemp ;

W ( : ,1:k ) ← Aold( : , k+1:n )Temp2 ;V( 1:k , k+1:n ) ← Vtemp ;

! – Now loop through all but the end block! – In the usual application to large sparse matrices! the number of loops blks is constrained by available RAM! or by satisfaction of a convergence requirement.

for i = 2 : blks,ilow ← il(i) ; ihi ← ih(i) ; ihp1 ← ih(i) + 1 ;

! To proceed with a reduction to banded form,! we need to multiply the updated A! Aupdated ← Aold - U Z - W V! by a block of vectors X. Since Aupdated is presumed dense,! Aupdated X is accomplished as! Aold X - U (Z X) - W (V X)

! Update the current column block of A

C ← Aold( ilow:m , ilow:ihi ) ;! Extract a column block of Aold

C ← C - U( ilow:m , 1:ilow-1 ) Z(1:ilow-1 , ilow:ihi ) ;C ← C - W( ilow:m , 1:ilow-1 )V(1:ilow-1 , ilow:ihi ) ;

! – qrl performs the QR factorization of the current column block.

[ Utemp, R, Ltemp ] ← qrl (m-ilow+1, k, C ) ;U( ilow:m , ilow:ihi ) ← Utemp ; L( ilow:ihi , : ) ← Ltemp ;

! Multiply (Li UTi ) Aupdate

B( ilow:m , ilow:ihi ) ← R ;Lup ← Ltemp UT

temp ;

Lua ← Lup Aold( ilow:m , ihp1:n ) ; ! SMDM with Aold

Lua ← Lua - (Lup U( ilow:m ,1: ilow-1 ) ) Z( 1: ilow-1,ihp1:n ) ;Lua ← Lua - (Lup W( ilow:m ,1: ilow-1 ) ) V(1: ilow-1, ihp1:n ) ;

! Update the current row block of B

C ← Aold( ilow:ihi , ihp1:n ) ; ! Extract row block of Aold

C ← C - U( ilow:ihi , 1:ilow-1) Z( 1:ilow-1, ihp1:n ) ;C ← C - W( ilow:ihi , 1:ilow-1 ) V( 1:ilow-1 , ihp1:n ) ;

16

! The row block also needs the update from the current column blockC ← C - U( ilow:ihi , ilow:ihi ) Lua ;

! Having updated the current row block, get its! QL factorization by calling qlt

[ Vtemp, Lr, Ttemp ] ← qlt( k, n-ihi , C ) ;B( ilow:ihi , ihp1:n ) ← Lr ; T( ilow:ihi , : ) ← Ttemp ;

! Get the next blocks for the U, V, Z, W matrices

Temp ← Ttemp Vtemp ;Z( ilow:ihi , ihp1:n ) ← Lua - (Lua VT

temp)Temp ;

Temp2 ← VTtemp Ttemp ;

Temp3 ← Aold(ilow:m , ihp1:n)Temp2 ; ! SMDM with Aold

Temp3 ← Temp3 - U( ilow:m , 1:ilow-1 ) ( Z( 1:ilow-1 , ihp1:n )Temp2 ) ;Temp3 ← Temp3 - W( ilow:m , 1:ilow-1 ) (V( 1:ilow-1 , ihp1:n )Temp2 ) ;W( ilow:m , ilow:ihi ) ← Temp3 ;V( ilow:ihi , ihp1:n ) ← Vtemp ;

end ! End of loop on blocks

! We’ve eliminated all the row blocks of width “k”.! Eliminate the rest of the columns as one block

ilow ← ih(blks) + 1 ; ihi ← n ;

! Update the current column block from Aold

C ← Aold( ilow:m , ilow:ihi ) ;! Extract block of Aold

C ← C - U( ilow:m , 1:ilow-1 ) Z( 1:ilow-1 , ilow:ihi ) ;C ← C - W( ilow:m , 1:ilow-1 )V( 1:ilow-1 , ilow:ihi ) ;

! – qrl for QR factorization of the last column block.

[ Utemp, R, Ltemp ] ← qrl(m-ilow+1 ,n-ilow+1, C ) ;U( ilow:m , ilow:ihi ) ← Utemp ;L2 = Ltemp ;B( ilow:m , ilow:ihi ) ← R ;

Endfunction

17

5.3 Comparison to Dense 2-Sided Block HouseholderReduction

This section makes explicit some differences between the sparse algorithmpresented in the pseudo-code and the more usual dense algorithm.

The dense algorithm proceeds by alternately eliminating column and rowblocks. Consider a partitioning of the original matrix

A =

A11 A12 A13

A21 A22 A23

A31 A32 A33

.

(5.6)

For the sparse algorithm, elimination of a block of columns corresponding toA21 and A31 and an inital row corresponding to A12 and A13 has changed noentries of A. A11 corresponds to an upper triangular matrix B11. A12, A13,A31, and the upper triangular part of A12 would have been eliminated in thedense algorithm. The dense algorithm would update the trailing matrix

(

A22 A23

A32 A33

)

. (5.7)

5.3.1 Multiplying AX

In the case of a large sparse matrix, we can’t actually form the updated

A = A + UZ + WV,

as it would be dense and exhaust RAM. Instead compute

AX = AX + U(ZX) + W(VX) (5.8)

Where the dense algorithm would have an already updated block to elim-inate, the sparse algorithm extracts the corresponding block of the originalmatrix and performs “just in time” update. For example for the block A32

• Perform the sequence of block Householder eliminations which had al-ready been made to eliminate A21, A11, A12.

A32 ← A32 − U(3, :) Z(:, 2) − W(3, :) V(:, 2) (5.9)

These are all BLAS-3.

• Perform a QR factorization of A32.

New blocks of W and Z are formed by multiplying the currently producedblocks of dense vectors by the sparse A, e.g.

W(:, 3) ← V(3, :) A

18

5.3.2 Storage and Flop Comparisons

Reducing an m × n,m ≥ n matrix to upper bandwidth k + 1 by Householdertransformations requires 4mn2 − 4/3n3 flops. For the ith elimination of rowand column blocks of size k, the dense algorithm requires 4(m − ik)(n − ik)kflops for updates and 4(m−ik)(n−ik)k flops for multiplications of A by blocksof k row and column multipliers where the ith block requires

8k(m − ik)(n − ik) (5.10)

flops for elimination.The sparse algorithm as illustrated in the pseudo-code differs in that, in-

stead of multiplications of the form AUi, VTi A, A dense we perform multipli-

cations

AupdatedUi = Aold Ui + U(ik+1: m, 1: ik) Z(1 : ik, ki+1: n) Ui

+ W(ik+1: m, 1: ik) V(1: ik, ki+1: n) Ui (5.11)

and

VTi Aupdated = VT

i Aold Ui + VTi U(ik+1: m, 1: ik) Z(1 : ik, ki+1: n)

+ VTi W(ik+1: m, 1: ik) V(1: ik, ki+1: n) (5.12)

Neglecting the sparse matrix dense matrix flops, the flop count for a com-pleted reduction would be 6mn2 − 2n3 with the incremental number of flopsfor the ith pair of row column blocks being

12k(ik)[m + n − 2ik]. (5.13)

requiring a total of6(lk)2(m + n) − 8(lk)3 (5.14)

flops to eliminate l row and column blocks of size k.For the dense algorithm, required storage is independent of the number of

row-column pairs eliminated. As seen in (5.3.2), inital row-column eliminationsrequire more flops than later ones.

Conversely, for the sparse algorithm, the number of flops for the next blockeliminated is proportional to i (when ik << n + m), so that the flop countis proportional to the square of the total number l of eliminated blocks. Form = n, the incremental flop counts for sparse and dense algorithm are equalfor ik = n/4 , so that at n/4 the difference in required flops is maximal. Forik > n/4, the dense algorithm becomes more competitive in terms of requiredflops.

In the dense serial algorithm, the size of matrix which can be reduced tosmall band form (on a single processor) depends on how large a dense matrixwill fit in available RAM.

19

..0 2000 4000 6000 8000 10000 12000 14000 16000

0

2

4

6

8

10

12

14

16

18x 10

12 Flops to Eliminate L Columns, Sparse vs. Dense Algorithms

Flo

ps

Row−Column Pairs Eliminated

Dense AlgorithmSparse Algorithm

Figure 4: Sparse and Dense Flops vs. Columns Eliminated for a Matrix forWhich the Dense Algorithm Requires 2 GBytes Storage

In the sparse serial algorithm, available storage limits the number of blocksthat can be eliminated. Neglecting the storage of A, eliminating l rows andcolumns requires 2(m + n)l double precision numbers stored in W, U, V andZ. When l = kK = n/4, the total storage for U, V, W, and Z is nm/2 + n2/2so comparable to the total storage required for the dense algorithm.

For a double precision “in-core” serial dense computation, the largest ma-trix we can expect to reduce with 2 GBytes of RAM is at most 16K square5

For the sparse matrix with a fixed quantity of RAM, the number l of elimi-nated rows and columns is inversely proportional to m+n. For example, with2 GBytes allocated for storage of U, V, W, and Z, then with m + n = 100000(one hundred thousand), at most 1250 row-column pairs can be eliminated“in-core”; for m + n = 1000000 (one million), at most 125. 6

5K here means 210, 16K*16K * 8 = 2Gbytes, with 8 bytes per double precision number.This assumes that only U, V are stored, overwriting part of A.

6For both the sparse and dense case, this discussion neglects the RAM needed for systemrequirements, which might typically reduce RAM available for storage to three quarters ofinstalled RAM. When storage use exceeds available RAM, execution times may markedlyincrease as data must be written to and read from hard disk drives.

20

..0 1 2 3 4 5 6 7 8 9 10

x 105

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Columns that Can be Eliminated Vs. Matrix Size

Num

ber

of R

ow C

olum

n P

airs

Tha

t Can

Be

Elim

inat

ed

Total Rows plus Columns

Figure 5: Number of Rows and Columns that can be Eliminated with 2 GBytesof Array Storage.

5.4 Apply a preliminary random block Householder trans-formation

As so far discussed, sparse block Householder reduction may perform wastedblock eliminations. For example, if A is already of banded upper trianguarform, Householder elimination of l rows and columns merely extracts the upperleft l × l matrix, singular values of which may not be representative of thesingular values of A. Taking a preliminary random Householder transformationensures that the SPMD operations AX and AT Y are made with X and Y denseso that the products are impacted by all nonzero entries of A.

Modification of the UBk+1V algorithm is straightforward. The only accessto the orginal sparse matrix A was for block extraction and mutliplications byblocks of dense vectors (SPMDs).

For the SPMD operation, denote

A0 = A − U0Z0 − W0V0

ThenX ← A0X = AX − U0(Z0X) − W0(V0X)

and Equation 5.8 becomes

AX = X + U(ZX) + W(VX)

Each extraction of a row or columns block of A is replaced by an extractionand a preliminary update. So instead the of the column extraction

C ← A(ilow :m , ilow :ihi)

21

Converged Singular Values for Bandwidth 2

..3.8 10K 40K 100K 400K0

10

20

30

40

50

60

70

80

90

100Bandwidth = 2, Number of Converged Singular Values

Matrix Size from 10K to 400K, 249 Matrices

Num

ber

of c

onve

rged

sin

gula

r va

lues

Converged, 2 GBytes StorageComparedConverged, 1 GByte Storage

Figure 6: For each of 261 matrices, some singular values were determined.When 2 GBytes of storage could be used without “relocation error” or “seg-mentation faults”, at least 20 singular values were found.

we have

C ← A(ilow :m , ilow :ihi) − U0(ilow :m,1:ilow−1) Z0(1:ilow−1, ilow :ihi)

− W0(ilow :m, 1:ilow−1) V0(1:ilow−1, ilow :ihi) (5.15)

The preliminary random Householder transformation was used in the nu-meric experiments discussed in the next section.

6 Numerical Experiments

Numerical tests were with matrices from the Davis UF Sparse Collection [4].As with Householder implementations of GMRES and ARPACK, UBk+1V

is a stable algorithm. Limiting the computation to use 2 GByte of dimensionedspace, then for matrices to size around 3000 × 3000, singular values of Bk+1

(all columns eliminated ) and of A can both be computed by a 32 bit versionof the standard LAPACK program dgesvd. As expected, LAPACK dgesvdgives the same singular values for A and Bk+1 to high accuracy 7.

7Our implementation of lazy Householder computation reduces to upper banded form,and can be run to completion only for matrices with at least as many rows as columns

22


..3.8 10K 40K 100K 400K0

10

20

30

40

50

60

70

80

90



Num

ber

of c

onve

rged

sin

gula

r va

lues


Figure 7: For one case of 307, no singular values were determined. In severalother cases, only a few singular values were determined.

For matrices A large enough that 32 bit LAPACK cannot be used, theUBk+1V algorithm can not be run to completion as the storage requirementswould be too high. For these larger matices, singular values of BN

k+1 werecomputed by two calls to LAPACK dgesvd. A first call to dgesvd was for theentire N×N matrix. A second dgesvd call was for the upper left matrix squaresubmatrix B1 of dimension min(N − k,N − 6). The largest L = 2

√N + 10

(lk = N ∝ 1/(m + n) ) singular values were compared to one another.Let σ1 ≥ σ2 ≥ . . . σL be the largest L singular values of BN

k+1, and σ1 ≥

σ2 ≥ . . . σL the largest singular values of B1. σi was said to be converged if

|σj − σj||σj|

| < 10−8,∀j, 1 ≤ j ≤ i

Converged singular values for different bandwidths agreed to high accuracy.Figures 6, 7, and 8 were tested on the June 2008 collection. These tests

used only 2 GBytes of RAM, compiled with 32 bit integers. For bandwidths 2(Figure 6) , 6 (Figure 7) , and 12 (Figure 8) we computed singular values forall the unsymmetric or rectangular matrices with 104 < (m + n)/2 < 4 × 105.The results for bandwidth 6 and 12 include integer valued matrices. In eachcase, the number of steps lk = N is calculated so that the 2N(m + n) doubleprecision numbers allocated for W,U, V, Z plus the nz elements of the sparsematrix A require less than 2 GBytes of storage (taking 8 bytes of storage per

23


..3.8 10K 40K 100K 400K0

10

20

30

40

50

60

70

80

90



Num

ber

of c

onve

rged

sin

gula

r va

lues


Figure 8: Using block size 12 for matrices larger than 100 thousand was noteffective with only 1 GByte of storage. For these matrices, the number ofmultiplications by A was relatively small

double precision number). For each matrix the code was recompiled to resetparameters for matrix dimensioning. For some of the larger matrices, Fortrancode compiled with the g77 compiler suffers “relocation” errors at compiletime, or run time “segmentation faults”. These instances were recompiled touse 1 GByte for matrix storage, and rerun. For each bandwidth k = 2, 6, 12around 300 matrices successfully ran. In each instance, the Frobenius norm ofBN

k+1was less than or equal to the Frobenius norm of A, with near equality

in some cases. The algorithm used a preliminary random block Householdertransformation of block size k.

Figures 6, 7 and 8 plot the number of singular values converged for matricesof sizes 10000 < (m+n)/2 < 400000. The legend “.” represents the number ofsingular values compared, “o” represents the number converged for 2 Gbytes ofstorage. “x” represents the number of converged singular values for 1 GBytesof storage. For the largest matrices represented only about 40 right and leftbasis vector could be computed. The maximal number of computed basisvectors was 1500 (representing the flat part of the plots).

When the “o” encloses a “.” or an “x”, all compared singular values con-verged. The isolated “o”s and “x”s indicate instances for which fewer singularvalues converged than were compared.

24

Converged Singular Values for Bandwidth 8, 48 GByte basis

..10

410

510

610

70

10

20

30

40

50

60

70

80

90

100


Matrix Size from 10 Thousand to 10 Million, 420 Matrices

Num

ber

of c

onve

rged

sin

gula

r va

lues

Converged, 64 GBytes StorageCompared

protect

Figure 9: Singular values were determined for 420 Matrices, size 10 thousandto 10 million. 7 matrices of size less than 1 Million failed to return at least adozen converged singular values. 64 GBytes of RAM were not always enoughto find multiple singular values for matrices of size greater than a million.

For a high proportion of the test matrices, all the L = 2√

N + 10 singularvalues converged.

• For bandwidth 2, for 238 of 250 matrices, all the compared singularvalues converged. The minimal number of converged singular values was10 of 30 compared.

• For bandwidth 6, for 278 of 308 matrices, all compared singular valuesconverged. In one instance there were no converged singular values of 36compared. Two other instances of poor convergence were 2 of 36 and 3of 28. All the worst cases were when only 1 GByte of storage was used.Also these cases tended to be the matrices of higher dimension (for whichthe size N of B7 was relatively small) The next worst was 9 of 53.

• For a bandwidth of 12, singular values were computed for 315 matrices.32 of these had suffered “relocation” errors or runtime segmentationfaults for 2 GBytes of storage, so were rerun allowing 1 GBytes for stor-age. For 259 of 315 matrices, all the compared singular values converged.There were many instances of no converged singular values, especially

25

Banded matrix norms as a fraction of the original matrix norm

..0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pro

port

ion

of C

ompa

red

Sin

gula

r V

alue

s th

at C

onve

rged

FrobeniusNorm(Abanded)/FrobeniusNorm(A) −− 462 Matrices

Ratio of Frobenius Norm vs. Proportion of Converged Singular Values

Figure 10: If the Frobenius norm of the banded matrix is of the same orderas the norm of the original matrix, compared singular values are likely to beconverged. The two circles at the lower right are exceptional cases.

for large matrices and in the case that only 1 GByte of storage could beused.

6.1 64 GBytes of RAM

The preceeding runs used were tested against the Davis collection of June 2008.In November of 2009, additional large matrices, several dozen of size greaterthan a million, had entered the collection. At the same time, 16 core (4 quadcore Opteron) blades had become available in the NC State blade center. 8

Using 48 GBytes of RAM for matrix storage, then for test matrices tosize about a million, the UBk+1V algorithm with k = 8 determined somesingular values in all but a few cases. Figure 6 plots the number of convergedsingular values vs. the matrix size. For this plot, the number of comparedsingular values is taken as L = 2

√N + 10. For matrices of size greater than a

million, fewer rows and columns can be eliminated, and fewer singular valuesare determined.

8These blades have 64 GByes of RAM. OpenMP BLAS performance on these machineswas plotted in Figure 2. Using 8 byte integers and the ACML 3.6.1 8 byte integer BLASlibrary with a PGI fortran compiler, segmentation faults did not occur.

26

Clustered Singular Values Slow Convergence

..0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Equal Singular Values Slow to Converge −− 441 Matrices

(Smallest Converged Singular Value)/(2nd Largest Singular Value)

Fra

ctio

n of

Com

pare

d S

ingu

lar

Val

ues

that

Con

verg

ed

Figure 11: If the 2nd largest singular value is nearly equal to the smallestconverged singular value, few singular values may converge. A samll ratioσmin/σ2 was a good predictor that most compared singular values would con-verg. There was one exceptional case with a small ratio for which only about1/5 of the compared singular values converged.

Table 3 shows some timing results. Times ranged from about 40 secondsfor a matrix of size 50 thousand to about 500 seconds for a matrix of size 322thousand. For these matrices and the matrix of size 160 thousand, 1250 rowsand columns were eliminated so that singular values were determined from atriangular matrix of size 1250 with bandwidth 8. For the matrix of size 1.96million, about 200 rows and columns were eliminated. Though the BLAS-3 operations are reasonably fast, the Sparse Matrix Dense Matrix (SMDM)multliplications and other computations (largely skinny QR) are a significantproportion of the 16 core time. Getting good parallel performance for morethan 16 processors will require parallelization of skinny QR (see Demmel,Grigori, Hoemmel, and Langou [5] for a successful approach) and more workon the SMDM operations.

6.2 Observations on Convergence

It’s natural to expect that the number of converged singular values tends toincrease with the basis size and the number of multiplications by the sparse

27

GFlop rates with 16 coresSize BLAS3 BLAS3 SMDM SMDM Otherin K Secs Gflop Secs Gflop Secs1961 117 5.85 54 .09 110322 334 16.3 79.4 .133 92.5160 171 16.4 54.1 .191 49.550 47.3 18.4 14.2 .138 14.7

Table 3: Runs of Spar3Bnd – Bandwidth 8.

matrix. We would also expect the number of the converged singular values toincrease with the fraction of the original matrix Frobenius norm captured inthe Frobenius norm of the banded matrix. Conversely, if many large singularvalues are nearly equal in size, then convergence is likely to be slow, so thatthe number of converged singular values will tend to be less. These tendencieswere evident in experiments with the Davis matrix collection. Figures 10 and11 are from same test ( 48 GBytes of basis vectors and the Dec. 2009 Daviscollection) as Figure 9.

• The number of converged singular values increases with the size of theHouseholder basis. Since the number of basis vectors and rows andcondition numbers is inversely proportional to matrix size, (see Figure5.3.2) a decrease in computed singular values with increased matrix sizeis expected. See Figures 6, 7 and 8.

• For a fixed number of rows and columns eliminated (fixed usage of RAM),the number of multiplications AX and Y T A is inversely proportional tothe bandwidth k. For a fixed allocation of storage, number of computedsingular values decreases somewhat as k increases. Again, see Figures 6,7 and 8). Conversely, increasing k increases the speed of the computaton(See Figure 2) .

• When Frobenius norms of the reduced matrix Bk+1 in Equation (5.1) arenear those of the orginal matrix A, .i.e, when

R =‖BN

k+1‖F

‖A‖F

≈ 1

convergence of a significant fraction of singular values is likely. Figure11 plots the proportion of compared to converged singular values vs. R.

• When the largest singular values are nearly equal, relatively few singularvalues may converge. Figure 11 plots the proportion of converged to

28

compared singular values vs. the ratio σmin/σ2 where σmin is the smallestconverged singular value and σ2 is the next to largest converged singularvalue

σmin ≤ . . . σ2 ≤ σ1

7 Conclusions and Acknowledgements

We report good success in using the lazy UBk+1V decomposition to computea collection of largest singular values for sparse matrices. Ongoing work is in

• computing singular vectors and low rank approximations

• comparing performance to other methods of computing sparse matrixsingular values

• simplifying and modernizing the code

• improving multi-core performance

Some current work is in using a UBk+1V decomposition for solving a sparseleast squares problem.

The author wishes to offer thanks for advice and encouragement from GeneGolub and Jim Demmel. He is grateful to Franc Brglez for aid in automatingnumerical experiments over a fairly large collection of matrices and to NouraHowell for help in editing the manuscript.

References

[1] J. Angeli, O. Basset, C. Fulton, G. Howell, R. Hsuand A. Sawet-prawhickal, M. Schuster, D. Richardson, H. Thompson, and S. Wilber-scheid. Some issues in efficient implementation of a vector based modeulfor document retrieval, June 2001. http://www.ncsu.edu/itd/hpc.

[2] M. Berry, T. Do, G. O’Brien, V. Krishna, and S. Varadhan. SVDPACKC:Version 1.0 user’s guide. Technical Report Tech. Report CS-93-194, Uni-versity of Tennessee, Knoxville, TN, October 1993.

[3] J. Choi, J. Dongarra, and D. Walker. The design of a parallel denselinear algebra software library: Reduction to Hessenberg, tridiagonal, andbidiagonal form Cholesky factorization routines. Num. Alg., 10:379–399,1995. LAPACK Working Note # 92.

[4] T. Davis. University of Florida sparse matrix collection, 2008.http://www.cise.ufl.edu/research/sparse/matrices/.

29

[5] J. Demmel, L. Grigori, M. Hoemmem, and J. Langou. Communication-optimal parallel and sequential QR and LU factorizations. Tech-nical Report UCB/EECS-2008-89,lawn204, University of California,http://www.netlib.org/lapack/lawns/downloads/, August 2008.

[6] A. A. Dubrulle. On block Householder algorithms for the reduction ofa matrix to Hessenberg form. Supercomputing 88. Vol.II: Science and

Applications. Proceedings, IEEE Explore, 2:129–140, Nov. 1988.

[7] L. Giraud and J. Langou. Robust selective Gram-Schmidt reorthogo-nalization. Technical Report TR/PA/02/52, CERFACS, Toulouse, FR,2002.

[8] G. Golub and W. Kahan. Calculating the singular values and psuedo-inverse of a matrix. SIAM J. Num. Anal., 2:205–224, 1965.

[9] G. Golub, F. Lusk, and M. Overton. A block Lanczos method for comp-tuing the singular values and corresponding singular vectors of a matrix.ACM Trans. Math. Soft., 7:147–169, 1981.

[10] B. Grosser and B. Lang. Efficient parallel reduction to bidiagonal form,1998. Preprint BUGHW-SC 98/2 (Available from http://www.math.uni-wuppertal/).

[11] G. Howell, J. Demmel, C. Fulton, S. Hammarling, and K. Marmol. BLAS2.5 Householder bidiagonalization. ACM Transactions on Mathematical

Software, 34(3):13–46, May 2008.

[12] E. Im. Optimizing the Performance of Sparse Matrix-Vector Multiplica-

tion. PhD thesis, University of California, Berkeley, 2000.

[13] W. Jalby and B. Philippe. Stability analysis and improvement of the blockGram-Schmidt algorithm. SIAM J. Sci. Stat. Comput., 12(5):1058–1073,1991.

[14] T. Joffrain, T. M. Low, E. S. Quintana-Orti, R. Van de Geijn, andF. G. Van Zee. Accumulating Householder transformations, revisited.ACM Trans. on Math. Software, 32(2):169–179, 2006.

[15] L. Kaufman. Application of dense Householder transformation to a sparsematrix. ACM Trans. on Math. Software, 5(4):442–450, 1979.

[16] B. Lang. Parallel reduction of banded matrices to bidiagonal form. Par-

allel Comput., 22:1–18, 1996.

[17] R. Larsen. PROPACK, software package for sparse SVD. Available fromhttp://soi.stanford.edu/ rmunk/PROPACK/.

30

[18] R.-M. Larsen. Lanczos bidiagonalization with partial reorthogonalization.PhD thesis, Dept. Computer Sci., University of Aarhus, 1998.

[19] R. Lehoucq, D. Sorensen, and C. Yang. ARPACK User’s Guide, solu-tion of large-scale eigenvalue problems with implicitly restarted Arnoldimethods. SIAM, 1998.

[20] B. Parlett. The Symmetric Eigenvalue Problem. Prentice-Hall, 1980.

[21] C. Puglisi. Modification of the householder method based on the compactwy representation. SIAM J. Sci. Statist. Comput., 13(3):723–726, 1992.

[22] R.Nishtala, R. W. Vuduc, J. W. Demmel, and K. Yelick. When cacheblocking sparse matrix vector multiply works and why. Applicable Algebra

in Engineering, Communication and Computing, 18:297–311, March 2007.

[23] R. Schreiber and C. F. Van Loan. A storage-efficient WY representa-tion for products of Householder transformations. SIAM Scientific and

Statistical Computing, 10:53–57, 1989.

[24] D. Simon and H. Zha. Low-rank matrix approximation using the Lanczosbidiagonalization process with applications. SIAM J. Sci. Computing,21(6):2257–2275, 2000.

[25] M. Sosonkina, D. C. S. Allison, and L. T. Watson. Scalable parallelimplementations of the GMRES algorithm via Householder reflections. InProc. Intern. Conf. on Parallel Processing, pages 396–404. IEEE Explore,10-14 Aug. 1998.

[26] G. W. Stewart. The Gram-Schmidt algorithm and its variations. Tech-nical Report TR-4642, Department of Computer Science, University ofMaryland, December 2004.

[27] S. Toledo. Improving the memory-system performance of sparse-matrixvector multiplication. IBM Journal of Research and Development, 41(6),1997.

[28] D. Vanderstraeten. A stable and efficient parallel block Gram-Schmidtalgorithm. In Euro-Par’99, Lecture Notes Computer Science, No. 1685,pages 1128–1135. Springer-Verlag, 1999.

[29] R. Vuduc, J. Demmel, and K. Yelick. OSKI: A library of automaticallytuned sparse matrix kernels. In Proceedings of SCIDAC 2005, Journal of

Physics: Conference Series. SCIDAC, June 2005.

31

lazy householder decomposition of sparse matrices

Documents