lu factorization with panel rank revealing pivoting and its … · 2012-08-14 · with partial...

ISS

N02

49-6

399

ISR

NIN

RIA

/RR

--78

67--

FR+E

NG

RESEARCHREPORTN° 7867Août 2012

Project-Teams Grand-Large

LU factorization withpanel rank revealingpivoting and itscommunication avoidingversionAmal Khabou ([email protected]), James W. Demmel([email protected]), Laura Grigori([email protected]), Ming Gu([email protected]).ar

Xiv

:120

8.24

51v1

[cs

.NA

] 1

2 A

ug 2

012

RESEARCH CENTRESACLAY – ÎLE-DE-FRANCE

Parc Orsay Université4 rue Jacques Monod91893 Orsay Cedex

LU factorization with panel rank revealingpivoting and its communication avoiding

version

Amal Khabou ∗ ([email protected]), James W. Demmel†

([email protected]), Laura Grigori‡

([email protected]), Ming Gu §

([email protected]).

Equipes-Projets Grand-Large

Rapport de recherche n° 7867 — Aout 2012 — 53 pages

Resume : On presente dans ce rapport une factorisation LU qui est plus stable que l’eliminationde Gauss avec pivotage partiel (GEPP) en terme de facteur de croissace. Cette factorisation permetde resoudre des matrices pathologiques ou GEPP echoue (Wilkinson, Foster). On presente aussila version minimisant la communication de ce nouvel algorithme.

Mots-cles : Factorisation LU, stabilite numerique, minimisation de la communication, factori-sation QR

∗ Laboratoire de Recherche en Informatique, Universite Paris-Sud 11, INRIA Saclay - Ilede France† Computer Science Division and Mathematics Department, UC Berkeley, CA 94720-1776,

USA. This work has been supported in part by Microsoft (Award #024263) and Intel (Award#024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227). Ad-ditional support comes from Par Lab affiliates National Instruments, Nokia, NVIDIA, Oracle,and Samsung. This work has also been supported by DOE grants de-sc0003959, de-sc0004938,and DE-AC02-05CH11231‡ INRIA Saclay - Ile de France, Laboratoire de Recherche en Informatique, Universite

Paris-Sud 11, France. This work has been supported in part by the French National ResearchAgency (ANR) through COSINUS program (project PETALH no ANR-10-COSI-013).§ Mathematics Department, UC Berkeley, CA 94720-1776, USA

http://arxiv.org/abs/de-sc/0003959

http://arxiv.org/abs/de-sc/0004938

Abstract: We present the LU decomposition with panel rank revealing pivot-ing (LU PRRP), an LU factorization algorithm based on strong rank revealingQR panel factorization. LU PRRP is more stable than Gaussian eliminationwith partial pivoting (GEPP), with a theoretical upper bound of the growthfactor of (1+ τb)

nb , where b is the size of the panel used during the block factor-

ization, τ is a parameter of the strong rank revealing QR factorization, and nis the number of columns of the matrix. For example, if the size of the panel isb = 64, and τ = 2, then (1+2b)n/b = (1.079)n 2n−1, where 2n−1 is the upperbound of the growth factor of GEPP. Our extensive numerical experiments showthat the new factorization scheme is as numerically stable as GEPP in practice,but it is more resistant to pathological cases and easily solves the Wilkinsonmatrix and the Foster matrix. The LU PRRP factorization does only O(n2b)additional floating point operations compared to GEPP.

We also present CALU PRRP, a communication avoiding version of LU PRRPthat minimizes communication. CALU PRRP is based on tournament pivoting,with the selection of the pivots at each step of the tournament being performedvia strong rank revealing QR factorization. CALU PRRP is more stable thanCALU, the communication avoiding version of GEPP, with a theoretical upperbound of the growth factor of (1 + τb)

nb (H+1)−1, where b is the size of the panel

used during the factorization, τ is a parameter of the strong rank revealing QRfactorization, n is the number of columns of the matrix, and H is the heightof the reduction tree used during tournament pivoting. The upper bound ofthe growth factor of CALU is 2n(H+1)−1. CALU PRRP is also more stable inpractice and is resistant to pathological cases on which GEPP and CALU fail.

Key-words: LU factorization, numerical stability, communication avoiding,strong rank revealing QR factorization

LU PRRP and CALU PRRP 3

1 Introduction

The LU factorization is an important operation in numerical linear algebrasince it is widely used for solving linear systems of equations, computing thedeterminant of a matrix, or as a building block of other operations. It consists ofthe decomposition of a matrix A into the product A = ΠLU , where L is a lowertriangular matrix, U is an upper triangular matrix, and Π a permutation matrix.The performance of the LU decomposition is critical for many applications, andit has received a significant attention over the years. Recently large efforts havebeen invested in optimizing this linear algebra kernel, in terms of both numericalstability and performance on emerging parallel architectures.

The LU decomposition can be computed using Gaussian elimination withpartial pivoting, a very stable operation in practice, except for several patho-logical cases, such as the Wilkinson matrix [21, 14], the Foster matrix [7], orthe Wright matrix [23]. Many papers [20, 17, 19] discuss the stability of theGaussian elimination, and it is known [14, 9, 8] that the pivoting strategy used,such as complete pivoting, partial pivoting, or rook pivoting, has an importantimpact on the numerical stability of this method, which depends on a quantityreferred to as the growth factor. However, in terms of performance, these piv-oting strategies represent a limitation, since they require asympotically morecommunication than established lower bounds on communication indicate isnecessary [4, 1].

Technological trends show that computing floating point operations is be-coming exponentially faster than moving data from the memory where they arestored to the place where the computation occurs. Due to this, the communica-tion becomes in many cases a dominant factor of the runtime of an algorithm,that leads to a loss of its efficiency. This is a problem for both a sequentialalgorithm, where data needs to be moved between different levels of the mem-ory hierarchy, and a parallel algorithm, where data needs to be communicatedbetween processors.

This challenging problem has prompted research on algorithms that reducethe communication to a minimum, while being numerically as stable as classicalgorithms, and without increasing significantly the number of floating pointoperations performed [4, 11]. We refer to these algorithms as communicationavoiding. One of the first such algorithms is the communication avoiding LUfactorization (CALU) [11, 10]. This algorithm is optimal in terms of communi-cation, that is it performs only polylogarithmic factors more than the theoreticallower bounds on communication require [4, 1]. Thus, it brings considerable im-provements to the performance of the LU factorization compared to the classicroutines that perform the LU decomposition such as the PDGETRF routineof ScaLAPACK, thanks to a novel pivoting strategy referred to as tournamentpivoting. It was shown that CALU is faster in practice than the correspondingroutine PDGETRF implemented in libraries as ScaLAPACK or vendor libraries,on both distributed [11] and shared memory computers [5]. While in practiceCALU is as stable as GEPP, in theory the upper bound of its growth factor isworse than that obtained with GEPP. One of our goals is to design an algo-rithm that minimizes communication and that has a smaller upper bound of itsgrowth factor than CALU.

In the first part of this paper we present the LU PRRP factorization, anovel LU decomposition algorithm based on that we call panel rank revealing

RR n° 7867


pivoting (PRRP). The LU PRRP factorization is based on a block algorithmthat computes the LU decomposition as follows. At each step of the blockfactorization, a block of columns (panel) is factored by computing the strongrank revealing QR (RRQR) factorization [12] of its transpose. The permutationreturned by the panel rank revealing factorization is applied on the rows of theinput matrix, and the L factor of the panel is computed based on the R factor ofthe strong RRQR factorization. Then the trailing matrix is updated. In exactarithmetic, the LU PRRP factorization computes a block LU decompositionbased on a different pivoting strategy, the panel rank revealing pivoting. Thefactors obtained from this decomposition can be stored in place, and so theLU PRRP factorization has the same memory requirements as standard LUand can easily replace it in any application.

We show that LU PRRP is more stable than GEPP. Its growth factor isupper bounded by (1 + τb)

nb , where b is the size of the panel, n is the num-

ber of columns of the input matrix, and τ is a parameter of the panel strongRRQR factorization. This bound is smaller than 2n−1, the upper bound of thegrowth factor for GEPP. For example, if the size of the panel is b = 64, then(1 + 2b)n/b = (1.079)n 2n−1. In terms of cost, it performs only O(n2b) morefloating point operations than GEPP. In addition, our extensive numerical ex-periments on random matrices and on a set of special matrices show that theLU PRRP factorization is very stable in practice and leads to modest growthfactors, smaller than those obtained with GEPP. It also solves easily pathologi-cal cases, as the Wilkinson matrix and the Foster matrix, on which GEPP fails.While the Wilkinson matrix is a matrix constructed such that GEPP has anexponential growth factor, the Foster matrix [8] arises from a real application.

We also discuss the backward stability of LU PRRP using three metrics,the relative error ‖PA − LU‖/‖A‖, the normwise backward error (7), and thecomponentwise backward error (8). For the matrices in our set, the relative erroris at most 5.26×10−14, the normwise backward error is at most 1.09×10−14, andthe componentwise backward error is at most 3.3 × 10−14 (with the exceptionof three matrices, sprandn, compan, and Demmel, for which the componentwisebackward error is 1.3× 10−13, 6.9× 10−12, and 1.16× 10−8 respectively). Laterin this paper, figure 2 displays the ratios of these errors versus the errors ofGEPP, obtained by dividing the maximum of the backward errors of LU PRRPand the machine epsilon (2−53) by the maximum of those of GEPP and themachine epsilon. For all the matrices in our set, the growth factor of LU PRRPis always smaller than that of GEPP (with the exception of one matrix, thecompar matrix). For random matrices, the relative error of the factorization ofLU PRRP is always smaller than that of GEPP. However, for the normwise andthe componentwise backward errors, GEPP is slightly better, with a ratio of atmost 2 between the two. For the set of special matrices, the ratio of the relativeerror is at most 1 in over 75% of cases, that is LU PRRP is more stable thanGEPP. For the rest of the 25% of the cases, the ratio is at most 3, except forone matrix (hadamard) for which the ratio is 23 and the backward error is onthe order of 10−15. The ratio of the normwise backward errors is at most 1 inover 75% of cases, and always 3.4 or smaller. The ratio of the componentwisebackward errors is at most 2 in over 81% of cases, and always 3 or smaller (exceptfor one matrix, the compan matrix, for which the componentwise backward erroris 6.9× 10−12 for LU PRRP and 6.2× 10−13 for GEPP).

In the second part of the paper we introduce the CALU PRRP factorization,

RR n° 7867


the communication avoiding version of LU PRRP. It is based on tournamentpivoting, a strategy introduced in [10] in the context of CALU, a communicationavoiding version of GEPP. With tournament pivoting, the panel factorizationis performed in two steps. The first step selects b pivot rows from the entirepanel at a minimum communication cost. For this, sets of b candidate rows areselected from blocks of the panel, which are then combined together through areduction-like procedure, until a set of b pivot rows are chosen. CALU PRRPuses the strong RRQR factorization to select b rows at each step of the reductionoperation, while CALU is based on GEPP. In the second step of the panelfactorization, the pivot rows are permuted to the diagonal positions, and the QRfactorization with no pivoting of the transpose of the panel is computed. Thenthe algorithm proceeds as the LU PRRP factorization. Note that the usageof the strong RRQR factorization ensures that bounds are respected locally ateach step of the reduction operation, but it does not ensure that the growthfactor is bounded globally as in LU PRRP.

To address the numerical stability of the communication avoiding factor-ization, we show that performing the CALU PRRP factorization of a matrixA is equivalent to performing the LU PRRP factorization of a larger matrix,formed by blocks of A and zeros. This equivalence suggests that CALU PRRPwill behave as LU PRRP in practice and it will be stable. The dimension andthe sparsity structure of the larger matrix also allows us to upper bound thegrowth factor of CALU PRRP by (1 + τb)

nb (H+1)−1, where in addition to the

parameters n, b, and τ previously defined, H is the height of the reduction treeused during tournament pivoting.

This algorithm has two significant advantages over other classic factoriza-tion algorithms. First, it minimizes communication, and hence it will be moreefficient than LU PRRP and GEPP on architectures where communication isexpensive. Here communication refers to both latency and bandwidth costs ofmoving data between levels of the memory hierarchy in the sequential case, andthe cost of moving data between processors in the parallel case. Second, it ismore stable than CALU. Theoretically, the upper bound of the growth factor ofCALU PRRP is smaller than that of CALU, for a reduction tree with a sameheight. More importantly, there are cases of interest for which it is smallerthan that of GEPP as well. Given a reduction tree of height H = logP , whereP is the number of processors on which the algorithm is executed, the panelsize b and the parameter τ can be chosen such that the upper bound of thegrowth factor is smaller than 2n−1. Extensive experimental results show thatCALU PRRP is as stable as LU PRRP, GEPP, and CALU on random matricesand a set of special matrices. Its growth factor is slightly smaller than that ofCALU. In addition, it is also stable for matrices on which GEPP fails.

As for the LU PRRP factorization, we discuss the stability of CALU PRRPusing three metrics. For the matrices in our set, the relative error is at most9.14 × 10−14, the normwise backward error is at most 1.37 × 10−14, and thecomponentwise backward error is at most 1.14×10−8 for Demmel matrix. Figure5 displays the ratios of the errors with respect to those of GEPP, obtained bydividing the maximum of the backward errors of CALU PRRP and the machineepsilon by the maximum of those of GEPP and the machine epsilon. For randommatrices, all the backward error ratios are at most 2.4. For the set of specialmatrices, the ratios of the relative error are at most 1 in over 62% of cases, andalways smaller than 2, except for 8% of cases, where the ratios are between 2.4

RR n° 7867


and 24.2. The ratios of the normwise backward errors are at most 1 in over75% of cases, and always 3.9 or smaller. The ratios of componentwise backwarderrors are at most 1 in over 47% of cases, and always 3 or smaller, except for 7ratios which have values up to 74.

We also discuss a different version of LU PRRP that minimizes commu-nication, but can be less stable than CALU PRRP, our method of choice forreducing communication. In this different version, the panel factorization isperformed only once, during which its off-diagonal blocks are annihilated usinga reduce-like operation, with the strong RRQR factorization being the operatorused at each step of the reduction. Every such factorization of a block of rows ofthe panel leads to the update of a block of rows of the trailing matrix. Indepen-dently of the shape of the reduction tree, the upper bound of the growth factorof this method is the same as that of LU PRRP. This is because at every stepof the algorithm, a row of the current trailing matrix is updated only once. Werefer to the version based on a binary reduction tree as block parallel LU PRRP,and to the version based on a flat tree as block pairwise LU PRRP. There aresimilarities between these two algorithms, the LU factorization based on blockparallel pivoting (an unstable factorization), and the LU factorization based onblock pairwise pivoting (whose stability is still under investigation) [18, 20, 2].All these methods perform the panel factorization as a reduction operation, andthe factorization performed at every step of the reduction leads to an updateof the trailing matrix. However, in block parallel pivoting and block pairwisepivoting, GEPP is used at every step of the reduction, and hence U factors arecombined together during the reduction phase. While in the block parallel andblock pairwise LU PRRP, the reduction operates always on original rows of thecurrent panel.

Despite having better bounds, the block parallel LU PRRP based on a binaryreduction tree of height H = logP is unstable for certain values of the panelsize b and the number of processors P . The block pairwise LU PRRP based ona flat tree of height H = n

b appears to be more stable. The growth factor islarger than that of CALU PRRP, but it is smaller than n for the sizes of thematrices in our test set. Hence, potentially this version can be more stable thanblock pairwise pivoting, but requires further investigation.

The remainder of the paper is organized as follows. Section 2 presents thealgebra of the LU PRRP factorization, discusses its stability, and compares itwith that of GEPP. It also presents experimental results showing that LU PRRPis more stable than GEPP in terms of worst case growth factor, and it is moreresistant to pathological matrices on which GEPP fails. Section 3 presents thealgebra of CALU PRRP, a communication avoiding version of LU PRRP. Itdescribes similarities between CALU PRRP and LU PRRP and it discusses itsstability. The communication optimality of CALU PRRP is shown in section 4,where we also compare its performance model with that of the CALU algorithm.Section 5 discusses two alternative algorithms that can also reduce communi-cation, but can be less stable in practice. Section 6 concludes and presents ourfuture work.

RR n° 7867


2 LU PRRP Method

In this section we introduce the LU PRRP factorization, an LU decompositionalgorithm based on panel rank revealing pivoting strategy. It is based on ablock algorithm, that factors at each step a block of columns (a panel), andthen it updates the trailing matrix. The main difference between LU PRRPand GEPP resides in the panel factorization. In GEPP the panel factorizationis computed using LU with partial pivoting, while in LU PRRP it is computedby performing a strong RRQR factorization of its transpose. This leads to adifferent selection of pivot rows, and the obtained R factor is used to computethe block L factor of the panel. In exact arithmetic, LU PRRP performs a blockLU decomposition with a different pivoting scheme, which aims at improving thenumerical stability of the factorization by bounding more efficiently the growthof the elements. We also discuss the numerical stability of LU PRRP, and weshow that both in theory and in practice, LU PRRP is more stable than GEPP.

2.1 The algebra

LU PRRP is based on a block algorithm that factors the input matrix A of sizem × n by traversing blocks of columns of size b. Consider the first step of thefactorization, with the matrix A having the following partition,

A =

[A11 A12

A21 A22

], (1)

where A11 is of size b × b, A21 is of size (m − b) × b, A12 is of size b × (n − b),and A22 is of size (m− b)× (n− b).

The main idea of the LU PRRP factorization is to eliminate the elementsbelow the b× b diagonal block such that the multipliers used during the updateof the trailing matrix are bounded by a given threshold τ . For this, we performa strong RRQR factorization on the transpose of the first panel of size m× b toidentify a permutation matrix Π, that is b pivot rows,[A11

A21

]TΠ =

[A11

A21

]T= Q

[R(1 : b, 1 : b) R(1 : b, b+ 1 : m)

]= Q

[R11 R12

],

where A denotes the permuted matrix A. The strong RRQR factorization en-sures that the quantity RT12(R−111 )T is bounded by a given threshold τ in themax norm. The strong RRQR factorization, as described in Algorithm 2 in Ap-pendix A, computes first the QR factorization with column pivoting, followedby additional swaps of the columns of the R factor and updates of the QRfactorization, so that ‖RT12(R−111 )T ‖max ≤ τ .

After the panel factorization, the transpose of the computed permutation Πis applied on the input matrix A, and then the update of the trailing matrix isperformed,

A = ΠTA =

[IbL21 Im−b

] [A11 A12

As22

], (2)

where

As22 = A22 − L21A12. (3)

RR n° 7867


Note that in exact arithmetic, we have L21 = A21A−111 = RT12(R−111 )T . Hence

the factorization in equation (2) is equivalent to the factorization

A = ΠTA =

[Ib

A21A−111 Im−b

] [A11 A12

As22

],

where

As22 = A22 − A21A−111 A12, (4)

and A21A−111 was computed in a numerically stable way such that it is bounded

in max norm by τ .Since the block size is in general b ≥ 2, performing LU PRRP on a given

matrix A first leads to a block LU factorization, with the diagonal blocks Aiibeing square of size b × b. An additional Gaussian elimination with partialpivoting is performed on the b × b diagonal block A11 as well as the update ofthe corresponding trailing matrix A12. Then the decomposition obtained afterthe elimination of the first panel of the input matrix A is

A = ΠTA =

[IbL21 Im−b

] [A11 A12

As22

]=

[IbL21 Im−b

] [L11

Im−b

] [U11 U12

As22

],

where L11 is a lower triangular b × b matrix with unit diagonal and U11 is anupper triangular b × b matrix. We show in section 2.2 that this step does notaffect the stability of the LU PRRP factorization. Note that the factors L andU can be stored in place, and so LU PRRP has the same memory requirementsas the standard LU decomposition and can easily replace it in any application.

Algorithm 1 presents the LU PRRP factorization of a matrix A of size n×npartitioned into n

b panels. The number of floating-point operations performedby this algorithm is

#flops =2

3n3 +O(n2b),

which is only O(n2b) more floating point operations than GEPP. The detailedcounts are presented in Appendix C. When the QR factorization with columnpivoting is sufficient to obtain the desired bound for each panel factorization,and no additional swaps are performed, the total cost is

#flops =2

3n3 +

3

2n2b.

2.2 Numerical stability

In this section we discuss the numerical stability of the LU PRRP factorization.The stability of an LU decomposition depends on the growth factor. In hisbackward error analysis [21], Wilkinson proved that the computed solution xof the linear system Ax = b, where A is of size n × n, obtained by Gaussianelimination with partial pivoting or complete pivoting satisfies

(A+ ∆A)x = b, ‖∆A‖∞ ≤ p(n)gWu‖A‖∞.

In the formula, p(n) is a cubic polynomial, u is the machine precision, and gWis the growth factor defined by

RR n° 7867


Algorithm 1 LU PRRP factorization of a matrix A of size n× n1: for j from 1 to n

b do2: Let Aj be the current panel Aj = A((j − 1)b+ 1 : n, (j − 1)b+ 1 : jb).3: Compute panel factorization ATj Πj := QjRj using strong RRQR factor-

ization,4: L2j := (Rj(1 : b, 1 : b)

−1Rj(1 : b, b+ 1 : n− (j − 1)b))T .

5: Pivot by applying the permutation matrix ΠTj on the entire matrix,

A = ΠTj A.

6: Update the trailing matrix,7: A(jb+ 1 : n, jb+ 1 : n)− = L2jA((j − 1)b+ 1 : jb, jb+ 1 : n).8: Let Ajj be the current b× b diagonal block,9: Ajj = A((j − 1)b+ 1 : jb, (j − 1)b+ 1 : jb).

10: Compute Ajj = ΠjjLjjUjj using GEPP.11: Compute U((j−1)b+1 : jb, jb+1 : n) = L−1jj ΠT

jjA((j−1)b+1 : jb, jb+1 :n).

12: end for

gW =maxi,j,k|a(k)

i,j |maxi,j |ai,j | ,

where a(k)i,j denotes the entry in position (i, j) obtained after k steps of elim-

ination. Thus the growth factor measures the growth of the elements dur-ing the elimination. The LU factorization is backward stable if gW is of or-der O(1) (in practice the method is stable if the growth factor is a slowlygrowing function of n). Lemma 9.6 of [13] (section 9.3) states a more gen-eral result, showing that the LU factorization without pivoting of A is back-ward stable if the growth factor is small. Wilkinson [21] showed that forpartial pivoting, the growth factor gW ≤ 2n−1, and this bound is attain-able. He also showed that for complete pivoting, the upper bound satisfiesgW ≤ n1/2(2.31/2......n1/(n−1))1/2 ∼ cn1/2n1/4 logn. In practice the growth fac-tors are much smaller than the upper bounds.

In the following, we derive the upper bound of the growth factor for theLU PRRP factorization. We use the same notation as in the previous sectionand we assume without loss of generality that the permutation matrix is theidentity. It is easy to see that the growth factor obtained after the eliminationof the first panel is bounded by (1 + τb). At the k-th step of the block factor-ization, the active matrix A(k) is of size (m− (k− 1)b)× (n− (k− 1)b), and thedecomposition performed at this step can be written as

A(k) =

[A

(k)11 A

(k)12

A(k)21 A

(k)22

]=

[Ib

L(k)21 Im−(k+1)b

] [A

(k)11 A

(k)12

A(k)s22

].

The active matrix at the (k+1)-th step is A(k+1)22 = A

(k)s22 = A

(k)22 − L

(k)21 A

(k)12 .

Then maxi,j |a(k+1)i,j | ≤ maxi,j |a(k)i,j |(1 + τb) with maxi,j |L(k)

21 (i, j)| ≤ τ and wehave

gW(k+1) ≤ gW (k)(1 + τb). (5)

RR n° 7867


Induction on equation (5) leads to a growth factor of LU PRRP performed onthe n

b panels of the matrix A that satisfies

gW ≤ (1 + τb)n/b. (6)

As explained in the algebra section, the LU PRRP factorization leads first to ablock LU factorization, which is completed with additional GEPP factorizationsof the diagonal b× b blocks and updates of corresponding blocks of rows of thetrailing matrix. These additional factorizations lead to a growth factor boundedby 2b on the trailing blocks of rows. Since we choose b n, we conclude thatthe growth factor of the entire factorization is still bounded by (1 + τb)n/b.

The improvement of the upper bound of the growth factor of LU PRRP withrespect to GEPP is illustrated in Table 1, where the panel size varies from 8 to128, and the parameter τ is equal to 2. The worst case growth factor becomesarbitrarily smaller than for GEPP, for b ≥ 64.

Table 1: Upper bounds of the growth factor gW obtained from factoring a matrixof size m× n using LU PRRP with different panel sizes and τ = 2.

b gW8 (1.425)n−1

16 (1.244)n−1

32 (1.139)n−1

64 (1.078)n−1

128 (1.044)n−1

Despite the complexity of our algorithm in pivot selection, we still computean LU factorization, only with different pivots. Consequently, the rounding erroranalysis for LU factorization still applies (see, for example, [3]), which indicatesthat element growth is the only factor controlling the numerical stability of ouralgorithm.

2.3 Experimental results

We measure the stability of the LU PRRP factorization experimentally on alarge set of test matrices by using several metrics, as the growth factor, thenormwise backward stability, and the componentwise backward stability. Thetests are performed in Matlab. In the tests, in most of the cases, the panel fac-torization is performed by using the QR with column pivoting factorization in-stead of the strong RRQR factorization. This is because in practice RT12(R−111 )T

is already well bounded after performing the RRQR factorization with columnpivoting (‖RT12(R−111 )T ‖max is rarely bigger than 3). Hence no additional swapsare needed to ensure that the elements are well bounded. However, for the ill-conditionned special matrices (condition number ≥ 1014), to get small growthfactors, we perform the panel factorization by using the strong RRQR factor-ization. In fact, for these cases, QR with column pivoting does not ensure asmall bound for RT12(R−111 )T .

We use a collection of matrices that includes random matrices, a set ofspecial matrices described in Table 15, and several pathological matrices onwhich Gaussian elimination with partial pivoting fails because of large growthfactors. The set of special matrices includes ill-conditioned matrices as well as

RR n° 7867


sparse matrices. The pathological matrices considered are the Wilkinson matrixand two matrices arising from practical applications, presented by Foster [7] andWright [23], for which the growth factor of GEPP grows exponentially. TheWilkinson matrix was constructed to attain the upper bound of the growthfactor of GEPP [21, 14], and a general layout of such a matrix is

A = diag(±1)

1 0 0 · · · 0 1−1 1 0 ... 0 1

−1 −1 1. . .

......

......

. . .. . . 0 1

−1 −1 · · · −1 1 1−1 −1 · · · −1 −1 1

×

0

T...0

0 · · · 0 θ

,

where T is an (n − 1) × (n − 1) non-singular upper triangular matrix and θ =max|aij |. We also test a generalized Wilkinson matrix, the general form of sucha matrix is

A =

1 0 0 · · · 0 10 1 0 ... 0 1

1. . .

......

.... . .

. . . 0 11 1

0 · · · 0 1

+ TT ,

where T is an n × n upper triangular matrix with zero entries on the maindiagonal. The matlab code of the matrix A is detailed in Appendix F.

The Foster matrix represents a concrete physical example that arises fromusing the quadrature method to solve a certain Volterra integral equation andit is of the form

A =

1 0 0 · · · 0 − 1c

−kh2 1− kh2 0 ... 0 − 1

c

−kh2 −kh 1− kh2

. . ....

......

.... . . 0 − 1

c

−kh2 −kh · · · −kh 1− kh2 − 1

c

−kh2 −kh · · · −kh −kh 1− 1c −

kh2

.

Wright [23] discusses two-point boundary value problems for which standardsolution techniques give rise to matrices with exponential growth factor whenGaussian elimination with partial pivoting is used. This kind of problems arisefor example from the multiple shooting algorithm. A particular example of thisproblem is presented by the following matrix,

A =

I I−eMh I 0

−eMh I...

. . .. . . 0−eMh I

,

RR n° 7867


where eMh = I +Mh+O(h2).The experimental results show that the LU PRRP factorization is very sta-

ble. Figure 1 displays the growth factor of LU PRRP for random matrices ofsize varying from 1024 to 8192 and for sizes of the panel varying from 8 to 128.We observe that the smaller the size of the panel is, the bigger the elementgrowth is. In fact, for a smaller size of the panel, the number of panels and thenumber of updates on the trailing matrix is bigger, and this leads to a largergrowth factor. But for all panel sizes, the growth factor of LU PRRP is smallerthan the growth factor of GEPP. For example, for a random matrix of size 4096and a panel of size 64, the growth factor is only about 19, which is smaller thanthe growth factor obtained by GEPP, and as expected, much smaller than thetheoretical upper bound of (1.078)4095.

Figure 1: Growth factor gW of the LU PRRP factorization of random matrices.

Tables 16 and 18 in Appendix B present more detailed results showing thestability of the LU PRRP factorization for random matrices and a set of specialmatrices. There, we include different metrics, such as the norm of the factors,the value of their maximum element and the backward error of the LU factoriza-tion. We evaluate the normwise backward stability by computing three accuracytests as performed in the HPL (High-Performance Linpack) benchmark [6], anddenoted as HPL1, HPL2 and HPL3.

HPL1 = ||Ax− b||∞/(ε||A||1 ∗N),

HPL2 = ||Ax− b||∞/(ε||A||1||x||1),

HPL3 = ||Ax− b||∞/(ε||A||∞||x||∞ ∗N).

In HPL, the method is considered to be accurate if the values of the threequantities are smaller than 16. More generally, the values should be of orderO(1). For the LU PRRP factorization HPL1 is at most 8.09, HPL2 is at most8.04 × 10−2 and HPL3 is at most 1.60 × 10−2. We also display the normwisebackward error, using the 1-norm,

η :=||r||

||A|| ||x||+ ||b||, (7)

RR n° 7867


and the componentwise backward error

w := maxi

|ri|(|A| |x|+ |b|)i

, (8)

where the computed residual is r = b−Ax. For our tests residuals are computedwith double-working precision.

Figure 2 summarizes all our stability results for LU PRRP. This figure dis-plays the ratio of the maximum between the backward error and machine epsilonof LU PRRP versus GEPP. The backward error is measured using three met-rics, the relative error ‖PA − LU‖/‖A‖, the normwise backward error η, andthe componentwise backward error w of LU PRRP versus GEPP, and the ma-chine epsilon. We take the maximum of the computed error with epsilon sincesmaller values are mostly roundoff error, and so taking ratios can lead to ex-treme values with little reliability. Results for all the matrices in our test set arepresented, that is 20 random matrices for which results are presented in Table16, and 37 special matrices for which results are presented in Tables 17 and 18.This figure shows that for random matrices, almost all ratios are between 0.5and 2. For special matrices, there are few outliers, up to 23.71 (GEPP is morestable) for the backward error ratio of the special matrix hadamard and downto 2.12 × 10−2 (LU PRRP is more stable) for the backward error ratio of thespecial matrix moler.

Figure 2: A summary of all our experimental data, showing the ratio betweenmax(LU PRRP’s backward error, machine epsilon) and max(GEPP’s backwarderror, machine epsilon) for all the test matrices in our set. Each vertical barrepresents such a ratio for one test matrix. Bars above 100 = 1 mean thatLU PRRP’s backward error is larger, and bars below 1 mean that GEPP’sbackward error is larger. For each matrix and algorithm, the backward error ismeasured 3 ways. For the first third of the bars, labeled ‖PA− LU‖/‖A‖, themetric is the backward error, using the Frobenius norm. For the middle thirdof the bars, labeled “normwise backward error”, the metric is η in equation(7). For the last third of the bars, labeled “componentwise backward error”,the metric is w in equation (8). The test matrices are further labeled either as“randn”, which are randomly generated, or “special”, listed in Table 15.

We consider now pathological matrices on which GEPP fails. Table 2 presentsresults for the linear solver using the LU PRRP factorization for a Wilkinsonmatrix [22] of size 2048 with a size of the panel varying from 8 to 128. The

RR n° 7867


growth factor is 1 and the relative error ||PA−LU ||||A|| is on the order of 10−19. Ta-

ble 3 presents results for the linear solver using the LU PRRP algorithm for ageneralized Wilkinson matrix of size 2048 with a size of the panel varying from8 to 128.

Table 2: Stability of the LU PRRP factorization of a Wilkinson matrix on whichGEPP fails.

n b gW ||U ||1 ||U−1||1 ||L||1 ||L−1||1 ||PA−LU ||F||A||F

2048

128 1 1.02e+03 6.09e+00 1 1.95e+00 4.25e-2064 1 1.02e+03 6.09e+00 1 1.95e+00 5.29e-2032 1 1.02e+03 6.09e+00 1 1.95e+00 8.63e-2016 1 1.02e+03 6.09e+00 1 1.95e+00 1.13e-198 1 1.02e+03 6.09e+00 1 1.95e+00 1.57e-19

Table 3: Stability of the LU PRRP factorization of a generalized Wilkinsonmatrix on which GEPP fails.


2048

128 2.69 1.23e+03 1.39e+02 1.21e+03 1.17e+03 1.05e-1564 2.61 9.09e+02 1.12e+02 1.36e+03 1.15e+03 9.43e-1632 2.41 8.20e+02 1.28e+02 1.39e+03 9.77e+02 5.53e-1616 4.08 1.27e+03 2.79e+02 1.41e+03 1.19e+03 7.92e-168 3.35 1.36e+03 2.19e+02 1.41e+03 1.73e+03 1.02e-15

For the Foster matrix, it was shown that when c = 1 and kh = 23 , the growth

factor of GEPP is (23 )(2n−1 − 1), which is close to the maximum theoretical

growth factor of GEPP of 2n−1. Table 4 presents results for the linear solverusing the LU PRRP factorization for a Foster matrix of size 2048 with a sizeof the panel varying from 8 to 128 ( c = 1, h = 1 and k = 2

3 ). Accordingto the obtained results, LU PRRP gives a modest growth factor of 2.66 forthis practical matrix, while GEPP has a growth factor of 1018 for the sameparameters.

Table 4: Stability of the LU PRRP factorization of a practical matrix (Foster)on which GEPP fails.


2048


For matrices arising from the two-point boundary value problems describedby Wright, it was shown that when h is chosen small enough such that allelements of eMh are less than 1 in magnitude, the growth factor obtained using

GEPP is exponential. For our experiment the matrix M =

[− 1

6 11 − 1

6

], that

RR n° 7867


is eMh ≈[

1− h6 h

h 1− h6

], and h = 0.3. Table 5 presents results for the linear

solver using the LU PRRP factorization for a Wright matrix of size 2048 with asize of the panel varying from 8 to 128. According to the obtained results, againLU PRRP gives minimum possible pivot growth 1 for this practical matrix,compared to the GEPP method which leads to a growth factor of 1095 usingthe same parameters.

Table 5: Stability of the LU PRRP factorization on a practical matrix (Wright)on which GEPP fails.


2048

128 1 3.25e+00 8.00e+00 2.00e+00 2.00e+00 4.08e-1764 1 3.25e+00 8.00e+00 2.00e+00 2.00e+00 4.08e-1732 1 3.25e+00 8.00e+00 2.05e+00 2.07e+00 6.65e-1716 1 3.25e+00 8.00e+00 2.32e+00 2.44e+00 1.04e-168 1 3.40e+00 8.00e+00 2.62e+00 3.65e+00 1.26e-16

All the previous tests show that the LU PRRP factorization is very stable forrandom, and for more special matrices, and it also gives modest growth factorfor the pathological matrices on which GEPP fails. We note that we were notable to find matrices for which LU PRRP attains the upper bound of (1 + τb)

nb

for the growth factor.

3 Communication avoiding LU PRRP

In this section we present a communication avoiding version of the LU PRRPalgorithm, that is an algorithm that minimizes communication, and so it will bemore efficient than LU PRRP and GEPP on architectures where communicationis expensive. We show in this section that this algorithm is more stable thanCALU, an existing communication avoiding algorithm for computing the LUfactorization [10]. More importantly, its parallel version is also more stablethan GEPP (under certain conditions).

3.1 Matrix algebra

CALU PRRP is a block algorithm that uses tournament pivoting, a strategyintroduced in [10] that allows to minimize communication. As in LU PRRP,at each step the factorization of the current panel is computed, and then thetrailing matrix is updated. However, in CALU PRRP the panel factorizationis performed in two steps. The first step, which is a preprocessing step, uses areduction operation to identify b pivot rows with a minimum amount of com-munication. The strong RRQR factorization is the operator used at each nodeof the reduction tree to select a new set of b candidate rows from the candidaterows selected at previous stages of the reduction. The b pivot rows are permutedinto the diagonal positions, and then the QR factorization with no pivoting ofthe transpose of the entire panel is computed.

In the following we illustrate tournament pivoting on the first panel, with theinput matrix A partitioned as in equation (1). Tournament pivoting considers

RR n° 7867


that the first panel is partitioned into P = 4 blocks of rows,

A(:, 1 : b) =

A00

A10

A20

A30

.The preprocessing step uses a binary reduction tree in our example, and wenumber the levels of the reduction tree starting from 0. At the leaves of thereduction tree, a set of b candidate rows are selected from each block of rows Ai0by performing the strong RRQR factorization on the transpose of each blockAi0. This gives the following decomposition,

AT00Π00

AT10Π10

AT20Π20

AT30Π30

=

Q00R00

Q10R10

Q20R20

Q30R30

,which can be written as

A(:, 1 : b)T Π0 = A(:, 1 : b)T

Π00

Π10

Π20

Π30

=

Q00R00

Q10R10

Q20R20

Q30R30

,

where Π0 is an m×m permutation matrix with diagonal blocks of size mP ×

mP ,

Qi0 is an orthogonal matrix of size b× b, and each factor Ri0 is an b× mP upper

triangular matrix.There are now P = 4 sets of candidate rows. At the first level of the binary

tree, two matrices A01 and A11 are formed by combining together two sets ofcandidate rows,

A01 =

[(AT00Π00)(:, 1 : b)(AT10Π10)(:, 1 : b)

]A11 =

[(AT20Π20)(:, 1 : b)(AT30Π30)(:, 1 : b)

].

Two new sets of candidate rows are identified by performing the strong RRQRfactorization of each matrix A01 and A11,

AT01Π01 = Q01R01,

AT11Π11 = Q11R11,

where Π10, Π11 are permutation matrices of size 2b×2b, Q01, Q11 are orthogonalmatrices of size b× b, and R01, R11 are upper triangular factors of size b× 2b.

The final b pivot rows are obtained by performing one last strong RRQRfactorization on the transpose of the following b× 2b matrix :

A02 =

[(AT01Π01)(:, 1 : b)(AT11Π11)(:, 1 : b)

],

RR n° 7867


that is

AT02Π02 = Q02R02,

where Π02 is a permutation matrix of size 2b× 2b, Q02 is an orthogonal matrixof size b×b, and R02 is an upper triangular matrix of size b×2b. This operationis performed at the root of the binary reduction tree, and this ends the first stepof the panel factorization. In the second step, the final pivot rows identified bytournament pivoting are permuted to the diagonal positions of A,

A = ΠTA = ΠT2 ΠT

1 ΠT0 A,

where the matrices Πi are obtained by extending the matrices Π to the dimensionm×m, that is

Π1 =

[Π01

Π11

],

with Πi1, for i = 0, 1 formed as

Πi1 =

Πi1(1 : b, 1 : b) Πi1(1 : b, b+ 1 : 2b)

ImP −b

Πi1(b+ 1 : 2b, 1 : b) Πi1(b+ 1 : 2b, b+ 1 : 2b)Im

P −b

,and

Π2 =

Π02(1 : b, 1 : b) Π02(1 : b, b+ 1 : 2b)

I2mP −b

Π02(b+ 1 : 2b, 1 : b) Π02(b+ 1 : 2b, b+ 1 : 2b)I2m

P −b

.Once the pivot rows are in the diagonal positions, the QR factorization with

no pivoting is performed on the transpose of the first panel,

AT (1 : b, :) = QR =[R11 R12

].

This factorization is used to update the trailing matrix, and the elimination ofthe first panel leads to the following decomposition (we use the same notationas in section 2),

A =

[Ib

A21A−111 Im−b

] [A11 A12

As22

],

where

As22 = A22 − A21A−111 A12.

As in the LU PRRP factorization, the CALU PRRP factorization computes ablock LU factorization of the input matrix A. To obtain the full LU factoriza-tion, an additional GEPP is performed on the diagonal block A11, followed bythe update of the block row A12. The CALU PRRP factorization continues thesame procedure on the trailing matrix As22.

RR n° 7867


Note that the factors L and U obtained by the CALU PRRP factoriza-tion are different from the factors obtained by the LU PRRP factorization.The two algorithms use different pivot rows, and in particular the factor L ofCALU PRRP is no longer bounded by a given threshold τ as in LU PRRP.This leads to a different worst case growth factor for CALU PRRP, that we willdiscuss in the following section.

The following figure displays the binary tree based tournament pivoting per-formed on the first panel using an arrow notation (as in [10]). The functionf(Aij) computes a strong RRQR of the matrix ATij to select a set of b candi-date rows. At each node of the reduction tree, two sets of b candidate rows aremerged together and form a matrix Aij , the function f is applied on Aij , andanother set of b candidate rows is selected. While in this section we focused onbinary trees, tournament pivoting can use any reduction tree, and this allowsthe algorithm to adapt on different architectures. Later in the paper we willconsider also a flat reduction tree.

A30

A20

A10

A00

→→→→

f(A30)

f(A20)

f(A10)

f(A00)

f(A11)

f(A01)

f(A02)

3.2 Numerical Stability of CALU PRRP

In this section we discuss the stability of the CALU PRRP factorization andwe identify similarities with the LU PRRP factorization. We also discuss thegrowth factor of the CALU PRRP factorization, and we show that its upperbound depends on the height of the reduction tree. For the same reduction tree,this upper bound is smaller than that obtained with CALU. More importantly,for cases of interest, the upper bound of the growth factor of CALU PRRP isalso smaller than that obtained with GEPP.

To address the numerical stability of CALU PRRP, we show that perform-ing CALU PRRP on a matrix A is equivalent to performing LU PRRP on alarger matrix ALU PRRP , which is formed by blocks of A (sometimes slightlyperturbed) and blocks of zeros. This reasoning is also used in [10] to show thesame equivalence between CALU and GEPP. While this similarity is explainedin detail in [10], here we focus only on the first step of the CALU PRRP factor-ization. We explain the construction of the larger matrix ALU PRRP to exposethe equivalence between the first step of the CALU PRRP factorization of Aand the LU PRRP factorization of ALU PRRP .

Consider a nonsingular matrix A of size m × n and the first step of itsCALU PRRP factorization using a general reduction tree of height H. Tour-nament pivoting selects b candidate rows at each node of the reduction tree byusing the strong RRQR factorization. Each such factorization leads to an Lfactor which is bounded locally by a given threshold τ . However this bound isnot guaranteed globally. When the factorization of the first panel is computedusing the b pivot rows selected by tournament pivoting, the L factor will notbe bounded by τ . This results in a larger growth factor than the one obtainedwith the LU PRRP factorization. Recall that in LU PRRP, the strong RRQRfactorization is performed on the transpose of the whole panel, and so every

RR n° 7867


entry of the obtained lower triangular factor L is bounded by τ .However, we show now that the growth factor obtained after the first step

of the CALU PRRP factorization is bounded by (1 + τb)H+1. Consider a rowj, and let As(j, b + 1 : n) be the updated row obtained after the first step ofelimination of CALU PRRP. Suppose that row j of A is a candidate row atlevel k − 1 of the reduction tree, and so it participates in the strong RRQRfactorization computed at a node sk at level k of the reduction tree, but it isnot selected as a candidate row by this factorization. We refer to the matrixformed by the candidate rows at node sk as Ak. Hence, row j is not used to formthe matrix Ak. Similarly, for every node i on the path from node sk to the rootof the reduction tree of height H, we refer to the matrix formed by the candidaterows selected by strong RRQR as Ai. Note that in practice it can happen thatone of the blocks of the panel is singular, while the entire panel is nonsingular.In this case strong RRQR will select less than b linearly independent rows thatwill be passed along the reduction tree. However, for simplicity, we assume inthe following that the matrices Ai are nonsingular. For a more general solution,the reader can consult [10].

Let Π be the permutation returned by the tournament pivoting strategyperformed on the first panel, that is the permutation that puts the matrix AHon diagonal. The following equation is satisfied,(

AH AH

A(j, 1 : b) A(j, b+ 1 : n)

)=

(Ib

A(j, 1 : b)A−1H 1

)·(AH AH

As(j, b+ 1 : n)

), (9)

where

AH = (ΠA)(1 : b, 1 : b),

AH = (ΠA)(1 : b, b+ 1 : n).

The updated rowAs(j, b+1 : n) can be also obtained by performing LU PRRPon a larger matrix ALU PRRP of dimension ((H−k+1)b+1)×((H−k+1)b+1),

ALU PRRP =

AH AH

AH−1 AH−1

AH−2 AH−2

. . .. . .

Ak Ak

(−1)H−kA(j, 1 : b) A(j, b+ 1 : n)

=

IbAH−1A

−1H Ib

AH−2A−1H−1 Ib

. . .. . .

AkA−1k+1 Ib

(−1)H−kA(j, 1 : b)A−1k 1

·

AH AH

AH−1 AH−1

AH−2 AH−2

. . ....

Ak Ak

As(j, b+ 1 : n)

, (10)

RR n° 7867


where

AH−i =

AH if i = 0,

−AH−iA−1H−i+1AH−i+1 if 0 < i ≤ H − k. (11)

Equation (10) can be easily verified, since

As(j, b+ 1 : n) = A(j, b+ 1 : n)− (−1)H−kA(j, 1 : b)A−1k (−1)H−kAk

= A(j, b+ 1 : n)−A(j, 1 : b)A−1k AkA−1k+1 . . . AH−2A

−1H−1AH−1A

−1H AH

= A(j, b+ 1 : n)−A(j, 1 : b)A−1H AH .

Equations (9) and (10) show that the Schur complement obtained after eachstep of performing the CALU PRRP factorization of a matrix A is equivalentto the Schur complement obtained after performing the LU PRRP factoriza-tion of a larger matrix ALU PRRP , formed by blocks of A (sometimes slightlyperturbed) and blocks of zeros. More generally, this implies that the entireCALU PRRP factorization of A is equivalent to the LU PRRP factorization ofa larger and very sparse matrix, formed by blocks of A and blocks of zeros (weomit the proofs here, since they are similar with the proofs presented in [10]).

Equation (10) is used to derive the upper bound of the growth factor ofCALU PRRP from the upper bound of the growth factor of LU PRRP. Theelimination of each row of the first panel using CALU PRRP can be obtained byperforming LU PRRP on a matrix of maximum dimension m×b(H+1). Hencethe upper bound of the growth factor obtained after one step of CALU PRRPis (1 + τb)H+1. This leads to an upper bound of (1 + τb)

nb (H+1)−1 for a matrix

of size m× n.Table 6 summarizes the bounds of the growth factor of CALU PRRP derived

in this section, and also recalls the bounds of LU PRRP, GEPP, and CALUthe communication avoiding version of GEPP. It considers the growth factorobtained after the elimination of b columns of a matrix of size m × (b + 1),and also the general case of a matrix of size m × n. As discussed in section 2already, LU PRRP is more stable than GEPP in terms of worst case growthfactor. From Table 6, it can be seen that for a reduction tree of a same height,CALU PRRP is more stable than CALU.

In the following we show that CALU PRRP can be more stable than GEPPin terms of worst case growth factor. Consider a parallel version of CALU PRRPbased on a binary reduction tree of height H = log(P ), where P is the number of

processors. The upper bound of the growth factor becomes (1 + τb)n(logP+1)

b −1,which is smaller than 2n(logP+1)−1, the upper bound of the growth factor ofCALU. For example, if the threshold is τ = 2, the panel size is b = 64, andthe number of processors is P = 128 = 27, then gWCALU PRRP = (1.7)n. Thisquantity is much smaller than 27n the upper bound of CALU, and even smallerthan the worst case growth factor of GEPP of 2n−1. In general, the upperbound of CALU PRRP can be smaller than the one of GEPP, if the differentparameters τ , H, and b are chosen such that the condition

H ≤ b

(log b+ log τ)(12)

RR n° 7867


is satisfied. For a binary tree of heightH = logP , it becomes logP ≤ b(log b+log τ) .

This is a condition which can be satisfied in practice, by choosing b and τ ap-propriately for a given number of processors P . For example, when P ≤ 512,b = 64, and τ = 2, the condition (12) is satisfied, and the worst case growthfactor of CALU PRRP is smaller than the one of GEPP.

However, for a sequential version of CALU PRRP using a flat tree of height

H = n/b, the condition to be satisfied becomes n ≤ b2

(log b+log τ) , which is more

restrictive. In practice, the size of b is chosen depending on the size of thememory, and it might be the case that it will not satisfy the condition in equation(12).

Table 6: Bounds for the growth factor gW obtained from factoring a matrix ofsize m × (b + 1) and a matrix of size m × n using CALU PRRP, LU PRRP,CALU, and GEPP. CALU PRRP and CALU use a reduction tree of height H.The strong RRQR used in LU PRRP and CALU PRRP is based on a thresholdτ . For the matrix of size m× (b+1), the result corresponds to the growth factorobtained after eliminating b columns.

matrix of size m× (b+ 1)TSLU(b,H) LU PRRP(b,H) GEPP LU PRRP

gW upper bound 2b(H+1) (1 + τb)H+1 2b 1 + τb

matrix of size m× nCALU CALU PRRP GEPP LU PRRP

gW upper bound 2n(H+1)−1 (1 + τb)n(H+1)

b−1 2n−1 (1 + τb)

nb

3.3 Experimental results

In this section we present experimental results and show that CALU PRRPis stable in practice and compare them with those obtained from CALU andGEPP in [10]. We present results for both the binary tree scheme and the flattree scheme.

As in section 2, we perform our tests on matrices whose elements follow anormal distribution. In Matlab notation, the test matrix is A = randn(n, n),and the right hand side is b = randn(n, 1). The size of the matrix is chosen suchthat n is a power of 2, that is n = 2k, and the sample size is 10 if k < 13 and 3if k ≥ 13. To measure the stability of CALU PRRP, we discuss several metrics,that concern the LU decomposition and the linear solver using it, such as thegrowth factor, normwise and componentwise backward errors. We also performtests on several special matrices including sparse matrices, they are describedin Appendix B.

Figure 3 displays the values of the growth factor gW of the binary tree basedCALU PRRP, for different block sizes b and different number of processors P .As explained in section 3.1, the block size determines the size of the panel,while the number of processors determines the number of block rows in whichthe panel is partitioned. This corresponds to the number of leaves of the binarytree. We observe that the growth factor of binary tree based CALU PRRP isin the most of the cases better than GEPP. The curves of the growth factor liebetween 1

2n1/2 and 3

4n1/2 in our tests on random matrices. These results show

that binary tree based CALU PRRP is stable and the growth factor values

RR n° 7867


obtained for the different layouts are better than those obtained with binarytree based CALU. The figure 3 includes also the growth factor of the LU PRRPmethod with a panel of size b = 64. We note that that results are better thanthose of binary tree based CALU PRRP.

Figure 3: Growth factor gW of binary tree based CALU PRRP for randommatrices.

Figure 4 displays the values of the growth factor gW for flat tree basedCALU PRRP with a block size b varying from 8 to 128. The growth factor gWis decreasing with increasing the panel size b. We note that the curves of thegrowth factor lie between 1

4n1/2 and 3

4n1/2 in our tests on random matrices. We

also note that the results obtained with the LU PRRP method with a panel ofsize b = 64 are better than those of flat tree based CALU PRRP.

The growth factors of both binary tree based and flat tree based CALU PRRPhave similar (sometimes better) behavior than the growth factors of GEPP.

Table 19 in Appendix B presents results for the linear solver using binarytree based CALU PRRP, together with binary tree based CALU and GEPP forcomparaison and Table 20 in Appendix B presents results for the linear solverusing flat tree based CALU PRRP, together with flat tree based CALU andGEPP for comparaison. We note that for the binary tree based CALU PRRP,when m

P = b, for the algorithm we only use P1 = m(b+1) processors, since to

perform a Strong RRQR on a given block, the number of its rows should beat least the number of its columns +1. Tables 19 and 20 also include resultsobtained by iterative refinement used to improve the accuracy of the solution.For this, the componentwise backward error in equation (8) is used. In theprevious tables, wb denotes the componentwise backward error before iterativerefinement and NIR denotes the number of steps of iterative refinement. NIR isnot always an integer since it represents an average. For all the matrices testedCALU PRRP leads to results as accurate as the results obtained with CALU

RR n° 7867


Figure 4: Growth factor gW of flat tree based CALU PRRP for random matri-ces.

and GEPP.In Appendix B we present more detailed results. There we include some

other metrics, such as the norm of the factors, the norm of the inverse of the fac-tors, their conditioning, the value of their maximum element, and the backwarderror of the LU factorization. Through the results detailed in this section and inAppendix B we show that binary tree based and flat tree based CALU PRRPare stable, have the same behavior as GEPP for random matrices, and are morestable than binary tree based and flat tree based CALU in terms of growthfactor.

Figure 5 summarizes all our stability results for the CALU PRRP factoriza-tion based on both binary tree and flat tree schemes. As figure 2, this figuredisplays the ratio of the maximum between the backward error and machine ep-silon of LU PRRP versus GEPP. The backward error is measured as the relativeerror ‖PA−LU‖/‖A‖, the normwise backward error η, and the componentwisebackward error w. Results for all the matrices in our test set are presented,that is 25 random matrices for binary tree base CALU PRRP from Table 23, 20random matrices for flat tree based CALU PRRP from Table 21, and 37 specialmatrices from Tables 22 and 24. As it can be seen, nearly all ratios are between0.5 and 2.5 for random matrices. However there are few outliers, for examplethe relative error ratio has values between 24.2 for the special matrix hadamard(GEPP is more stable than binary tree based CALU PRRP), and 5.8×10−3 forthe special matrix moler (binary tree based CALU PRRP is more stable thanGEPP).

We consider now the same set of pathological matrices as in section 2 onwhich GEPP fails. For the Wilkinson matrix, both CALU and CALU PRRPbased on flat and binary tree give modest element growth. For the generalizedWilkinson matrix, the Foster matrix, and Wright matrix, CALU fails with both

RR n° 7867


Figure 5: A summary of all our experimental data, showing the ratio ofmax(CALU PRRP’s backward error, machine epsilon) to max(GEPP’s back-ward error, machine epsilon) for all the test matrices in our set. Each verticalbar represents such a ratio for one test matrix. Bars above 100 = 1 mean thatCALU PRRP’s backward error is larger, and bars below 1 mean that GEPP’sbackward error is larger. For each matrix and algorithm, the backward erroris measured using three different metrics. For the last third of the bars, la-beled “componentwise backward error”, the metric is w in equation (8). Thetest matrices are further labeled either as “randn”, which are randomly gener-ated, or “special”, listed in Table 15. Finally, each test matrix is factored usingCALU PRRP with a binary reduction tree (labeled BCALU for BCALU PRRP)and with a flat reduction tree (labeled FCALU for FCALU PRRP).

flat tree and binary tree reduction schemes.Tables 7 and 8 present the results obtained for the linear solver using the

CALU PRRP factorization based on flat and binary tree schemes for a gener-alized Wilkinson matrix of size 2048 with a size of the panel varying from 8 to128 for the flat tree scheme and a number of processors varying from 128 to 32for the binary tree scheme. The growth factor is of order 1 and the quantity||PA−LU ||||A|| is on the order of 10−16. Both flat tree based CALU and binary tree

based CALU fail on this pathological matrix. In fact for a generalized Wilkinsonmatrix of size 1024 and a panel of size b = 128, the growth factor obtained withflat tree based CALU is of size 10234.

Table 7: Stability of the flat tree based CALU PRRP factorization of a gener-alized Wilkinson matrix on which GEPP fails.


2048


For the Foster matrix, we have seen in the section 2 that LU PRRP givesmodest pivot growth, whereas GEPP fails. Both flat tree based CALU andbinary tree based CALU fail on the Foster matrix. However flat tree based

RR n° 7867


Table 8: Stability of the binary tree based CALU PRRP factorization of ageneralized Wilkinson matrix on which GEPP fails.

n P b gW ||U ||1 ||U−1||1 ||L||1 ||L−1||1 ||PA−LU ||F||A||F

2048

128 8 2.10e+00 1.33e+03 1.29e+02 1.34e+03 1.33e+03 1.08e-15

6416 2.04e+00 6.85e+02 1.30e+02 1.30e+03 6.85e+02 7.85e-168 8.78e+01 1.21e+03 1.60e+02 1.33e+03 1.01e+03 9.54e-16

3232 2.08e+00 9.47e+02 1.58e+02 1.41e+03 9.36e+02 5.95e-1616 2.08e+00 1.24e+03 1.32e+02 1.35e+03 1.24e+03 1.01e-158 1.45e+02 1.03e+03 1.54e+02 1.37e+03 6.61e+02 6.91e-16

CALU PRRP and binary tree based CALU PRRP solve easily this pathologicalmatrix.

Tables 9 and 10 present results for the linear solver using the CALU PRRPfactorization based on the flat tree scheme and the binary tree scheme, respec-tively. We test a Foster matrix of size 2048 with a panel size varying from 8to 128 for the flat tree based CALU PRRP and a number of processors vary-ing from 128 to 32 for the binary tree based CALU PRRP. We use the sameparameters as in section 2, that is c = 1, h = 1 and k = 2

3 . According to theobtained results, CALU PRRP gives a modest growth factor of 1.33 for thispractical matrix.

Table 9: Stability of the flat tree based CALU PRRP factorization of a practicalmatrix (Foster) on which GEPP fails.


2048


Table 10: Stability of the binary tree based CALU PRRP factorization of apractical matrix (Foster) on which GEPP fails.


2048

128 8 1.33 1.13e+01 1.87e+00 2.04e+03 9.00e+00 6.07e-17

6416 1.33 2.20e+01 1.87e+00 2.03e+03 1.70e+01 4.80e-178 1.33 1.13e+01 1.87e+00 2.04e+03 9.00e+00 6.07e-17

3232 1.33 4.33e+01 1.87e+00 2.01e+03 3.300e+01 2.91e-1716 1.33 2.20e+01 1.87e+00 2.03e+03 1.70e+01 4.80e-178 1.33 1.13e+01 1.87e+00 2.04e+03 9.00e+00 6.07e-17

As GEPP, both the flat tree based and the binary tree based CALU fail onthe Wright matrix. In fact for a matrix of size 2048, a parameter h = 0.3, witha panel of size b = 128, the flat tree based CALU gives a growth factor of 1098.With a number of processors P = 64 and a panel of size b = 16, the binarytree based CALU also gives a growth factor of 1098. Tables 11 and 12 present

RR n° 7867


results for the linear solver using the CALU PRRP factorization for a Wrightmatrix of size 2048. For the flat tree based CALU PRRP, the size of the panelis varying from 8 to 128. For the binary tree based CALU PRRP, the numberof processors is varying from 32 to 128 and the size of the panel from 16 to64 such that the number of rows in the leaf nodes is equal or bigger than twotimes the size of the panel. The obtained results, show that CALU PRRP givesa modest growth factor of 1 for this practical matrix, compared to the CALUmethod.

Table 11: Stability of the flat tree based CALU PRRP factorization of a prac-tical matrix (Wright) on which GEPP fails.


2048

128 1 3.25e+00 8.00e+00 2.00e+00 2.00e+00 4.08e-1764 1 3.25e+00 8.00e+00 2.00e+00 2.00e+00 4.08e-1732 1 3.25e+00 8.00e+00 2.05e+00 2.02e+00 6.65e-1716 1 3.25e+00 8.00e+00 2.32e+00 2.18e+00 1.04e-168 1 3.40e+00 8.00e+00 2.62e+00 2.47e+00 1.26e-16

Table 12: Stability of the binary tree based CALU PRRP factorization of apractical matrix (Wright) on which GEPP fails.


2048

128 8 1 3.40e+00 8.00e+00 2.62e+00 2.47e+00 1.26e-16

6416 1 3.25e+00 8.00e+00 2.32e+00 2.18e+00 1.04e-168 1 3.40e+00 8.00e+00 2.62e+00 2.47e+00 1.26e-16

3232 1 3.25e+00 8.00e+00 2.05e+00 2.02e+00 6.65e-1716 1 3.25e+00 8.00e+00 2.32e+00 2.18e+00 1.04e-168 1 3.40e+00 8.00e+00 2.62e+00 2.47e+00 1.26e-16

All the previous tests show that the CALU PRRP factorization is very stablefor random and more special matrices, and it also gives modest growth factorfor the pathological matrices on which CALU fails, this is for both binary treeand flat tree based CALU PRRP.

4 Lower bounds on communication

In this section we focus on the parallel CALU PRRP algorithm based on a bi-nary reduction tree, and we show that it minimizes the communication betweendifferent processors of a parallel computer. For this, we use known lower boundson the communication performed during the LU factorization of a dense matrixof size n× n, which are

# words moved = Ω

(n3√M

), (13)

# messages = Ω

(n3

M32

), (14)

RR n° 7867


where # words moved refers to the volume of communication, # messages refersto the number of messages exchanged, and M refers to the size of the memory(the fast memory in the case of a sequential algorithm, or the memory perprocessor in the case of a parallel algorithm). These lower bounds were firstintroduced for dense matrix multiplication [15], [16], generalized later to LUfactorization [4], and then to almost all direct linear algebra [1]. Note thatthese lower bounds apply to algorithms based on orthogonal transformationsunder certain conditions [1]. However, this is not relevant to our case, sinceCALU PRRP uses orthogonal transformations only to select pivot rows, whilethe update of the trailing matrix is still performed as in the classic LU factor-ization algorithm. Hence the lower bounds from equations (13) and, (14) arevalid for CALU PRRP.

We estimate now the cost of computing in parallel the CALU PRRP fac-torization of a matrix A of size m × n. The matrix is distributed on a grid ofP = Pr × Pc processors using a two-dimensional (2D) block cyclic layout. Weuse the following performance model. Let γ be the cost of performing a floatingpoint operation, and let α+βw be the cost of sending a message of size w words,where α is the latency cost and β is the inverse of the bandwidth. Then, thetotal running time of an algorithm is estimated to be

α · (# messages) + β · (# words moved) + γ · (# flops),

where #messages, #words moved, and #flops are counted along the criticalpath of the algorithm.

Table 13 displays the performance of parallel CALU PRRP (a detailed es-timation of the counts is presented in Appendix D). It also recalls the perfor-mance of two existing algorithms, the PDGETRF routine from ScaLAPACKwhich implements GEPP, and the CALU factorization. All three algorithmshave the same volume of communication, since it is known that PDGETRFalready minimizes the volume of communication. However, the number of mes-sages of both CALU PRRP and CALU is smaller by a factor of the order ofb than the number of messages of PDGETRF. This improvement is achievedthanks to tournament pivoting. In fact, partial pivoting, as used in the routinePDGETRF, leads to an O(n logP ) number of messages, and because of this,GEPP cannot minimize the number of messages.

Compared to CALU, CALU PRRP sends a small factor of less messages

(depending on Pr and Pc) and performs 1Pr

(2mn− n2

)b + nb2

3 (5 log2 Pr + 1)more flops (which represents a lower order term). This is because CALU PRRPuses the strong RRQR factorization at every node of the reduction tree of everypanel factorization, while CALU uses GEPP.

Despite this additional communication cost, we show now that CALU PRRPis optimal in terms of communication. We choose optimal values of the param-eters Pr, Pc, and b, as used in CAQR [4] and CALU [10], that is,

Pr =

√mP

n, Pc =

√nP

mand b =

1

4log−2

(√mP

n

)·√mn

P= log−2

(mP

n

)·√mn

P.

For a square matrix of size n× n, the optimal parameters are,

Pr =√P , Pc =

√P and b =

1

4log−2

(√P)· n√

P= log−2 (P ) · n√

P.

RR n° 7867


Table 13: Performance estimation of parallel (binary tree based) CALU PRRP,parallel CALU, and PDGETRF routine when factoring an m×n matrix, m ≥ n.The input matrix is distributed using a 2D block cyclic layout on a Pr×Pc gridof processors. Some lower order terms are omitted.

Parallel CALU PRRP

# messages 3nb

log2 Pr + 2nb

log2 Pc

# words(nb+ 3

2n2

Pc

)log2 Pr + 1

Pr

(mn− n2

2

)log2 Pc

# flops 1P

(mn2 − n3

3

)+ 2

Pr

(2mn− n2

)b+ n2b

2Pc+ 10nb2

3log2 Pr

Parallel CALU

# messages 3nb

log2 Pr + 3nb

log2 Pc

# words(nb+ 3n2

2Pc

)log2 Pr + 1

Pr

(mn− n2

2

)log2 Pc

# flops 1P

(mn2 − n3

3

)+ 1

Pr

(2mn− n2

)b+ n2b

2Pc+ nb2

3(5 log2 Pr − 1)

PDGETRF

# messages 2n(1 + 2

b

)log2 Pr + 3n

blog2 Pc

# words(

nb2

+ 3n2

2Pc

)log2 Pr + log2 Pc

1Pr

(mn− n2

2

)# flops 1

P

(mn2 − n3

3

)+ 1

Pr

(mn− n2

2

)b+ n2b

2Pc

Table 14 presents the performance estimation of parallel CALU PRRP andparallel CALU when using the optimal layout. It also recalls the lower boundson communication from equations (13) and (14) when the size of the memoryper processor is on the order of n2/P . Both CALU PRRP and CALU attainthe lower bounds on the number of words and on the number of messages,modulo polylogarithmic factors. Note that the optimal layout allows to reducecommunication, while keeping the number of extra floating point operationsperformed due to tournament pivoting as a lower order term. While in thissection we focused on minimizing communication between the processors of aparallel computer, it is straightforward to show that the usage of a flat treeduring tournament pivoting allows CALU PRRP to minimize communicationbetween different levels of the memory hierarchy of a sequential computer.

Table 14: Performance estimation of parallel (binary tree based) CALU PRRPand CALU with an optimal layout. The matrix factored is of size n× n. Somelower-order terms are omitted.

Parallel CALU PRRP with optimal layout Lower bound

# messages 52

√P log3 P Ω(

√P )

# words n2√

P

(12

log−1 P + logP)

Ω( n2√P

)

# flops 1P

2n3

3+ 5n3

2P log2 P+ 5n3

3P log3 P1P

2n3

3

Parallel CALU with optimal layout Lower bound

# messages 3√P log3 P Ω(

√P )

# words n2√

P

(12

log−1 P + logP)

Ω( n2√P

)

# flops 1P

2n3

3+ 3n3

2P log2 P+ 5n3

6P log3 P1P

2n3

3

RR n° 7867


5 Less stable factorizations that can also mini-mize communication

In this section, we present briefly two alternative algorithms that are basedon panel strong RRQR pivoting and that are conceived such that they canminimize communication. But we will see that they can be unstable in practice.These algorithms are also based on block algorithms, that factor the inputmatrix by traversing panels of size b. The main difference between them andCALU PRRP is the panel factorization, which is performed only once in thealternative algorithms.

We present first a parallel alternative algorithm, which we refer to as blockparallel LU PRRP. At each step of the block factorization, the panel is parti-tioned into P block-rows [A0;A1; . . . ;AP−1]. The blocks below the diagonal b×bblock of the current panel are eliminated by performing a binary tree of strongRRQR factorizations. At the leaves of the tree, the elements below the diagonalblock of each block Ai are eliminated using strong RRQR. The elimination ofeach such block row is followed by the update of the corresponding block row ofthe trailing matrix. The algorithm continues by performing the strong RRQRfactorization of pairs of b×b blocks stacked atop one another, until all the blocksbelow the diagonal block are eliminated and the corresponding trailing matricesare updated. The algebra of the block parallel LU PRRP algorithm is detailedin Appendix E, while in figure 6 we illustrate one step of the factorization byusing an arrow notation, where the function g(Aij) computes a strong RRQRon the matrix ATij and updates the trailing matrix in the same step.

g(A30)

g(A20)

g(A10)

g(A00)

g(A11)

g(A01)

g(A02)

Figure 6: Block parallel LU PRRP

A sequential version of the algorithm is based on the usage of a flat tree,and we refer to this algorithm as block pairwise LU PRRP. Using the arrownotation, the figure 7 illustrates the elimination of one panel.

A30

A20

A10

A00

:

:

:-g(A00)-g(A01)- g(A02)- g(A03)

Figure 7: Block pairwise LU PRRP

The block parallel LU PRRP and the block pairwise LU PRRP algorithmshave similarities with the block parallel pivoting and the block pairwise pivotingalgorithms. These two latter algorithms were shown to be potentially unstable

RR n° 7867


in [10]. There is a main difference between all these alternative algorithms andalgorithms that compute a classic LU factorization as GEPP, LU PRRP, andtheir communication avoiding variants. The alternative algorithms compute afactorization in the form of a product of lower triangular factors and an uppertriangular factor. And the elimination of each column leads to a rank update ofthe trailing matrix larger than one. It is thought in [20] that the rank-1 updateproperty of algorithms that compute an LU factorization inhibits potential ele-ment growth during the factorization, while a large rank update might lead toan unstable factorization.

Note however that at each step of the factorization, block parallel and blockpairwise LU PRRP use at each level of the reduction tree original rows of theactive matrix. Block parallel pivoting and block pairwise pivoting algorithmsuse U factors previously computed to achieve the factorization, and this couldpotentially lead to a faster propagation of ill-conditioning.

Figure 8: Growth factor of block parallel LU PRRP for varying block size band number of processors P .

The upper bound of the growth factor of both block parallel and block pair-wise LU PRRP is (1+τb)

nb , since for every panel factorization, a row is updated

only once. Hence they have the same bounds as the LU PRRP factorization,and smaller than that of the CALU PRRP factorization. Despite this, theyare less stable than the CALU PRRP factorization. Figures 8 and 9 displaythe growth factor of block parallel LU PRRP and block pairwise LU PRRP formatrices following a normal distribution. In figure 8, the number of processorsP on which each panel is partitioned is varying from 16 to 32, and the block sizeb is varying from 2 to 16. The matrix size varies from 64 to 2048, but we haveobserved the same behavior for matrices of size up to 8192. When the numberof processors P is equal to 1, the block parallel LU PRRP corresponds to theLU PRRP factorization. The results show that there are values of P and b forwhich this method can be very unstable. For the sizes of matrices tested, whenb is chosen such that the blocks at the leaves of the reduction tree have morethan 2b rows, the number of processors P has an important impact, the growthfactor increases with increasing P , and the method is unstable.

In Figure 9, the matrix size varies from 1024 to 8192. For a given matrix

RR n° 7867


size, the growth factor increases with decreasing the size of the panel b, as onecould expect. We note that the growth factor of block pairwise LU PRRP islarger than that obtained with the CALU PRRP factorization based on a flattree scheme presented in Table 21. But it stays smaller than the size of thematrix n for different panel sizes. Hence this method is more stable than blockparallel LU PRRP. Further investigation is required to conclude on the stabilityof these methods.

Figure 9: Growth factor of block pairwise LU PRRP for varying matrix sizeand varying block size b.

RR n° 7867


6 Conclusions

This paper introduces LU PRRP, an LU factorization algorithm based on panelrank revealing pivoting. This algorithm is more stable than GEPP in terms ofworst case growth factor. It is also very stable in practice for various classes ofmatrices, including pathological cases on which GEPP fails.

Its communication avoiding version, CALU PRRP, is also more stable interms of worst case growth factor than CALU, the communication avoidingversion of GEPP. More importantly, there are cases of interest for which theupper bound of the growth factor of CALU PRRP is smaller than that of GEPPfor several cases of interest. Extensive experiments show that CALU PRRP isvery stable in practice and leads to results of the same order of magnitude asGEPP, sometimes even better.

Our future work focuses on two main directions. The first direction investi-gates the design of a communication avoiding algorithm that has smaller boundson the growth factor than that of GEPP in general. The second direction focuseson estimating the performance of CALU PRRP on parallel machines based onmulticore processors, and comparing it with the performance of CALU.

References

[1] G. Ballard, J. Demmel, O. Holtz, and O. Schwartz, Minimizingcommunication in linear algebra, SIMAX, 32 (2011), pp. 866–90.

[2] D. W. Barron and H. P. F. Swinnerton-Dyer, Solution of Simulta-neous Linear Equations using a Magnetic-Tape Store, Computer Journal,3 (1960), pp. 28–33.

[3] J. Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997.

[4] J. W. Demmel, L. Grigori, M. Hoemmen, and J. Langou,Communication-optimal parallel and sequential QR and LU factoriza-tions, SIAM Journal on Scientific Computing, (2011). short version ofUCB/EECS-2008-89.

[5] S. Donfack, L. Grigori, and A. Gupta, Adapting Communication-avoinding LU and QR factorizations to multicore architectures, Proceedingsof the IPDPS Conference, (2010).

[6] J. Dongarra, P. Luszczek, and A. Petitet, The LINPACK Bench-mark: Past, Present and Future, Concurrency: Practice and Experience,15 (2003), pp. 803–820.

[7] L. V. Foster, Gaussian Elimination with Partial Pivoting Can Fail inPractice, SIAM J. Matrix Anal. Appl., 15 (1994), pp. 1354–1362.

[8] L. V. Foster, The growth factor and efficiency of Gaussian eliminationwith rook pivoting, J. Comput. Appl. Math., 86 (1997), pp. 177–194.

[9] N. I. M. Gould, On growth in Gaussian elimination with complete pivot-ing, SIAM J. Matrix Anal. Appl., 12 (1991), pp. 354–361.

RR n° 7867


[10] L. Grigori, J. Demmel, and H. Xiang, CALU: A communication op-timal LU factorization algorithm, SIAM Journal on Matrix Analysis andApplications, (2011).

[11] L. Grigori, J. W. Demmel, and H. Xiang, Communication avoidingGaussian elimination, Proceedings of the ACM/IEEE SC08 Conference,(2008).

[12] M. Gu and S. C. Eisenstat, Efficient Algorithms For Computing AStrong Rank Reaviling QR Factorization, SIAM J. SCI.COMPUT., 17(1996), pp. 848–896.

[13] N. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, sec-ond ed., 2002.

[14] N. Higham and D. J. Higham, Large Growth Factors in Gaussian Elim-ination with Pivoting, SIMAX, 10 (1989), pp. 155–164.

[15] J.-W. Hong and H. T. Kung, I/O complexity: The Red-Blue PebbleGame, in STOC ’81: Proceedings of the Thirteenth Annual ACM Sympo-sium on Theory of Computing, New York, NY, USA, 1981, ACM, pp. 326–333.

[16] D. Irony, S. Toledo, and A. Tiskin, Communication lower bounds fordistributed-memory matrix multiplication, J. Parallel Distrib. Comput., 64(2004), pp. 1017–1026.

[17] R. D. Skeel, Iterative refinement implies numerical stability for Gaussianelimination, Math. Comp., 35 (1980), pp. 817–831.

[18] D. C. Sorensen, Analysis of pairwise pivoting in Gaussian elimination,IEEE Transactions on Computers, 3 (1985), p. 274278.

[19] G. W. Stewart, Gauss, statistics, and Gaussian elimination, J. Comput.Graph. Statist., 4 (1995).

[20] L. N. Trefethen and R. S. Schreiber, Average-case stability of Gaus-sian elimination, SIAM J. Matrix Anal. Appl., 11 (1990), pp. 335–360.

[21] J. H. Wilkinson, Error analysis of direct methods of matrix inversion, J.Assoc. Comput. Mach., 8 (1961), pp. 281–330.

[22] , The algebric eigenvalue problem, Oxford University Press, (1985).

[23] S. J. Wright, A collection of problems for which Gaussian eliminationwith partial pivoting is unstable, SIAM J. SCI. STATIST. COMPUT., 14(1993), pp. 231–238.

RR n° 7867


Appendix A

We describe briefly strong RRQR introduced by M. Gu and S. Eisenstat in [12].This factorization will be used in our new LU decomposition algorithm, whichaims to obtain an upper bound of the growth factor smaller than GEPP (seeSection 2.) Consider a given threshold τ > 1 and an h×p matrix B with p > h,a Strong RRQR factorization on a matrix B gives (with an empty (2, 2) block)

BTΠ = QR = Q[R11 R12

],

where ‖R−111 R12‖max ≤ τ , with ‖ . ‖max being the biggest entry of a given ma-trix in absolute value. This factorization can be computed by a classical QRfactorization with column pivoting followed by a limited number of additionalswaps and QR updates if necessary.

Algorithm 2 Strong RRQR

1: Compute BTΠ = QR using the classical RRQR with column pivoting2: while there exist i and j such that |(R−111 R12)ij | > τ do

3: Set Π = ΠΠij and compute the QR factorization of R Πij (QR updates)4: end while

Ensure: BTΠ = QR with ‖R−111 R12‖max ≤ τ

The while loop in Algorithm 2 interchanges any pairs of columns that canincrease |det(R11)| by at least a factor τ . At most O(logτ n) such interchangesare necessary before Algorithm 2 finds a strong RRQR factorization. The QRfactorization of BTΠ can be computed numerically via efficient and numericallystable QR updating procedures. See [12] for details.

Appendix B

We present experimental results for the LU PRRP factorization, the binary treebased CALU PRRP, and the flat tree based CALU PRRP. We show resultsobtained for the LU decomposition and the linear solver. Tables 16, 21, and 23display the results obtained for random matrices. They show the growth factor,the norm of the factor L and U and their inverses, and the relative error of thedecomposition.Tables 17, 18, 22, and 24 display the results obtained for the special matricespresented in Table 15. The size of the tested matrices is n = 4096. For LU PRRPand flat tree based CALU PRRP, the size of the panel is b = 8. For binary treebased CALU PRRP we use P = 64 and b = 8, this means that the size of thematrices used at the leaves of the reduction tree is 64× 8.Tables 19 and 20 present results for the linear solver using binary tree based andflat tree based CALU PRRP, together with CALU and GEPP for comparison.The tables are presented in the following order.

• Table 16: Stability of the LU decomposition for LU PRRP and GEPP onrandom matrices.

• Table 17: Stability of the LU decomposition for GEPP on special matrices.

RR n° 7867


• Table 18: Stability of the LU decomposition for LU PRRP on specialmatrices.

• Table 19: Stability of the linear solver using binary tree based CALU PRRP,binary tree based CALU, and GEPP.

• Table 20: Stability of the linear solver using flat tree based CALU PRRP,flat tree based CALU, and GEPP.

• Table 21: Stability of the LU decomposition for flat tree based CALU PRRPand GEPP on random matrices.

• Table 22: Stability of the LU decomposition for flat tree based CALU PRRPon special matrices.

• Table 23: Stability of the LU decomposition for binary tree based CALU PRRPand GEPP on random matrices.

• Table 24: Stability of the LU decomposition for binary tree based CALU PRRPon special matrices.

Table 15: Special matrices in our test set.

No. Matrix Remarks1 hadamard Hadamard matrix, hadamard(n), where n, n/12, or n/20 is power of

2.2 house Householder matrix, A = eye(n) − β ∗ v ∗ v′, where [v, β, s] =

gallery(’house’, randn(n, 1)).3 parter Parter matrix, a Toeplitz matrix with most of singular values near π.

gallery(’parter’, n), or A(i, j) = 1/(i− j + 0.5).4 ris Ris matrix, matrix with elements A(i, j) = 0.5/(n− i− j + 1.5). The

eigenvalues cluster around −π/2 and π/2. gallery(’ris’, n).5 kms Kac-Murdock-Szego Toeplitz matrix. Its inverse is tridiagonal.

gallery(’kms’, n) or gallery(’kms’, n, rand).6 toeppen Pentadiagonal Toeplitz matrix (sparse).7 condex Counter-example matrix to condition estimators. gallery(’condex’, n).8 moler Moler matrix, a symmetric positive definite (spd) matrix.

gallery(’moler’, n).9 circul Circulant matrix, gallery(’circul’, randn(n, 1)).10 randcorr Random n × n correlation matrix with random eigenvalues from

a uniform distribution, a symmetric positive semi-definite matrix.gallery(’randcorr’, n).

11 poisson Block tridiagonal matrix from Poisson’s equation (sparse), A =gallery(’poisson’,sqrt(n)).

12 hankel Hankel matrix, A = hankel(c, r), where c=randn(n, 1), r=randn(n, 1),and c(n) = r(1).

13 jordbloc Jordan block matrix (sparse).14 compan Companion matrix (sparse), A = compan(randn(n+1,1)).15 pei Pei matrix, a symmetric matrix. gallery(’pei’, n) or gallery(’pei’, n,

randn).16 randcolu Random matrix with normalized cols and specified singular values.

gallery(’randcolu’, n).17 sprandn Sparse normally distributed random matrix, A = sprandn(n, n,0.02).18 riemann Matrix associated with the Riemann hypothesis. gallery(’riemann’, n).

RR n° 7867


19 compar Comparison matrix, gallery(’compar’, randn(n), unidrnd(2)−1).20 tridiag Tridiagonal matrix (sparse).21 chebspec Chebyshev spectral differentiation matrix, gallery(’chebspec’, n, 1).22 lehmer Lehmer matrix, a symmetric positive definite matrix such that

A(i, j) = i/j for j ≥ i. Its inverse is tridiagonal. gallery(’lehmer’,n).

23 toeppd Symmetric positive semi-definite Toeplitz matrix. gallery(’toeppd’, n).24 minij Symmetric positive definite matrix with A(i, j) = min(i, j).

gallery(’minij’, n).25 randsvd Random matrix with preassigned singular values and specified band-

width. gallery(’randsvd’, n).26 forsythe Forsythe matrix, a perturbed Jordan block matrix (sparse).27 fiedler Fiedler matrix, gallery(’fiedler’, n), or gallery(’fiedler’, randn(n, 1)).28 dorr Dorr matrix, a diagonally dominant, ill-conditioned, tridiagonal matrix

(sparse).

29 demmel A = D∗(eye(n) + 10−7∗rand(n)), where D = diag(1014∗(0:n−1)/n) [3].30 chebvand Chebyshev Vandermonde matrix based on n equally spaced points on

the interval [0, 1]. gallery(’chebvand’, n).31 invhess A=gallery(’invhess’, n, rand(n−1, 1)). Its inverse is an upper Hessen-

berg matrix.32 prolate Prolate matrix, a spd ill-conditioned Toeplitz matrix. gallery(’prolate’,

n).33 frank Frank matrix, an upper Hessenberg matrix with ill-conditioned eigen-

values.34 cauchy Cauchy matrix, gallery(’cauchy’, randn(n, 1), randn(n, 1)).35 hilb Hilbert matrix with elements 1/(i+ j − 1). A =hilb(n).36 lotkin Lotkin matrix, the Hilbert matrix with its first row altered to all ones.

gallery(’lotkin’, n).37 kahan Kahan matrix, an upper trapezoidal matrix.

RR n° 7867


Tab

le16

:S

tab

ilit

yof

the

LU

dec

om

posi

tion

for

LU

PR

RP

an

dG

EP

Pon

ran

dom

matr

ices

.L

UP

RR

P

nb

g W||U|| 1

||U−1|| 1

||PA−LU|| F

||A|| F

HP

L1

HP

L2

HP

L3

8192

128

2.38

E+

019.

94E

+04

3.3

4E

+02

5.2

6E

-14

4.1

3E

-02

2.3

5E

-02

4.6

8E

-03

642.

76E

+01

1.07

E+

05

1.0

6E

+03

5.2

4E

-14

5.3

8E

-02

2.0

8E

-02

4.3

6E

-03

323.

15E

+01

1.20

E+

05

3.3

8E

+02

5.1

4E

-14

3.9

5E

-02

2.2

7E

-02

4.5

4E

-03

163.

63E

+01

1.32

E+

05

3.2

0E

+02

4.9

1E

-14

3.2

9E

-02

1.8

9E

-02

3.7

9E

-03

83.

94E

+01

1.41

E+

05

2.6

5E

+02

4.8

4E

-14

2.9

5E

-02

1.6

9E

-02

3.3

7E

-03

4096

128

1.75

E+

013.

74E

+04

4.3

4E

+02

2.6

9E

-14

3.2

8E

-02

2.2

1E

-02

4.5

8E

-03

641.

96E

+01

4.09

E+

04

1.0

2E

+03

2.6

2E

-14

2.1

0E

-01

2.1

7E

-02

4.6

4E

-03

322.

33E

+01

4.51

E+

04

2.2

2E

+03

2.5

6E

-14

5.7

8E

-22.1

0E

-02

4.5

1E

-03

162.

61E

+01

4.89

E+

04

1.1

5E

+03

2.5

7E

-14

2.9

8E

-01

1.9

9E

-02

4.0

8E

-03

82.

91E

+01

5.36

E+

04

1.9

3E

+03

2.6

0E

-14

2.3

6E

-01

1.8

2E

-02

3.9

0E

-03

2048

128

1.26

E+

011.

39E

+04

5.6

1E

+02

1.4

6E

-14

4.3

7E

-02

2.2

5E

-02

5.1

6E

-03

641.

46E

+01

1.52

E+

04

1.8

6E

+02

1.4

2E

-14

4.2

5E

-02

2.2

4E

-02

5.0

5E

-03

321.

56E

+01

1.67

E+

04

3.4

8E

+03

1.3

8E

-14

7.5

3E

-12.1

5E

-02

4.6

5E

-03

161.

75E

+01

1.80

E+

04

2.5

9E

+02

1.3

8E

-14

2.5

1E

-02

2.0

3E

-02

4.5

9E

-03

82.

06E

+01

2.00

E+

04

5.0

0E

+02

1.4

0E

-14

4.4

6E

-02

1.9

3E

-02

4.3

3E

-03

1024

128

8.04

E+

005.

19E

+03

5.2

8E

+02

7.7

1E

-15

9.9

2E

-02

2.2

6E

-02

5.4

5E

-03

649.

93E

+00

5.68

E+

03

5.2

8E

+02

7.7

3E

-15

4.3

6E

-02

2.1

3E

-02

4.8

8E

-03

321.

13E

+01

6.29

E+

03

3.7

5E

+02

7.5

1E

-15

7.9

0E

-02

2.0

9E

-02

4.7

4E

-03

161.

31E

+01

6.91

E+

03

1.5

1E

+02

7.5

5E

-15

2.1

0E

-02

1.9

3E

-02

4.0

8E

-03

81.

44E

+01

7.54

E+

03

4.1

6E

+03

7.5

5E

-15

1.3

4E

+00

1.8

4E

-02

4.2

5E

-03

GE

PP

8192

-5.

47E

+01

8.71

E+

03

6.0

3E

+02

7.2

3E

-14

1.3

4E

-02

1.3

6E

-02

2.8

4E

-03

4096

-3.

61E

+01

1.01

E+

03

1.8

7E

+02

3.8

8E

-14

1.8

3E

-02

1.4

3E

-02

2.9

0E

-03

2048

-2.

63E

+01

5.46

E+

02

1.8

1E

+02

2.0

5E

-14

2.8

8E

-02

1.5

4E

-02

3.3

7E

-03

1024

-1.

81E

+01

2.81

E+

02

4.3

1E

+02

1.0

6E

-14

5.7

8E

-02

1.6

2E

-02

3.7

4E

-03

RR n° 7867


Tab

le17

:S

tab

ilit

yof

the

LU

dec

om

posi

tion

for

GE

PP

on

spec

ial

matr

ices

.

matr

ixco

nd

(A,2

)g W

||L|| 1

||L−1|| 1

max

ij|Uij|

minkk|Ukk|

con

d(U

,1)

||PA−LU|| F

||A|| F

ηwb

NIR

well-conditioned

had

am

ard

1.0E

+0

4.1

E+

34.1

E+

34.

1E+

34.

1E+

31.

0E+

05.

3E+

50.0

E+

03.3

E-1

64.6

E-1

52

hou

se1.

0E

+0

5.1

E+

08.9

E+

22.

6E+

25.

1E+

05.7

1.4E

+4

2.0

E-1

55.6

E-1

76.3

E-1

53

part

er4.

8E

+0

1.6

E+

04.8

E+

12.

0E+

03.

1E+

02.

0E+

02.

3E+

22.3

E-1

58.3

E-1

64.4

E-1

53

ris

4.8E

+0

1.6

E+

04.8

E+

12.

0E+

01.

6E+

01.

0E+

02.

3E+

22.3

E-1

57.1

E-1

64.7

E-1

52

km

s9.

1E

+0

1.0

E+

02.0

E+

01.

5E+

01.

0E+

07.

5E-1

3.0E

+0

2.0

E-1

61.1

E-1

66.7

E-1

61

toep

pen

1.0E

+1

1.1

E+

02.1

E+

09.

0E+

01.

1E+

11.

0E+

13.

3E+

11.1

E-1

77.2

E-1

73.0

E-1

51

con

dex

1.0E

+2

1.0

E+

02.0

E+

05.

6E+

01.

0E+

21.

0E+

07.

8E+

21.8

E-1

59.7

E-1

66.8

E-1

53

mole

r1.

9E

+2

1.0

E+

02.2

E+

12.

0E+

01.

0E+

01.

0E+

04.

4E+

13.8

E-1

42.6

E-1

61.7

E-1

52

circ

ul

3.7E

+2

1.8

E+

21.0

E+

31.

4E+

36.

4E+

23.

4E+

01.

2E+

64.3

E-1

42.1

E-1

51.2

E-1

41

ran

dco

rr1.

4E

+3

1.0

E+

03.1

E+

15.

7E+

11.

0E+

02.

3E-1

5.0E

+4

1.6

E-1

57.8

E-1

78.0

E-1

61

pois

son

1.7E

+3

1.0

E+

02.0

E+

03.

4E+

14.

0E+

03.

2E+

07.

8E+

12.8

E-1

61.4

E-1

67.5

E-1

61

han

kel

2.9E

+3

6.2

E+

19.8

E+

21.

5E+

32.

4E+

24.

5E+

02.

0E+

64.2

E-1

42.5

E-1

51.6

E-1

42

jord

blo

c5.

2E

+3

1.0

E+

01.0

E+

01.

0E+

01.

0E+

01.

0E+

08.

2E+

30.0

E+

02.0

E-1

78.3

E-1

70

com

pan

7.5E

+3

1.0

E+

02.0

E+

04.

0E+

07.

9E+

02.

6E-1

7.8E

+1

0.0

E+

02.0

E-1

76.2

E-1

31

pei

1.0E

+4

1.0

E+

04.1

E+

39.

8E+

01.

0E+

03.

9E-1

2.5E

+1

7.0

E-1

66.6

E-1

82.3

E-1

70

ran

dco

lu1.

5E

+4

4.6

E+

19.9

E+

21.

4E+

33.

2E+

05.

6E-0

21.

1E+

74.0

E-1

42.3

E-1

51.4

E-1

41

spra

nd

n1.

6E

+4

7.4

E+

07.4

E+

21.

5E+

32.

9E+

11.

7E+

01.

3E+

73.4

E-1

48.5

E-1

59.3

E-1

42

riem

an

n1.

9E

+4

1.0

E+

04.1

E+

33.

5E+

04.

1E+

31.

0E+

02.

6E+

65.7

E-1

92.0

E-1

61.7

E-1

52

com

par

1.8E

+6

2.3

E+

19.8

E+

21.

4E+

31.

1E+

23.

1E+

02.

7E+

72.3

E-1

41.2

E-1

58.8

E-1

51

trid

iag

6.8E

+6

1.0

E+

02.0

E+

01.

5E+

32.

0E+

01.

0E+

05.

1E+

31.4

E-1

82.6

E-1

71.2

E-1

60

cheb

spec

1.3E

+7

1.0

E+

05.4

E+

19.

2E+

07.

1E+

61.

5E+

34.

2E+

71.8

E-1

52.9

E-1

81.6

E-1

51

leh

mer

1.8E

+7

1.0

E+

01.5

E+

32.

0E+

01.

0E+

04.

9E-4

8.2E

+3

1.5

E-1

52.8

E-1

71.7

E-1

60

toep

pd

2.1E

+7

1.0

E+

04.2

E+

19.

8E+

22.

0E+

32.

9E+

21.

3E+

61.5

E-1

55.0

E-1

73.3

E-1

61

min

ij2.

7E

+7

1.0

E+

04.1

E+

32.

0E+

01.

0E+

01.

0E+

08.

2E+

30.0

E+

07.8

E-1

94.2

E-1

80

ran

dsv

d6.

7E

+7

4.7

E+

09.9

E+

21.

4E+

36.4

E-0

23.

6E-7

1.4

E+

105.6

E-1

53.4

E-1

62.0

E-1

52

fors

yth

e6.

7E

+7

1.0

E+

01.0

E+

01.

0E+

01.

0E+

01.

5E-8

6.7E

+7

0.0

E+

00.0

E+

00.

0E

+0

0

ill-conditioned

fied

ler

2.5

E+

10

1.0

E+

01.7

E+

31.

5E+

17.

9E+

04.

1E-7

2.9E

+8

1.6

E-1

63.3

E-1

71.0

E-1

51

dorr

7.4

E+

10

1.0

E+

02.0

E+

03.

1E+

23.

4E+

51.

3E+

01.7

E+

116.0

E-1

82.3

E-1

72.2

E-1

51

dem

mel

1.0

E+

14

2.5

E+

01.2

E+

21.

4E+

21.

6E+

147.

8E+

31.7

E+

173.7

E-1

57.1

E-2

14.8

E-9

2ch

ebva

nd

3.8

E+

19

2.0

E+

22.2

E+

33.

1E+

31.

8E+

29.

0E-1

04.8

E+

225.1

E-1

43.3

E-1

72.6

E-1

61

invh

ess

4.1

E+

19

2.0

E+

04.1

E+

32.

0E+

05.

4E+

04.

9E-4

3.0

E+

481.2

E-1

41.7

E-1

72.4

E-1

4(1

)p

rola

te1.4

E+

20

1.2

E+

11.4

E+

34.

6E+

35.

3E+

05.

9E-1

34.7

E+

231.6

E-1

44.7

E-1

66.3

E-1

5(1

)fr

ank

1.7

E+

20

1.0

E+

02.0

E+

02.

0E+

04.

1E+

35.

9E-2

41.9

E+

302.2

E-1

84.9

E-2

71.2

E-2

30

cau

chy

5.5

E+

21

1.0

E+

03.1

E+

21.

9E+

21.

0E+

72.

3E-1

52.1

E+

241.4

E-1

56.1

E-1

95.2

E-1

5(1

)h

ilb

8.0

E+

21

1.0

E+

03.1

E+

31.

3E+

31.

0E+

04.

2E-2

02.2

E+

222.2

E-1

66.0

E-1

92.0

E-1

70

lotk

in5.4

E+

22

1.0

E+

02.6

E+

31.

3E+

31.

0E+

03.

6E-1

92.3

E+

228.0

E-1

73.0

E-1

82.3

E-1

5(1

)ka

han

1.1

E+

28

1.0

E+

01.0

E+

01.

0E+

01.

0E+

02.

2E-1

34.1

E+

530.0

E+

09.7

E-1

84.3

E-1

61

RR n° 7867


Tab

le18

:S

tab

ilit

yof

the

LU

dec

om

posi

tion

for

LU

PR

RP

on

spec

ial

matr

ices

.

mat

rix

con

d(A

,2)

g W||L|| 1

||L−1|| 1

||U|| 1

||U−1|| 1

||PA−LU|| F

||A|| F

ηwb

NIR

well-conditioned

had

amar

d1.

0E+

05.

1E+

02

5.1E

+02

2.5

E+

165.6

E+

06.

0E+

00

5.2E

-15

1.1E

-15

9.0

E-1

52

hou

se1.

0E+

07.

4E+

00

8.3E

+02

7.9

E+

023.

7E+

025.

0E+

01

5.8E

-16

4.2E

-17

4.8

E-1

5(2

)p

art

er4.

8E+

01.

5E+

00

4.7E

+01

3.7

E+

001.

4E+

013.

6E+

01

8.5E

-16

6.8E

-16

4.4

E-1

53

ris

4.8E

+0

1.5E

+00

4.7E

+01

1.1

E+

017.

3E+

007.

2E+

01

8.5E

-16

7.6E

-16

4.6

E-1

52

km

s9.

1E+

01.

0E+

00

6.7E

+00

4.0

E+

004.

1E+

001.

2E+

01

1.4E

-16

6.8E

-17

3.7

E-1

61

toep

pen

1.0E

+1

1.3E

+00

2.3E

+00

3.4

E+

003.

6E+

011.

0E+

00

1.3E

-16

8.3E

-17

2.2

E-1

51

con

dex

1.0E

+2

1.0E

+00

1.9E

+00

5.5

E+

005.

9E+

021.

2E+

00

6.5E

-16

7.4E

-16

5.1

E-1

5(2

)m

oler

1.9E

+2

1.0E

+00

2.7E

+03

3.9

E+

021.

3E+

053.

0E+

15

8.1E

-16

3.9E

-19

3.4

E-1

70

circ

ul

3.7E

+2

1.4E

+02

1.0E

+03

6.7

E+

177.

4E+

041.

3E+

01

3.7E

-14

3.2E

-15

2.1

E-1

42

ran

dco

rr1.

4E+

31.

0E+

00

3.0E

+01

3.7

E+

013.

5E+

012.

4E+

04

5.8E

-16

5.9E

-17

5.5

E-1

61

poi

sson

1.7E

+3

1.0E

+00

1.9E

+00

2.0

E+

017.

32E

+00

2.0E

+01

1.6E

-16

9.1E

-17

1.8

E-1

51

han

kel

2.9E

+3

5.0E

+01

1.0E

+03

1.8

E+

175.

7E+

043.

3E+

01

3.5E

-14

2.8E

-15

1.7

E-1

42

jord

blo

c5.

2E+

31.

0E+

00

1.0E

+00

1.0

E+

002.

0E+

004.

0E+

03

0.0E

+00

1.9E

-17

6.1

E-1

70

com

pan

7.5E

+3

1.0E

+00

2.9E

+00

3.9

E+

016.

8E+

024.

1E+

00

6.7E

-16

5.5E

-16

6.9

E-1

21

pei

1.0E

+4

4.9E

+00

2.8E

+03

1.6

E+

011.

7E+

011.

7E+

01

6.5E

-16

1.4E

-17

3.2

E-1

70

ran

dco

lu1.

5E+

43.

3E+

01

1.1E

+03

1.5

E+

178.

8E+

021.

3E+

04

3.5E

-14

2.9E

-15

1.8

E-1

42

spra

nd

n1.

6E+

44.

9E+

00

7.8E

+02

1.4

E+

178.

8E+

035.

4E+

03

3.1E

-14

1.0E

-14

1.3

E-1

32

riem

ann

1.9E

+4

1.0E

+00

4.0E

+03

1.0

E+

033.

1E+

061.

4E+

02

4.5E

-16

1.5E

-16

2.0

E-1

51

com

par

1.8E

+6

9.0E

+02

1.7E

+03

9.8

E+

169.

0E+

058.

0E+

02

1.4E

-14

1.1E

-15

7.7

E-1

51

trid

iag

6.8E

+6

1.0E

+00

1.9E

+00

2.0

E+

024.

0E+

001.

6E+

04

3.2E

-16

2.8E

-17

2.4

E-1

60

cheb

spec

1.3E

+7

1.0E

+00

5.2E

+01

1.6

E+

249.

7E+

061.

7E+

00

5.9E

-16

1.0E

-18

5.0

E-1

51

leh

mer

1.8E

+7

1.0E

+00

1.5E

+03

9.5

E+

008.

90E

+00

8.1E

+03

5.6E

-16

1.00

E-1

76.1

E-1

70

toep

pd

2.1E

+7

1.0E

+00

4.2E

+01

1.2

E+

026.

69E

+04

7.2E

+01

5.8E

-16

2.6E

-17

1.6

E-1

60

min

ij2.

7E+

71.

0E+

00

4.0E

+03

1.1

E+

031.

7E+

064.

0E+

00

5.3E

-16

1.1E

-18

1.3

E-1

60

ran

dsv

d6.

7E+

73.

9E+

00

1.1E

+03

3.2

E+

175.

4E+

002.

5E+

09

5.0E

-15

4.6E

-16

2.9

E-1

52

fors

yth

e6.

7E+

71.

0E+

00

1.0E

+00

1.0

E+

001.

0E+

006.

7E+

07

0.0E

+00

0.0E

+00

0.0E

+00

0

ill-conditioned

fied

ler

2.5E

+10

1.0E

+00

9.3E

+02

3.06

E+

036.

5E+

011.

5E+

07

2.4E

-16

9.9E

-17

2.0

E-1

51

dorr

7.4E

+10

1.0E

+00

2.0E

+00

5.4

E+

016.

7E+

052.

5E+

05

1.6E

-16

3.4E

-17

4.0

E-1

51

dem

mel

1.0E

+14

1.3E

+00

1.1E

+02

1.7

E+

165.

0E+

152.

5E+

01

3.2E

-15

9.1E

-21

1.1

E-0

82

cheb

van

d3.

8E+

19

1.8E

+02

1.4E

+03

1.2

E+

177.

3E+

044.

3E+

19

5.3E

-14

4.0E

-17

2.3

E-1

60

invh

ess

4.1E

+19

2.1E

+00

4.0E

+03

7.4

E+

035.

8E+

031.

0E+

22

1.4E

-15

2.5E

-17

3.3

E-1

41

pro

late

1.4E

+20

8.9E

+00

1.3E

+03

1.4

E+

181.

4E+

039.

7E+

20

1.9E

-14

1.0E

-15

2.6

E-1

41

fran

k1.

7E+

20

1.0E

+00

1.9E

+00

1.9

E+

004.

1E+

062.

3E+

16

1.6E

-17

1.8E

-20

6.2

E-1

70

cau

chy

5.5E

+21

1.0E

+00

3.3E

+02

2.0

E+

066.

9E+

071.

8E+

20

6.0E

-16

3.9E

-19

1.5

E-1

41

hil

b8.

0E+

21

1.0E

+00

3.0E

+03

1.1

E+

182.

4E+

003.

1E+

22

3.2E

-16

6.6E

-19

2.3

E-1

70

lotk

in5.

4E+

22

1.0E

+00

2.6E

+03

2.0

E+

182.

4E+

008.

1E+

22

6.6E

-17

2.3E

-18

1.0

E-1

50

kah

an1.

1E+

28

1.0E

+00

1.0E

+00

1.0

E+

005.

3E+

007.

6E+

52

0E

+00

1.1E

-17

5.2

E-1

61

RR n° 7867


Table 19: Stability of the linear solver using binary tree based CALU PRRP,binary tree based CALU, and GEPP.

n P b η wb NIR HPL1 HPL2 HPL3Binary tree based CALU PRRP

8192

25632 7.5E-15 4.4E-14 2 6.2E-02 2.3E-02 4.4E-0316 6.7E-15 4.1E-14 2 3.4E-02 2.2E-02 4.6E-03

12864 7.6E-15 4.7E-14 2 2.6E-02 2.5E-02 4.9E-0332 7.5E-15 4.9E-14 2 6.9E-02 2.6E-02 4.9E-0316 7.3E-15 5.1E-14 2 4.2E-02 2.7E-02 5.7E-03

64

128 7.6E-15 5.0E-14 2 2.8E-02 2.6E-02 5.2E-0364 7.9E-15 5.3E-14 2 6.2E-02 2.8E-02 5.9E-0332 7.8E-15 5.0E-14 2 7.0E-02 2.6E-02 5.0E-0316 6.7E-15 5.0E-14 2 4.2E-02 2.7E-02 5.7E-03

4096

256 16 3.5E-15 2.2E-14 2 5.8E-02 2.3E-02 5.1E-03

12832 3.8E-15 2.3E-14 2 5.3E-02 2.4E-02 5.1E-0316 3.6E-15 2.2E-14 1.6 1.3E-02 2.4E-02 5.1E-03

6464 4.0E-15 2.3E-14 2 3.8E-02 2.4E-02 4.9E-0332 3.9E-15 2.4E-14 2 1.2 E-02 2.5E-02 5.7E-0316 3.8E-15 2.4E-14 1.6 2.3E-02 2.5E-02 5.2E-03

2048128 16 1.8E-15 1.0E-14 2 8.1E-02 2.2E-02 4.8E-03

6432 1.8E-15 1.2E-14 2 9.7E-02 2.5E-02 5.6E-0316 1.9E-15 1.2E-14 1.8 3.4E-02 2.5E-02 5.4E-03

1024 64 16 1.0E-15 6.3E-15 1.3 7.2E-02 2.5E-02 6.1E-03Binary tree based CALU

8192

25632 6.2E-15 4.1E-14 2 3.6E-02 2.2E-02 4.5E-0316 5.8E-15 3.9E-14 2 4.5E-02 2.1E-02 4.1E-03

12864 6.1E-15 4.2E-14 2 5.0E-02 2.2E-02 4.6E-0332 6.3E-15 4.0E-14 2 2.5E-02 2.1E-02 4.4E-0316 5.8E-15 4.0E-14 2 3.8E-02 2.1E-02 4.3E-03

64

128 5.8E-15 3.6E-14 2 8.3E-02 1.9E-02 3.9E-0364 6.2E-15 4.3E-14 2 3.2E-02 2.3E-02 4.4E-0332 6.3E-15 4.1E-14 2 4.4E-02 2.2E-02 4.5E-0316 6.0E-15 4.1E-14 2 3.4E-02 2.2E-02 4.2E-03

4096

256 16 3.1E-15 2.1E-14 1.7 3.0E-02 2.2E-02 4.4E-03

12832 3.2E-15 2.3E-14 2 3.7E-02 2.4E-02 5.1E-0316 3.1E-15 1.8E-14 2 5.8E-02 1.9E-02 4.0E-03

6464 3.2E-15 2.1E-14 1.7 3.1E-02 2.2E-02 4.6E-0332 3.2E-15 2.2E-14 1.3 3.6E-02 2.3E-02 4.7E-0316 3.1E-15 2.0E-14 2 9.4E-02 2.1E-02 4.3E-03

2048128 16 1.7E-15 1.1E-14 1.8 6.9E-02 2.3E-02 5.1E-03

6432 1.7E-15 1.0E-14 1.6 6.5E-02 2.1E-02 4.6E-0316 1.6E-15 1.1E-14 1.8 4.7E-02 2.2E-02 4.9E-03

1024 64 16 8.7E-16 5.2E-15 1.6 1.2E-1 2.1E-02 4.7E-03GEPP

8192 - 3.9E-15 2.6E-14 1.6 1.3E-02 1.4E-02 2.8E-034096 - 2.1E-15 1.4E-14 1.6 1.8E-02 1.4E-02 2.9E-032048 - 1.1E-15 7.4E-15 2 2.9E-02 1.5E-02 3.4E-031024 - 6.6E-16 4.0E-15 2 5.8E-02 1.6E-02 3.7E-03

RR n° 7867


Table 20: Stability of the linear solver using flat tree based CALU PRRP, flattree based CALU, and GEPP.

n P b η wb NIR HPL1 HPL2 HPL3Flat tree based CALU PRRP

8096

- 8 6.2E-15 3.8E-14 1 1.7E-02 2.0E-02 4.0E-03- 16 6.6E-15 4.3E-14 1 3.6E-02 2.2E-02 4.3E-03- 32 7.2E-15 4.8E-14 1 6.5E-02 2.5E-02 5.3E-03- 64 7.3E-15 5.1E-14 1 4.8E-02 2.7E-02 5.4E-03

4096


2048

- 8 1.4E-15 1.1E-14 1 1.3E-1 2.3E-02 5.1E-03- 16 1.9E-15 1.1E-14 1 1.6E+0 2.3 E-02 5.3E-03- 32 2.1E-15 1.3E-14 1 2.5E-02 2.7E-02 5.8E-03- 64 1.9E-15 1.25E-14 1 1.2E-1 2.6E-02 5.9E-03

1024


Flat tree based CALU

8096

- 8 4.5E-15 3.1E-14 1.7 4.4E-02 1.6E-02 3.4E-03- 16 5.6E-15 3.7E-14 2 1.9E-02 2.0E-02 3.3E-03- 32 6.7E-15 4.4E-14 2 4.6E-02 2.4E-02 4.7E-03- 64 6.5E-15 4.2E-14 2 5.5E-02 2.2E-02 4.6E-03

4096

- 8 2.6E-15 1.7E-14 1.3 1.3E-02 1.8E-02 4.0E-03- 16 3.0E-15 1.9E-14 1.7 2.6E-02 2.0E-02 3.9E-03- 32 3.8E-15 2.4E-14 2 1.9E-02 2.5E-02 5.1E-03- 64 3.4E-15 2.0E-14 2 6.0E-02 2.1E-02 4.1E-03

2048

- 8 1.5E-15 8.7E-15 1.6 2.7E-02 1.8E-02 4.2E-03- 16 1.6E-15 1.0E-14 2 2.1E-1 2.1E-02 4.5E-03- 32 1.8E-15 1.1E-14 1.8 2.3E-1 2.3E-02 5.1E-03- 64 1.7E-15 1.0E-14 1.2 4.1E-02 2.1E-02 4.5E-03

1024

- 8 7.8E-16 4.9E-15 1.6 5.5E-02 2.0E-02 4.9E-03- 16 9.2E-16 5.2E-15 1.2 1.1E-1 2.1E-02 4.8E-03- 32 9.6E-16 5.8E-15 1.1 1.5E-1 2.3E-02 5.6E-03- 64 8.7E-16 4.9E-15 1.3 7.9E-02 2.0E-02 4.5E-03

GEPP8192 - 3.9E-15 2.6E-14 1.6 1.3E-02 1.4E-02 2.8E-034096 - 2.1E-15 1.4E-14 1.6 1.8E-02 1.4E-02 2.9E-032048 - 1.1E-15 7.4E-15 2 2.9E-02 1.5E-02 3.4E-031024 - 6.6E-16 4.0E-15 2 5.8E-02 1.6E-02 3.7E-03

RR n° 7867


Tab

le21

:S

tab

ilit

yof

the

LU

dec

omp

osit

ion

for

flat

tree

base

dC

AL

UP

RR

Pan

dG

EP

Pon

ran

dom

matr

ices

Fla

tT

ree

bas

edC

AL

UP

RR

P

nb

g W||L|| 1

||L−1|| 1

||U|| 1

||U−1|| 1

||PA−LU|| F

||A|| F

HP

L1

HP

L2

HP

L3

8192

128

3.4

4E

+01

2.9

6E

+03

2.0

3E+

03

1.29

E+

052.

86E

+03

8.67

E-1

41.

77E

-01

2.5

3E

-02

5.0

0E

-03

64

4.4

2E

+01

3.3

0E

+03

2.2

8E+

03

1.43

E+

056.

89E

+02

8.56

E-1

43.

62E

-02

2.9

7E

-02

6.4

0E

-03

32

5.8

8E

+01

4.3

4E

+03

2.5

0E+

03

1.54

E+

053.

79E

+02

8.15

E-1

43.

40E

-02

2.9

6E

-02

6.4

0E

-03

16

6.0

5E

+01

4.5

4E

+03

2.5

0E+

03

1.53

E+

054.

57E

+02

6.59

E-1

45.

66E

-02

2.2

7E

-02

4.6

0E

-03

85.2

5E

+01

3.2

4E

+03

2.6

3E+

03

1.60

E+

051.

78E

+02

5.75

E-1

41.

92E

-02

1.9

3E

-02

4.1

0E

-03

4096

128

2.3

8E

+01

1.6

0E

+03

1.0

5E+

03

4.75

E+

045.

11E

+03

4.15

E-1

41.5

8E

+00

2.4

0E

-02

5.1

6E

-03

64

2.9

0E

+01

1.8

5E

+03

1.1

9E+

03

5.18

E+

042.

38E

+03

4.55

E-1

41.

71E

-01

2.6

0E

-02

5.6

5E

-03

32

3.3

2E

+01

1.8

9E

+03

1.2

9E+

03

5.57

E+

043.

04E

+02

4.63

E-1

44.

63E

-02

2.5

7E

-02

5.5

1E

-03

16

3.8

7E

+01

1.9

4E

+03

1.3

4E+

03

5.90

E+

042.

62E

+02

4.54

E-1

42.

92E

-02

2.4

4E

-02

5.1

8E

-03

83.8

0E

+01

1.6

3E

+03

1.3

6E+

03

5.90

E+

045.

63E

+02

4.24

E-1

46.

70E

-02

2.1

9E

-02

4.7

3E

-03

2048

128

1.5

7E

+01

6.9

0E

+02

5.2

6E+

02

1.66

E+

043.

79E

+02

2.01

E-1

46.

29E

-02

2.4

3E

-02

5.3

9E

-03

64

1.8

0E

+01

7.9

4E

+02

6.0

6E+

02

1.86

E+

044.

22E

+02

2.25

E-1

41.

26E

-01

2.6

3E

-02

5.9

6E

-03

32

2.1

9E

+01

9.3

0E

+02

6.8

6E+

02

2.05

E+

041.

08E

+02

2.41

E-1

42.

59E

-02

2.7

0E

-02

5.8

0E

-03

16

2.5

3E

+01

8.6

2E

+02

6.9

9E+

02

2.17

E+

043.

50E

+02

2.36

E-1

41.6

8E

+00

2.3

0E

-02

5.3

6E

-03

82.6

2E

+01

8.3

2E

+02

7.1

9E+

02

2.21

E+

044.

58E

+02

2.27

E-1

41.

37E

-01

2.3

6E

-02

5.1

3E

-03

1024

128

9.3

6E

+00

3.2

2E

+02

2.5

9E+

02

5.79

E+

039.

91E

+02

9.42

E-1

53.

98E

-02

2.5

0E

-02

5.7

7E

-03

64

1.1

9E

+01

3.5

9E

+02

3.0

0E+

02

6.67

E+

031.

79E

+02

1.09

E-1

43.

69E

-02

2.7

4E

-02

6.7

3E

-03

32

1.4

0E

+01

4.2

7E

+02

3.5

1E+

02

5.79

E+

032.

99E

+02

1.21

E-1

44.

23E

-02

2.2

7E

-02

5.4

2E

-03

16

1.7

2E

+01

4.4

2E

+02

3.7

4E+

02

8.29

E+

032.

10E

+02

1.24

E-1

46.

21E

-02

2.4

3E

-02

5.4

8E

-03

81.6

7E

+01

4.1

6E

+02

3.8

3E+

02

8.68

E+

031.

85E

+02

1.19

E-1

43.

80E

-02

2.2

2E

-02

5.3

7E

-03

GE

PP

8192

-5.

5E

+01

1.9

E+

03

2.6E

+03

8.7

E+

03

6.0E

+02

7.2E

-14

1.3

E-0

21.

4E

-02

2.8

E-0

340

96

-3.6

1E

+01

1.0

1E

+03

1.3

8E+

03

2.31

E+

041.

87E

+02

3.88

E-1

41.

83E

-02

1.4

3E

-02

2.9

0E

-03

2048

-2.6

3E

+01

5.4

6E

+02

7.4

4E+

02

6.10

E+

041.

81E

+02

2.05

E-1

42.

88E

-02

1.5

4E

-02

3.3

7E

-03

1024

-1.8

1E

+01

2.8

1E

+02

4.0

7E+

02

1.60

E+

054.

31E

+02

1.06

E-1

45.

78E

-02

1.6

2E

-02

3.7

4E

-03

RR n° 7867


Tab

le22

:S

tab

ilit

yof

the

LU

dec

om

posi

tion

for

flat

tree

base

dC

AL

UP

RR

Pon

spec

ial

matr

ices

.

mat

rix

con

d(A

,2)

g W||L|| 1

||L−1|| 1

||U|| 1

||U−1|| 1

||PA−LU|| F

||A|| F

ηwb

NIR

well-conditioned

had

am

ard

1.0E

+0

5.1E

+02

5.1E

+02

1.4E

+02

5.8

E+

045.3

E+

005.

2E-1

59.

5E-1

67.9

E-1

51

hou

se1.

0E+

06.

5E+

009.

0E+

023.

3E+

023.3

E+

025.2

E+

016.

6E-1

64.

8E-1

75.3

E-1

51

part

er4.

8E+

01.

5E+

004.

7E+

013.

7E+

001.4

E+

013.6

E+

018.

6E-1

66.

8E-1

64.4

E-1

51

ris

4.8E

+0

1.5E

+00

4.7E

+01

3.7E

+00

7.3

E+

007.2

E+

018.

5E-1

66.

8E-1

64.

1-1

51

km

s9.

1E+

01.

0E+

005.

4E+

003.

2E+

003.9

E+

009.8

E+

001.

13E

-16

7.0E

-17

4.2

E-1

61

toep

pen

1.0E

+1

1.3E

+00

2.3E

+00

3.4E

+00

3.6

E+

011.0

E+

001.

0E-1

68.

3E-1

72.7

E-1

51

con

dex

1.0E

+2

1.0E

+00

1.9E

+00

5.5E

+00

5.9

E+

021.2

E+

006.

5E-1

67.

4E-1

65.2

E-1

51

mol

er1.

9E+

21.

4E+

002.

7E+

031.

5E+

032.0

E+

063.0

E+

158.

7E-1

66.

0E-1

94.4

E-1

70

circ

ul

3.7E

+2

1.5E

+02

1.5E

+03

1.4E

+03

8.5

E+

042.5

E+

014.

7E-1

44.

0E-1

52.4

E-1

41

ran

dco

rr1.

4E+

31.

0E+

003.

0E+

015.

9E+

013.5

E+

012.4

E+

045.

8E-1

65.

9E-1

75.3

E-1

61

poi

sson

1.7E

+3

1.0E

+00

1.9E

+00

2.2E

+01

7.3

E+

002.0

E+

011.

6E-1

69.

0E-1

71.8

E-1

51

han

kel

2.9E

+3

7.9E

+01

1.6E

+03

1.6E

+03

6.5

E+

048.2

E+

014.

5E-1

43.

6E-1

52.2

E-1

41

jord

blo

c5.

2E+

31.

0E+

001.

0E+

001.

0E+

002.0

E+

004.0

E+

030.

0E+

00

1.9E

-17

6.1

E-1

70

com

pan

7.5E

+3

1.0E

+00

2.9E

+00

5.9E

+02

6.8

E+

024.1

E+

006.

7E-1

65.

4E-1

65.5

E-1

21

pei

1.0E

+4

4.9E

+00

2.8E

+03

1.6E

+01

1.7

E+

011.7

E+

016.

5E-1

62.

8E-1

74.7

E-1

70

ran

dco

lu1.

5E+

43.

7E+

011.

6E+

031.

3E+

031.0

E+

031.3

E+

044.

4E-1

43.

3E-1

52.1

E-1

41

spra

nd

n1.

6E+

46.

3E+

001.

2E+

031.

6E+

039.7

E+

033.5

E+

034.

0E-1

41.

2E-1

41.5

E-1

31

riem

ann

1.9E

+4

1.0E

+00

4.0E

+03

1.0E

+03

3.1

E+

061.4

E+

024.

5E-1

61.

5E-1

62.0

E-1

51

com

par

1.8E

+6

9.0E

+02

2.0E

+03

1.3E

+05

9.0

E+

051.0

E+

031.

7E-1

41.

2E-1

58.3

E-1

51

trid

iag

6.8E

+6

1.0E

+00

1.9E

+00

1.8E

+02

4.0

E+

001.6

E+

043.

2E-1

62.

8E-1

72.6

E-1

60

cheb

spec

1.3E

+7

1.0E

+00

5.2E

+01

5.6E

+01

9.7

E+

061.7

E+

005.

9E-1

61.

0E-1

84.4

E-1

51

leh

mer

1.8E

+7

1.0E

+00

1.5E

+03

8.9E

+00

8.9

E+

008.1

E+

035.

6E-1

61.

0E-1

76.0

E-1

70

toep

pd

2.1E

+7

1.0E

+00

4.2E

+01

1.3E

+03

6.6

E+

047.2

E+

015.

8E-1

62.

6E-1

71.7

E-1

60

min

ij2.

7E+

71.

0E+

004.

0E+

031.

1E+

031.7

E+

064.0

E+

005.

3E-1

67.

7E-1

91.0

E-1

60

ran

dsv

d6.

7E+

75.

0E+

001.

5E+

031.

3E+

035.8

E+

004.1

E+

096.

1E-1

55.

1E-1

62.9

E-1

51

fors

yth

e6.

7E+

71.

0E+

001.

0E+

001.

0E+

001.0

E+

006.7

E+

070.

0E+

00

0.0E

+00

0.0E

+00

0

ill-conditioned

fied

ler

2.5E

+10

1.0E

+00

9.3E

+02

1.4E

+01

6.5

E+

011.5

E+

072.

4E-1

68.

9E-1

71.9

E-1

51

dorr

7.4E

+10

1.0E

+00

2.0E

+00

4.0E

+01

6.7

E+

052.5

E+

051.

6E-1

63.

7E-1

73.4

E-1

51

dem

mel

1.0E

+14

1.6E

+00

1.6E

+02

2.5E

+02

5.7

E+

153.5

E+

013.

7E-1

59.

2E-2

11.1

E-0

81

cheb

van

d3.

8E+

192.

1E+

021.

1E+

042.

6E+

038.5

E+

043.5

E+

196.

2E-1

46.

9E-1

75.1

E-1

61

invh

ess

4.1E

+19

2.0E

+00

4.0E

+03

1.5E

+03

8.2

E+

032.9

E+

281.

2E-1

58.

0E-1

83.7

E-1

51

pro

late

1.4E

+20

1.1E

+01

1.4E

+03

4.1E

+03

1.6

E+

033.6

E+

212.

6E-1

41.

2E-1

54.0

E-1

41

fran

k1.

7E+

201.

0E+

001.

9E+

001.

9E+

004.1

E+

062.3

E+

161.

6E-1

72.

6E-2

06.8

E-1

70

cau

chy

5.5E

+21

1.0E

+00

3.2E

+02

1.8E

+02

4.3

E+

075.6

E+

185.

6E-1

66.

8E-1

93.3

E-1

41

hil

b8.

0E+

211.

0E+

003.

0E+

031.

3E+

032.4

E+

009.5

E+

223.

2E-1

65.

2E-1

91.8

E-1

70

lotk

in5.

4E+

221.

0E+

002.

8E+

031.

2E+

032.4

E+

002.5

E+

216.

5E-1

75.

6E-1

82.3

E-1

5(1

)ka

han

1.1E

+28

1.0E

+00

1.0E

+00

1.0E

+00

5.3

E+

007.6

E+

520.

0E+

00

8.1E

-18

4.3

E-1

61

RR n° 7867


Tab

le23

:S

tab

ilit

yof

the

LU

dec

omp

osit

ion

for

bin

ary

tree

base

dC

AL

UP

RR

Pan

dG

EP

Pon

ran

dom

matr

ices

.B

inary

Tre

eb

ase

dC

AL

UP

RR

P

nP

bg W

||L|| 1

||L−1|| 1

||U|| 1

||U−1|| 1

||PA−LU|| F

||A|| F

HP

L1

HP

L2

HP

L3

8192

256

164.

29E

+01

4.41

E+

032.

68E

+03

1.6

2E

+05

2.1

5E

+02

7.3

9E

-14

3.4

4E

-02

2.2

0E

-02

4.6

9E

-03

128

324.

58E

+01

3.47

E+

032.

58E

+03

1.5

6E

+05

2.9

6E

+02

8.6

7E

-14

6.9

2E

-02

2.6

3E

-02

4.9

8E

-03

164.

97E

+01

4.05

E+

032.

73E

+03

1.6

3E

+05

2.1

6E

+02

7.5

9E

-14

4.2

2E

-02

2.7

1E

-02

5.7

6E

-03

64

644.

77E

+01

3.23

E+

032.

35E

+03

1.4

5E

+05

3.2

2E

+02

9.1

4E

-14

6.3

7E

-02

2.8

5E

-02

5.9

1E

-03

324.

86E

+01

4.16

E+

032.

64E

+03

1.5

8E

+05

5.1

5E

+02

9.0

4E

-14

7.0

7E

-02

2.6

9E

-02

5.0

9E

-03

165.

37E

+01

3.83

E+

032.

67E

+03

1.6

4E

+05

2.5

6E

+02

7.3

0E

-14

4.2

1E

-02

2.7

0E

-02

5.7

5E

-03

4096

256

83.

49E

+01

1.69

E+

031.

4E+

03

6.1

2E

+04

3.0

7E

+02

4.7

1E

-14

7.2

6E

-02

2.1

4E

-02

4.6

9E

-03

128

164.

09E

+01

1.91

E+

031.

39E

+03

6.0

2E

+04

5.1

3E

+02

4.9

9E

-14

2.2

2E

-02

2.4

4E

-02

5.2

5E

-03

83.

85E

+01

1.78

E+

031.

40E

+03

6.0

9E

+04

1.5

2E

+04

4.7

5E

-14

4.4

1E

+00

2.4

1E

-02

4.9

8E

-03

6432

3.06

E+

011.

95E

+03

1.34

E+

03

5.8

0E

+04

2.9

9E

+02

4.9

4E

-14

3.2

6E

-02

2.5

1E

-02

5.7

1E

-03

163.

88E

+01

1.95

E+

031.

37E

+03

6.0

2E

+04

2.3

9E

+02

5.0

6E

-14

3.4

5E

-02

2.3

3E

-02

5.4

4E

-03

83.

64E

+01

1.73

E+

031.

40E

+03

5.9

7E

+04

1.8

7E

+02

4.7

3E

-14

4.3

3E

-02

2.0

6E

-02

4.2

9E

-03

32

642.

80E

+01

1.69

E+

031.

18E

+03

5.1

5E

+04

2.1

2E

+03

4.6

7E

-14

2.3

4E

-01

2.8

1E

-02

6.0

6E

-03

323.

71E

+01

1.83

E+

031.

32E

+03

5.8

2E

+04

3.7

4E

+02

4.9

2E

-14

5.1

3E

-02

2.3

5E

-02

5.2

8E

-03

163.

75E

+01

1.97

E+

031.

41E

+03

6.2

4E

+04

2.2

3E

+02

5.0

5E

-14

3.5

8E

-02

2.3

9E

-02

5.2

7E

-03

83.

62E

+01

1.95

E+

031.

43E

+03

6.1

1E

+04

7.2

1E

+02

4.7

0E

-14

1.6

9E

-01

2.1

7E

-02

4.9

1E

-03

2048

128

82.

91E

+01

8.48

E+

027.

59E

+02

2.3

2E

+04

3.3

7E

+02

2.4

9E

-14

7.7

1E

-02

2.2

0E

-02

4.7

2E

-03

6416

2.63

E+

018.

42E

+02

7.34

E+

02

2.2

2E

+04

2.7

6E

+02

2.5

7E

-14

6.0

9E

-02

2.3

4E

-02

5.2

9E

-03

83.

01E

+01

8.48

E+

027.

36E

+02

2.2

8E

+04

3.7

9E

+02

2.4

5E

-14

6.5

5E

-02

2.4

2E

-02

5.1

0E

-03

3232

2.18

E+

018.

70E

+02

6.66

E+

02

2.0

4E

+04

5.7

9E

+03

2.4

6E

-14

1.9

1E

+00

2.3

8E

-02

5.4

4E

-03

162.

66E

+01

1.05

E+

037.

24E

+02

2.2

5E

+04

8.4

8E

+02

2.5

7E

-14

6.3

7E

-02

2.3

7E

-02

5.2

9E

-03

82.

81E

+01

8.86

E+

027.

42E

+02

2.3

0E

+04

1.4

5E

+02

2.4

3E

-14

1.2

5E

-02

2.2

1E

-02

5.0

4E

-03

1024

648

1.73

E+

014.

22E

+02

3.87

E+

02

8.4

4E

+03

2.3

3E

+02

1.3

4E

-14

7.9

2E

-02

2.4

6E

-02

5.7

0E

-03

3216

1.55

E+

014.

39E

+02

3.72

E+

02

7.9

8E

+03

1.6

7E

+02

1.2

7E

-14

7.5

5E

-02

2.2

0E

-02

4.8

9E

-03

81.

84E

+01

4.00

E+

023.

92E

+02

8.4

6E

+03

1.2

9E

+02

1.2

5E

-14

5.4

1E

-02

2.3

2E

-02

5.4

0E

-03

GE

PP

8192

--

5.5E

+01

1.9E

+03

2.6E

+03

8.7

E+

03

6.0

E+

02

7.2

E-1

41.3

E-0

21.4

E-0

22.8

E-0

340

96-

-3.

61E

+01

1.01

E+

031.

38E

+03

2.3

1E

+04

1.8

7E

+02

3.8

8E

-14

1.8

3E

-02

1.4

3E

-02

2.9

0E

-03

2048

--

2.63

E+

015.

46E

+02

7.44

E+

02

6.1

0E

+04

1.8

1E

+02

2.0

5E

-14

2.8

8E

-02

1.5

4E

-02

3.3

7E

-03

1024

--

1.81

E+

012.

81E

+02

4.07

E+

02

1.6

0E

+05

4.3

1E

+02

1.0

6E

-14

5.7

8E

-02

1.6

2E

-02

3.7

4E

-03

RR n° 7867


Tab

le24

:S

tab

ilit

yof

the

LU

dec

omp

osi

tion

for

bin

ary

tree

base

dC

AL

UP

RR

Pon

spec

ial

matr

ices

.

mat

rix

con

d(A

,2)

g W||L|| 1

||L−1|| 1

||U|| 1

||U−1|| 1

||PA−LU|| F

||A|| F

ηwb

NIR

well-conditioned

had

amard

1.0

E+

05.

1E

+02

5.1E

+02

1.3E

+02

4.9

E+

046.

5E+

00

5.3

E-1

51.

2E

-15

8.7

E-1

52

hou

se1.0

E+

07.

4E

+00

8.3E

+02

3.8E

+02

3.7

E+

025.

0E+

01

5.8

E-1

64.

2E

-17

4.8

E-1

53

par

ter

4.8

E+

01.

5E

+00

4.7E

+01

3.7E

+00

1.4

E+

013.

6E+

01

8.6

E-1

66.

9E

-16

4.4

E-1

53

ris

4.8

E+

01.

5E

+00

4.7E

+01

3.7E

+00

7.3

E+

007.

2E+

01

8.5

E-1

66.

2E

-16

4.4

E-1

52

km

s9.1

E+

01.

0E

+00

5.1E

+00

3.4E

+00

4.3

E+

009.

3E+

00

1.1

E-1

66.

3E

-17

3.6

E-1

61

toep

pen

1.0

E+

11.

3E

+00

2.3E

+00

3.4E

+00

3.6

E+

011.

0E+

00

1.0

E-1

68.

2E

-17

4.0

E-1

51

con

dex

1.0

E+

21.

0E

+00

1.9E

+00

5.5E

+00

5.9

E+

021.

2E+

00

6.5

E-1

67.

4E

-16

5.2

E-1

52

mol

er1.9

E+

21.

0E

+00

4.0E

+03

9.0E

+00

2.2

E+

043.

8E+

15

1.0

E-1

82.

1E

-20

2.7

E-1

61

circ

ul

3.7

E+

21.

5E

+02

1.6E

+03

1.4E

+03

8.4

E+

042.

1E+

01

5.2

E-1

44.

1E

-15

3.2

E-1

42

ran

dco

rr1.4

E+

31.

4E

+02

5.4E

+02

4.4E

+03

2.9

E+

036.

1E+

03

2.2

E-1

51.

3E

-16

5.7

E-1

52

poi

sson

1.7

E+

31.

0E

+00

1.9E

+00

2.2E

+01

7.3

E+

002.

0E+

01

1.7

E-1

69.

3E

-17

1.5

E-1

51

han

kel

2.9

E+

36.

1E

+01

1.8

4E

+03

1.5E

+03

6.5

E+

048.

0E+

01

4.9

E-1

43.

9E

-15

2.4

E-1

42

jord

blo

c5.2

E+

31.

0E

+00

1.0E

+00

1.0E

+00

2.0

E+

004.

0E+

03

0.0E

+00

1.9E

-17

7.6

7E

-17

0co

mp

an

7.5

E+

32.

0E

+00

2.9E

+00

6.8E

+02

6.7

E+

029.

4E+

00

9.8

E-1

66.

1E

-16

6.3

E-1

21

pei

1.0

E+

41.

3E

+00

5.5E

+02

7.9E

+00

1.6

E+

012.

9E+

00

3.6

E-1

62.

2E

-17

5.0

E-1

70

ran

dco

lu1.5

E+

44.

6E

+01

1.7E

+03

1.4E

+03

1.0

E+

031.

1E+

04

4.8

E-1

43.

5E

-15

2.2

E-1

42

spra

nd

n1.6

E+

46.

6E

+00

1.1E

+03

1.6E

+03

9.6

E+

031.

4E+

03

4.1

E-1

41.

3E

-14

2.0

E-1

3(1

)ri

eman

n1.9

E+

41.

0E

+00

4.0E

+03

1.8E

+03

7.6

E+

061.

4E+

02

1.7

E-1

61.

2E

-16

1.1

E-1

51

com

par

1.8

E+

69.

0E

+02

2.0E

+03

1.3E

+05

9.0

E+

056.

5E+

02

1.9

E-1

41.

2E

-15

8.9

E-1

51

trid

iag

6.8

E+

61.

0E

+00

1.9E

+00

1.8E

+02

4.0

E+

001.

6E+

04

3.2

E-1

62.

8E

-17

2.6

E-1

60

cheb

spec

1.3

E+

71.

0E

+00

5.2E

+01

5.6E

+01

9.7

E+

061.

7E+

00

6.0

E-1

67.

8E

-19

1.3

E-1

51

leh

mer

1.8

E+

71.

0E

+00

1.5E

+03

8.9E

+00

8.9

E+

008.

1E+

03

5.7

E-1

69.9

2E-1

85.7

E-1

70

toep

pd

2.1

E+

71.

0E

+00

4.3E

+01

9.1E

+02

6.8

E+

041.

8E+

01

5.8

E-1

62.

7E

-17

1.7

E-1

60

min

ij2.7

E+

71.

0E

+00

4.0E

+03

9.0E

+00

1.8

E+

044.

0E+

00

6.4

E-1

96.

9E

-19

6.1

E-1

80

ran

dsv

d6.7

E+

75.

4E

+00

1.7E

+03

1.4E

+03

6.4

E+

003.

2E+

09

7.2

E-1

55.

5E

-16

3.3

E-1

52

fors

yth

e6.7

E+

71.

0E

+00

1.0E

+00

1.0E

+00

1.0

E+

006.

7E+

07

0.0E

+00

0.0

E+

00

0.0

E+

00

0

ill-conditioned

fied

ler

2.5E

+10

1.0E

+00

9.0E

+02

1.2E

+01

5.5

E+

011.

4E+

07

2.5

E-1

67.

9E

-17

1.7

E-1

51

dor

r7.

4E

+10

1.0E

+00

2.0E

+00

4.0E

+01

6.7

E+

052.

5E+

05

1.6

E-1

63.

3E

-17

3.4

E-1

51

dem

mel

1.0E

+14

1.5E

+00

1.2E

+02

5.3E

+02

5.5

E+

152.

9E+

01

3.3

E-1

59.

0E

-21

7.9

E-0

92

cheb

van

d3.

8E

+19

3.2E

+02

2.2E

+03

3.3E

+03

8.9

E+

041.

6E+

20

7.5

E-1

43.

7E

-17

2.3

E-1

60

invh

ess

4.1E

+19

1.7E

+00

4.0E

+03

9.0E

+00

4.5

E+

031.4

E+

100

2.9

E-1

51.

5E

-16

6.9

E-1

41

pro

late

1.4E

+20

1.2E

+01

2.1E

+03

4.7E

+03

2.1

E+

031.

6E+

21

2.8

E-1

41.

2E

-15

3.9

E-1

41

fran

k1.

7E

+20

1.0E

+00

1.9E

+00

1.9E

+00

4.1

E+

062.

3E+

16

1.6

E-1

72.

6E

-20

6.8

E-1

70

cau

chy

5.5E

+21

1.0E

+00

3.2E

+02

4.1E

+02

5.4

E+

074.

7E+

18

6.1

E-1

66.

7E

-19

5.7

E-1

42

hil

b8.

0E

+21

1.0E

+00

2.9E

+03

1.4E

+03

2.4

E+

002.

3E+

22

3.2

E-1

66.

3E

-19

2.2

E-1

70

lotk

in5.

4E

+22

1.0E

+00

3.0E

+03

1.4E

+03

2.4

E+

003.

7E+

22

6.6

E-1

75.

2E

-18

1.7

5E

-15

1ka

han

1.1E

+28

1.0E

+00

1.0E

+00

1.0E

+00

5.3

E+

007.

6E+

52

0.0E

+00

1.2E

-17

3.9

E-1

61

RR n° 7867


Appendix C

Here we summarize the floating-point operation counts for the LU PRRP algorithmperformed on an input matrix A of size m × n, where m ≥ n . We first focus on thestep k of the algorithm, that is we consider the kth panel of size (m− (k−1)b)× b. Wefirst perform a Strong RRQR on the transpose of the considered panel, then updatethe trailing matrix of size (m−kb)×(n−kb), finally we perform GEPP on the diagonalblock of size b× b. We assume that m− (k− 1)b ≥ b+ 1. The Strong RRQR performsnearly as many floating-point operations as the QR with column pivoting. Here weconsider that is the same, since in practice performing QR with column pivoting isenough to obtain the bound τ , and thus it is

FlopsSRRQR,1block,stepk = 2(m− (k − 1)b)b2 − 2

3b3.

If we consider the update step (3), then the flops count is

Flopsupdate,stepk = 2b(m− kb)(n− kb).

For the additional GEPP on the diagonal block, the flops count is

Flopsgepp,stepk =2

3b3 + (n− kb)b2.

Then the flops count for the step k is

FlopsLU PRRP,stepk = b2(2m+ n− 3(k − 1)b)− b3 + 2b(m− kb)(n− kb).

This gives us an arithmetic operation count of

FlopsLU PRRP (m,n, b) = Σnbk=1

[b2(2m+ n− 2(k − 1)b− kb) + 2b(m− kb)(n− kb)

],

F lopsLU PRRP (m,n, b) = mn2 + 2mnb+ 2nb2 − 1

2n2b− 1

3n3

∼ mn2 − 1

3n3 + 2mnb− 1

2n2b.

Then for a square matrix of size n× n, the flops count is

FlopsLU PRRP (n, n, b) =2

3n3 +

3

2n2b+ 2nb2 ∼ 2

3n3 +

3

2n2b.

Appendix D

Here we detail the performance model of the parallel version of the CALU PRRPfactorization performed on an input matrix A of size m × n where, m ≥ n. Weconsider a 2D layout P = Pr × Pc . We first focus on the panel factorization for theblock LU factorization, that is the selection of the b pivot rows with the tournamentpivoting strategy. This step is similar to CALU except that the reduction operator isStrong RRQR instead of GEPP, then for each panel the amount of communication isthe same as for TSLU:

# messages = logPr

# words = b2 logPr

However the floating-point operations count is different. We consider as in AppendixC that Strong RRQR performs as many flops as QR with columns pivoting, thenthe panel factorization performs the QR factorization with columns pivoting on thetranspose of the blocks of the panel and logPr reduction steps:

RR n° 7867


# flops = 2m− (k − 1)b

Prb2 − 2

3b3 + logPr(2(2b)b2 − 2

3b3)

= 2m− (k − 1)b

Prb2 +

10

3b3 logPr −

2

3b3

To perform the QR factorization without pivoting on the transpose of the panel, andthe update of the trailing matrix:

• broadcast the pivot information along the rows of the process grid.

# messages = logPc

# words = b logPc

• apply the pivot information to the original rows.

# messages = logPr

# words =nb

PclogPr

• Compute the block column L and broadcast it through blocks of columns

# messages = logPc

# words =m− kbPr

b logPc

# flops = 2m− kbPr

b2

• broadcast the upper block of the permuted matrix A through blocks of rows

# messages = logPr

# words =n− kbPc

b logPr

• perform a rank-b update of the trailing matrix

# flops = 2bm− kbPr

n− kbPc

Thus to get the block LU factorization :

# messages =3n

blogPr +

2n

blogPc

# words = (mn

Pr− 1

2

n2

Pr+ n) logPc + (nb+

3

2

n2

Pc) logPr

# flops =1

P(mn2 − 1

3n3) +

2

3nb2(5 logPr − 1) +

b

Pr(4mn+ 2nb− 2n2)

∼ 1

P(mn2 − 1

3n3) +

2

3nb2(5 logPr − 1) +

4

Pr(mn− n2

2)b

Then to obtain the full LU factorization, for each b×b block, we perform the Gaussianelimination with partial pivoting and we update the corresponding trailing matrix ofsize b × n − kb. During this additional step, we first perform GEPP on the diagonalblock, broadcast pivot rows through column blocks (this broadcast can be done to-gether with the previous broadcast of L, thus there is no additional message to send),apply pivots, and finally compute U.

# words = n(1 + b2) logPc

RR n° 7867


# flops =

nb∑

k=1

[2

3b3 +

n− kbPc

b2]

=2

3nb2 +

1

2

n2b

Pc

Finally the total count is :

# messages =3n

blogPr +

2n

blogPc

# words = (mn

Pr− 1

2

n2

Pr+nb

2+ 2n) logPc + (nb+

3

2

n2

Pc) logPr

∼ (mn

Pr− 1

2

n2

Pr) logPc + (nb+

3

2

n2

Pc) logPr

# flops =1

P(mn2 − 1

3n3) +

4

Pr(mn− n2

2)b+

n2b

2Pc+

10

3nb2 logPr

Appendix E

Here we detail the algebra of the block parallel LU-PRRP. At the first iteration, thematrix A has the following partition

A =

[A11 A12

A21 A22

].

The block A11 is of size m/p × b, where p is the number of processors used andm/p ≥ b+ 1, the block A12 is of size m/p×n− b, the block A21 is of size m−m/p× b,and the block A22 is of size m−m/p× n− b.

To describe the algebra, we consider 4 processors and a block of size b. We firstfocus on the b first columns,

A(:, 1 : b) =

A0

A1

A2

A3

.Each block Ai is of size m/p× b. In the following we describe the different steps of thepanel factorization. First, we perform Strong RRQR factorization on the transpose ofeach block Ai so we obtain :

AT0 Π00 = Q00R00

AT1 Π10 = Q10R10

AT2 Π20 = Q20R20

AT3 Π30 = Q30R30

Each matrix Ri is of size b×m/p. Using MATLAB notations, we can write Ri asfollowing :

Ri = Ri(1 : b, 1 : b) ¯Ri = Ri(1 : b, b+ 1 : m/p)

This step aims to eliminate the last m/p − b rows of each block Ai. We define thematrix D0Π0 :

D0Π0 =

[Ib

−D00 Im/p−b

][

Ib−D10 Im/p−b

][

Ib−D20 Im/p−b

][

Ib−D30 Im/p−b

]

×

ΠT

00ΠT

10ΠT

20ΠT

30

,

RR n° 7867


where

D00 = ¯RT00(R−1

00 )T ,

D10 = ¯RT10(R−1

10 )T ,

D20 = ¯RT20(R−1

20 )T ,

D30 = ¯RT30(R−1

30 )T .

Multiplying A(:, 1 : b) by D0Π0 we obtain :

D0Π0 ×A(:, 1 : b) =

(ΠT00 ×A0)(1 : b, 1 : b)

0m/p−b

(ΠT10 ×A1)(1 : b, 1 : b)

0m/p−b

(ΠT20 ×A2)(1 : b, 1 : b)

0m/p−b

(ΠT30 ×A3)(1 : b, 1 : b)

0m/p−b

=

A01

0m/p−b

A11

0m/p−b

A21

0m/p−b

A31

0m/p−b

.

The second step corresponds to the second level of the reduction tree. We mergepairs of b×b blocks and as in the previous step we perform Strong RRQR factorizationon the transpose of the 2b× b blocks :[

A01

A11

]and [

A21

A31

].

We obtain

[A01

A11

]TΠ01 = Q01R01,[

A21

A31

]TΠ11 = Q11R11.

We note :

R01 = R01(1 : b, 1 : b) ¯R01 = R01(1 : b, b+ 1 : 2b),

R11 = R11(1 : b, 1 : b) ¯R11 = R11(1 : b, b+ 1 : 2b).

As in the previous level, we aim to eliminate b rows in each block, so we consider thematrix

RR n° 7867


D1Π1 =

Ib

Im/p−b

−D01 IbIm/p−b

IbIm/p−b

−D11 IbIm/p−b

×[

ΠT01

ΠT11

],

where

D01 = ¯RT01(R−1

01 )T ,

D11 = ¯RT11(R−1

01 )T .

The matrices Π01 and Π11 are the permutations corresponding to the Strong RRQRfactorizations of the two 2b×b blocks. The matrices Π01 and Π11 can easily be deducedfrom the matrices Π01 and Π11 extended by the appropriate identity matrices to getmatrices of size 2m/p× 2m/p.

The multiplication of the block D0Π0A(:, 1 : b) with the matrix D1Π1 leeds to

D1Π1D0Π0A(:, 1 : b) =

[ΠT

01 ×[A01

A11

]](1 : b, 1 : b)

02m/p−b[ΠT

11 ×[A21

A31

]](1 : b, 1 : b)

02m/p−b

.

This matrix can be written as

D1Π1D0Π0A(:, 1 : b) =

[ΠT

01 ×[

(ΠT00 ×A0)(1 : b, 1 : b)

(ΠT10 ×A1)(1 : b, 1 : b)

]](1 : b, 1 : b)

02m/p−b[ΠT

11 ×[

(ΠT20 ×A2(1 : b, 1 : b))

(ΠT30 ×A3(1 : b, 1 : b))

]](1 : b, 1 : b)

02m/p−b

=

A02

02m/p−b

A12

02m/p−b

.

For the final step we consider the last 2b× b block

[A02

A12

].

We perform a Strong RRQR factorization on the transpose of this block and thenwe obtain

RR n° 7867


[A02

A12

]TΠ02 = Q02R02.

We note

R02 = R02(1 : b, 1 : b) ¯R02 = R02(1 : b, b+ 1 : 2b).

We define the matrix D2Π2 as

D2Π2 =

IbIm/p−b

IbIm/p−b

−D02 IbIm/p−b

IbIm/p−b

×ΠT

02,

where

D02 = ¯RT02(R−1

02 )T ,

and the permutation matrix Π02 can easily be deduced from the permutation matrixΠ02.

The multiplication of the block D1Π1D0Π0A(:, 1 : b) by the matrix D2Π2 leeds to :

D2Π2D1Π1D0Π0A(:, 1 : b) =

[RT

021QT02

04m/p−b

]= ΠT

02 ×[

A02

04m/p−b

]=

[A03

04m/p−b

].

If we consider the block A(:, 1 : b) of the beginning and all the steps performed,we get :

D2Π2D1Π1D0Π0A(:, 1 : b) =

[ΠT

02 ×[A02

A12

](1 : b, 1 : b)

]04m/p−b

We can also write

D2Π2D1Π1D0Π0A(:, 1 : b) =

ΠT

02 ×

ΠT01 ×

(ΠT00 ×A0)(1 : b, 1 : b)

(ΠT10 ×A1)(1 : b, 1 : b)

(1 : b, 1 : b)

ΠT11 ×

(ΠT20 ×A2)(1 : b, 1 : b)

(ΠT30 ×A3)(1 : b, 1 : b)

(1 : b, 1 : b)

(1 : b, 1 : b)

04m/p−b

.

RR n° 7867


We can also write

A(:, 1 : b) = (D2Π2D1Π1D0Π0)−1

ΠT

02 ×

ΠT01 ×

(ΠT00 ×A0)(1 : b, 1 : b)

(ΠT10 ×A1)(1 : b, 1 : b)

(1 : b, 1 : b)

ΠT11 ×

(ΠT20 ×A2)(1 : b, 1 : b)

(ΠT30 ×A3)(1 : b, 1 : b)

(1 : b, 1 : b)

(1 : b, 1 : b)

04m/p−b

.

Then, we have

A(:, 1 : b) = ΠT0D−10 ΠT

1D−11 ΠT

2D−12

ΠT

02 ×

ΠT01 ×

(ΠT00 ×A0)(1 : b, 1 : b)

(ΠT10 ×A1)(1 : b, 1 : b)

(1 : b, 1 : b)

ΠT11 ×

(ΠT20 ×A2)(1 : b, 1 : b)

(ΠT30 ×A3)(1 : b, 1 : b)

(1 : b, 1 : b)

(1 : b, 1 : b)

04m/p−b

,

where

ΠT0 =

Π00

Π10

Π20

Π30

,

D−10 =

[IbD00 Im/p−b

][

IbD10 Im/p−b

][

IbD20 Im/p−b

][

IbD30 Im/p−b

]

.

For each Strong RRQR performed previously, the corresponding trailing matrix is up-dated at the same time. Thus, at the end of the process, we have

A = ΠT0D−10 ΠT

1D−11 ΠT

2D−12 ×

[A03 A12

04m/p−b A22

],

where A03 is the b × b diagonal block containing the b selected pivot rows from thecurrent panel. An additional GEPP should be performed on this block to get the fullLU factorization. A22 is the trailing matrix already updated, and on which the blockparallel LU-PRRP should be continued.

Appendix F

Here is the matlab code for generating the matrix T used in Section 2 to define thegeneralized Wilkinson matrix on which GEPP fails. For any given integer r > 0, thisgeneralized Wilkinson matrix is an upper triangular semi-separable matrix with rankat most r in all of its submatrices above the main diagonal. The entries are all negative

RR n° 7867


and are chosen randomly. The Wilkinson matrix is for the special case where r = 1and every entry above the main diagonal is 1.

function [A] = counterexample_GEPP(n,r,u,v);

% Function counterexample_GEPP generates a matrix which fails

% GEPP in terms of large element growth.

% This is a generalization of the Wilkinson matrix.

%

if (nargin == 2)

u = rand(n,r);

v = rand(n,r);

A = - triu(u * v’);

for k = 2:n

umax = max(abs(A(k-1,k:n))) * (1 + 1/n);

A(k-1,k:n) = A(k-1,k:n) / umax;

end

A = A - diag(diag(A));

A = A’ + eye(n);

A(1:n-1,n) = ones(n-1,1);

else

A = triu(u * v’);

A = A - diag(diag(A));

A = A’ + eye(n);

A(1:n-1,n) = ones(n-1,1);

end

RR n° 7867

RESEARCH CENTRESACLAY – ÎLE-DE-FRANCE

Parc Orsay Université4 rue Jacques Monod91893 Orsay Cedex

PublisherInriaDomaine de Voluceau - RocquencourtBP 105 - 78153 Le Chesnay Cedexinria.fr

ISSN 0249-6399

lu factorization with panel rank revealing pivoting and its … · 2012-08-14 · with partial...

Documents