lapack++: a design overview of object-oriented extensions for high performance linear algebra

10

Upload: independent

Post on 17-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

LAPACK++: A Design Overview of Object-Oriented Extensionsfor High Performance Linear AlgebraJack J. Dongarra �xz, Roldan Pozoz , and David W. WalkerxxOak Ridge National Laboratory zUniversity of TennesseeMathematical Sciences Section Department of Computer ScienceAbstractLAPACK++ is an object-oriented C++ extensionof the LAPACK ( Linear Algebra PACKage) libraryfor solving the common problems of numerical lin-ear algebra: linear systems, linear least squares, andeigenvalue problems on high-performance computerarchitectures. The advantages of an object-orientedapproach include the ability to encapsulate various ma-trix representations, hide their implementation details,reduce the number of subroutines, simplify their callingsequences, and provide an extendible software frame-work that can incorporate future extensions of LA-PACK, such as ScaLAPACK++ for distributed mem-ory architectures. We present an overview of theobject-oriented design of the matrix and decompositionclasses in C++ and discuss its impact on elegance,generality, and performance.1 IntroductionLAPACK++ is an object-oriented C++ extensionto the Fortran LAPACK [1] library for numerical lin-ear algebra. This package includes state-of-the-art nu-merical algorithms for the more common linear al-gebra problems encountered in scienti�c and engi-neering applications. It is based on the widely usedLINPACK [5] and EISPACK [13] libraries for solvinglinear equations, linear least squares, and eigenvalueproblems for dense and banded systems. The current�This project was supported in part by the Defense Ad-vanced Research Projects Agency under contract DAAL03-91-C-0047, administered by the Army Research O�ce, the Ap-plied Mathematical Sciences subprogram of the O�ce of En-ergy Research, U.S. Department of Energy, under Contract DE-AC05-84OR21400, and by the National Science Foundation Sci-ence and Technology Center Cooperative Agreement No. CCR-8809615.

LAPACK software consists of over 1,000 routines and600,000 lines of Fortran 77 source code.The numerical algorithms in LAPACK utilizeblock-matrix operations, such as matrix-multiply, inthe innermost loops to achieve high performance oncached and hierarchical memory architectures. Theseoperations, standardized as a set of subroutines calledthe Level 3 BLAS (Basic Linear Algebra Subpro-grams [6]), improve performance by increasing thegranularity of the computations and keeping the mostfrequently accessed subregions of a matrix in thefastest level of memory. The result is that these blockmatrix versions of the fundamental algorithms typi-cally show performance improvements of a factor ofthree over non-blocked versions [1].LAPACK++ provides a framework for describinggeneral block matrix computations in C++. Withouta proper design of fundamental matrix and factoriza-tion classes, the performance bene�ts of blocked codescan be easily lost due to unnecessary data copying, in-e�cient access of submatrices, and excessive run-timeoverhead in the dynamic-bindingmechanisms of C++.LAPACK++, however, is not a general purpose ar-ray package. There are no functions, for example, tocompute trigonometric operations on matrices, or todeal with multi-dimensional arrays. There are severalgood public domain and commercial C++ packagesfor these problems [4], [9], [11], [12]. The classes in LA-PACK++, however, can easily integrate with these orwith any other C++ matrix interface. These objectshave been explicitly designed with block matrix algo-rithms and make extensive use of the level 3 BLAS.Furthermore, LAPACK++ is more than just a shellto the FORTRAN library; some of the key routines,such as the matrix factorizations, are actually imple-mented in C++ so that the general algorithm can beapplied to derived matrix classes, such as distributedmemory matrix objects.

LAPACK++ provides speed and e�ciency com-petitive with native Fortran codes (see Section 2.2),while allowing programmers to capitalize on the soft-ware engineering bene�ts of object oriented program-ming. Replacing the Fortran 77 interface of LAPACKwith an object-oriented framework simpli�es the cod-ing style and allows for a more exible and extendiblesoftware platform. In Section 6, for example, we dis-cuss extensions to support distributed matrix compu-tations on scalable architectures [2].The motivation and design goals for LAPACK++include� Maintaining competitive performance with For-tran 77.� Providing a simple interface that hides implemen-tation details of various matrix storage schemesand their corresponding factorization structures.� Providing a universal interface and open sys-tem design for integration into user-de�ned datastructures and third-party matrix packages.� Replacing static work array limitations of Fortranwith more exible and type-safe dynamic memoryallocation schemes.� Providing an e�cient indexing scheme for matrixelements that has minimal overhead and can beoptimized in most application code loops.� Utilizing function and operator overloading inC++ to simplify and reduce the number of in-terface entry points to LAPACK.� Providing the capability to access submatrices byreference, rather than by value, and performfactorizations \in place" { vital for implementingblocked algorithms e�ciently.� Providing more meaningful naming conventionsfor variables and functions (e.g. names no longerlimited to six alphanumeric characters, and soon).LAPACK++ also provides an object-oriented in-terface to the Basic Linear Algebra Subprograms(BLAS) [6], allowing programmers to utilize these op-timized computational kernels in their own C++ ap-plications.2 OverviewThe underlying philosophy of the LAPACK++ de-sign is to provide an interface which is simple, yet

powerful enough to express the sophisticated numeri-cal algorithms within LAPACK, including those whichoptimize performance and/or storage. Programmerswho wish to utilize LAPACK++ as a black box tosolve Ax = B need not be concerned with the intri-cate details.Following the framework of LAPACK, the C++ ex-tension contains driver routines for solving stan-dard types of problems, computational routinesto perform distinct computational tasks, and auxil-iary routines to perform certain subtasks or commonlow-level computations. Each driver routine typicallycalls a sequence of computational routines. Taken asa whole, the computational routines can perform awider range of tasks than are covered by the driverroutines.Utilizing function overloading and object inheri-tance in C++, the procedural interface to LAPACKhas been simpli�ed: two fundamental drivers and theirvariants, LaLinSolve() and LaEigenSolve(), replaceseveral hundred subroutines in the original Fortranversion.LAPACK++ supports various algorithms for solv-ing linear equations and eigenvalue problems:� Algorithms{ LU Factorization{ Cholesky (LLT ) Factorization{ QR Factorization (linear least squares){ Singular Value Decomposition (SVD){ Eigenvalue problems (as included in LA-PACK )� Storage Classes{ rectangular matrices{ symmetric and symmetric positive de�nite(SPD){ banded matrices{ tri/bidiagonal matrices� Element Data Types{ oat, double, single and double precisioncomplexIn this paper we focus on matrix factorizations forlinear equations and linear least squares.

2.1 A simple code exampleTo illustrate how LAPACK++ simpli�es the userinterface, we present a small code fragment to solvelinear systems. The examples are incomplete and aremeant to merely illustrate the interface style. Thenext few sections discuss the details of matrix classesand their operations.Consider solving the linear system Ax = b usingLU factorization in LAPACK++:#include <lapack++.h>LaGenMatDouble A(N,N);LaVectorDouble x(N), b(N);// ...LaLinSolve(A,x,b);The �rst line includes the LAPACK++ object andfunction declarations. The second line declares A tobe a square N �N coe�cient matrix, while the thirdline declares the right-hand-side and solution vectors.Finally, the LaLinSolve() function in the last linecalls the underlying LAPACK driver routine for solv-ing general linear equations.Consider now solving a similar system with a tridi-agonal coe�cient matrix:#include <lapack++.h>LaTridiagMatDouble A(N,N);LaVectorDouble x(N), b(N);// ...LaLinSolve(A,x,b);The only code modi�cation is in the declaration ofA. In this case LaLinSolve() calls the driver rou-tine DGTSV() for tridiagonal linear systems. TheLaLinSolve() function has been overloaded to per-form di�erent tasks depending on the type of the inputmatrix A. There is no runtime overhead associatedwith this; it is resolved by C++ at compile time.2.2 PerformanceThe elegance of the LAPACK++ matrix classesmay seem to imply that they incur a signi�cant run-time performance overhead compared to similar com-putations using optimized Fortran. This is not true.The design of LAPACK++ has been optimized for

performance and can utilize the BLAS kernels as e�-ciently as Fortran. Figure 1, for example, illustratesthe Mega op rating of the simple codeC = A*B;for square matrices of various sizes on the IBMRS/6000 Model 550 workstation. This particular im-plementation used GNU g++ v. 2.3.1 and utilizedthe Level 3 BLAS routines from the native ESSLlibrary. The performance results are nearly identi-cal with those of optimized Fortran calling the samelibrary. This is accomplished by inlining the LA-PACK++ BLAS kernels. That is, these functions areexpanded at the point of their call by C++ compiler,saving the runtime overhead of an explicit functioncall.In this case the above expression calls the under-lying DGEMM BLAS 3 routine. This occurs at com-pile time, without any runtime overhead. The perfor-mance numbers are very near the machine peak andillustrate that using C++ with optimized computa-tional kernels provides an elegant high-level interfacewithout sacri�cing performance.Figure 2 illustrates performance characteristics ofthe LU factorization of various matrices on the samearchitecture using the LAPACK++LaLUFactorIP(A,F)routine, which overwrites A with its LU factors. Thisroutine essentially inlines to the underlying LAPACKroutine DGETRF() and incurs no runtime overhead.(The IP su�x stands for \In Place" factorization. SeeSection 4.1 for details.)3 LAPACK++ Matrix ObjectsThe fundamental objects in LAPACK++ are nu-merical vectors and matrices; however, LAPACK++is not a general-purpose array package. Rather, LA-PACK++ is a self-contained interface consisting ofonly the minimal number of classes to support thefunctionality of the LAPACK algorithms and datastructures.LAPACK++ matrices can be referenced, assigned,and used in mathematical expressions as naturally asif they were an integral part of C++; the matrix el-ement aij, for example, is referenced as A(i,j). Bydefault, matrix subscripts begin at zero, keeping withthe indexing convention of C++; however, they can beset to any user-de�ned value. (Fortran programmerstypically prefer 1.) Internally, LAPACK++ matrices

0 50 100 150 200 250 300 350 400 450 50010

20

30

40

50

60

70

80

Order of Matrix

Spee

d in

Meg

aflo

ps

Matrix Multiply (C = A*B) on IBM RS/6000-550

Figure 1: Performance of matrix multiply in LA-PACK++ on the IBM RS/6000 Model 550 worksta-tion. GNU g++ v. 2.3.1 was used together with theESSL Level 3 routine dgemm.0 50 100 150 200 250 300 350 400 450 500

0

5

10

15

20

25

30

35

40

45

50

Order of vectors/matrices

Spee

d in

Meg

aflo

ps

LAPACK++ (solid)

LAPACK (dashed)

Figure 2: Performance of LAPACK++ LU factoriza-tion on the IBM RS/6000 Model 550 workstation, us-ing GNU g++ v. 2.3.1 and BLAS routines from theIBM ESSL library. The results are indistinguishablebetween the Fortran and C++ interfaces.

are typically stored in column-order for compatibilitywith Fortran subroutines and libraries.Various types of matrices are supported: banded,symmetric, Hermitian, packed, triangular, tridiago-nal, bidiagonal, and non-symmetric. Rather thanhave an unstructured collection of matrix classes, LA-PACK++ maintains a class hierarchy (Figure 3) toexploit commonality in the derivation of the funda-mental matrix types. This limits much of the code re-dundancy and allows for an open-ended design whichcan be extended as new matrix storage structures areintroduced.LAPACKMatrix

Tri/Bidiagonal Band

SymmetricHermitianTriangular

UnitUpper/Lower

Upper/Lower Upper/Lower

SPD SPD

Matrix

Orthogonal

Used in QR

GeneralDense

LAPACKVector

Vector

General Subranges

BaseClasses

Figure 3: The Matrix Hierarchy of LAPACK++.Rectangular and Vector classes allow indexing of anyrectangular submatrix by reference.Matrix classes and other related data types speci�cto LAPACK++ begin with the pre�x \La" to avoidnaming con icts with user-de�ned or third-party ma-trix packages. The list of valid names is a subset ofthe nomenclature shown in Figure 4.3.1 General MatricesOne of the fundamental matrix types in LA-PACK++ is a general (nonsymmetric) rectangularmatrix. The possible data element types of this ma-trix include single and double precision of real andcomplex numbers. The corresponding LAPACK++names are given as LaGenMattype

La 8>>>>>><>>>>>>: GenOrthog" SymmHermitianSPD #8<: BandTridiagBidiagPacked 9=;[Unit]n UpperLower o fTriangg 9>>>>>>=>>>>>>; Mat 8<: FloatDoubleFcomplexDcomplex 9=;Figure 4: LAPACK++ matrix nomenclature. Itemsin square brackets are optional.for Lapack General Matrix. The type su�x can befloat, double, fcomplex, or dcomplex. Matrices inthis category have the added property that submatri-ces can be e�ciently accessed and referenced in matrixexpressions. This is a necessity for describing block-structured algorithms.3.1.1 DeclarationsGeneral LAPACK++ matrices may be declared (con-structed) in various ways:#include <lapack++.h>float d[4] = {1.0, 2.0, 3.0, 4.0};LaGenMatDouble A(200,100) = 0.0; // 1LaGenMatDComplex B; // 2LaGenMatDouble C.ref(A); // 3LaGenMatDouble D(A); // 4LaGenMatFloat E(d, 2, 2) // 5Line (1) declares A to be a rectangular 200x100 ma-trix, with all of its elements initialized to 0.0. Line(2) declares B to be an empty (uninitialized) matrix.Until B becomes initialized, any attempt to referenceits elements will result in a run time error. Line (3)declares C to share the same elements of A. Line (4)illustrates an equivalent way of specifying this at thetime of a new object construction. Finally, line (5)demonstrates how one can initialize a 2x2 matrix withthe data from a standard C++ vector. The values areinitalized in column-major form, so that the �rst col-umn of E contains f1:0; 2:0gT, and the second columncontains f3:0; 4:0gT.3.1.2 SubmatricesBlocked linear algebra algorithms utilize submatricesas their basic unit of computation. It is crucial thatsubmatrix operations be highly optimized. Because ofthis, LAPACK++ provides mechanisms for accessing

rectangular subregions of a general matrix. These re-gions are accessed by reference, that is, without copy-ing data, and can be used in any matrix expression.Ideally, one would like to use familiar colon nota-tion of Fortran 90 or Matlab for expressing submatri-ces. However, this modi�cation of the C++ syntaxis not possible without rede�ning the language speci-�cations. As a reasonable compromise, LAPACK++denotes submatrices by specifying a subscript rangethrough the LaIndex() function. For example, the3x3 matrix in the upper left corner of A is denoted asA( LaIndex(0,2), LaIndex(0,2) )This references Aij ; i = 0; 1; 2 j = 0; 1; 2, and is equiv-alent to the A(0:2,0:2) colon notation used else-where. Submatrix expressions may be also be usedas a destination for assignment, as inA( LaIndex(0,2), LaIndex(0,2) ) = 0.0;which sets the 3x3 submatrix of A to zero. Followingthe Fortran 90 conventions, the index notation has anoptional third argument denoting the stride value,LaIndex(start, end, increment)If the increment value is not speci�ed it is as-sumed to be one. The expression LaIndex(s, e, i)is equivalent to the index sequences; s + i; s + 2i; : : : s+ be � si ciThe internal representation of an index is not ex-panded to a full vector, but kept in its compact tripletformat. The increment values may be negative and al-low one to traverse a subscript range in the oppositedirection, such as in (10,7,-1) to denote the sequencef10; 9; 8; 7g. Indices can be named and used in expres-sions, as in the following submatrix assignments,LaGenMat<double> A(10,10), B, C; // 1LaIndex I(1,9,2), // 2LaIndex J(1,3,2); // 3B.ref(A(I,I)); // 4B(2,3) = 3.1; // 5C = B(LaIndex(2,4,2), J); // 6In lines (2) and (3) we declare indices I = f1; 3; 5; 7; 9g,and J =f1; 3g. Line (4) sets B to the speci�ed 5x5 sub-matrix of A. The matrix B can used in any matrix ex-pression, including accessing its individual elements,as in line (5). Note that B(2,3) is the same mem-ory location as A(5,7), so that a change to B will also

modify the contents of A. Line (6) assigns the 2x2 sub-matrix of B to C. Note that C can also be referencedas A( LaIndex(5,9,2), LaIndex(3,7,2)).Although LAPACK++ submatrix expressions al-low one to access non-contiguous rows or columns,many of the LAPACK routines only allow submatriceswith unit stride in the column direction. Calling anLAPACK++ routine with a non-contiguous subma-trix columnsmay cause data to be copied into contigu-ous submatrix and can optionally generate a runtimewarning to advise the programmer that data copyinghas taken place. (In Fortran, the user would need toneed to do this by hand.)3.1.3 SubmatricesBlocked linear algebra algorithms utilize submatricesas their basic unit of computation. It is crucial thatsubmatrix operations be highly optimized. Because ofthis, LAPACK++ provides mechanisms for accessingrectangular subregions of a general matrix. These re-gions are accessed by reference, that is, without copy-ing data, and can be used in any matrix expression.Ideally, one would like to use familiar colon nota-tion of Fortran 90 or Matlab for expressing submatri-ces. However, this modi�cation of the C++ syntaxis not possible without rede�ning the language speci-�cations. As a reasonable compromise, LAPACK++denotes submatrices by specifying a subscript rangethrough the LaIndex() function. For example, the3x3 matrix in the upper left corner of A is denoted asA( LaIndex(0,2), LaIndex(0,2) )This references Aij; i = 0; 1; 2 j = 0; 1; 2, and is equiv-alent to the A(0:2,0:2) colon notation used else-where. Submatrix expressions may be also be usedas a destination for assignment, as inA( LaIndex(0,2), LaIndex(0,2) ) = 0.0;which sets the 3x3 submatrix of A to zero. Followingthe Fortran 90 conventions, the index notation has anoptional third argument denoting the stride value,LaIndex(start, end, increment)If the increment value is not speci�ed it is as-sumed to be one. The expression LaIndex(s, e, i)is equivalent to the index sequences; s+ i; s+ 2i; : : : s+ be � si ciThe internal representation of an index is not ex-panded to a full vector, but kept in its compact triplet

format. The increment values may be negative and al-low one to traverse a subscript range in the oppositedirection, such as in (10,7,-1) to denote the sequencef10; 9; 8; 7g. Indices can be named and used in expres-sions, as in the following submatrix assignments,LaGenMat<double> A(10,10), B, C; // 1LaIndex I(1,9,2), // 2LaIndex J(1,3,2); // 3B.ref(A(I,I)); // 4B(2,3) = 3.1; // 5C = B(LaIndex(2,4,2), J); // 6In lines (2) and (3) we declare indices I = f1; 3; 5; 7; 9g,and J =f1; 3g. Line (4) sets B to the speci�ed 5x5 sub-matrix of A. The matrix B can used in any matrix ex-pression, including accessing its individual elements,as in line (5). Note that B(2,3) is the same mem-ory location as A(5,7), so that a change to B will alsomodify the contents of A. Line (6) assigns the 2x2 sub-matrix of B to C. Note that C can also be referencedas A( LaIndex(5,9,2), LaIndex(3,7,2)).Although LAPACK++ submatrix expressions al-low one to access non-contiguous rows or columns,many of the LAPACK routines only allow submatriceswith unit stride in the column direction. Calling anLAPACK++ routine with a non-contiguous subma-trix columnsmay cause data to be copied into contigu-ous submatrix and can optionally generate a runtimewarning to advise the programmer that data copyinghas taken place. (In Fortran, the user would need toneed to do this by hand.)4 Driver RoutinesThis section discusses LAPACK++ routines forsolving linear a system of linear equationsAx = b;where A is the coe�cient matrix, b is the righthand side, and x is the solution. A is assumed tobe a square matrix of order n, although underlyingcomputational routines allow for A to be rectangular.For several right hand sides, we writeAX = B;where the columns of B are individual right handsides, and the columns of X are the correspondingsolutions. The task is to �nd X, given A and B. Thecoe�cient matrix A can be of the types show in Fig-ure 4.

The basic syntax for a linear equation driver in LA-PACK++ is given byLaLinSolve(op(A), X, B);The matrices A and B are input, and X is the output. Ais anM �N matrix of one of the above types. Lettingnrhs denote the number of right hand sides in eq. 4, Xand B are both rectangular matrices of size N � nrhs.The syntax op(A) can denote either A or the transposeof A, expressed as transp(A).This version requires intermediate storage of ap-proximately M � (N + nrhs) elements.In cases where no additional information is sup-plied, the LAPACK++ routines will attempt to fol-low an intelligent course of action. For example,if LaLinSolve(A,X,B) is called with a non-squareM � N matrix, the solution returned will be the lin-ear least square that minimizes jjAx� bjj2 using a QRfactorization. Or, if A is declared as SPD, then aCholesky factorization will be used. Alternatively, onecan directly specify the exact factorization method,such as LaLUFactor(F, A). In this case, if A is non-square, the factors return only a partial factorizationof the upper square portion of A.Error conditions in performing the LaLinSolve()operations can be retrieved via the LaLinSolveInfo()function, which returns information about the lastcalled LaLinSolve(). A zero value denotes a success-ful completion. A value of �i denotes that the ithargument was somehow invalid or inappropriate. Apositive value of i denotes that in the LU decomposi-tion, U (i; i) = 0; the factorization has been completedbut the factor U is exactly singular, so the solutioncould not be computed. In this case, the value re-turned by LaLinSolve() is a null (0x0) matrix.4.1 Memory Optimizations: Factorizingin placeWhen using large matrices that consume a signi�-cant portion of available memory, it may be bene�cialto remove the requirement of storing intermediate fac-torization representations at the expense of destroyingthe contents of the input matrix A. For most matrixfactorizations we require temporary data structuresroughly equal to the size of the original input ma-trix. (For general banded matrices, one needs slightlymore storage due to pivoting, which causes �ll in ad-ditional bands.) For example, the temporary memoryrequirement of a square N � N dense non-symmetricfactorization can be reduced fromN�(N+nrhs+1) el-ements to N �1. Such memory-e�cient factorizationsare performed with the LaLinSolveIP() routine:

LaLinSolveIP(A, X, B);Here the contents of A are overwritten (with the re-spective factorization). These \in-place" functions areintended for advanced programmers and are not rec-ommended for general use. They assume the program-mer's responsibility to recognize that the contents of Ahave been destroyed; however, they can allow a largenumerical problem to be solved on a machine withlimited memory.5 Programming ExamplesThis code example solves the linear least squaresproblem of �tting N data points (xi; yi) with a dthdegree polynomial equationp(x) = a0 + a1x+ a2x2 + : : :+ adxdusing QR factorization. It is intended for illustrativepurposes only; there are more e�ective methods tosolve this problem.Given the two vectors x and y it returns the vectorof coe�cients a = fa0; a1; a2; : : : ; ad�1g. It is assumedthat N � d. The solution arises from solving theoverdetermined Vandermonde system Xa = y:26664 1 x10 x20 : : : xd01 x11 x21 : : : xd1... ...1 x1N�1 x2N�1 : : : xdN�1 3777526664 a0a1...ad 37775 = 26664 y0y1...yN�1 37775in the least squares sense, i.e., minimizing jjXa� yjj2.The resulting code is shown in �gure 5.6 ScaLAPACK++: an extension fordistributed architecturesThere are various ways to extend LAPACK++.Here we discuss one such extension, ScaLA-PACK++ [2], for linear algebra on distributed mem-ory architectures. The intent is that for large scaleproblems ScaLAPACK++ should e�ectively exploitthe computational hardware of medium grain-sizedmulticomputers with up to a few thousand processors,such as the Intel Paragon and Thinking Machines Cor-poration's CM-5.Achieving these goals while maintaining portabil-ity, exibility, and ease-of-use can present serious chal-lenges, since the layout of an application's data withinthe hierarchical memory of a concurrent computer is

void poly_fit(LaVector<double> &x,LaVector<double> &y, LaVector<double> &p){ int N = min(x.size(), y.size());int d = p.size();LaGenMatDouble P(N,d);LaVectorDouble a(d);double x_to_the_j;// construct Vandermonde matrixfor (i=0; i<N; i++){ x_to_the_j = 1;for (j=0; j<d; j++){ P(i,j) = x_to_the_j;x_to_the_j *= x(i);}}// solve Pa = y using linear least squaresLaLinSolveIP(P, p, y);}Figure 5: LAPACK++ code example: polynomialdata �tting.critical in determining the performance and scalabilityof the parallel code. To enhance the programmabilityof the library we would like details of the parallel im-plementation to be hidden as much as possible, butstill provide the user with the capability to controlthe data distribution.The design of ScaLAPACK++ includes a generaltwo-dimensional matrix decomposition that supportsthe most common block scattered encountered in thecurrent litertature. ScaLAPACK++ can be extendedto support arbitrary matrix decompositions by pro-viding the speci�c parallel BLAS library to operateon such matrices.Decoupling the matrix operations from the detailsof the decomposition not only simpli�es the encodingof an algorithm but also allows the possibility of post-poning the decomposition scheme until runtime. Thisis only possible with object-oriented programming lan-guages, like C++, that support dynamic-binding, orpolymorphism { the ability to examine an object's typeand dynamically select the appropriate action [10]. Inmany applications the optimal matrix decompositionis strongly dependent on how the matrix is utilized in

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7

1 20 3

4 5 6 7Figure 6: An example of block scattered decomposi-tion over an 2x4 processor grid.other parts of the program. Furthermore, it is oftennecessary to dynamically alter the matrix decomposi-tion at runtime to accommodate special routines. Theability to support dynamic run-time decompositionstrategies is one of the key features of ScaLAPACK++that makes it integrable with scalable applications.The currently supported decomposition scheme de-�nes global matrix objects which are distributedacross a P � Q logical grid of processors. Matricesare mapped to processors using a block scattered classof decompositions (Figure 6) that allows a wide va-riety of matrix mappings while enhancing scalabilityand maintaining good load balance for various factor-ization algorithms. Each node of the multicomputerruns a sequential LAPACK++ library that providesthe object oriented framework to describe block al-gorithms on conventional matrices in each individualprocessor.We can view the block scattered decomposition asstamping a P � Q processor grid, or template, overthe matrix, where each cell of the grid covers r � sdata items, and is labeled by its position in the tem-plate (Figure 6). The block and scattered decompo-sitions may be regarded as special cases of the blockscattered (SBS) decomposition. This scheme is prac-tical and su�ciently general-purpose for most, if notall, dense linear algebra computations. Furthermore,in problems, such as LU factorization, in which rows

ParallelApplication

ParallelBLAS

SequentialBLAS

Message PassingPrimitves

GLOBAL

LOCAL

ScaLAPACK++

LAPACK++

BLACSFigure 7: Design Hierarchy of ScaLAPACK++. In anSPMD environment, components above the horizontalreference line represent a global viewpoint (a singledistributed structure), while elements below representa per-node local viewpoint of data.and/or columns are eliminated in successive steps, theSBS decomposition enhances scalability by ensuringprocessor load balance. Preliminary experiments ofan object-based LU factorization algorithm using anSBS decomposition [2] suggest these algorithms scalewell on multicomputers. For example, on the IntelTouchstone Delta System, a 520 node i860-based mul-ticomputer, such algorithms can achieve nearly twelveG ops (�gure 8).Parallelism is exploited through the use of dis-tributed memory versions of the Basic Linear Alge-bra Subprogram (BLAS) [6] [3] that perform the ba-sic computational units of the block algorithms. Thus,at a higher level, the block algorithms look the samefor the parallel and sequential versions, and only oneversion of each needs to be maintained.The bene�ts of an object oriented design (Figure 7)include the ability to hide the implementation details

0 0.5 1 1.5 2 2.5 3

x 104

0

2

4

6

8

10

12

Matrix Size (N)

Glfo

ps

2x16

4x16

4x32

8x32

8x64

Figure 8: Performance of distributed LU factoriza-tion using an Square Block Scattered (SBS) matrixdecomposition on the Intel Touchstone Delta system.Results are provided for various processor-grid con�g-urations.of distributed matrices from the application program-mer, and the ability to support a generic interface tobasic computational kernels (BLAS), such as matrixmultiply, without specifying the details of the matrixstorage class.7 ConclusionWe have presented a design overview for objectoriented linear algebra on high performance archi-tectures. We have also described extensions for dis-tributed memory architectures. These designs treatboth conventional and distributed matrices as funda-mental objects.In both the parallel and sequential aspects of LA-PACK++, decoupling the matrix algorithm from aspeci�c data decomposition provides three importantattributes: (1) it results in simpler code which moreclosely matches the underlying mathematical formula-tion, (2) it allows for one \universal" algorithm, ratherthan supporting one version for each data decompo-sition needed, and (3) it allows one to postpone thedata decomposition decision until runtime.We have used the inheritance mechanism of theobject oriented design to provide a common sourcecode for both the parallel and sequential versions.Because the parallelism in embedded in the parallelBLAS library, the sequential and parallel high levelmatrix algorithms in ScaLAPACK++ look the same.

This tremendously simpli�es the complexity of theparallel libraries.We have used polymorphism, or dynmanic bind-ing mechanisms to achieve a truly portable matrix li-brary in which the data decomposition may be dy-namically changed at runtime.We have utilized operator and function overload-ing capabilities of C++ to simplify the syntax anduser interface into the LAPACK and BLAS functions.We have utilized the function inlining capabilitiesof C++ to reduce the function-call overhead usuallyassociated with interfacing Fortran or assembly ker-nels.In short, we have used various important aspectsof object oriented mechanisms and C++ in the de-sign of LAPACK++. These attributes were utilizednot because of novelty, but out of necessity to incor-porate a design which provides scalibility, portability, exibility, and ease-of-use.References[1] E. Anderson, Z. Bai, C. Bischof, J. W.Demmel, J. J. Dongarra, J. Du Croz,A. Greenbaum, S. Hammarling, A. McKen-ney, S. Ostrouchov, and D. Sorensen, LA-PACK User's Guide, SIAM, Philadelphia, 1992.[2] J. Choi,J. J. Dongarra, R. Pozo, D. W.Walker, ScaLAPACK: A Scalable Linear Alge-bra Library for Distributed Memory ConcurrentComputers, Frontiers of Massively Parallel Com-puting, McLean, Virginia, October 19-21, 1992.[3] J. Choi and J. J. Dongarra and D. W.Walker, PB-BLAS : Parallel Block Basic Lin-ear Algebra Subroutines on Distributed MemoryConcurrent Computers, Oak Ridge National Lab-oratory, Mathematical Sciences Section, in prepa-ration, 1993.[4] R. Davies, newmat07, an experimental matrixpackage in C++, [email protected],1993.[5] J. J. Dongarra, J. R. Bunch, C. B. Moler,and G. W. Stewart, LINPACK Users' Guide,SIAM, Philadelphia, PA, 1979.[6] J. J. Dongarra, J. Du Croz, I. S. Duff, andS. Hammarling, A set of Level 3 Basic LinearAlgebra Subprograms, ACM Trans. Math. Soft.,16 (1990), pp. 1{17.

[7] J. J. Dongarra, R. Pozo, D. W. Walker,LAPACK++ Users' Manual, in preparation.[8] J. J. Dongarra, R. Pozo, D. W. Walker,An Object Oriented Design for High PerformanceLinear Algebra on Distributed Memory Archi-tectures, Object Orriented Numerics Conference(OONSKI), Sunriver, Oregon, May 26-27, 1993.[9] Dyad Software Corporation, M++ ClassLibrary, Bellevue, Washington, 1991.[10] M. A. Ellis, B. Stroustrup, The Annotatedc++ Reference Manual, Addison-Wesley, 1990.[11] K. E. Gorlen, S. M. Orlow, P. S. Plexico,Data Abstraction and Object-Oriented Program-ming in C++, John Wiley & Sons, Chichester,England, 1990.[12] Rogue Wave Software, Math.h++, Corvallis,Oregon, 1992.[13] B. T. Smith, J. M. Boyle, J. J. Dongarra,B. S. Garbow, Y. Ikebe, V. C. Klema,and C. B. Moler, Matrix Eigensystem Rou-tines { EISPACK Guide, vol. 6 of Lecture Notesin Computer Science, Springer-Verlag, Berlin, 2ed., 1976.