solving linear algebra algorithms using multiple...

Solving Challenging Numerical Linear Algebra Algorithms Using Multiple GPU Accelerators

Hatem Ltaief

KAUST Supercomputing Laboratory

Stanimire Tomov

University of Tennessee, Knoxville

http://www.gputechconf.com/page/home.html

Outline

MAGMA: LAPACK for GPUs

Methodology overview

— Use both GPUs and multicore CPUs

MAGMA: from single to multiGPU support

— One-sided factorizations and linear solvers

— Two-sided factorizations and eigensolvers

Dynamic scheduling approaches to DLA

MAGMA algorithms with dynamic scheduling

Conclusions

MAGMA: LAPACK for GPUs

MAGMA

— Matrix algebra for GPU and multicore architecture

— To provide LAPACK/ScaLAPACK on hybrid architectures

— http://icl.cs.utk.edu/magma/

MAGMA BLAS

— A subset of BLAS for GPUs, highly optimized for NVIDIA GPGPUs

— Fast GEMM for Fermi (CUBLAS3.2) [IJHPCA’10]

MAGMA developers & collaborators

— UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia)

— Community effort, similarly to LAPACK/ScaLAPACK

http://icl.cs.utk.edu/magma/

A New Generation of DLA Software

MAGMA

Hybrid Algorithms

(heterogeneity friendly)

Rely on

- hybrid scheduler

- hybrid kernels

MAGMA Software Stack

MAGMA 1.1

50+ hybrid LAPACK algorithms have been developed (total of 200+ routines) Every algorithm is in 4 precisions (s/c/d/z)

There are 3 mixed precision algorithms (zc & ds)

These are hybrid algorithms, expressed in terms of BLAS

MAGMA BLAS A subset of GPU BLAS, optimized for Tesla and Fermi GPUs

MAGMA Methodology

A methodology to use all available resources:

MAGMA uses HYBRIDIZATION methodology based on

– Representing linear algebra algorithms as collections

of TASKS and DATA DEPENDENCIES among them

– Properly SCHEDULING tasks' execution over

multicore and GPU hardware components

Successfully applied to fundamental

linear algebra algorithms

– One and two-sided factorizations and solvers

– Iterative linear and eigen-solvers

Productivity

– 1) High-level; 2) Leveraging prior developments; 3) Exceeding in performance homogeneous solutions

Hybrid CPU+GPU algorithms (small tasks for multicores and large

tasks for GPUs)

Hybrid Algorithms

One-sided factorizations (LU, QR, Cholesky)

Hybridization

– Panels (Level 2 BLAS) are factored on CPU using LAPACK

– Trailing matrix updates (Level 3 BLAS) are done on the GPU using “look-ahead”

A Hybrid Algorithm Example

Left-looking hybrid Cholesky factorization in MAGMA 1.0

The difference with LAPACK – the 3 additional lines in red

Line 10 (done on CPU) is overlapped with work on the GPU

LU Factorization (Single GPU)

From single to multiple GPUs support

Data distribution

— 1-D block-cyclic distribution

Algorithm

— GPU holding current panel is sending

it to CPU

— All updates are done in parallel on

the GPUs

— Look-ahead is done with GPU holding

the next panel

GPU

0 GPU

1 GPU

2

GPU

0 . . .

nb

LU factorization (multiGPUs)

LU factorization (multiGPUs)

Matrix out of

GPU memory

Out of GPU Memory Algorithms

Perform left-looking factorizations on sub-matrices

that fit in the GPU memory (using existing algorithms)

The rest of the matrix stays on the CPU

Left-looking versions minimize writing on the CPU

Factored

sub-matric

A1 on CPU

To be

factored

sub-matrix

A2 on GPU . . .

1) Copy A2 to the GPU

2) Update A2 using A1 (a panel of A1 at a time)

3) Factor the updated A2 using existing

hybrid code

4) Copy factored A2 to the CPU

Trivially extended to multiGPUs:

A2 is “larger” with 1-D block cyclic

distribution, again reusing existing algorithms

Hybrid Algorithms

Two-sided factorizations (to bidiagonal, tridiagonal, and upper

Hessenberg forms) for eigen- and singular-value problems

Hybridization

– Trailing matrix updates (Level 3 BLAS) are done on the GPU

(similar to the one-sided factorizations)

– Panels (Level 2 BLAS) are hybrid

– operations with memory footprint restricted to the panel are done on CPU

– The time consuming matrix-vector products involving the entire trailing

matrix are done on the GPU

Hybrid Two-Sided Factorizations

MultiGPUs Two-Sided Factorizations

Performance of DSYMV on multi M2090s

Need HP multiGPU Level 2 BLAS

T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication on

multiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.

Tridiagonalization

T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense

matrix-vector multiplication on multiple GPUs and its application to symmetric dense and

sparse eigenvalue problems, ICL Technical report, 03/2012.

50 % of the flops are in SYMV

Memory bound, i.e. does not

scale well on multicore CPUs

Use the GPU’s high memory

bandwidth and optimized SYMV

8 x speedup over 12 Intel cores

(X5660 @2.8 GHz)

Keeneland system, using one node 3 NVIDIA GPUs (M2070@ 1.1 GHz, 5.4 GB)

2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)

Performance of MAGMA tridiagonalization in DP

From Static to Dynamic Scheduling…

Static may stall in situations where work is available

Hand tuned optimizations

Hardware heterogeneity

Kernel heterogeneity

Separation of concerns

Dynamic Runtime System

Punch Lines

Productivity!

Productivity!

Productivity!

Oh… Did I say Productivity?

Block Algorithms

Panel-Update Sequence

Transformations are blocked/accumulated within the Panel

(Level 2 BLAS)

Transformations applied at once on the trailing submatrix

(Level 3 BLAS)

Parallelism hidden inside the BLAS

Fork-join Model

Block QR Factorization

Fork-Join Paradigm

Leveraging Block Algorithms…

Column-major data layout Tile data layout

Lessons Learnt from PLASMA

PLASMA: Parallel Linear Algebra for Scalable Multi-core Architectures:

http://icl.cs.utk.edu/plasma/

Tile Algorithms on homogeneous x86 cores

Parallelism is brought to the fore

May require the redesign of linear algebra algorithms

Tile data layout translation

Remove unnecessary synchronization points between Panel-Update sequences

DAG execution where nodes represent tasks and edges define dependencies

between them

Dynamic runtime system environment QUARK

http://icl.cs.utk.edu/plasma/

Example: Tile QR Factorization

First panel factorization and

corresponding updates DAG for a 4x4 tiles matrix

Let’s go crazy!

DAG of 20x20 tile QR Factorization

Dynamic Scheduling

Conceptually similar to out-of-order processor scheduling

because it has:

— Dynamic runtime DAG scheduler

— Out-of-order execution flow of fine-grained tasks

— Task scheduling as soon as dependencies are satisfied

— Producer-Consumer

Data Flow Programming Model: five decades old concept

— Think "how things connect" rather than "how things happen”

— Assembly line

— Inherently parallel

Matrices Over Runtime Systems at Exascale (MORSE)

Mission statement: "Design dense and sparse linear

algebra methods that achieve the fastest possible time to

an accurate solution on large-scale Hybrid systems”

Runtime challenges due to the ever growing hardware

complexity

Algorithmic challenges to exploit the hardware

capabilities at most

Integrated into MAGMA software stack

MAGMA-MORSE: x86 + MultiGPUs

Lessons Learned from PLASMA!

CUDA-based hybrid systems

New high performance numerical kernels

StarPU Runtime System (Augonnet et. Al, INRIA, Bordeaux)

Both: x86 and GPUs => Hybrid Computations

Similar to LAPACK in functionality

Achieving High Level of Productivity

From Sequential Nested-Loop Code to Parallel Execution

for (k = 0; k < min(MT, NT); k++){

zgeqrt(A[k;k], ...);

for (n = k+1; n < NT; n++)

zunmqr(A[k;k], A[k;n], ...);

for (m = k+1; m < MT; m++){

ztsqrt(A[k;k],,A[m;k], ...);

for (n = k+1; n < NT; n++)

ztsmqr(A[m;k], A[k;n], A[m;n], ...);

}

}

Achieving High Level of Productivity

From Sequential Nested-Loop Code to Parallel Execution

for (k = 0; k < min(MT, NT); k++){

starpu_Insert_Task(&cl_zgeqrt, k , k, ...);

for (n = k+1; n < NT; n++)

starpu_Insert_Task(&cl_zunmqr, k, n, ...);

for (m = k+1; m < MT; m++){

starpu_Insert_Task(&cl_ztsqrt, m, k, ...);

for (n = k+1; n < NT; n++)

starpu_Insert_Task(&cl_ztsmqr, m, n, k, ...);

}

}

Hybrid Architecture Targeted

⇒ PCI Interconnect 16X 64Gb/s, very thin pipe!

⇒ Fermi C2050 448 cuda cores 515 Gflop/s

Performance Charts

Cholesky

QR

LU

Symmetric Matrix Inversion

Cholesky Performance 8 Intel x86 cores + 3 Tesla GPUs

E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, S. Thibault, R. Namyst, S. Thibault, S. Tomov, Software

for GPUs, GPU Computing GEMs, vol.2, 2011.

QR Performance

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, IEEE International Parallel and

Distributed Processing Symposium, 2011.

8 AMD x86 cores + 4 Tesla GPUs

QR Performance

+~200 Gflop/s but 12 cores = ~150 Gflop/s

8 AMD x86 cores + 4 Tesla GPUs

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, IEEE International Parallel and

Distributed Processing Symposium, 2011.

Performance Breakdown

Task distribution observed on StarPU:

— sgeqrt : 20% of tasks on GPUs

— stsmqr : 92.5% of tasks on GPUs

Taking advantage of heterogeneity !

— Only do what you are good for

— Don’t do what you are not good for

Kernel CPU GPU Speedup

sgeqrt 9 Gflops 60 Gflops ~ 6

stsqrt 12 Gflops 67 Gflops ~ 6

sormqr 8.5 Gflops 227 Gflops ~ 27

stsmqr 10 Gflops 285 Gflops ~ 27

LU Performance

E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief and S. Tomov, ACS/IEEE International

Conference on Computer Systems and Applications (best paper award), 2011

8 Intel x86 cores + 3 Tesla GPUs

Symmetric Matrix Inversion

A−1, Seriously???

YES!

Critical component of the variance-covariance matrix

computation in statistics

Three steps:

— Cholesky factorization (DPOTRF)

— Inverting the Cholesky factor (DTRTRI)

— Calculating the product of the inverted Cholesky factor with its

transpose (DLAUUM)

Built on previous work from E. Agullo, H. Bouwmeester, J. Dongarra, J. Kurzak, J. Langou, and L. Rosenberg

Scheduling Algorithms as DAGs

41

48 cores

POTRF, TRTRI and LAUUM.

The matrix is 4000 x 4000

Tile size is 200 x 200

A−1 Performance

H. Ibeid, D. Kaushik, D. Keyes and H. Ltaief, HIPC'11, India

8 Intel x86 cores + 2 Fermi GPUs

Summary and Future Directions

Two methodologies for solving challenging DLA

Static Scheduling (performance)

Dynamic Scheduling (productivity)

LAPACK compliant API

Source codes freely available in MAGMA

What’s next?

— Extended numerical functionality

— Distributed Memory Systems

Colloborators / Support

MAGMA [Matrix Algebra on GPU

and Multicore Architectures] team


PLASMA [Parallel Linear Algebra

for Scalable Multicore

Architectures] team

http://icl.cs.utk.edu/plasma

Collaborating partners University of Tennessee, Knoxville

University of California, Berkeley

University of Colorado, Denver

INRIA, France

KAUST, Saudi Arabia


http://icl.cs.utk.edu/plasma

Questions?

solving linear algebra algorithms using multiple...

Documents