cs 267 applications of parallel processors lecture 13: parallel matrix multiply

33
267 Lecture 13 06/18/22 1 CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply Kathy Yelick http://www.cs.berkeley.edu/~dmartin/ cs267

Upload: maeve

Post on 23-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply. Kathy Yelick http://www.cs.berkeley.edu/~dmartin/cs267. Outline. - Recap - Sources of large dense linear systems - BLAS for linear algebra - Parallel Matrix multiply. Model overview. Work-depth PRAM - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 1

CS 267 Applications of Parallel Processors

Lecture 13: Parallel Matrix Multiply

Kathy Yelick

http://www.cs.berkeley.edu/~dmartin/cs267

Page 2: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 2

Outline

- Recap

- Sources of large dense linear systems

- BLAS for linear algebra

- Parallel Matrix multiply

Page 3: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 3

Model overview

• Work-depth• PRAM• Latency/Bandwidth model

– is the 1-time cost per message (latency)– is the per byte cost of communication– Use this today

• LogP model– correction: gap should be greater than overhead– more on this with parallel sorting

• Topology-specific models

Page 4: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 4

Dense Linear Algebra in Electromagentics

Page 5: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 5

Computational Electromagnetics

• developed during 1980s, driven by defense applications

• determine the RCS (radar cross section) of airplane

• reduce signature of plane (stealth technology)

• other applications are antenna design, medical equipment

• two fundamental numerical approaches:

•MOM methods of moments ( frequency domain), and

• finite differences (time domain)

Page 6: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 6

Computational Electromagnetics

image: NW Univ. Comp. Electromagnetics Laboratory http://nueml.ece.nwu.edu/

- discretize surface into triangular facets using standard modeling tools

- amplitude of currents on surface are unknowns

- integral equation is discretized into a set of linear equations

Page 7: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 7

Computational Electromagnetics (MOM)

After discretization the integral equation

has the form Z J = V where

Z is the impedance matrix, J is the unknown vector of amplitudes, and V is the excitation vector.

(see Cwik, Patterson, and Scott, Electromagnetic Scattering on the Intel Touchstone Delta, IEEE Supercomputing ‘92, pp 538 - 542)

Page 8: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 8

The main steps in the solution process are

A) computing the matrix elements

B) factoring the dense matrix

C) solving for one or more excitations (RHS)

D) computing the fields scattered from the object

Computational Electromagnetics (MOM)

Page 9: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 9

Analysis of MOM for Parallel Implementation

Task Work Parallelism Parallel Speed

Fill O(n**2) embarrassing low

Factor O(n**3) moderately diff. very high

Solve O(n**2) moderately diff. high

Field Calc. O(n) embarrassing high

For most scientific applications the biggest gain in performance is from parallelism within each task.

Page 10: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 10

Results for Parallel Implementation on Delta

Task Time (hours)

Fill 9.20

Factor 8.25

Solve 2.17

Field Calc. 0.12

The problem solved was for a matrix of size 48,672. (The world record in 1991.)

Page 11: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 11

Current Records for Solving Dense Systems

Year System Size Machine

1950's O(100) 1991 55,296 CM-2 1992 75,264 Intel 1993 75,264 Intel 1994 76,800 CM-5 1995 128,600 Paragon XP1996 215,000 ASCI Red (Tflop)

source: Alan Edelman http://www-math.mit.edu/~edelman/records.html

Page 12: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 12

Sources for large dense linear systems

- Not many basic factorizations outside CEM

- Large dense eigen problems used in chemistry

- Alternatives often debatedChoice for algorithms in existing codes are not the result of careful planning and design. - Reflect the start-of-the-art at the time, - May be purely coincidental.

Page 13: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 13

Solving Large Dense Linear Systems

see Demmel http://HTTP.CS.Berkeley.EDU/~demmel/cs267/lecture12/lecture12.html

Gaussian elimination to solve Ax=b where A is a dense matrix• Add multiples of each row to subsequent

rows in order to create zeros below the diagonal

• End up with an upper triangular matrix U. • Solve a linear system with U by substitution,

starting with the last variable.

Solving these systems uses basic vector and matrix operations called BLAS.

Page 14: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 14

Parallel Matrix Multiply

Page 15: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 15

Parallel Matrix Multiply

• Computing C=C+A*B• Using basic algorithm: 2*n3 Flops• Variables are:

– Data layout– Topology of machine – Scheduling communication

• Use of performance models for algorithm design

Page 16: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 16

1D Layout

• Assume matrices are nxn and n is divisible by p

• A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i))

• B(i,j) is the n/p by n/p sublock of B(i) – in rows j*n/p through (j+1)*n/p

p0 p1 p2 p3 p5 p4 p6 p7

Page 17: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 17

Matrix Multiply: 1D Layout on Bus

• Algorithm uses the formula

C(i) = C(i) + A*B(i) = C(i) + A(j)*B(j,i)

• First consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time

• Second consider a bus-connected machine with broadcast: may send from one to many in single step

j <p

j =0

Page 18: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 18

MatMul on 1D Bus without Broadcast

Naïve algorithm:

C(myproc) = C(myproc) + A(myproc)*b(myproc,myproc)

for i = 0 to p-1

for j = 0 to p-1 except i

if (myproc == i) send A(i) to processor j // message passing

if (myproc == j)

receive A(i) from processor i

C(myproc) = C(myproc) + A(i)*B(i,myproc)

barrier

Cost of inner loop:

computation: 2*n*(n/p)2 = 2*n3/p2

communication: *p2 + *p*n2 // approximately

Page 19: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 19

Naïve MatMul (continued)

Cost of inner loop:

computation: 2*n*(n/p)2 = 2*n3/p2

communication: *p2 + *p*n2 // approximately

Only 1 pair of processors (i and j) are active on any iteration,

an of those, only i is doing computation

=> the algorithm is almost entirely serial

Running time: (p*(p-1) + 1)*computation + p*(p-1)*communication

~= 2*n3 + p2* + p*n2* this is worse than the serial time and grows with p

Page 20: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 20

Better MatMul on a Bus

Remove the barrier and send A(i) multiple times C(myproc) = C(myproc) + A(myproc)*b(myproc,myproc)

for i = 0 to myproc-1

receive A(i) from processor i

C(myproc) = C(myproc) + A(i)*B(i,myproc)

for i = 0 to p-1 except myproc

send A(myproc) to processor i

for i = myproc+1 to p-1

receive A(i) from processor i

C(myproc) = C(myproc) + A(i)*B(i,myproc)

Program is indeterminate: sends/receives may happen in different orders

Page 21: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 21

Performance of “Better” Algorithm

• Intuitively, if a computation step is sufficient large compared to communication, efficiency will be good– communication: + n*(n/p)*– computation: 2*n3/p2

• Assume computation > communication

i-j represents communication from i to j

iC represents computation step by processor i

Page 22: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 22

Performance of “Better” Algorithm• Timeline of execution

• If computation<= (p-2)*communication– No bubbles in the pipeline– Time = p*(p-1)*communication + 2*computation– Time <= (p2 + p-4)*communication– If communication = computation/(p-2), close to lower bound

of 2*n3/p • If communication is faster, only small impact on performance

0C 0-1 0-2 0-3 0- p-1

1C 1C 1- 0

2C 2C 2-0

p-1C

Page 23: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 23

MatMul: 1D Layout and Broadcast

• Modify the previous algorithm so that each of sends of A(i) is a broadcast (assumed 1 comm. step)

• Time is now 2*n3/p + p* +n2*• Again, we require n>>p for good efficiency

– p times less communication time– broadcast helps performance (as expected)

Page 24: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 24

MatMul with 2D Layout

• Consider processors in 2D grid (physical or logical)

p(0,0) p(0,1) p(0,2)

p(1,0) p(1,1) p(1,2)

p(2,0) p(2,1) p(2,2)

Page 25: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 25

Cannon’s Algorithm• C(i,j) = C(i,j) + A(i,k)*B(k,j)• Algorithm (s = sqrt(p)):• for i=0 to s-1 // skew A• left-circular-shift row i of A by i• so that A(i,j) overwritten by A(i,(j+i)mod s)• for i=0 to s-1 // skew B• up-circular shift B• so that B(i,j) overwritten by B((i+j)mod s), j)• for k=0 to s-1• for i=0 to s-1 and j=0 to s-1• C(i,j) = C(i,j) + A(i,j)*B(i,j)• left-circular-shift each row of A by 1• up-circular-shift each row of B by 1•

k<s

k=0

Page 26: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 26

Communication in Cannon

Page 27: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 27

BLAS Review

Page 28: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 28

Fast linear algebra kernels: BLAS

• Simple linear algebra kernels such as matrix-matrix multiply

• More complicated algorithms can be built from these basic kernels.

• The interfaces of these kernels have been standardized as the Basic Linear Algebra Subroutines (BLAS).

• Early agreement on standard interface (~1980)

• Led to portable libraries for vector and shared memory parallel machines.

• On distributed memory, there is a less-standard interface called the PBLAS

Page 29: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 29

Level 1 BLAS

• Operate on vectors or pairs of vectors– perform O(n) operations; – return either a vector or a scalar.

• saxpy – y(i) = a * x(i) + y(i), for i=1 to n. – s stands for single precision, daxpy is for

double precision, caxpy for complex, and zaxpy for double complex,

• sscal y = a * x, for scalar a and vectors x,y

• sdot computes s = ni=1 x(i)*y(i)

Page 30: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 30

Level 2 BLAS

• Operate on a matrix and a vector; – return a matrix or a vector;– O(n^2) operations

• sgemv: matrix-vector multiply– y = y + A*x– where A is m-by-n, x is n-by-1 and y is m-by-1.

• sger: rank-one update – A = A + y*x', i.e., A(i,j) = A(i,j)+y(i)*x(j) – where A is m-by-n, y is m-by-1, x is n-by-1, x'

is x transpose • strsv: triangular solve

– solves y=T*x for x, where T is triangular

Page 31: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 31

Level 3 BLAS

• Operate on pairs or triples of matrices– returning a matrix;– complexity is O(n**3).

• sgemm: Matrix-matrix multiplication– C = C +A*B, – where C is m-by-n, A is m-by-k, and B is k-

by-n

• sgtrsm: multiple triangular solve– solves Y = T*X for X, – where T is a triangular matrix, and X is a

rectangular matrix.

Page 32: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 32

Performance of BLAS

Level 2

Level 3

Level 1

Page 33: CS 267 Applications of Parallel Processors Lecture 13: Parallel Matrix Multiply

267 Lecture 13 04/21/23 33

Performance of BLAS

• BLAS are specially optimized by the vendor – Sun BLAS uses features in the Ultrasparc

• Big payoff for algorithms that can be expressed in terms of the BLAS3 instead of BLAS2 or BLAS1.

• The top speed of the BLAS3• Algorithms like Gaussian elimination

organized so that they use BLAS3