parallel matlab programming using distributed arrays · slide-1 parallel matlab mit lincoln...

28
Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln Laboratory This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the United States Government.

Upload: others

Post on 04-Sep-2019

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

Slide-1Parallel MATLAB

MIT Lincoln Laboratory

Parallel Matlab programming using Distributed Arrays

Jeremy Kepner

MIT Lincoln Laboratory

This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not

necessarily endorsed by the United States Government.

Page 2: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-2

Parallel MATLAB

Goal: Think Matrices not Messages

Programmer Effort

Perf

orm

ance

Spe

edup 100

10

1

0.1hours days weeks months

acceptable

hardware limitExpert

Novice

• In the past, writing well performing parallel programs has required a lot of code and a lot of expertise

• pMatlab distributed arrays eliminates the coding burden– However, making programs run fast still requires expertise

• This talk illustrates the key math concepts experts use to make parallel programs perform well

Page 3: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-3

Parallel MATLAB

• Serial Program• Parallel Execution• Distributed Arrays• Explicitly Local

Outline

• Parallel Design

• Distributed Arrays

• Concurrency vs Locality

• Execution

• Summary

Page 4: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-4

Parallel MATLAB

Serial Program

• Matlab is a high level language• Allows mathematical expressions to be written concisely• Multi-dimensional arrays are fundamental to Matlab

Y = X + 1

X,Y : NxN

Y(:,:) = X + 1;

X = zeros(N,N);Y = zeros(N,N);

Math Matlab

Page 5: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-5

Parallel MATLAB

Pid=Np-1

Pid=1PID=NP-1

PID=1Pid=0

PID=0

Parallel Execution

• Run NP (or Np) copies of same program– Single Program Multiple Data (SPMD)

• Each copy has a unique PID (or Pid)• Every array is replicated on each copy of the program

Y = X + 1

X,Y : NxN

Y(:,:) = X + 1;

X = zeros(N,N);Y = zeros(N,N);

Math pMatlab

Page 6: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-6

Parallel MATLAB

Pid=Np-1

Pid=1Pid=0

PID=NP-1

PID=1PID=0

Distributed Array Program

• Use P() notation (or map) to make a distributed array• Tells program which dimension to distribute data• Each program implicitly operates on only its own data

(owner computes rule)

Y = X + 1

X,Y : P(N)xN

Y(:,:) = X + 1;

XYmap = map([Np N1],{},0:Np-1);X = zeros(N,N,XYmap);Y = zeros(N,N,XYap);

Math pMatlab

Page 7: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-7

Parallel MATLAB

Explicitly Local Program

• Use .loc notation (or local function) to explicitly retrieve local part of a distributed array

• Operation is the same as serial program, but with different data on each processor (recommended approach)

Y.loc = X.loc + 1

X,Y : P(N)xN

Yloc(:,:) = Xloc + 1;

XYmap = map([Np 1],{},0:Np-1);Xloc = local(zeros(N,N,XYmap));Yloc = local(zeros(N,N,XYmap));

Math pMatlab

Page 8: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-8

Parallel MATLAB

• Maps• Redistribution

Outline

• Parallel Design

• Distributed Arrays

• Concurrency vs Locality

• Execution

• Summary

Page 9: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-9

Parallel MATLAB

Parallel Data Maps

• A map is a mapping of array indices to processors• Can be block, cyclic, block-cyclic, or block w/overlap• Use P() notation (or map) to set which dimension to split

among processors

P(N)xN Xmap=map([Np 1],{},0:Np-1)

Math Matlab

0 1 2 3Computer

PID Pid

Array

NxP(N) Xmap=map([1 Np],{},0:Np-1)

P(N)xP(N) Xmap=map([Np/2 2],{},0:Np-1)

Page 10: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-10

Parallel MATLAB

Maps and Distributed Arrays

A processor map for a numerical array is an assignment of blocks of data to processing elements.

Amap = map([Np 1],{},0:Np-1);

Processor Grid

A = zeros(4,6,Amap);

0000

0000

0000

0000

0000

0000

P0P1P2P3

List of processors

pMatlab constructors are overloaded to take a map as an argument, and return a distributed array.

A =

Distribution{}=default=block

Page 11: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-11

Parallel MATLAB

Advantages of Maps

FFT along columns

Matrix Multiply

*

MAP1 MAP2Maps are scalable. Changing the number of processors or distribution does not change the application.

Maps support different algorithms. Different parallel algorithms have different optimal mappings.

Maps allow users to set up pipelinesin the code (implicit task parallelism).

foo1

foo2

foo3

foo4

%ApplicationA=rand(M,map<i>);B=fft(A);

map1=map([Np 1],{},0:Np-1) map2=map([1 Np],{},0:Np-1)

map([2 2],{},0:3)map([2 2],{},[0 2 1 3])

map([2 2],{},1)

map([2 2],{},0) map([2 2],{},2)

map([2 2],{},3)

Page 12: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-12

Parallel MATLAB

Redistribution of Data

• Different distributed arrays can have different maps• Assignment between arrays with the “=” operator causes

data to be redistributed• Underlying library determines all the message to send

Y = X + 1Y : NxP(N)

Xmap = map([Np 1],{},0:Np-1);Ymap = map([1 Np],{},0:Np-1);X = zeros(N,N,Xmap);Y = zeros(N,N,Ymap);

Y(:,:) = X + 1;

Math pMatlab

X : P(N)xN

P0P1P2P3

X =

P0P1P2P3

Y =

Data Sent

Page 13: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-13

Parallel MATLAB

• Definition• Example• Metrics

Outline

• Parallel Design

• Distributed Arrays

• Concurrency vs Locality

• Execution

• Summary

Page 14: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-14

Parallel MATLAB

Definitions

Parallel Concurrency

• Number of operations that can be done in parallel (i.e. no dependencies)

• Measured with:Degrees of Parallelism

Parallel Locality

• Is the data for the operations local to the processor

• Measured with ratio:Computation/Communication

= (Work)/(Data Moved)

• Concurrency is ubiquitous; “easy” to find

• Locality is harder to find, but is the key to performance

• Distributed arrays derive concurrency from locality

Page 15: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-15

Parallel MATLAB

Serial

• Concurrency: max degrees of parallelism = N2

• Locality– Work = N2

– Data Moved: depends upon map

Math Matlab

for i=1:Nfor j=1:NY(i,j) = X(i,j) + 1

for i=1:Nfor j=1:NY(i,j) = X(i,j) + 1;

endend

X = zeros(N,N);Y = zeros(N,N);X,Y : NxN

Page 16: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-16

Parallel MATLAB

1D distribution

• Concurrency: degrees of parallelism = min(N,NP)• Locality: Work = N2, Data Moved = 0• Computation/Communication = Work/(Data Moved) ® ¥

Math pMatlab

for i=1:Nfor j=1:NY(i,j) = X(i,j) + 1

X,Y : P(N)xN

for i=1:Nfor j=1:NY(i,j) = X(i,j) + 1;

endend

XYmap = map([NP 1],{},0:Np-1);X = zeros(N,N,XYmap);Y = zeros(N,N,XYmap);

Page 17: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-17

Parallel MATLAB

2D distribution

• Concurrency: degrees of parallelism = min(N2,NP)• Locality: Work = N2, Data Moved = 0• Computation/Communication = Work/(Data Moved) ® ¥

Math pMatlab

for i=1:Nfor j=1:NY(i,j) = X(i,j) + 1

for i=1:Nfor j=1:NY(i,j) = X(i,j) + 1;

endend

XYmap = map([Np/2 2],{},0:Np-1);X = zeros(N,N,XYmap);Y = zeros(N,N,XYmap);X,Y : P(N)xP(N)

Page 18: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-18

Parallel MATLAB

2D Explicitly Local

• Concurrency: degrees of parallelism = min(N2,NP)

• Locality: Work = N2, Data Moved = 0

• Computation/Communication = Work/(Data Moved) ® ¥

Math pMatlab

for i=1:size(X.loc,1)for j=1:size(X.loc,2)Y.loc(i,j) =

X.loc(i,j) + 1

for i=1:size(Xloc,1)for j=1:size(Xloc,2)Yloc(i,j) = Xloc(i,j) + 1;

endend

XYmap = map([Np/2 2],{},0:Np-1);Xloc = local(zeros(N,N,XYmap));Yloc = local(zeros(N,N,XYmap));X,Y : P(N)xP(N)

Page 19: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-19

Parallel MATLAB

1D with Redistribution

• Concurrency: degrees of parallelism = min(N,NP)

• Locality: Work = N2, Data Moved = N2

• Computation/Communication = Work/(Data Moved) = 1

Math pMatlab

for i=1:Nfor j=1:NY(i,j) = X(i,j) + 1

Xmap = map([Np 1],{},0:Np-1);Ymap = map([1 Np],{},0:Np-1);X = zeros(N,N,Xmap);Y = zeros(N,N,Ymap);

for i=1:Nfor j=1:NY(i,j) = X(i,j) + 1;

endend

Y : NxP(N)X : P(N)xN

Page 20: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-20

Parallel MATLAB

• Four Step Process• Speedup• Amdahl’s Law• Perforfmance vs Effort• Portability

Outline

• Parallel Design

• Distributed Arrays

• Concurrency vs Locality

• Execution

• Summary

Page 21: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-21

Parallel MATLAB

Running

• Start Matlab– Type: cd examples/AddOne

• Run dAddOne– Edit pAddOne.m and set: PARALLEL = 0;– Type: pRUN(’pAddOne’,1,{})

• Repeat with: PARALLEL = 1;

• Repeat with: pRUN(’pAddOne’,2,{});

• Repeat with: pRUN(’pAddOne’,2,{’cluster’});

• Four steps to taking a serial Matlab program and making it a parallel Matlab program

Page 22: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-22

Parallel MATLAB

Parallel Debugging Processes

• Simple four step process for debugging a parallel programSerialMatlab

SerialpMatlab

ParallelpMatlab

OptimizedpMatlab

MappedpMatlab

Functional correctness

pMatlab correctness

Parallel correctness

Performance

Step 1Add DMATs

Step 2Add Maps

Step 4Add CPUs

Step 3Add Matlabs

Add distributed matrices without maps, verify functional correctnessPARALLEL=0; pRUN(’pAddOne’,1,{});

Add maps, run on 1 processor, verify parallel correctness, compare performance with Step 1PARALLEL=1; pRUN(’pAddOne’,1,{});

Run with more processes, verify parallel correctnessPARALLEL=1; pRUN(’pAddOne’,2,{}) );

Run with more processors, compare performance with Step 2PARALLEL=1; pRUN(’pAddOne’,2,{‘cluster’});

• Always debug at earliest step possible (takes less time)

Page 23: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-23

Parallel MATLAB

Timing

• Run dAddOne: pRUN(’pAddOne’,1,{’cluster’});– Record processing_time

• Repeat with: pRUN(’pAddOne’,2,{’cluster’});– Record processing_time

• Repeat with: pRUN(’pAddone’,4,{’cluster’});– Record processing_time

• Repeat with: pRUN(’pAddone’,8,{’cluster’});– Record processing_time

• Repeat with: pRUN(’pAddone’,16,{’cluster’});– Record processing_time

• Run program while doubling number of processors• Record execution time

Page 24: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-24

Parallel MATLAB

Computing Speedup

Number of Processors

Spee

dup

• Speedup Formula: Speedup(NP) = Time(NP=1)/Time(NP)• Goal is sublinear speedup• All programs saturate at some value of NP

Page 25: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-25

Parallel MATLAB

Amdahl’s Law

Processors

Spee

dup

Smax = w|-1

NP

= w

|-1

Smax/2

Linear

w| = 0.1

• Divide work into parallel (w||) and serial (w|) fractions• Serial fraction sets maximum speedup: Smax = w|

-1

• Likewise: Speedup(NP=w|-1) = Smax/2

Page 26: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-26

Parallel MATLAB

HPC Challenge Speedup vs Effort

0.0001

0.001

0.01

0.1

1

10

100

1000

0.001 0.01 0.1 1 10Relative Code Size

Spe

edup

C + MPIMatlabpMatlab

ideal speedup = 128

FFT

HPL

STREAM

Random AccessRandom

AccessRandom Access

STREAM

FFT

HPL(32)

STREAMFFT

HPL

• Ultimate Goal is speedup with minimum effort• HPC Challenge benchmark data shows that pMatlab can

deliver high performance with a low code size

Serial C

Page 27: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-27

Parallel MATLAB

Portable Parallel Programming

Amap = map([Np 1],{},0:Np-1);Bmap = map([1 Np],{},0:Np-1);A = rand(M,N,Amap);B = zeros(M,N,Bmap);B(:,:) = fft(A);

Universal Parallel Matlab programming

• pMatlab runs in all parallel Matlab environments

• Only a few functions are needed– Np– Pid– map– local– put_local– global_index– agg– SendMsg/RecvMsg

Jeremy Kepner

Parallel MATLABfor Multicore and Multinode Systems

• Only a small number of distributed array functions are necessary to write nearly all parallel programs

• Restricting programs to a small set of functions allows parallel programs to run efficiently on the widest range of platforms

1 2 3 4

Page 28: Parallel Matlab programming using Distributed Arrays · Slide-1 Parallel MATLAB MIT Lincoln Laboratory Parallel Matlab programming using Distributed Arrays Jeremy Kepner MIT Lincoln

MIT Lincoln LaboratorySlide-28

Parallel MATLAB

SerialMATLAB

SerialpMatlab

ParallelpMatlab

OptimizedpMatlab

MappedpMatlab

Add DMATs Add Maps Add Matlabs Add CPUs

Functional correctness

pMatlab correctness

Parallel correctness

Performance

Step 1 Step 2 Step 3 Step 4

Get It Right Make It Fast

Summary

• Distributed arrays eliminate most parallel coding burden• Writing well performing programs requires expertise• Experts rely on several key concepts

– Concurrency vs Locality– Measuring Speedup– Amdahl’s Law

• Four step process for developing programs– Minimizes debugging time– Maximizes performance