a brief introduction to parallel programming...

1

A brief introduction to parallel programming models

Eduard AyguadéComputational Sciences Dep., Barcelona Supercomputing CenterComputer Architecture Dep., Technical University of Catalunya

Scalable programming modelsScalable programming models

Shared-memoryprogramming models

- threading model- memory model (consistency)- work creation and distribution- synchronization

Distributed-memoryprogramming models

- process model- data distribution- data movement- synchronization

Chip-level programming models- Multicore, SMT, …- Streaming- local memories, DMA transfers, …

The Grid

FutureFuture Petaflop Petaflop systemssystems

cccc--NUMANUMA

Small DMMSmall DMM

OnOn--boardboard SMPSMP

LargeLarge cluster cluster systemssystems

ChipChip

Grid programmingmodels

- Granularity issues- Heterogeneous systems- Dynamicity in availablity of resources

Hardware DSM- thread affinity- page migration: first/next touch

Software DSM- paging shared memory- overlapped data movement

with computation

2

Hybrid programming modelsHybrid programming models

Most modern high-performance computing systems are clusters of SMP nodes with multicore chips

Programming models allow to specify:How computation is distributed?How data is distributed and how is it accessed?How to avoid data races?

Hybrid vs. single programming model

Interconnection NetworkInterconnection Network

Memory

MC MC MC MC

SMP

Memory

MC MC MC MC

SMP

Memory

MC MC MC MC

SMP

Memory

MC MC MC MC

SMPMC: multicore. SMP: symmetric multiprocessor

Shared address spaceShared address space

Programmer needs:Distribute workAll threads can access data, heap and stacksUse synchronization mechanisms to avoid data races

Code

Data +Heap

Single processsequential

Stack

Single processmultithreaded

Stack3

Stack2

Stack

Data +Heap

3

Example: heat distributionExample: heat distribution

Time=0

Temp=20 Temp=100Temp=0

Time=2

…

Time=3

Time=99

Time=100

Time=1

xi-1 xi+1

xi

xixi(t) = (xi-1(t-1) + xi+1(t-1)) / 2

Heat distribution: sequential programHeat distribution: sequential program

int t; /* time step */int i; /* array index */float x[n+1],y[n+1]; /* temperatures */

x[0] = y[0] = 20; x[n+1] = y[n+1] = 100;for(i=1; i<n+1; i++)

x[i] = 0;

for(t=1; t<=100; t++){

for(i=1; i<n+1; i++)y[i] = 0.5 * (x[i-1] + x[i+1]);

swap(x,y);}

x

…

…

y

4

Heat distribution: shared memoryHeat distribution: shared memory

int t /* time step */, i;int id; /* my processor id */shared float x[1002]; /* temperatures */int leftmost,rightmost; /* processor boundary points */float x_leftmost,x_rightmost; /* their temperatures */

id=GetID();leftmost=id*100+1; rightmost=(id+1)*100;if(id==0){x[0]=20; x[1001]=100;}for(i=leftmost;i<=rightmost;i++) x[i]=0;

for(t=1;t<=100;t++){barrier(n);x_leftmost=0.5*(x[leftmost-1]+x[leftmost+1]);x_rightmost=0.5*(x[rightmost-1]+x[rightmost+1]);barrier(n);for(i=leftmost+1;i<rightmost;i++)

x[i]=0.5*(x[i-1]+x[i+1]);x[leftmost]=x_leftmost;x[rightmost]=x_rightmost;

}barrier(n);

1002 points

10 processors

… …

Altix 4700: ccNUMA shared memoryAltix 4700: ccNUMA shared memory

Shared memory

ccNUMA: cache-coherentNon-Uniform Memory Access

Shared

5

Altix 4700: blade, IRU and RackAltix 4700: blade, IRU and Rack

Altix 4700: C2 bladeAltix 4700: C2 blade

Each blade contains 2 sockets for Itanium2 Montecito dual-core

SHub 2.0 with FSB at 533 MHz (*), 12 DDR2 memory sockets and 2 NumaLink4 channels (6.4 GB/s each)

(*)16 bytesx533 MHz=8.5 GB/s

8.5 GB8.5 GB

M9MsocketM9M

socket

M9MsocketM9M

socket

SHub 2.0SHub 2.08.5 GB

DDR2DDR2DDR2

DDR2DDR2DDR2

DDR2DDR2DDR2

DDR2DDR2DDR2

NL4

6

Altix 4700: Individual Rack Unit (physical)Altix 4700: Individual Rack Unit (physical)

Altix 4700: Individual Rack Unit (logic)Altix 4700: Individual Rack Unit (logic)

7

Altix 4700: 32 and 64 processor systemAltix 4700: 32 and 64 processor system

6 x 6.4 = 38.4 GB/s = 3.84 GB/s/blade

8 x 6.4 = 51.2 GB/s = 2.56 GB/s/blade

Altix 4700: 128 processor systemAltix 4700: 128 processor system

Systems larger than 32 bladesrequire the use of additionaldense router modules, with 4 8x8 crossbar ASIC up to 256 compute blades

Systems larger than 256 compute blades requireadditional Shubs that create 1D or 2D meshes (up to 4096 compute blades)

8 x 6.4 = 51.2 GB/s = 1.28 GB/s/blade

8

Altix 4700: up to 256 compute bladesAltix 4700: up to 256 compute blades

Shared address spaceShared address space

Programmer needs:Distribute workAll threads can access data, heap and stacksUse synchronization mechanisms to avoid data races

But memory is not flat …

Perform work to benefit from spatial and temporal locality

Single processmultithreaded

Stack3

Stack2

Stack

Data +Heap

Code

Data +Heap

Single processsequential

Stack

9

NASA Columbia SupercomputerNASA Columbia Supercomputer

20 SGI® Altix™ 3700 superclusters, each with 512 Itanium2 processors (1.5 GHz, 6 MB cache)

Infiniband network to connect superclusters

Distributed address spaceDistributed address space

Programmer needs:Distribute workDistribute dataUse communication mechanisms to share data explicitlyUse synchronization mechanisms to avoid data races

Distributed Memory

Code

Data+Heap

Shared Memory

Stack3

Stack2

Stack

10

Message passing programming modelMessage passing programming model

Send specifies buffer to be transmitted and receiving processReceive specifies sending process and application storage to receive intoOptional tag on send and matching rule on receiveImplicit synchronization (e.g. non-blocking send and blocking receive)Many overheads: copying, buffer management, protection

Process P Process Q

AddressY

AddressX

Send Q, X, tag

Receive P, Y, tag

Match

Local pr ocessaddress spaceLocal pr ocess

address space

Heat distribution: message passingHeat distribution: message passing

int t /* time step */, i;int id; /* my processor id */int num_points; /* number of points per processor */float x[-1:(1000/P)]; /* temperatures, including shadow points */num_points = 1000/P;id=GetID();if (id == 0) x[-1]=20;if (id == (P-1)) x[num_points]=100;for(i=0; i<num_points; i++) x[i]=0;

for(t=1;t<=100;t++){if(id < P) receive(id+1,&x[num_points]);if(id > 0) receive(id-1,&x[-1]);for(i=0; i<num_points; i++)

x[i]=0.5*(x[i-1]+x[i+1]);if(id > 0) send(id-1,x[0]);if(id < P) send(id+1,x[num_points-1]);

}

shadow points

… …

…

id-1 id+1

id

time=t-1

time=t-1

1002 points

10 processors

… …

a brief introduction to parallel programming...

Documents