a brief introduction to parallel programming...
TRANSCRIPT
1
A brief introduction to parallel programming models
Eduard AyguadéComputational Sciences Dep., Barcelona Supercomputing CenterComputer Architecture Dep., Technical University of Catalunya
Scalable programming modelsScalable programming models
Shared-memoryprogramming models
- threading model- memory model (consistency)- work creation and distribution- synchronization
Distributed-memoryprogramming models
- process model- data distribution- data movement- synchronization
Chip-level programming models- Multicore, SMT, …- Streaming- local memories, DMA transfers, …
The Grid
FutureFuture Petaflop Petaflop systemssystems
cccc--NUMANUMA
Small DMMSmall DMM
OnOn--boardboard SMPSMP
LargeLarge cluster cluster systemssystems
ChipChip
Grid programmingmodels
- Granularity issues- Heterogeneous systems- Dynamicity in availablity of resources
Hardware DSM- thread affinity- page migration: first/next touch
Software DSM- paging shared memory- overlapped data movement
with computation
2
Hybrid programming modelsHybrid programming models
Most modern high-performance computing systems are clusters of SMP nodes with multicore chips
Programming models allow to specify:How computation is distributed?How data is distributed and how is it accessed?How to avoid data races?
Hybrid vs. single programming model
Interconnection NetworkInterconnection Network
Memory
MC MC MC MC
SMP
Memory
MC MC MC MC
SMP
Memory
MC MC MC MC
SMP
Memory
MC MC MC MC
SMPMC: multicore. SMP: symmetric multiprocessor
Shared address spaceShared address space
Programmer needs:Distribute workAll threads can access data, heap and stacksUse synchronization mechanisms to avoid data races
Code
Data +Heap
Single processsequential
Stack
Single processmultithreaded
Stack3
Stack2
Stack
Data +Heap
3
Example: heat distributionExample: heat distribution
Time=0
Temp=20 Temp=100Temp=0
Time=2
…
Time=3
Time=99
Time=100
Time=1
xi-1 xi+1
xi
xixi(t) = (xi-1(t-1) + xi+1(t-1)) / 2
Heat distribution: sequential programHeat distribution: sequential program
int t; /* time step */int i; /* array index */float x[n+1],y[n+1]; /* temperatures */
x[0] = y[0] = 20; x[n+1] = y[n+1] = 100;for(i=1; i<n+1; i++)
x[i] = 0;
for(t=1; t<=100; t++){
for(i=1; i<n+1; i++)y[i] = 0.5 * (x[i-1] + x[i+1]);
swap(x,y);}
x
…
…
y
4
Heat distribution: shared memoryHeat distribution: shared memory
int t /* time step */, i;int id; /* my processor id */shared float x[1002]; /* temperatures */int leftmost,rightmost; /* processor boundary points */float x_leftmost,x_rightmost; /* their temperatures */
id=GetID();leftmost=id*100+1; rightmost=(id+1)*100;if(id==0){x[0]=20; x[1001]=100;}for(i=leftmost;i<=rightmost;i++) x[i]=0;
for(t=1;t<=100;t++){barrier(n);x_leftmost=0.5*(x[leftmost-1]+x[leftmost+1]);x_rightmost=0.5*(x[rightmost-1]+x[rightmost+1]);barrier(n);for(i=leftmost+1;i<rightmost;i++)
x[i]=0.5*(x[i-1]+x[i+1]);x[leftmost]=x_leftmost;x[rightmost]=x_rightmost;
}barrier(n);
1002 points
10 processors
… …
Altix 4700: ccNUMA shared memoryAltix 4700: ccNUMA shared memory
Shared memory
ccNUMA: cache-coherentNon-Uniform Memory Access
Shared
5
Altix 4700: blade, IRU and RackAltix 4700: blade, IRU and Rack
Altix 4700: C2 bladeAltix 4700: C2 blade
Each blade contains 2 sockets for Itanium2 Montecito dual-core
SHub 2.0 with FSB at 533 MHz (*), 12 DDR2 memory sockets and 2 NumaLink4 channels (6.4 GB/s each)
(*)16 bytesx533 MHz=8.5 GB/s
8.5 GB8.5 GB
M9MsocketM9M
socket
M9MsocketM9M
socket
SHub 2.0SHub 2.08.5 GB
DDR2DDR2DDR2
DDR2DDR2DDR2
DDR2DDR2DDR2
DDR2DDR2DDR2
NL4
6
Altix 4700: Individual Rack Unit (physical)Altix 4700: Individual Rack Unit (physical)
Altix 4700: Individual Rack Unit (logic)Altix 4700: Individual Rack Unit (logic)
7
Altix 4700: 32 and 64 processor systemAltix 4700: 32 and 64 processor system
6 x 6.4 = 38.4 GB/s = 3.84 GB/s/blade
8 x 6.4 = 51.2 GB/s = 2.56 GB/s/blade
Altix 4700: 128 processor systemAltix 4700: 128 processor system
Systems larger than 32 bladesrequire the use of additionaldense router modules, with 4 8x8 crossbar ASIC up to 256 compute blades
Systems larger than 256 compute blades requireadditional Shubs that create 1D or 2D meshes (up to 4096 compute blades)
8 x 6.4 = 51.2 GB/s = 1.28 GB/s/blade
8
Altix 4700: up to 256 compute bladesAltix 4700: up to 256 compute blades
Shared address spaceShared address space
Programmer needs:Distribute workAll threads can access data, heap and stacksUse synchronization mechanisms to avoid data races
But memory is not flat …
Perform work to benefit from spatial and temporal locality
Single processmultithreaded
Stack3
Stack2
Stack
Data +Heap
Code
Data +Heap
Single processsequential
Stack
9
NASA Columbia SupercomputerNASA Columbia Supercomputer
20 SGI® Altix™ 3700 superclusters, each with 512 Itanium2 processors (1.5 GHz, 6 MB cache)
Infiniband network to connect superclusters
Distributed address spaceDistributed address space
Programmer needs:Distribute workDistribute dataUse communication mechanisms to share data explicitlyUse synchronization mechanisms to avoid data races
Distributed Memory
Code
Data+Heap
Shared Memory
Stack3
Stack2
Stack
10
Message passing programming modelMessage passing programming model
Send specifies buffer to be transmitted and receiving processReceive specifies sending process and application storage to receive intoOptional tag on send and matching rule on receiveImplicit synchronization (e.g. non-blocking send and blocking receive)Many overheads: copying, buffer management, protection
Process P Process Q
AddressY
AddressX
Send Q, X, tag
Receive P, Y, tag
Match
Local pr ocessaddress spaceLocal pr ocess
address space
Heat distribution: message passingHeat distribution: message passing
int t /* time step */, i;int id; /* my processor id */int num_points; /* number of points per processor */float x[-1:(1000/P)]; /* temperatures, including shadow points */num_points = 1000/P;id=GetID();if (id == 0) x[-1]=20;if (id == (P-1)) x[num_points]=100;for(i=0; i<num_points; i++) x[i]=0;
for(t=1;t<=100;t++){if(id < P) receive(id+1,&x[num_points]);if(id > 0) receive(id-1,&x[-1]);for(i=0; i<num_points; i++)
x[i]=0.5*(x[i-1]+x[i+1]);if(id > 0) send(id-1,x[0]);if(id < P) send(id+1,x[num_points-1]);
}
shadow points
… …
…
id-1 id+1
id
time=t-1
time=t-1
1002 points
10 processors
… …