load balancing hybrid programming models for smp clusters and fully permutable loops
DESCRIPTION
Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops. Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
Load Balancing Hybrid Programming Load Balancing Hybrid Programming Models for SMP Clusters and Fully Models for SMP Clusters and Fully
Permutable LoopsPermutable Loops
Nikolaos Drosinos and Nectarios Koziris
National Technical University
of Athens
Computing Systems
Laboratory
{ndros,nkoziris}@cslab.ece.ntua.grwww.cslab.ece.ntua.gr
Oslo, June 15, 2005 ICPP-HPSEC 2005 2
MotivationMotivation
fully permutable loops always a computational challenge for HPC hybrid parallelization attractive for DSM architectures currently, popular free message passing libraries provide limited multi-threading support SPMD hybrid parallelization suffers from intrinsic load imbalance
Oslo, June 15, 2005 ICPP-HPSEC 2005 3
ContributionContribution
two static thread load balancing schemes (constant-variable) for coarse-grain funneled hybrid parallelization of fully permutable loops
• generic• simple to implement
experimental evaluation against micro-kernel benchmarks of different programming models
• message passing• fine-grain hybrid• coarse-grain hybrid (unbalanced, balanced)
Oslo, June 15, 2005 ICPP-HPSEC 2005 4
Algorithmic modelAlgorithmic model
foracross tile1 do
…
foracross tileN do
for tilen-1 do
Receive(tile);
Compute(A,tile);
Send(tile);
Restrictions: fully permutable loops unitary inter-process dependencies
Oslo, June 15, 2005 ICPP-HPSEC 2005 5
Message passing Message passing parallelizationparallelization
tiling transformation (overlapped?) computation and communication phases pipelined execution
portable scalable highly optimized
Oslo, June 15, 2005 ICPP-HPSEC 2005 6
Hybrid parallelizationHybrid parallelization
So… why bother?
Oslo, June 15, 2005 ICPP-HPSEC 2005 7
Hybrid parallelization: why Hybrid parallelization: why bother Ibother I
shared memory programming model vs message passing programming model for shared memory architecture
Oslo, June 15, 2005 ICPP-HPSEC 2005 8
Hybrid parallelization: why Hybrid parallelization: why bother IIbother II
DSM architectures are popular!
Oslo, June 15, 2005 ICPP-HPSEC 2005 9
Fine-grain hybrid Fine-grain hybrid parallelizationparallelization
incremental parallelization of loops relatively easy to implement popular
Amdahl’s law restricts parallel efficiency overhead of thread structures re-initialization restrictive programming model for many applications
Oslo, June 15, 2005 ICPP-HPSEC 2005 10
Coarse-grain hybrid Coarse-grain hybrid parallelizationparallelization
generic SPMD programming style good parallelization efficiency no thread re-initialization overhead
more difficult to implement intrinsic load imbalance assuming common funneled thread support level
Oslo, June 15, 2005 ICPP-HPSEC 2005 11
MPI thread support levelsMPI thread support levels
single masteronly funneled serialized multiple
fine-grain hybrid
coarse-grain hybrid
comm
comp
comp
comp
comm…
comm
comp
comp
…comp
Oslo, June 15, 2005 ICPP-HPSEC 2005 12
Load balancingLoad balancing
Idea
Consequencemaster thread assumes a smaller fraction of the process tile computational load compared to other threads
othercomp
mastercomm
mastercomp ttt
Oslo, June 15, 2005 ICPP-HPSEC 2005 13
Load balancing (2)Load balancing (2)
T………total number of threadsp………current process id
1
1,
,
11
N
Cdir
dirdircomm
tilecomp
p
tt
Tbal
datastartupcomm
compcomp
txtxt
txxtAssuming
It follows
Oslo, June 15, 2005 ICPP-HPSEC 2005 14
Load balancing (3)Load balancing (3)
X1
X2
87% 87% 87% 92%
95% 95% 95% 100%
Z
thread 0 thread 1process (0,0)
process (3,1)
Oslo, June 15, 2005 ICPP-HPSEC 2005 15
Experimental ResultsExperimental Results
8-node dual SMP Linux Cluster (800 MHz PIII, 256 MB RAM, kernel 2.4.26) MPICH v.1.2.6 (--with-device=ch_p4, --with-comm=shared, P4_SOCKBUFSIZE=104KB) Intel C++ compiler 8.1 (-O3 -static
-mcpu=pentiumpro) FastEthernet interconnection network
Oslo, June 15, 2005 ICPP-HPSEC 2005 16
Alternating Direction Implicit Alternating Direction Implicit (ADI)(ADI)
Stencil computation used for solving partial differential equations Unitary data dependencies 3D iteration space (X x Y x Z)
X
Y
Z
Seque
ntial
Exe
cutio
nProcessor Mapping
DataDependencies
Oslo, June 15, 2005 ICPP-HPSEC 2005 17
ADIADI
Oslo, June 15, 2005 ICPP-HPSEC 2005 18
Synthetic benchmarkSynthetic benchmark
Oslo, June 15, 2005 ICPP-HPSEC 2005 19
ConclusionsConclusions
fine-grain hybrid parallelization inefficient unbalanced coarse-grain hybrid parallelization also inefficient balancing improves hybrid model performance variable balanced coarse-grain hybrid model most efficient approach overall relative performance improvement increases for higher communication vs computation needs
Oslo, June 15, 2005 ICPP-HPSEC 2005 20
Thank You!Thank You!
Questions?
Oslo, June 15, 2005 ICPP-HPSEC 2005 21
ADIADI
Oslo, June 15, 2005 ICPP-HPSEC 2005 22
Synthetic benchmarkSynthetic benchmark