parallel multigrid preconditioning on graphics processing ... · power grid test cases ibm power...
TRANSCRIPT
Design Automation Group
Parallel Multigrid Preconditioning on Graphics Processing
Design Automation Group
Units (GPUs) for Robust Power Grid Analysis
Zhuo FengMichigan Technological University
Zhiyu ZengTexas A&M University
1
2010 ACM/EDAC/IEEE Design Automation Conference
MotivationOn-chip power distribution network verification challenge
– Tens of millions of grid nodes (recent IBM design reaches ~400M)Need long simulation time for transient power grid verification– Need long simulation time for transient power grid verification
Parallel circuit simulation algorithms on GPUs– Pros: very cost efficient: 240-core GPU costs $400– Pros: very cost efficient: 240-core GPU costs $400– Hardware resource usage limitations:
– Shared memory size, number of registers, etcAl ith d d t t t d i f– Algorithm and data structure design preferences:
– Multilevel iterative algorithms for SIMD computing platform– GPU-friendly device memory access patterns, simple control flow
Our contribution: a robust power grid simulation method for GPU– Multigrid preconditioning assures fast convergence (< 20 iterations)
GPU specific data structure guarantee coalesced memory access
2
– GPU-specific data structure guarantee coalesced memory access
IR Drop in Power Distribution NetworkIR drop: voltage drop due to non-ideal resistive wires
VDD
XX
VDD VDD
3
GNDGND Cadence
Power Grid Modeling & Analysis
Vdd Vdd
Multi-layer interconnects are modeled as 3D RC network– Switching gate effects are modeled by time-varying current loadings
Vdd Vdd
Vdd Vdd
DC analysis solves linear systemDC analysis solves linear system
Tens of millionsof unknowns !
G v b⋅ = :n nG ×∈ℜ Conductance Matrix
Transient analysis solves( )( ) ( )dv tG v t C b t⋅ + ⋅ =
1
::
n n
n
Cv
×
×
∈ℜ
∈ℜ
Capacitance Matrix
Node Voltage Vector
4
( ) ( )G v t C b tdt
+1 :nb ×∈ℜ Current Loading Vector
Prior WorkPrior power grid analysis approaches
–Direct methods (LU factorization, Cholesky decomposition)– Cholmod uses 7GB memory and >1,000 s for a 9-million grid
–Iterative methods–Preconditioned conjugate gradient (T Chen et al DAC’01)–Preconditioned conjugate gradient (T. Chen et al, DAC 01) –Multigrid methods (S. Nassif et al, DAC’00)
–Stochastic method–Random walk (H. Qian et al, DAC’05) VDDVDD
5
VDDVDD
Random walk MultigridDirect Method
Prior Work (Cont.)Recent GPU based power grid analysis methods
–Hybrid Multigrid method on GPU (Z. Feng et al, ICCAD’08)P f t ( l f illi d d)–Pros: very fast (solves four million nodes per second)
–Cons: convergence rate depends on 2D grid approximation
P i S l (J Shi t l DAC’09)–Poisson Solver (J. Shi et al, DAC’09)–Pros: public CUFFT library -> easier implementation–Cons: only suitable 2D regular grids
Robust preconditioned Krylov subspace iterative methods on GPU–Preconditioners using incomplete LU or Cholesky matrix factors
–Matrix factors are hard to store and process on GPU
–Multigrid based preconditioning methods
6
–SIMD multigrid solver + sparse matrix-vector operations on GPU
NVIDIA GPU ArchitectureStreaming Multiprocessor (SM)– 8 streaming processors (SP)
2 i l f ti it (SFU)
Thread Execution Manager
– 2 special function units (SFU)
– Multithreaded instruction fetch/dispatch unit
Read/write
Global Memory
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel Data
Cache
Read/write Read/write Read/write Read/write Read/write
Multi-threaded instruction dispatch– 1 to 512 active threads
– 32 threads (a warp) share one Instruction L1 Data L1Streaming Multiprocessor
32 threads (a warp) share one instruction fetch
– Cover memory load latency Some facts about an SM SP SP
Instruction Fetch/Dispatch
Instruction L1 Data L1
Shared Memory
Some facts about an SM– 16 KB shared memory
– 8,196 registersSP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
7
– >30 Gflops peak performance SP SP
GPU Memory Space (CUDA Memory Model)Each thread:– R/W per-thread local memory
R/W per block shared memoryThread Local
M– R/W per-block shared memory
– R/W per-grid global memory
– Read only per-grid texture/constant memory
Memory
Local Shared Global Texture
Read Yes Yes Yes Yes
Block SharedMemory
Write Yes Yes Yes No
Size Large Small Large Large
BW Hi h Hi h Hi h Hi h
Global
Grid 0
. . .Block1
Block2
BlockN
BW High High High High
Cached? No Yes No Yes
Latency 500 cyc. 20 cyc. 500 cyc. 300 cyc.
Mem
ory
Grid 1Block
1Block
2Block
N. . .
8
y y y y y
Device Memory Comparison
Contribution of This WorkMultigrid preconditioned Krylov subspace iterative solver on GPUHost (CPU) Memory GPU Global Memory
(DRAM)Jacobi Smoother
MGPCG Algorithm on GPU
VDD VDD VDD
3D Multi‐layer Irregular Power Grid
( )Jacobi Smoother using Sparse Matrix
1,1 1,4 1,5
2,2 2,3 2,6
3,2 3,3 3,8
4,1 4,4
a a aa a aa a a
a a
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥
Set Initial Solution
Get Initial Residual and
Search Direction
VDD VDD VDD
5,1 5,5 5,7
6,2 6,6 6,8
7,5 7,7
8,3 8,6 8,8
a a aa a a
a aa a a
⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
Update Solution and Residual
Check ConvergenceConverged+
…Geometrical Multigrid
Solver (GMD)
Multigrid Preconditioning
Check Convergence
Not Converged
Gx b=
( )( ) ( )dx tGx t C b tdt
+ =
DC :
TR :
Update Search Direction
Return Final Solution
9
Original Grid Matrix + Geometrical Representation
GPU-friendly Multi-levelIterative Algorithm
Multigrid MethodsAmong fastest numerical algorithms for PDE-like problems
Linear complexity in the number of unknowns
A hierarchy of exact to coarse replicas of the problemHigh (low) frequency errors damped on fine (coarse) gridsDirect/iterative solvers for coarsest gridDirect/iterative solvers for coarsest grid
Multigrid operationsSmoothing restriction prolongation and correction etcSmoothing, restriction, prolongation and correction, etc
Algebraic MG (AMG) and Geometric MG (GMD)GMD: suitable for GPU’s SIMD computationGMD: suitable for GPU s SIMD computationAMG: robust for handling irregular grids, but needs irregular memory access and complex control flow
10
Power Grid Topology RegularizationLocation-based mapping (Z. Feng et al, ICCAD’08)
Metal 5~6
Metal 3~4
Metal 1~2
2D Regular Grid
11
Parallel Multigrid Preconditioning3D grid smoother + 2D gird GMD solver
– 3D finest grid is stored using ELL-like sparse matrix format
2D t t id d t i ll– 2D coarser to coarsest grids are processed geometricallyCoalesced memory accesses are guaranteed on GPU
RHS Solution
P l & C t
SmoothSmooth SmoothSmooth
Jacobi SmoothingJacobi Smoothing
l &Prolong & Correct
…Restrict
Smooth
Restrict Prolong & Correct
Smooth
Restrict
Smooth
Restrict Prolong & Correct
Smooth
Prolong & Correct
IterativeMatrix Solver
IterativeMatrix SolverGMD Solver
12
Jacobi GMD Solve Jacobi GMD Solve Jacobi…
GMD Smoother
SP1SP1 SP2W i ht d J bi
Mixed block-wise relaxation on GPU
Multiprocessors
SP1 SP2
SP3 SP4
SP5 SP6
SP7 SP8
SP1 SP2
SP3 SP4
SP5 SP6
SP7 SP8
SP1 SP2
SP3 SP4
SP5 SP6
SP7 SP8
Streaming processors …Weighted Jacobi
iterations within each block
Multiprocessors
Shared MemorySM1
Shared MemorySM2
Shared MemorySM3
…
SP7 SP8
G S id l it ti
Global Memory
Gauss-Seidel iterationsamong blocks
…
13
Execution Time
Memory Layout on GPUMixed data structures
Original grid (finest grid level) ELL-like Sparse Matrixa a a⎡ ⎤
X
1,1 1,4 1,5
2,2 2,3 2,6
3,2 3,3 3,8
4,1 4,4
a a aa a aa a a
a aa a a
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥
ProlongationLevel 0
Restriction
5,1 5,5 5,7
6,2 6,6 6,8
7,5 7,7
8,3 8,6 8,8
a a aa a a
a aa a a
⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
Level 3
Prolongation
Y
Restriction
L l 1
Level 2
Level 3(coarsest grid)
G G
14
Level 1Regularized coarse to coarsest grids
Graphics Pixels on GPU
Nodal Analysis Matrix ELL-like sparse matrix storage
11 1 4 15a a a⎡ ⎤⎢ ⎥
Off Diagonal Elements
Inversed DiagonalM
1,1 1,4 1,5
2,2 2,3 2,6
3,2 3,3 3,8
4,1 4,4
a a aa a a
a a
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥
Diagonal Elements
1D−Element Value Vector
Element IndexVector
C l 1 C l 2 C l 1 C 2, ,
5,1 5,5 5,7
6,2 6,6 6,8
7,5 7,7
a a aa a a
a a
⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥
1,4a
a
5
6
4
3
1,5a
a
1,1
1a
1
Col 1 Col 2 Col 1 Col 2
8,3 8,6 8,8a a a⎢ ⎥⎢ ⎥⎣ ⎦
A D M= +
2,3a
7 ,5a
…
6
0
…
3
5
…
2,6a
0
…2,2a
1a
…
+A D MDM
= +
: Diagonal Elements of A
: Off-Diagonal Elements of A
,
8 ,3a 6
P P
3
P
8,6a
P
7 ,7a
8,8
1a
P
15
M : Off-Diagonal Elements of A P PPP P
GPU Device Memory Access PatternGPU-based Jacobi Iteration (smoother): ( )1 1 ( )k kx D b Mx+ − ⎡ ⎤= −⎣ ⎦
11,4a
2,3a
1,5a
2,6a
4
3
5
6
1,1
1a
2 ,2
1a
1
2
tt
1
2
tt
1
2
tt
1
2
tt
1
2
tt
7 ,5a
…
0
…
5
…
0…
,
1a
…
7t
…
2
t
…
7t
…
7t
…
7t…
,
8 ,3a 8,6a3 6
7,7a
8,8
1a
7
8
tt
7
8
tt
7
8t7
8t7
8t
P PP P P
Execution Time
…
( )kS b M ( )1 1k D S+
…… … …
T1 T2 T3 T4
16
Time ( )kS b M x= − ( )1 1kx D S+ −=
Algorithm Flow
Set Initial SolutionJacobi Smoother using
Sparse Matrix
a a a⎡ ⎤Get Initial Residual and
Search Direction
1,1 1,4 1,5
2,2 2,3 2,6
3,2 3,3 3,8
4,1 4,4
5,1 5,5 5,7
a a aa a aa a a
a aa a a
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥
Update Solution and Residual
Check Convergence
6,2 6,6 6,8
7,5 7,7
8,3 8,6 8,8
a a aa a
a a a
⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
+
Multigrid Preconditioning
Check Convergence
Not Converged
Converged
Geometrical Multigrid Solver (GMD)
Update Search Direction
Converged
17
Return Final Solution
Experiment ResultsLinux computing system: C++ & CUDA
CPU: Core 2 Quad 2.66GHz + 6GB DRAM GPU: NVIDIA GTX 285 1 5GHz with 240 SPs ($400)GPU: NVIDIA GTX 285 1.5GHz with 240 SPs ($400)
Power grid test casesIBM power grid benchmark circuits CKT1~5 (0 13M ~ 1 7 M)IBM power grid benchmark circuits CKT1 5 (0.13M 1.7 M)Larger industrial power grid designs CKT6~8 (4.5 M ~ 10M)
Direct solver on the hostDirect solver on the hostCholmod with Supernodal and Metis functions
Iterative solvers on GPUIterative solvers on GPUMGPCG: multigrid preconditioned CGDPCG: diagonally preconditioned CGHMD: hybrid multigrid (Z. Feng et al, ICCAD’08)
18
y g ( g , )
Power Grid Design Information
CKT N_node N_layer N_nnz N_res N_cur
CKT1 127 0K 5 532 9K 209 7K 37 9KCKT1 127.0K 5 532.9K 209.7K 37.9K
CKT2 851.6K 5 3.7M 1.4M 201.1K
CKT3 953 6K 6 4 1M 1 5M 277 0KCKT3 953.6K 6 4.1M 1.5M 277.0K
CKT4 1.0M 3 4.3M 1.7M 540.8K
CKT5 1.7M 3 6.6M 2.5M 761.5K
CKT6 4.7M 8 18.8M 6.8M 185.5K
CKT7 6.7M 8 26.2M 9.7M 267.3K
CKT8 10.5M 8 40.8M 14.8M 419.3K
N_layer: the number of metal layers
19
_ y yN_res: the number of resistorsN_cur: the number of current sources
Convergence Comparison10
0
MGPCGHMD
MG
PC
10-1
al
HMD CG
conve
10-2
Res
idua
erges muc
10-3
Max ErrorsHMD: 1e‐3 VoltMGPCG: 1e‐5 Volt
ch faster
HMD H b id M lti id M th d
0 2 4 6 8 10 12 1410
-4
Iteration Number
G CG: e 5 o t ☺
20
HMD: Hybrid Multigrid MethodMGPCG: Multigrid Preconditioned Conjugate Gradient Method
Results
CKT NCG NDPCG NMGPCG NHMD TCG TDPCG
Power Grid DC Analysis
CKT1 1,405 400 4 7 0.2 0.12CKT2 4,834 3,351 4 13 5.9 3.9CKT3 2 253 681 3 8 5 4 2 1CKT3 2,253 681 3 8 5.4 2.1CKT4 4,062 411 5 >30 5.3 3.7CKT5 6,433 700 6 >30 15.2 9.5
NCG: the number of CG iterationsNDPCG: the number of diagonally preconditioned CG iterationsNMGPCG: the number of multigrid preconditioned CG iterationsNHMD: the number of hybrid multigrid iterationsTCG: the runtime of CGTDPCG: the runtime of diagonally preconditioned CG iterations
21
TDPCG: the runtime of diagonally preconditioned CG iterations
Results (Cont.)
CKT TMGPCG THMD TCHOL Eavg Emax Speedup
Power Grid DC Analysis (Cont.)
CKT1 0.05 0.1 1.7 1e-4 4e-4 34XCKT2 0.5 0.78 20.2 2e-6 2e-5 40XCKT3 0 4 0 72 21 6 1e-5 1e-4 54XCKT3 0.4 0.72 21.6 1e-5 1e-4 54XCKT4 0.9 >2.4 19.4 8e-5 7e-4 22XCKT5 1.1 >4.5 25.6 1e-4 5e-4 25X
TMGPCG: the runtime of multigrid preconditioned CG iterationsTHMD: the runtime of hybrid multigrid iterationsTCHOL: the runtime of direct matrix solver (Cholmod) Eavg: average errorEmax: maximum errorSpeedup: TCHOL/TMGPCG
22
Speedup: TCHOL/TMGPCG
Results (Cont.)
DC Analysis of Large Circuits
CKT N_node N_MGPCG T_MGPCG T_CHOL Speedup
CKT6 4.7M 7 4.9 131.5 27X
CKT7 6.7M 9 7.9 205.1 26X
CKT8 10.5M 11 11.6 N/A N/A
N_MGPCG: the number of MGPCG iterationsT_MGPCG: the runtime of MGPCG solverT_CHOL: the runtime of the Cholmod solver
23
Results (Cont.)
CKT Tcpu Tgpu Ngpu Eavg Emax Speedup
Transient Analysis Results
CKT Tcpu Tgpu Ngpu Eavg Emax Speedup
CKT1 40.8 2.1 102 3e-6 8e-4 20XCKT2 315.3 15.2 99 1e-5 3e-4 21XCKT3 360.7 15.6 100 5e-6 1e-4 23XCKT4 352.7 19.6 140 3e-5 2e-4 18XCKT5 553 9 26 1 130 2 5 1 4 21XCKT5 553.9 26.1 130 2e-5 1e-4 21X
Tcpu: Cholmod solve timecpu C o od so e t eTgpu: MGPCG timeNgpu: the number of MGPCG iterations
24
Transient Analysis: CKT11.8
1.75
Volta
ge (V
)
1.7
CholmodGPU
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 10-9
1.65
Time (seconds)
1.7078
1.7079
Cholmod: 212sCKT1 with 127K nodes500 time steps
1.7073
1.7074
1.7075
1.7076
1.7077
Volta
ge (V
)
CholmodGPU
GPU: 9.2s500 time steps509 MGPCG iters.
23X Speedups
25
2.84 2.845 2.85 2.855 2.86 2.865
x 10-9
1.7072
Time (seconds)
Transient Analysis: CKT5
1.8
CholmodGPU
1.785
1.79
1.795
ltage
(V)
GPU
1.77
1.775
1.78Vo
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 10-9
1.765
Time (seconds)
CKT5 with 1 7M nodes1 771
1.7711
Cholmod: 2,700sGPU: 128s
CKT5 with 1.7M nodes
500 time steps693 MGPCG iters.
1.7708
1.7709
1.771
1.771
1.771
Volta
ge (V
)
Cholmod 22X Speedups
26
2.3 2.305 2.31 2.315 2.32 2.325 2.33 2.335
x 10-9
1.7708
Time (seconds)
CholmodGPU
22X Speedups
Conclusion and Future WorkRobust circuit simulation on GPU is challenging
– How to accelerate simulations for irregular problems?Hard to guarantee the accuracy and robustness?– Hard to guarantee the accuracy and robustness?
Parallel multigrid preconditioning method for power grid analysisM lti id diti i ( t i l t i t ti )– Multigrid preconditioning (geometrical + matrix representations)
– Geometrical multigrid solver on GPU– ELL-like sparse matrix-vector operations for original grids on GPU
A li bl t l id ith t i l iti– Applicable to more general power grids with strong irregularities– Much faster convergence & higher accuracy then ever before
F t kFuture work– Node ordering and grid partitioning for Multi-Core-Multi-GPUs– GPU performance modeling for further improving the solver efficiency
27
– Heterogeneous computing to adaptively balance the work loads