parallel multigrid preconditioning on graphics processing ... · power grid test cases ibm power...

Design Automation Group

Parallel Multigrid Preconditioning on Graphics Processing

Design Automation Group

Units (GPUs) for Robust Power Grid Analysis

Zhuo FengMichigan Technological University

Zhiyu ZengTexas A&M University

1

2010 ACM/EDAC/IEEE Design Automation Conference

MotivationOn-chip power distribution network verification challenge

– Tens of millions of grid nodes (recent IBM design reaches ~400M)Need long simulation time for transient power grid verification– Need long simulation time for transient power grid verification

Parallel circuit simulation algorithms on GPUs– Pros: very cost efficient: 240-core GPU costs $400– Pros: very cost efficient: 240-core GPU costs $400– Hardware resource usage limitations:

– Shared memory size, number of registers, etcAl ith d d t t t d i f– Algorithm and data structure design preferences:

– Multilevel iterative algorithms for SIMD computing platform– GPU-friendly device memory access patterns, simple control flow

Our contribution: a robust power grid simulation method for GPU– Multigrid preconditioning assures fast convergence (< 20 iterations)

GPU specific data structure guarantee coalesced memory access

2

– GPU-specific data structure guarantee coalesced memory access

IR Drop in Power Distribution NetworkIR drop: voltage drop due to non-ideal resistive wires

VDD

XX

VDD VDD

3

GNDGND Cadence

Power Grid Modeling & Analysis

Vdd Vdd

Multi-layer interconnects are modeled as 3D RC network– Switching gate effects are modeled by time-varying current loadings

Vdd Vdd

Vdd Vdd

DC analysis solves linear systemDC analysis solves linear system

Tens of millionsof unknowns !

G v b⋅ = :n nG ×∈ℜ Conductance Matrix

Transient analysis solves( )( ) ( )dv tG v t C b t⋅ + ⋅ =

1

::

n n

n

Cv

×

×

∈ℜ

∈ℜ

Capacitance Matrix

Node Voltage Vector

4

( ) ( )G v t C b tdt

+1 :nb ×∈ℜ Current Loading Vector

Prior WorkPrior power grid analysis approaches

–Direct methods (LU factorization, Cholesky decomposition)– Cholmod uses 7GB memory and >1,000 s for a 9-million grid

–Iterative methods–Preconditioned conjugate gradient (T Chen et al DAC’01)–Preconditioned conjugate gradient (T. Chen et al, DAC 01) –Multigrid methods (S. Nassif et al, DAC’00)

–Stochastic method–Random walk (H. Qian et al, DAC’05) VDDVDD

5

VDDVDD

Random walk MultigridDirect Method

Prior Work (Cont.)Recent GPU based power grid analysis methods

–Hybrid Multigrid method on GPU (Z. Feng et al, ICCAD’08)P f t ( l f illi d d)–Pros: very fast (solves four million nodes per second)

–Cons: convergence rate depends on 2D grid approximation

P i S l (J Shi t l DAC’09)–Poisson Solver (J. Shi et al, DAC’09)–Pros: public CUFFT library -> easier implementation–Cons: only suitable 2D regular grids

Robust preconditioned Krylov subspace iterative methods on GPU–Preconditioners using incomplete LU or Cholesky matrix factors

–Matrix factors are hard to store and process on GPU

–Multigrid based preconditioning methods

6

–SIMD multigrid solver + sparse matrix-vector operations on GPU

NVIDIA GPU ArchitectureStreaming Multiprocessor (SM)– 8 streaming processors (SP)

2 i l f ti it (SFU)

Thread Execution Manager

– 2 special function units (SFU)

– Multithreaded instruction fetch/dispatch unit

Read/write

Global Memory

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel Data

Cache

Read/write Read/write Read/write Read/write Read/write

Multi-threaded instruction dispatch– 1 to 512 active threads

– 32 threads (a warp) share one Instruction L1 Data L1Streaming Multiprocessor

32 threads (a warp) share one instruction fetch

– Cover memory load latency Some facts about an SM SP SP

Instruction Fetch/Dispatch

Instruction L1 Data L1

Shared Memory

Some facts about an SM– 16 KB shared memory

– 8,196 registersSP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

7

– >30 Gflops peak performance SP SP

GPU Memory Space (CUDA Memory Model)Each thread:– R/W per-thread local memory

R/W per block shared memoryThread Local

M– R/W per-block shared memory

– R/W per-grid global memory

– Read only per-grid texture/constant memory

Memory

Local Shared Global Texture

Read Yes Yes Yes Yes

Block SharedMemory

Write Yes Yes Yes No

Size Large Small Large Large

BW Hi h Hi h Hi h Hi h

Global

Grid 0

. . .Block1

Block2

BlockN

BW High High High High

Cached? No Yes No Yes

Latency 500 cyc. 20 cyc. 500 cyc. 300 cyc.

Mem

ory

Grid 1Block

1Block

2Block

N. . .

8

y y y y y

Device Memory Comparison

Contribution of This WorkMultigrid preconditioned Krylov subspace iterative solver on GPUHost (CPU) Memory GPU Global Memory

(DRAM)Jacobi Smoother

MGPCG Algorithm on GPU

VDD VDD VDD

3D Multi‐layer Irregular Power Grid

( )Jacobi Smoother using Sparse Matrix

1,1 1,4 1,5

2,2 2,3 2,6

3,2 3,3 3,8

4,1 4,4

a a aa a aa a a

a a

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥

Set Initial Solution

Get Initial Residual and

Search Direction

VDD VDD VDD

5,1 5,5 5,7

6,2 6,6 6,8

7,5 7,7

8,3 8,6 8,8

a a aa a a

a aa a a

⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

Update Solution and Residual

Check ConvergenceConverged+

…Geometrical Multigrid

Solver (GMD)

Multigrid Preconditioning

Check Convergence

Not Converged

Gx b=

( )( ) ( )dx tGx t C b tdt

+ =

DC :

TR :

Update Search Direction

Return Final Solution

9

Original Grid Matrix + Geometrical Representation

GPU-friendly Multi-levelIterative Algorithm

Multigrid MethodsAmong fastest numerical algorithms for PDE-like problems

Linear complexity in the number of unknowns

A hierarchy of exact to coarse replicas of the problemHigh (low) frequency errors damped on fine (coarse) gridsDirect/iterative solvers for coarsest gridDirect/iterative solvers for coarsest grid

Multigrid operationsSmoothing restriction prolongation and correction etcSmoothing, restriction, prolongation and correction, etc

Algebraic MG (AMG) and Geometric MG (GMD)GMD: suitable for GPU’s SIMD computationGMD: suitable for GPU s SIMD computationAMG: robust for handling irregular grids, but needs irregular memory access and complex control flow

10

Power Grid Topology RegularizationLocation-based mapping (Z. Feng et al, ICCAD’08)

Metal 5~6

Metal 3~4

Metal 1~2

2D Regular Grid

11

Parallel Multigrid Preconditioning3D grid smoother + 2D gird GMD solver

– 3D finest grid is stored using ELL-like sparse matrix format

2D t t id d t i ll– 2D coarser to coarsest grids are processed geometricallyCoalesced memory accesses are guaranteed on GPU

RHS Solution

P l & C t

SmoothSmooth SmoothSmooth

Jacobi SmoothingJacobi Smoothing

l &Prolong & Correct

…Restrict

Smooth

Restrict Prolong & Correct

Smooth

Restrict

Smooth

Restrict Prolong & Correct

Smooth

Prolong & Correct

IterativeMatrix Solver

IterativeMatrix SolverGMD Solver

12

Jacobi GMD Solve Jacobi GMD Solve Jacobi…

GMD Smoother

SP1SP1 SP2W i ht d J bi

Mixed block-wise relaxation on GPU

Multiprocessors

SP1 SP2

SP3 SP4

SP5 SP6

SP7 SP8

SP1 SP2

SP3 SP4

SP5 SP6

SP7 SP8

SP1 SP2

SP3 SP4

SP5 SP6

SP7 SP8

Streaming processors …Weighted Jacobi

iterations within each block

Multiprocessors

Shared MemorySM1

Shared MemorySM2

Shared MemorySM3

…

SP7 SP8

G S id l it ti

Global Memory

Gauss-Seidel iterationsamong blocks

…

13

Execution Time

Memory Layout on GPUMixed data structures

Original grid (finest grid level) ELL-like Sparse Matrixa a a⎡ ⎤

X

1,1 1,4 1,5

2,2 2,3 2,6

3,2 3,3 3,8

4,1 4,4

a a aa a aa a a

a aa a a

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥

ProlongationLevel 0

Restriction

5,1 5,5 5,7

6,2 6,6 6,8

7,5 7,7

8,3 8,6 8,8

a a aa a a

a aa a a

⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

Level 3

Prolongation

Y

Restriction

L l 1

Level 2

Level 3(coarsest grid)

G G

14

Level 1Regularized coarse to coarsest grids

Graphics Pixels on GPU

Nodal Analysis Matrix ELL-like sparse matrix storage

11 1 4 15a a a⎡ ⎤⎢ ⎥

Off Diagonal Elements

Inversed DiagonalM

1,1 1,4 1,5

2,2 2,3 2,6

3,2 3,3 3,8

4,1 4,4

a a aa a a

a a

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥

Diagonal Elements

1D−Element Value Vector

Element IndexVector

C l 1 C l 2 C l 1 C 2, ,

5,1 5,5 5,7

6,2 6,6 6,8

7,5 7,7

a a aa a a

a a

⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥

1,4a

a

5

6

4

3

1,5a

a

1,1

1a

1

Col 1 Col 2 Col 1 Col 2

8,3 8,6 8,8a a a⎢ ⎥⎢ ⎥⎣ ⎦

A D M= +

2,3a

7 ,5a

…

6

0

…

3

5

…

2,6a

0

…2,2a

1a

…

+A D MDM

= +

: Diagonal Elements of A

: Off-Diagonal Elements of A

,

8 ,3a 6

P P

3

P

8,6a

P

7 ,7a

8,8

1a

P

15

M : Off-Diagonal Elements of A P PPP P

GPU Device Memory Access PatternGPU-based Jacobi Iteration (smoother): ( )1 1 ( )k kx D b Mx+ − ⎡ ⎤= −⎣ ⎦

11,4a

2,3a

1,5a

2,6a

4

3

5

6

1,1

1a

2 ,2

1a

1

2

tt

1

2

tt

1

2

tt

1

2

tt

1

2

tt

7 ,5a

…

0

…

5

…

0…

,

1a

…

7t

…

2

t

…

7t

…

7t

…

7t…

,

8 ,3a 8,6a3 6

7,7a

8,8

1a

7

8

tt

7

8

tt

7

8t7

8t7

8t

P PP P P

Execution Time

…

( )kS b M ( )1 1k D S+

…… … …

T1 T2 T3 T4

16

Time ( )kS b M x= − ( )1 1kx D S+ −=

Algorithm Flow

Set Initial SolutionJacobi Smoother using

Sparse Matrix

a a a⎡ ⎤Get Initial Residual and

Search Direction

1,1 1,4 1,5

2,2 2,3 2,6

3,2 3,3 3,8

4,1 4,4

5,1 5,5 5,7

a a aa a aa a a

a aa a a

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥

Update Solution and Residual

Check Convergence

6,2 6,6 6,8

7,5 7,7

8,3 8,6 8,8

a a aa a

a a a

⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

+

Multigrid Preconditioning

Check Convergence

Not Converged

Converged

Geometrical Multigrid Solver (GMD)

Update Search Direction

Converged

17

Return Final Solution

Experiment ResultsLinux computing system: C++ & CUDA

CPU: Core 2 Quad 2.66GHz + 6GB DRAM GPU: NVIDIA GTX 285 1 5GHz with 240 SPs ($400)GPU: NVIDIA GTX 285 1.5GHz with 240 SPs ($400)

Power grid test casesIBM power grid benchmark circuits CKT1~5 (0 13M ~ 1 7 M)IBM power grid benchmark circuits CKT1 5 (0.13M 1.7 M)Larger industrial power grid designs CKT6~8 (4.5 M ~ 10M)

Direct solver on the hostDirect solver on the hostCholmod with Supernodal and Metis functions

Iterative solvers on GPUIterative solvers on GPUMGPCG: multigrid preconditioned CGDPCG: diagonally preconditioned CGHMD: hybrid multigrid (Z. Feng et al, ICCAD’08)

18

y g ( g , )

Power Grid Design Information

CKT N_node N_layer N_nnz N_res N_cur

CKT1 127 0K 5 532 9K 209 7K 37 9KCKT1 127.0K 5 532.9K 209.7K 37.9K

CKT2 851.6K 5 3.7M 1.4M 201.1K

CKT3 953 6K 6 4 1M 1 5M 277 0KCKT3 953.6K 6 4.1M 1.5M 277.0K

CKT4 1.0M 3 4.3M 1.7M 540.8K

CKT5 1.7M 3 6.6M 2.5M 761.5K

CKT6 4.7M 8 18.8M 6.8M 185.5K

CKT7 6.7M 8 26.2M 9.7M 267.3K

CKT8 10.5M 8 40.8M 14.8M 419.3K

N_layer: the number of metal layers

19

_ y yN_res: the number of resistorsN_cur: the number of current sources

Convergence Comparison10

0

MGPCGHMD

MG

PC

10-1

al

HMD CG

conve

10-2

Res

idua

erges muc

10-3

Max ErrorsHMD: 1e‐3 VoltMGPCG: 1e‐5 Volt

ch faster

HMD H b id M lti id M th d

0 2 4 6 8 10 12 1410

-4

Iteration Number

G CG: e 5 o t ☺

20

HMD: Hybrid Multigrid MethodMGPCG: Multigrid Preconditioned Conjugate Gradient Method

Results

CKT NCG NDPCG NMGPCG NHMD TCG TDPCG

Power Grid DC Analysis

CKT1 1,405 400 4 7 0.2 0.12CKT2 4,834 3,351 4 13 5.9 3.9CKT3 2 253 681 3 8 5 4 2 1CKT3 2,253 681 3 8 5.4 2.1CKT4 4,062 411 5 >30 5.3 3.7CKT5 6,433 700 6 >30 15.2 9.5

NCG: the number of CG iterationsNDPCG: the number of diagonally preconditioned CG iterationsNMGPCG: the number of multigrid preconditioned CG iterationsNHMD: the number of hybrid multigrid iterationsTCG: the runtime of CGTDPCG: the runtime of diagonally preconditioned CG iterations

21

TDPCG: the runtime of diagonally preconditioned CG iterations

Results (Cont.)

CKT TMGPCG THMD TCHOL Eavg Emax Speedup

Power Grid DC Analysis (Cont.)

CKT1 0.05 0.1 1.7 1e-4 4e-4 34XCKT2 0.5 0.78 20.2 2e-6 2e-5 40XCKT3 0 4 0 72 21 6 1e-5 1e-4 54XCKT3 0.4 0.72 21.6 1e-5 1e-4 54XCKT4 0.9 >2.4 19.4 8e-5 7e-4 22XCKT5 1.1 >4.5 25.6 1e-4 5e-4 25X

TMGPCG: the runtime of multigrid preconditioned CG iterationsTHMD: the runtime of hybrid multigrid iterationsTCHOL: the runtime of direct matrix solver (Cholmod) Eavg: average errorEmax: maximum errorSpeedup: TCHOL/TMGPCG

22

Speedup: TCHOL/TMGPCG

Results (Cont.)

DC Analysis of Large Circuits

CKT N_node N_MGPCG T_MGPCG T_CHOL Speedup

CKT6 4.7M 7 4.9 131.5 27X

CKT7 6.7M 9 7.9 205.1 26X

CKT8 10.5M 11 11.6 N/A N/A

N_MGPCG: the number of MGPCG iterationsT_MGPCG: the runtime of MGPCG solverT_CHOL: the runtime of the Cholmod solver

23

Results (Cont.)

CKT Tcpu Tgpu Ngpu Eavg Emax Speedup

Transient Analysis Results

CKT Tcpu Tgpu Ngpu Eavg Emax Speedup

CKT1 40.8 2.1 102 3e-6 8e-4 20XCKT2 315.3 15.2 99 1e-5 3e-4 21XCKT3 360.7 15.6 100 5e-6 1e-4 23XCKT4 352.7 19.6 140 3e-5 2e-4 18XCKT5 553 9 26 1 130 2 5 1 4 21XCKT5 553.9 26.1 130 2e-5 1e-4 21X

Tcpu: Cholmod solve timecpu C o od so e t eTgpu: MGPCG timeNgpu: the number of MGPCG iterations

24

Transient Analysis: CKT11.8

1.75

Volta

ge (V

)

1.7

CholmodGPU

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10-9

1.65

Time (seconds)

1.7078

1.7079

Cholmod: 212sCKT1 with 127K nodes500 time steps

1.7073

1.7074

1.7075

1.7076

1.7077

Volta

ge (V

)

CholmodGPU

GPU: 9.2s500 time steps509 MGPCG iters.

23X Speedups

25

2.84 2.845 2.85 2.855 2.86 2.865

x 10-9

1.7072

Time (seconds)

Transient Analysis: CKT5

1.8

CholmodGPU

1.785

1.79

1.795

ltage

(V)

GPU

1.77

1.775

1.78Vo

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 10-9

1.765

Time (seconds)

CKT5 with 1 7M nodes1 771

1.7711

Cholmod: 2,700sGPU: 128s

CKT5 with 1.7M nodes

500 time steps693 MGPCG iters.

1.7708

1.7709

1.771

1.771

1.771

Volta

ge (V

)

Cholmod 22X Speedups

26

2.3 2.305 2.31 2.315 2.32 2.325 2.33 2.335

x 10-9

1.7708

Time (seconds)

CholmodGPU

22X Speedups

Conclusion and Future WorkRobust circuit simulation on GPU is challenging

– How to accelerate simulations for irregular problems?Hard to guarantee the accuracy and robustness?– Hard to guarantee the accuracy and robustness?

Parallel multigrid preconditioning method for power grid analysisM lti id diti i ( t i l t i t ti )– Multigrid preconditioning (geometrical + matrix representations)

– Geometrical multigrid solver on GPU– ELL-like sparse matrix-vector operations for original grids on GPU

A li bl t l id ith t i l iti– Applicable to more general power grids with strong irregularities– Much faster convergence & higher accuracy then ever before

F t kFuture work– Node ordering and grid partitioning for Multi-Core-Multi-GPUs– GPU performance modeling for further improving the solver efficiency

27

– Heterogeneous computing to adaptively balance the work loads

parallel multigrid preconditioning on graphics processing ... · power grid test cases ibm power...

Documents