tbs: fast analysis of structured power grid by triangularization based structure preserving model...

TBS: Fast Analysis of Structured Power Grid by Triangularization Based Structure Preserving

Model Order Reduction

Hao Yu, Yiyu Shi and Lei He Electrical Engineering Dept.UCLA

Partially supported by NSFPartially supported by NSFand UC-MICRO fund sponsored by Analog Devices, Intel and Mindspeed and UC-MICRO fund sponsored by Analog Devices, Intel and Mindspeed

http//:eda.ee.ucla.edu

2New Challenges in Integrity Verification

Integrity verification is to check transient V/T-violation for linear power/signal/thermal network Large-scale

millions of nodes and ports Often structured

e.g., locally regular and globally irregular P/G network [Singh-Sapatnekar:TCAD’05]

A fast yet accurate linear simulator to perform large-scale transient verification is necessary Linear-network macromodeling is one effective approach

How to use structure information to buildaccurate and efficient macromodels

3Existing Structured Macromodeling

Hierarchical node-elimination (HNE) by [Zhao-Panda-Sapatnekar-Blaauw:DAC’00] Build macromodel by internal node elimination with source mapping Analyze macromodel in a hierarchical (two-level) fashion Require a sparsification by linear-programming (LP) due to the dense fill-in

SPRIM [Freund:ICCAD’04] and BSMOR [Yu-He-Tan:BMAS’05] Leverage block structure in the state matrix Build macromodel by a structure-preserved moment-matching

HiPRIME [Cao-Lee-Chen: DAC’02], a hierarchical extension of PRIMA [Odabaisoglu-Celik-Pileggi:TCAD’98] Build macromodel by hierarchical orthonormalization Lose the hierarchy due to the final flat-projection

We propose a new structure-preserved moment matching,with 20x less waveform error and 50x speedup

4Outline

Review macromodeling by moment matching

Our Approach: TBS method

Experimental Results

Conclusions

5Macromodeling by Moment Matching (I)

Electric systems can be described in MNA (modified nodal analysis)

Solution (x) of MNA is contained in block Krylov subspace

Grimme’s Projection Theorem

6Macromodeling by Moment Matching (II)

a) To remove linear dependency in the low-dimensioned projection matrix V, block-Arnoldi orthnormalization is applied

c) To handle large number of inputs such as P/G network, SIMO (single-input-multi-output) reduction can be assumed

b) To preserve passivity, a congruence transformation is used to project state matrices (G,C,B,L) respectively

Replace the input port matrix B by a common input vector J

All poles are matched w.r.t. one superposed input Matched moments/poles (q) are independent on input number (p)

V is flat and destroys the structure of state matrices [Feldmann-Liu: ICCAD’04Feldmann-Liu: ICCAD’04]

7Structure-preserved Moment Matching

Limitations of SPRIM and BSMOR Moment/pole matching is not localized Reduction does not preserve the structure of latency Model does not leverage redundancy Inefficient and inaccurate for P/G grid macromodeling

SPRIM [Freund:ICCAD’04] leverages the 2 x 2 block structure in MNA Splits V into a 2 x 2 block diagonal form Preserves the structure of reciprocity (symmetry between input

and output), and hence achieves a higher accuracy than PRIMA

BSMOR [Yu-He-Tan:BMAS’05] partitions state matrices into more blocks Splits V into a m x m block diagonal form Preserves the block structure and sparsity, and hence achieves better

efficiency than SPRIM

8Outline


Our Approach: TBS method Triangular Block Structured moment matching


Conclusions

9

Stamp interconnection blocks off-diagonally

Stamp basic blocks diagonally

From Layout to Structured Model

Build a structured state matrix by partitioning the layout

1 2 3 4 5 6 7 8

1 2

3 4

5 6

7 8

2g -g -g

-g1 g3 -g -gx

-g1 2g1 - g1

-g -g1 g3 -g

-g

-g

g4 -g -g

-g2 2g2 -g

-g2 g4 -g

-g -g

1 2

3 4

5 6

7 8

2g1 -g1 -g1

- -g1 -

- -

-g1 - -gx

-gx

-gx

-g2 -g2

- -g2

- -g2

-g2 -g2 2g2

1 2

3 4

5 6

7 8

1 2

3 4

5 6

7 8

w1 w2

g3=2g1+gx g4=2g2+gx

A number of interconnected basic blocks can beused to represent both homogenous and heterogeneous circuits

g1 g2

gx

10Properties of Interconnected Basic Blocks

Structure of latency: the spatial distribution oftime constants Each basic block has a time constant

Due to redundancy, basic block representation is not compact

Redundancy: different basic blocks can share a same or similar time constant

11

Dominant-pole Based Clustering removes redundancy

TBS Flow

(Reduced Blocks)

(Basic Blocks)

Block Diagonal Projection

(Block Integrity)Two-level Relaxation Analysis

(Triangular Blocks)Triangularization

(Compact Blocks) Dominant-pole Clustering

12Clustering Procedure

Compress basic blocks into compact blocks

Cluster number is determined by the nature of the network structure There is no need to cluster a homogeneous circuit,

but TBS still applies

2. Cluster basic blocks if the mode-distance is small enough

1. Calculate the q-dominant pole-set (mode) for each basic block

and

13Advantages of Clustering

Redundant poles are removed Hence redundant columns in the projection

matrix are also removed, i.e., the effective rank of projection matrix is improved

Structure of latency is leveraged Each compact block can be solved with different

time-step

A complete modal decomposition is achieved Each compact block has a unique pole-set or mode,

and the resulted system is block-wisely stiff

System poles are determined by both diagonal and off-diagonal blocks, which is not efficient

14TBS Flow

Triangularization can localize system poles to diagonal blocks, which is the key contribution of this work

(Reduced Blocks)

(Basic Blocks)





15Triangularization Procedure

2. Move the original lower-triangular parts to the new upper-triangular parts

1. Stack a replica-block diagonally

This procedure is implemented by a block matrix data structure without increasing memory usage

16Advantages of Triangularization

System poles are determined only by those compact blocks in diagonal Compact blocks are almost decoupled from

each other

A triangular system has a factorization cost only coming from those diagonal blocks There is no need to factorize the entire matrix

Block duplication results in an equivalent solution Simpler than the existing permutation

based triangularization procedure[Kim Davis: KLU]

Due to the replica block, the overall cost of factorization is the same as the original

17TBS Flow

Block diagonal projection can reduce the system size and the cost of the factorization

(Reduced Blocks)

(Basic Blocks)





18

2. Reduce the state matrices block by block respectively

Block Diagonal Projection Procedure

1. Split a flat into a structured with an increased rank by a factor of cluster number

The reduced system preserves upper-triangular structure

19Advantages of Block Diagonal Projection

System moments and poles are matched locally Each compact block is reduced locally to match q poles Total mq poles are matched for m unique compact blocks

(poles from the replica are duplicate poles)

Reduced model preserves block triangular structure and structure of latency Each reduced block can be factorized independently Each reduced block could have different time-constant

More matched poles improves accuracy Using a low-order reduction for each compact block locally can

achieve a high-order accuracy for the overall system

It can be efficiently solved by a block backward-substitution or a two-level analysis with relaxation

20TBS Flow

Two-level relaxation can further reduce simulation cost

Reduced Blocks

Basic Blocks


Block Integrity

Two-level Relaxation Analysis

Triangular BlocksTriangularization

Compact Blocks Dominant-pole Clustering

21Two-level Relaxation Solver

The time-domain iteration of a triangular system always converges [White: Book’87]

Two-level representation and analysis

+

Each reduced diagonal block can be factorized independently, and solved with different time step during backward-Euler (BE) integration In contrast, the previous pole-residue solution

eigen-decompose the entire reduced matrix (dense and no structure) structure of latency cannot be explored

22Outline


Our Approach: TBS method


Conclusions

23Experiment Settings

Large-scale homogeneous and heterogeneous P/G grid (RC-mesh) with millions of nodes

For heterogeneous case, each block has different wire-pitch/width, block-size and hence different time-constant

Reduction algorithm assumes SIMO reduction for large number of inputs but also supports the general MIMO reduction

Compare TBS to BSMOR [Yu-He-Tan:BMAS’05], HiPRIME [Cao-Lee-Chen:DAC’02], and HNE[Zhao-Panda-Sapatnekar-Blaauw:DAC’00]

24Triangular Block Structure Preservation

Nonzero (nz) pattern of conductance matrices (a) original system (b) triangular system (c) reduced system by TBS

25m x q Pole Matching

(m0=32, m=4, q=8): TBS has exact 32-pole matched, BSMOR has exact 8-pole matched and 24-pole approximately matched, and HiPRIME (a partitioned PRIMA) has only 8-pole matched

Waveforms in time domain: improved accuracy with more matched poles

26Study Waveform-error Scalability

ckt Node (N) Port (p) Order (q) HNE HiPRIME BSMOR TBS

ckt1 1K 48 8 5.54e-6 9.09e-6 4.87e-6 5.03e-7

ckt2 10K 320 40 1.21e-5 2.31e-5 7.93e-6 1.84e-6

ckt3 100K 480 60 1.31e-2 6.82e-4 1.91e-4 3.02e-5

ckt4 1M 800 100 6.01e-2 9.67e-3 4.23e-3 1.27e-4

ckt5 7.68M 4800 200 0.11 9,93e-2 5.10e-2 3.01e-3

ckt6 7.68M 6.14M 300 NA NA NA 5.04e-3

HiPRIME, BSMOR and TBS use the same order (moments) to generate the macromodel

The macromodel obtained by HNE has a similar size and sparsity as TBS

1. TBS reduces waveform-error by 38X compared to HNE as truncation used in HNE leads to large error

2. TBS reduces waveform-error by 33X compared to HiPRIME as more poles are matched

3. TBS reduces waveform-error by 17X compared to BSMOR as more poles are exactly matched

27Study Runtime Scalability

1day:1hr:29min

6min:16sNANANANANANAckt6

1day:18min2min:8s1day:1hr:36min

1hr:45min

~5day2min:42s1day:5hr:11min

4hr:43min:18sckt5

11min:23s20.7s11min:42s4min:54s

~1day47.3s21min:32s34min:58sckt4

1min:32s1.62s1min:38s1min:2s2hr:48min:20s

5.76s1min:51s1min:17sckt3

1.02s0.11s1.18s0.63s1min:42s0.54s1.24s2.19sckt2

0.08s0.09s0.08s0.12s1.02s0.15s0.08s0.44sckt1

simulationbuildsimulationbuildsimulationbuildsimulationbuild

TBSBSMORHiPRIMEHNEckt

All methods generate macromodels with similar accuracy

1. TBS (and HiPRIME) is 133X faster to build than HNE as no LP-truncation is needed to preserve sparsity

2. TBS (and HiPRIME) is 54X faster to build than BSMOR as the orthonormalization is performed locally

3. TBS (and BSMOR/HNE) is 109X faster to simulate than HiPRIME as their macromodels have hierarchy

Runtime includes macromodel-building/simulation time

28Conclusions TBS enables localized moment matching, and matches more poles than

PRIMA TBS is stable, and is passive for MIMO reduction TBS is applicable to both homogenous and heterogeneous designs TBS achieves over 20x less waveform error and 50x speedup compared

to HNE, HiPRIME, and BSMOR (an improved version of SPRIM)

TBS approach has been extended to Handle inductance and its inverse element [Yu-Shi-He:ICCAD’06] Optimize simultaneous power and thermal integrity in 3D integration

[Yu-Ho-He:ICCAD’06]

More details can be found in DAC Ph. D forum 2006

tbs: fast analysis of structured power grid by triangularization based structure preserving model...

Documents

tcad05 n

leverage block structure

structure information

structure of latency

iccad04 slide

outline n review macromodeling

prima n bsmor yu

n electric systems