Transient Analysis Transient Analysis
CK Cheng
UC San Diego
CK Cheng
UC San Diego
Jan. 25, 2007Jan. 25, 2007
Outline
• Research Directions• Simulation test case results• Overview of Simulation• Commercial Package• Alternating direction implicit (ADI) Method• General Operator Splitting Method• Distributed Computing• Conclusions and Future Works
Research Directions
• Simulation: SPICE, STA
• Network on Chip: topology and wire styles,
• Power, and Clock Networks
• Data Path Components: adders, shifters, multipliers, division
• Packaging: passive distortion compensation
6x6 Bump Simulation Results• The Circuit:
– 184K Capacitors, 17K Current Sources, 120K Inductors and 246K Resistors.
– 306K Nodes
• Accuracy:– Waveform and measurement results match Fujitsu’s
with less than 0.002% error.
• Runtime / Memory Comparison:
CPU_Time Memory Computer Used
UCSD 678s 600.2M Pentium 4 3.2G, Linux
Fujistu Log File 1845s 771M unknown
6x6 Bump Simulation Results• Measurement results and waveform
Min_pwr_l_est_10000954 Min_18269323 Min_33085875
UCSD 0.9980790 0.9967357 0.9934251
Fujistu Log File 0.9980620 0.9966940 0.9933790
Error 0.002% 0.004% 0.005%
(Red curve is UCSD result)
703KR Simulation Results• The Circuit:
– 514K Capacitors, 76K Current Sources, 370K Inductors and 703K Resistors.
– 1.3M Nodes
• Accuracy:– Measurement results match Fujitsu’s with less than
0.02% error.
• Runtime / Memory Comparison:
CPU_Time Memory Computer Used
UCSD 2575s (0.7h) 1.7G Pentium 4 3.2G, Linux
Fujistu Log File 864561s (240h) 2.28G unknown
703KR Simulation Results • Measurement results and waveform
Min_33096003 Min_33096004 Min_33097557
UCSD 0.9400988 0.9421157 0.9370827
Fujistu Log File 0.9399610 0.9419260 0.9368400
Error 0.015% 0.02% 0.026%
(UCSD results only. Fujitsu waveform is not available for comparison)
Further Speed-ups• Reduce iteration count by 50% for pure linear circuits (like
6x6 bump and 703KR)– 2x speed up
• More effective time step control– DVDT, breakpoint, truncation error. 1.5 - 3x speed up
• Use Multigrid solver– 1.5 - 2x speed up for medium circuits (6x6 bump)
– 2x – 10x speed up for large circuits (703KR)
• Parallel simulation– 4 or more processors on linux cluster
– 32 to hundreds of processors on supercomputer.
• Overall speed-up– 6x - 60x speed up without parallel simulation
– 12x - 1000x speed up with parallel simulation
Performance and capacity prediction
Cases 10x-100x larger than 703KR.
Preferred Solver Cpu Time Memory
Small - Medium
0.3M nodes
LU Decomposition 11 minutes 600M
Medium - Large
1.3M nodes
Multigrid 43 minutes 1.7G
Huge
10–100 M nodes
Multigrid + Parallel
5 – 100 hours 15G - 200G
Overview of Simulation
Our research• Fast speed with SPICE
accuracy• Nonlinear devices• Efficient matrix solvers• Effective integration methods• Time step controls according
to different integration methods
• Distributed computingYes
Load Circuit
Device Evaluation
LU Decomposition
N-R Converge?
Next Time Point
Time Step Control
Integration Approximation
Linearization
No
Overview of Simulation
•Matrix Solver•LU Decomposition•Iterative Approach
•Integration•Time Step Control•ADI
•Nonlinear Devices•Two Stage Newton Raphson
•Distributed Computing•Commercial Implementation
Overview of Simulation
•Integration•Time Step Control•ADI (two-way partitioning)•Operator Splitting (multi-way)
•Distributed Computing•MPI•Partitioning
•Three Ph.D. Students
Commercial Package: Fastrack Design
•Founded in January 2001•Headquartered in San Jose•Privately funded, cash-flow positive•Two Business Units
•Design Services•Technology Products
Analog Designs
DesignDesign # Elements# Elements Sim. Sim. LenLen
HSpiceHSpice mSPICEmSPICE SPEEDUPSPEEDUP
FACTORFACTOR
LVDS 13490 20us 80h 26h 3.1X
Oscillator 222 1 ms 13,706s 2,670s 5.1X
Biasing Circuit
49197 200ns 427s 82s 5.2X
PLL 16050 40us 67d 12d 5.6X
PLL (post-layout)
300K 40us 290d (est) 16d 18.1X
Digital Blocks
DesignDesign
NameNameDevicesDevices RuntimeRuntime Speedup Speedup
FactorFactorMOSMOS RR CC mSPICEmSPICE Traditional Traditional SpiceSpice
ALU 10.1k 12.7k 7.5k 6.9m 7m 1.0X
CONTROL 69k 83.7k 52.5k 1.5h 9.5h 6.3X
YN_BLK 205K 242.8k 203.9k 3.5h > 2d >13.7X
THP 437k 499.3k 313.5k 5.0h COULD NOT RUN ∞
VCON 936k 753k 561k 15.0h COULD NOT RUN ∞
Memory Blocks
DesignDesign # #
TrTr
##
RR
##
CC
# Vectors / # Vectors / Sim. LengthSim. Length
mSPICEmSPICERun TimeRun Time
BRAM (pre) 220K 0 500 2 2.5 hours
SRAM (pre)
8Kx8 SP
410K 0 0 2 7 hours
eRAM (post)
256x16
72K 28K 427K 48ns 8 hours
BRAM (post) 220K 1320K 870K 2 18 hours
• 100% accurate Spice simulation
mSPICE-Parallel
• Industry’s first practical parallel Spice simulation solution
– Increases capacity further
– Dramatically improves throughput
• Uses Matrix Level Partitioning
– No loss of accuracy
– Client-Server configuration
– Minimal memory requirement for client nodes
Client-Server Configuration
• Server distributes sub-matrices to clients• Clients communicate partial solutions• Minimal memory requirements for clients
1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1
1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1
0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 1
1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1
1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1
1 0 1 0 0 0 0 0 0 1 0 0
1 0 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0
0 1 0 0 0 1 0 1
Experimental Results
DesignDesign TotalTotal
ElementsElements
Sim. LengthSim. Length RuntimeRuntime
1-proc1-proc 2-proc2-proc 4-proc4-proc
ASIC 1.2M 8ns 12.2h 7.0h
(1.7X)
5.1h
(2.4X)
38IO SSO 1.4M 30ns 3.0h 2.0h
(1.5X)
1.4h
(2.2X)
Signal-power 2.1M 1.2us 13d 7d18h
(1.7X)
5d12h
(2.4X)
4096x8 RAM
(extracted)
2.3M 10ns 32h 18.5h
(1.7X)
13.4h
(2.4X)
120IO SSO 3.5M 30ns 6.2h 4.1h
(1.5X)
3.1h
(2.0X)
ADI: Previous Works
• 1999, Namiki and Ito
– the alternating direction implicit (ADI) is used to simulate a 2D TE wave.
• 2001, Zheng etc.
– extend to 3D problem
• 2001 & 2003, Lee and Chen
– ADI is used to transmission line modeled power grid
The alternation is among different geometric directions, so the simulated geometric structure is constrained.
Alternating Direction Implicit (ADI)
• ADI Integration Method– Two way partition of the circuit– One partition is used for each backward integration
– Unconditional stable
(A-stable: independent of time step size)– Time step size according to local truncation error.
Alternating Direction Implicit (ADI)
• ADI method formulation• Circuit partition algorithm• Local truncation error estimation• Stability discussion• Experimental results
SPICE Formulation
• Equations for RLC circuits
where C: capacitance matrix L: inductance matrix
R: resistance matrix G: conductance matrix
E: incidence matrix
)t(U)t(I
)t(V
RE
EG
)t(I
)t(V
L0
0C T
ADI Formulation
• Transient simulation
– Split the resistors and inductors branchesinto two parts
• G = G1 + G2
• E = E1 + E2
• R = R1 + R2
– Alternate Backward and Forward integrationon each partition
ADI Formulation (Cont.)
• Equations of ADI method
– the size of left-hand-side matrix remains unchanged
– the number of non-zero elements is decreased
– direct solving methods can be efficient
)2
ht(U
)2
ht(I
)2
ht(V
Rh
L2E
EGh
C2
)ht(I
)ht(V
Rh
L2E
EGh
C2
)2
ht(U
)t(I
)t(V
Rh
L2E
EGh
C2
)2
ht(I
)2
ht(V
Rh
L2E
EGh
C2
11
T11
22
T22
22
T22
11
T11
Experiments of non-zero fill-ins
• A small ASIC Design
Spice matrix : Dimension: 10,286 The number of non-zero elements: 46,655 The number of non-zero fill-ins: 90,960
• A large I/O Design
Spice matrix : Dimension: 615,436 The number of non-zero elements: 2,126,246
Sub-matrix1 Sub-matrix2 Total# non-zero
fill-ins# non-zeroelements
# non-zerofill-ins
# non-zeroelements
# non-zerofill-ins
Case 1 38,572 2,618 42,020 10,040 12,658
Case 2 1,176,208 12,421,534 950,038 14,772,068 27,193,602
Local Truncation Error (LTE)
• Time step control using LTE– In circuit transient analysis, the next time step can be
estimated from the local truncation error at the present time point
– LTE is defined as the difference between the calculated solution and the exact solution
– To ensure the consistency, the local truncation error should not exceed the error tolerance, thus the time step can be estimated using
)tΔ(fx̂xεLTE n1n1nn
toln1n1nn E)tΔ(fx̂xεLTE
Local Truncation Error (Cont.)
• LTE of ADI method(1) equations
let , , and
then
)t(U)t(I
)t(V
RE
EG
)t(I
)t(V
L0
0C T
UNXXM
)t(I
)t(VX
L0
0CM
RE
EGN
T
BUAXUMNXMX 11
Local Truncation Error (Cont.)
• LTE of ADI method(2) Estimate exact solution
we characterize the input as a simple ramp over the interval (tn, tn+1), the exact analytic solution with time step tn:
]tΔ
UΔBA)UΔU(B[A]
tΔ
UΔBABU(AX[eX
n
n1nn
1
n
n1n
1n
tΔA1n
n
n3
n32
n2
n X)tΔA6
1tΔA
2
1tΔAI(
n3
n22
n U)tΔBA6
1tΔAB
2
1B(
)tΔ(OUΔ)tΔAB6
1tΔB
2
1( 4
nn2
nn
Local Truncation Error (Cont.)
• LTE of ADI method(3) Estimate ADI solution
2/1n2/1n1n
1n2n
2/1nn2n
2/1n1n
UX)NMtΔ
2(X)NM
tΔ
2(
UX)NMtΔ
2(X)NM
tΔ
2(
n2n1
1n
1n1
2n
1n X)A2
tΔI()A
2
tΔI)(A
2
tΔI()A
2
tΔI(X̂
2/1nn1
2n1
1n
1n1
2n BU
2
tΔ])A
2
tΔI()A
2
tΔI)(A
2
tΔI()A
2
tΔI[(
Local Truncation Error (Cont.)
• LTE of ADI method(3) Estimate ADI solution
n2n1
1n
1n1
2n
1n X)A2
tΔI()A
2
tΔI)(A
2
tΔI()A
2
tΔI(X̂
2/1nn1
2n1
1n
1n1
2n BU
2
tΔ])A
2
tΔI()A
2
tΔI)(A
2
tΔI()A
2
tΔI[(
n3
n213
n32
n2
n X)tΔAAA4
1tΔA
4
1tΔA
2
1tΔAI(
n3
n213
n22
nn U)tΔBAA4
1tΔBA
4
1tΔAB
2
1tΔB(
)tΔ(OUΔ)tΔAB4
1tΔB
2
1( 4
nn2
nn
Local Truncation Error (Cont.)
• LTE of ADI method(4) LTE estimation
1n1nn X̂XεLTE
n3
n213
n3 X)tΔAAA
4
1tΔA
12
1(
)tΔ(OXtΔAA4
1XtΔ
12
1 4nn
3n21n
3n
)tΔ(OUΔtΔAB12
1U)tΔBAA
4
1tΔBA
12
1( 4
nn2
nn3
n213
n2
Local Truncation Error (Cont.)
• LTE of ADI method(5) Time step control
2/1n2/1n1n
1n2n
2/1nn2n
2/1n1n
UX)NMtΔ
2(X)NM
tΔ
2(
UX)NMtΔ
2(X)NM
tΔ
2(
2/1n1n22/1n12/1n1nn
2/1nn22/1n1n2/1nn
UXNXN)XX(MtΔ
2
UXNXN)XX(MtΔ
2
Local Truncation Error (Cont.)
• LTE of ADI method(5) Time step control
)XX(tΔAA4
1)XX(
2
tΔXX n1n
2n21n1n
nn1n
n3
n21n1nn
nn XtΔAA4
1)XX(
2
tΔXtΔ
)XX(2
tΔXtΔAA
4
1n1n
nn
3n21
)XX(2
tΔXtΔAA
4
11nn
1n1n
31n21
3n2
1n
1nnnn
3n21n
3n tΔ)
tΔ2
XX
12
X(XtΔAA
4
1XtΔ
12
1LTE
Stability Discussion
• The stability is concerned with whether the accumulated error grows or decays as time evolves through a series of time steps.
• One-step integration approximations, the error is accumulated by a factor of
• If the final steady state error vector is smaller than the initial, then the integration method is stable.
• In ADI integration method:
– It can be proved to be unconditional stable
]tΔ
UΔBABU(AX[e]
tΔ
UΔBA)UΔU(B[AX
n
n1n
1n
tΔA
n
n1nn
11n
n
ntΔAe
)A2
tΔI()A
2
tΔI)(A
2
tΔI()A
2
tΔI(e 2
n11
n1
n12
ntΔA n
Experimental Results
Circuit1 Cuicuit2 Circuit3 1k-cell
#Nodes 10,000 40,000 90,000 10,200
#Transistors 0 0 0 6,500
Period 10ns 10ns 10ns 10ns
SPICE3 CPU time (sec) 77.8 485.3 3,061.1 181.6
#steps 115 115 114 193
ADI CPU time (sec) 28.6 117.8 275.2 523.3
#steps 102 102 102 949
Speedup 2.7x 4.1x 11.1x -
Voltage drop of Circuit3 (power mesh with sinks)
Signal in 1k_cell (ASIC design)
General Operator Splitting
• General operator splitting method– Multiple way partitions
– Each partition is considered separately in each time step simulation
– No geometry constrains
– Local truncation error is used to dynamically control time step size
General Operator Splitting
• Fundamental theory• Operator splitting formulation• Local truncation error estimation• Stability discussion• Experimental results
Fundamental theory
• In circuit transient simulation, the integration approximation is actually the approximation of the exponential operator
• The exponential operators can be approximated in any order using a general scheme of fractal decomposition
• The decomposition of exponential operators corresponds to the circuit multi-way partition
New integration approximation in transient simulation
Fundamental theory
• Approximation of exponential operator– General circuit equation and solution
– If we characterize the input as a simple ramp over the interval (tn, tn+1), the exact analytic solution with time step tn
– Exponential operator approximation
• Forward Euler
• Backward Euler
• Trapezoidal
]tΔ
UΔBA)UΔU(B[A]
tΔ
UΔBABU(AX[eX
n
n1nn
1
n
n1n
1n
tΔA1n
n
)t(Bu)t(Ax)t(x
tΔt
t
)τtΔt(AtΔA τd)τ(Bue)t(xe)tΔt(x
1tΔA )tΔAI(e
tΔAIe tΔA
)tΔA2
1I()tΔA
2
1I(e 1tΔA
Fundamental theory
• Decomposition of exponential operators(Masuo Suzuki, 1991, Physics)– Function
– First order:
– Second order:
– Third order:
– (2m-1)th and (2m)th order:
)BA(xe)x(F xBxA
1 ee)x(f xA
2
1xB
xA2
1
2 eee)x(f
)22/(1s,eeeeeee)x(f 3xA
2
ssxB
xA2
s1xB)s21(
xA2
s1sxB
xA2
s
3
)22/(1k
)xk(f)x)k21((f)xk(f)x(f)x(f1m2
m
m3m2m3m2m3m2m21m2
Fundamental theory
• Decomposition of exponential operators
)()(2
1)(
)()2
1
2
1
2
1
2
1()(
)()4
1
2
1
2
1
8
1
2
1
8
1()
2
1
2
1(
)](8
1
2
1)][(
2
1)][(
8
1
2
1[
)(
)()(2
1)()(
322
3222
322222
322322322
2
1
2
1
2
322)(
xOxBAxBAI
xOxBAABBAxBAI
xOxABAABABAxABAI
xOxAAxIxOxBBxIxOxAAxI
eeexf
xOxBAxBAIexF
xAxBxA
BAx
General Operator Splitting Formulation
• Transient simulation:– Apply the second order approximation
– In each time step, every partition is calculated separately and trapezoidal integration is used for every partition
– The size of left-hand-side matrix may be changed
– The number of non-zero elements is definitely decreased
– Can be easily extended to multi-way partitions
12
121
xA2
1xAxA
2
1)AA(x eeee
121qq1q21q21xA
2
1xA
2
1xA
2
1xAxA
2
1xA
2
1xA
2
1)A...AA(xxA ee...eee...eeee
General Operator Splitting Formulation
• Equations
)2
ht(U
2
1
)t(I
)t(V
2
R
h
L2
2
E2
E
2
G
h
C2
)ht(I
)ht(V
2
R
h
L2
2
E2
E
2
G
h
C2
)2
ht(U
2
1
)t(I
)t(V
2
R
h
L
2
E2
E
2
G
h
C
)t(I
)t(V
2
R
h
L
2
E2
E
2
G
h
C
)2
ht(U
2
1
)t(I
)t(V
2
R
h
L2
2
E2
E
2
G
h
C2
)t(I
)t(V
2
R
h
L2
2
E2
E
2
G
h
C2
1T1
T11
1T1
T11
2T2
T22
2T2
T22
1T1
T11
1T1
T11
12
121
hA2
1hAhA
2
1)AA(h eeee
Local Truncation Error (Cont.)
• LTE of general operator splitting methodEstimate solution
2/1nn1
n1n
1
n
2/1nn2
nn
2
n
2/1nn1
nn
1
n
U2
1X)
2
NM
tΔ
2(X)
2
NM
tΔ
2(
U2
1X)
2
NM
tΔ
1(X)
2
NM
tΔ
1(
U2
1X)
2
NM
tΔ
2(X)
2
NM
tΔ
2(
Local Truncation Error (Cont.)
• LTE of general operator splitting methodEstimate solution
n1n1
1n
2n1
2n
1n1
1n
1n X)A4
tΔI()A
4
tΔI)(A
2
tΔI()A
2
tΔI)(A
4
tΔI()A
4
tΔI(X̂
11
n2
n12
n1
n11
n )A4
tΔI)(A
2
tΔI()A
2
tΔI)(A
4
tΔI()A
4
tΔI[(
2/1n1
1n1
2n
1n1
1n U
2
1])A
4
tΔI()A
2
tΔI)(A
4
tΔI()A
4
tΔI(
n3
n2122122
21
31
3n
32n
2n X)tΔ)AAA
4
1AA
8
1AA
8
1A
16
1(tΔA
4
1tΔA
2
1tΔAI(
n3
n1221
3n
22nn U)tΔB)AA
16
3A
32
3(tΔBA
4
1tΔAB
2
1tΔB(
)tΔ(OUΔ)tΔAB4
1tΔB
2
1( 4
nn2
nn
Local Truncation Error (Cont.)
• LTE of general operator splitting methodLTE estimation
1n1nn X̂XεLTE
n3
n2122122
21
31n
3n XtΔ)AAA
4
1AA
8
1AA
8
1A
16
1(XtΔ
12
1
)tΔ(OUtΔB)AA16
3A
32
3( 4
nn3
n1221
Local Truncation Error (Cont.)
• LTE of general operator splitting methodLTE estimation
2/1nn1nn1n1n
2/1nnnnn2nn
2/1nnnnn1nn
UtΔB4
1)XX(tΔA
4
1XX
UtΔB2
1)XX(tΔA
2
1XX
UtΔB4
1)XX(tΔA
4
1XX
Local Truncation Error (Cont.)
• LTE of general operator splitting methodLTE estimation
)XX(2
tΔXX n1n
nn1n
n3
n2122122
21
31 XtΔ)AAA
4
1AA
8
1AA
8
1A
16
1(
n3
n1221 UtΔB)AA
16
3A
32
3(
3n2
1n
1nnn tΔ)tΔ2
XX
12
X(LTE
Stability Discussion
• The trapezoidal integration method is unconditional stable for stable system.
• In our operator splitting method, trapezoidal method is used for all the sub-systems
still unconditional stable
)A4
tΔI()A
4
tΔI)(A
2
tΔI()A
2
tΔI)(A
4
tΔI()A
4
tΔI(e 1
n11
n2
n12
n1
n11
ntΔA n
)A2
tΔI()A
2
tΔI(e n1ntΔA n
12
121
xA2
1xAxA
2
1)AA(x eeee
Experimental Results
Circuit1 Cuicuit2 Circuit3
#Nodes 10,000 40,000 90,000
#Transistors 0 0 0
Period 10ns 10ns 10ns
SPICE3 CPU time (sec) 77.8 485.3 3,061.1
#steps 115 115 114
GOS CPU time (sec) 164.7 1011.6 3435.9
#steps 102 102 102
Comparison 2.1x 2x 1.1x
Voltage drop of Circuit3 (power mesh with sinks)
Conclusions
• We investigate alternating direction implicit and general operator splitting integration methods for transistor-level circuit transient simulation.
• In both methods, the circuit will be divided into several sub-circuits, thus the direct matrix solver is still efficient because the matrix is simplified.
• Both methods are second order accurate and unconditional stable.
• Overhead:– Circuit partition– Each time step consists of many sub-steps, each sub-step is a
N-R iteration process• Better for circuits with large linear network
• Distributed Processors – Cluster
– Supercomputer
– Multi-Core Processors (Intel Dual/Quad-Core, IBM Cell etc.)
• Standard– MPI
– Partitioning
– Matrix Solver
• Capabilities– Speed-up (10-100+)
– Memory Capacity (10-100+)
Distributed Computing
Future Works
• ADI method– More experiments
• General operator splitting method– Design and implement multi-way circuit partition
algorithm– Implement multi-way general operator splitting program– Derive LTE for general multi-way situation– More experiments
• Distributed Computing– MPI Standard– Distributed Partitioning, Matrix Solver