an efficient clustering algorithm for low power clock tree synthesis rupesh s. shelar enterprise...
TRANSCRIPT
An Efficient Clustering Algorithm For
Low Power Clock Tree Synthesis
Rupesh S. Shelar
Enterprise Microprocessor Group
Intel Corporation, Hillsboro, OR
March 21st, 2007
ISPD 2007, Austin
22
Outline
•Introduction•Problem Formulation•Clustering Algorithm •Experimental Results•Conclusion
33
Local Clock Capacitance Distribution in a Microprocessor
•Interconnects contribute to major portion of total capacitance
•Clocks are the most active nets in the design
•Minimizing interconnect capacitance in clocks leads to reduction in dynamic power
Distribution generated from several blocks in a microprocessor
44
Microprocessor Clock Hierarchy
...
...
...
Local Clock Network:CTS Solution Space
•Clock network in a processor:
– Distributed as a grid followed by tree
PLL
Global Clock DistributionUsing Multiple spines
Tunable Grid Buffers
Regional Clock Buffers
Local Clock Buffers
Clock Grid
RCBs
RCBs
LCBs
LCBs
To stateelements
55
Previous Work
•Zero skew (unbuffered) trees: Tsay TCAD’93, Boese et al. ASIC’92, Edahiro DAC’93, ’94
•Buffered trees:– Vittal et al., DAC’95: Trades off buffers with wires; unsuitable for
controlled implementation of clock gating and delayed clocking– Mehta et al., ICCD’97: Uses dynamic programming based heuristic
for clustering– Tsai et al., ICCAD’05: Formulation employing tunable buffers
66
CTS
Clock Tree Synthesis (CTS)
•Performed after the placement/sizing of sequentials
•Converts logical clock tree into physical one
•Flow employed in several microprocessor designs
PhysicalSynthesis
RTL
LogicSynthesis
Routing
SizingClock Buffers
RoutingClock Nets
Logical Clock Tree
Clock BufferDuplication
Sequentials(x,y), sizes
(Simplified version)CTS
77
Clock Buffer Duplication
•Given a clock buffer, duplicate it to meet delay, slope, RC, skew constraints
– Decides • receivers driven by the same driver• the clock tree topology
•Applied recursively in reverse topological order
•Driven by clustering or partitioning– Often intractable when capacity constraints
specified– Many heuristics available
K-stage receivers
K-stage buffers
Duplication
Duplication
88
Outline
•Introduction•Problem Formulation•Clustering Algorithm •Experimental Results•Conclusion
99
Effect of Clustering on Capacitance
• A cluster implies a clock buffer
• Interconnect capacitance varies significantly for different solutions even with same number of clusters
Solution 1 Solution 2 Solution 34 placed sequentials
1010
Clustering Targeting Power•Find the clusters such that total local clock power is minimum
– Power in local clock, PLocal Clock = PDynamic+ PLeakgge
– PDynamic = PSequential Cap + PBuffer Cap + PRouting Cap
– PLeakage and PBuffer Cap can be shown proportional to total cap
– PSequential Cap is fixed for CTS purposes
– Reducing PLocal Clock is equivalent to minimizing interconnect cap
•Find the clusters such that total interconnect capacitance is minimum
1111
Routing-aware Clustering: Chicken-and-Egg ProblemRouting cap is unknown till the clustering is performed
Clustering cannot be performed till routing cap is known
?
1212
Problem Simplification
•Let’s assume minimum spanning tree (MST) routing estimates – Other candidates: HPWL, Edahiro metric– Data in the paper show MST and Edahiro metric strongly correlated
with actual clock tree wirelength– MST possesses submodularity property suitable for greedy optimization
•Can the problem be solved optimally, i.e., can we perform clustering such that the routing cap./overall power is minimum
•Yes, it can be (if capacity constraints are dropped)
1313
Problem Definition
•Given: Set of receivers S = {s1, …, sn}, their loads (csi), and locations
(xsi, ys
i)
•Find: A set of clusters, Sclusters = {c1, …, cm} such that Σi α + MST (ci) is minimum
•Subject to Constraints (or Design Parameters): – Maximum # of receivers
• Due to process, routing, etc.– Maximum load in a cluster
• Due to library– Bounding box width/height
• To control RC delay and variations in it
1414
Outline
•Introduction•Problem Formulation•Clustering Algorithm •Experimental Results•Conclusion
1515
Power-aware Clustering Algorithm
•Similar to Kruskal’s MST construction algorithm
•Steps in algorithm:– Create complete graph G(S, E, W)– Assign each edge estimated capacitance as the weight– Create trivial solution with each cluster containing a
receiver– For each edge, in ascending order of weights
• Merge clusters till the cost function is minimized
1616
Example
•Constraint: maximum # of receivers constraint 3
A cluster
An edge
The weight
1
2
44 55
1717
Example
•Constraint: maximum # of receivers constraint 3
1
2
44 5 5
1818
Example
•Constraint: maximum # of receivers constraint 3
1
2
44 5 5
1919
Example
•Constraint: maximum # of receivers constraint 3
•Power-aware clustering results in clusters with total MST value of 3, which is optimal in this case
1
2
44 5 5
2020
Optimality, Time Complexity of Algorithm
•Ensures optimality when no capacity constraints (max. load, # of receivers) specified
– Reduces to minimum spanning forest problem
•Runs in O(n2 log n) time in number of receivers – Handles blocks with ~5K sequentials easily– 1.34 seconds for clustering of 1037 sequentials
•Run-times practical and comparable to competitive algorithms
– Clock buffer duplication takes minutes on ~5K sequential blocks
2121
Outline
•Introduction•Problem Formulation•Clustering Algorithm •Experimental Results•Conclusion
2222
Evaluation of Power-Aware Clustering (PoAwCl)
•Implemented clustering algorithm, PoAwCl, in C++
•Incorporated in the clock buffer duplication step using TCL
•Rest of the CTS kept unchanged
•Generated clock trees on microprocessor blocks by changing only the clustering/partitioning heuristics
•Best of the results compared with the PoAwCl
2323
Results on Clock Trees: Int. Cap. Improvement
13% AverageImprovement
2424
Results on Clock Trees: Total Cap. Improvement
6% AverageImprovement
2525
Results on Clock Trees: Wirelength Improvement
11% AverageImprovement
2626
• ●,+,*,▼denote locations of sequentials; same type symbols denote a cluster
• 4 clusters, in each case, represent 4 clock buffers driving the sequentials in their clusters
Looking at Cluster Pictures
Clustering aimed at minimizing # of buffers Power-aware clustering
2727
Viewing the Routing
Power-aware clustering (on right) results in smaller wirelength
2828
Agenda
•Introduction•Motivation•Problem Formulation•Clustering Algorithm •Experimental Results•Conclusion
2929
Conclusion
•Power-aware clustering results in 13% improvement in interconnect cap
•Also Frees up routing resources by 11% discounting shielding and spacing of clock wires
•Used for other applications such as enable logic (or clock gating) synthesis, trunk-routing
•Acknowledgment: Intel’s CAD Organization – for providing the source code of the CTS package which sped up the
development
3030
Thank you….