an efficient clustering algorithm for low power clock tree synthesis rupesh s. shelar enterprise...

30
An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR March 21 st , 2007

Upload: julius-hunt

Post on 31-Dec-2015

225 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

An Efficient Clustering Algorithm For

Low Power Clock Tree Synthesis

Rupesh S. Shelar

Enterprise Microprocessor Group

Intel Corporation, Hillsboro, OR

March 21st, 2007

ISPD 2007, Austin

Page 2: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

22

Outline

•Introduction•Problem Formulation•Clustering Algorithm •Experimental Results•Conclusion

Page 3: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

33

Local Clock Capacitance Distribution in a Microprocessor

•Interconnects contribute to major portion of total capacitance

•Clocks are the most active nets in the design

•Minimizing interconnect capacitance in clocks leads to reduction in dynamic power

Distribution generated from several blocks in a microprocessor

Page 4: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

44

Microprocessor Clock Hierarchy

...

...

...

Local Clock Network:CTS Solution Space

•Clock network in a processor:

– Distributed as a grid followed by tree

PLL

Global Clock DistributionUsing Multiple spines

Tunable Grid Buffers

Regional Clock Buffers

Local Clock Buffers

Clock Grid

RCBs

RCBs

LCBs

LCBs

To stateelements

Page 5: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

55

Previous Work

•Zero skew (unbuffered) trees: Tsay TCAD’93, Boese et al. ASIC’92, Edahiro DAC’93, ’94

•Buffered trees:– Vittal et al., DAC’95: Trades off buffers with wires; unsuitable for

controlled implementation of clock gating and delayed clocking– Mehta et al., ICCD’97: Uses dynamic programming based heuristic

for clustering– Tsai et al., ICCAD’05: Formulation employing tunable buffers

Page 6: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

66

CTS

Clock Tree Synthesis (CTS)

•Performed after the placement/sizing of sequentials

•Converts logical clock tree into physical one

•Flow employed in several microprocessor designs

PhysicalSynthesis

RTL

LogicSynthesis

Routing

SizingClock Buffers

RoutingClock Nets

Logical Clock Tree

Clock BufferDuplication

Sequentials(x,y), sizes

(Simplified version)CTS

Page 7: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

77

Clock Buffer Duplication

•Given a clock buffer, duplicate it to meet delay, slope, RC, skew constraints

– Decides • receivers driven by the same driver• the clock tree topology

•Applied recursively in reverse topological order

•Driven by clustering or partitioning– Often intractable when capacity constraints

specified– Many heuristics available

K-stage receivers

K-stage buffers

Duplication

Duplication

Page 8: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

88

Outline

•Introduction•Problem Formulation•Clustering Algorithm •Experimental Results•Conclusion

Page 9: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

99

Effect of Clustering on Capacitance

• A cluster implies a clock buffer

• Interconnect capacitance varies significantly for different solutions even with same number of clusters

Solution 1 Solution 2 Solution 34 placed sequentials

Page 10: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

1010

Clustering Targeting Power•Find the clusters such that total local clock power is minimum

– Power in local clock, PLocal Clock = PDynamic+ PLeakgge

– PDynamic = PSequential Cap + PBuffer Cap + PRouting Cap

– PLeakage and PBuffer Cap can be shown proportional to total cap

– PSequential Cap is fixed for CTS purposes

– Reducing PLocal Clock is equivalent to minimizing interconnect cap

•Find the clusters such that total interconnect capacitance is minimum

Page 11: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

1111

Routing-aware Clustering: Chicken-and-Egg ProblemRouting cap is unknown till the clustering is performed

Clustering cannot be performed till routing cap is known

?

Page 12: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

1212

Problem Simplification

•Let’s assume minimum spanning tree (MST) routing estimates – Other candidates: HPWL, Edahiro metric– Data in the paper show MST and Edahiro metric strongly correlated

with actual clock tree wirelength– MST possesses submodularity property suitable for greedy optimization

•Can the problem be solved optimally, i.e., can we perform clustering such that the routing cap./overall power is minimum

•Yes, it can be (if capacity constraints are dropped)

Page 13: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

1313

Problem Definition

•Given: Set of receivers S = {s1, …, sn}, their loads (csi), and locations

(xsi, ys

i)

•Find: A set of clusters, Sclusters = {c1, …, cm} such that Σi α + MST (ci) is minimum

•Subject to Constraints (or Design Parameters): – Maximum # of receivers

• Due to process, routing, etc.– Maximum load in a cluster

• Due to library– Bounding box width/height

• To control RC delay and variations in it

Page 14: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

1414

Outline

•Introduction•Problem Formulation•Clustering Algorithm •Experimental Results•Conclusion

Page 15: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

1515

Power-aware Clustering Algorithm

•Similar to Kruskal’s MST construction algorithm

•Steps in algorithm:– Create complete graph G(S, E, W)– Assign each edge estimated capacitance as the weight– Create trivial solution with each cluster containing a

receiver– For each edge, in ascending order of weights

• Merge clusters till the cost function is minimized

Page 16: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

1616

Example

•Constraint: maximum # of receivers constraint 3

A cluster

An edge

The weight

1

2

44 55

Page 17: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

1717

Example

•Constraint: maximum # of receivers constraint 3

1

2

44 5 5

Page 18: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

1818

Example

•Constraint: maximum # of receivers constraint 3

1

2

44 5 5

Page 19: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

1919

Example

•Constraint: maximum # of receivers constraint 3

•Power-aware clustering results in clusters with total MST value of 3, which is optimal in this case

1

2

44 5 5

Page 20: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

2020

Optimality, Time Complexity of Algorithm

•Ensures optimality when no capacity constraints (max. load, # of receivers) specified

– Reduces to minimum spanning forest problem

•Runs in O(n2 log n) time in number of receivers – Handles blocks with ~5K sequentials easily– 1.34 seconds for clustering of 1037 sequentials

•Run-times practical and comparable to competitive algorithms

– Clock buffer duplication takes minutes on ~5K sequential blocks

Page 21: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

2121

Outline

•Introduction•Problem Formulation•Clustering Algorithm •Experimental Results•Conclusion

Page 22: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

2222

Evaluation of Power-Aware Clustering (PoAwCl)

•Implemented clustering algorithm, PoAwCl, in C++

•Incorporated in the clock buffer duplication step using TCL

•Rest of the CTS kept unchanged

•Generated clock trees on microprocessor blocks by changing only the clustering/partitioning heuristics

•Best of the results compared with the PoAwCl

Page 23: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

2323

Results on Clock Trees: Int. Cap. Improvement

13% AverageImprovement

Page 24: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

2424

Results on Clock Trees: Total Cap. Improvement

6% AverageImprovement

Page 25: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

2525

Results on Clock Trees: Wirelength Improvement

11% AverageImprovement

Page 26: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

2626

• ●,+,*,▼denote locations of sequentials; same type symbols denote a cluster

• 4 clusters, in each case, represent 4 clock buffers driving the sequentials in their clusters

Looking at Cluster Pictures

Clustering aimed at minimizing # of buffers Power-aware clustering

Page 27: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

2727

Viewing the Routing

Power-aware clustering (on right) results in smaller wirelength

Page 28: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

2828

Agenda

•Introduction•Motivation•Problem Formulation•Clustering Algorithm •Experimental Results•Conclusion

Page 29: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

2929

Conclusion

•Power-aware clustering results in 13% improvement in interconnect cap

•Also Frees up routing resources by 11% discounting shielding and spacing of clock wires

•Used for other applications such as enable logic (or clock gating) synthesis, trunk-routing

•Acknowledgment: Intel’s CAD Organization – for providing the source code of the CTS package which sped up the

development

Page 30: An Efficient Clustering Algorithm For Low Power Clock Tree Synthesis Rupesh S. Shelar Enterprise Microprocessor Group Intel Corporation, Hillsboro, OR

3030

Thank you….