architecture and details of a high quality, large-scale analytical placer andrew b. kahng, sherief...

39
Architecture and Details of a High Quality, Large- Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California, San Diego http://vlsicad.ucsd.edu/ Work partially supported by the MARCO Gigascale Systems Research Center. ABK is currently with Blaze DFM, Inc., Sunnyvale, CA.

Post on 20-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

Architecture and Details of a High Quality, Large-Scale

Analytical Placer

Andrew B. Kahng, Sherief Reda and Qinke Wang

VLSI CAD Lab

University of California, San Diegohttp://vlsicad.ucsd.edu/

Work partially supported by the MARCO Gigascale Systems Research Center. ABK is currently with Blaze DFM, Inc., Sunnyvale, CA.

Page 2: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

2

Outline

• History of APlace

• From APlace1.0 to APlace2.0

• Anatomy of APlace2.0

• New techniques in APlace2.0

• Experimental Results

• Conclusions and Future Work

Page 3: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

3

History of APlace

• Research to study Synopsys patent– Naylor et al., US Patent 6,301,693 (2001)

• Extensible foundation: APlace1.0– Timing-driven placement– Mixed-size placement– Area-I/O placement

• ISPD-2005 placement contest APlace2.0– Many parts of APlace rewritten– Superior performance

Page 4: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

4

Outline

• History of APlace

• From APlace1.0 to APlace2.0

• Anatomy of APlace2.0

• New techniques in APlace2.0

• Experimental Results

• Conclusions and Future Works

Page 5: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

5

APlace Problem Formulation

• Constrained Nonlinear Optimization: Divide the layout area into uniform bins, and seek to minimize HPWL etc. so that total cell area in every bin is equalized

– : density function that equals the total cell area in a global bin g

– D : average cell area over all global bins

Page 6: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

6

Nonlinear Optimization

• Smooth approximation of placement objectives: wirelength, density function, etc.

• Quadratic Penalty method– Solve a sequence of unconstrained minimization

problems for a sequence of µ → 0

• Conjugate Gradient (CG) solver– Useful for finding an unconstrained minimum

of a high-dimensional function– Adaptable to large-scale placement problems:

memory requirement is linear in problem size

Page 7: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

7

Wirelength Approximation

• Half-Perimeter Wirelength (HPWL)– Half-perimeter of net’s bounding box

– Simple, close measure of routing congestion– Not strictly convex, or everywhere differentiable

• Log-Sum-Exp approximation – Naylor et al., US Patent 6,301,693 (2001)– Precise, closer to HPWL when α → 0 – Strictly convex, continuously differentiable

Page 8: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

8

: Smoothing Parameter

• “Significance criterion” for choosing nets with large wirelength to minimize– Larger gradients for longer nets– Minimize long nets more efficiently than short nets

-10 -5 0 5 10-1

-0.5

0

0.5

1

Par

tial G

radi

en

t o

f x 1

(x1 - x2) /

• Two-pin net

• Partial gradient for x1

– close to 0, when net length |x1- x2| is small

compared to – close to 1 or -1, o.w.

Page 9: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

9

Area Potential Function

• Overlap area =– overlap along the x and y directions– 0/1 function with cell size ignored

• Area potential function: defines an “area potential” exerted by a cell to nearby grids– smooth bell-shaped function for standard cells

[Naylor et al., US Patent 6,301,693 (2001)]

Page 10: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

10

Module Area Potential Function

• Mixed-size placement: decide scope of area potential based on module's dimension

• p(d) : potential function – d : distance from module to grid

– radius r = w/2 + 2wg for block with width w

1-a*d2

b*(r-d)2

d

p(d)

-w/2-2wg w/2+ wg

– convex curved < w/2 + wg

– concave curvew/2 + wg < d < w/2+ 2wg

– smooth at d = w/2 + wg

Page 11: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

11

Changes: APlace1.0 APlace2.0

• Strong scalability from new clustering algorithm

• Dynamic adjustment of weights for wirelength and overlap penalty during global placement

• Improvements to legalization, detailed placement– whitespace compaction– cell reordering algorithms– global greedy cell movement

• APlace2.0 vs. APlace1.0: up to 19% WL reduction 1.5-2x speedup

Page 12: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

12

IBM BigBlue4 Placement

2.1M instances, HPWL = 833.21, CPU = 23h

Page 13: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

13

Outline

• History of APlace

• From APlace1.0 to APlace2.0

• Anatomy of APlace2.0

• New techniques in APlace2.0

• Experimental Results

• Conclusions and Future Works

Page 14: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

14

Anatomy of APlace 2.0

Clustering

Adaptive APlace engine

WS arrangement

Cell order polishing

Unclustering

Global moving

Legalization

GlobalPhase

DetailedPhase

Page 15: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

15

New Feature 1: Multi-Level Clustering

Objective: cluster to reduce runtime and allow scalable implementations with no compromise to quality

Multi-level approach using best-choice clustering (ISPD’05)

Clustering ratio 10 #Top-level clusters 2000 Wirelength calculation

– assume modules located at cluster center

– only consider inter-cluster parts of nets

netlist

reduce netlist size by 10x

size ~ 2000?

global placement

uncluster flat?

Legalization

yes

no

yes

no

Page 16: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

16

Best-Choice Clustering

• Each clustering level uses the best-choice heuristic with lazy updates and tight area control

For each clustering level: Calculate the clustering score of each node to its neighbors based on the number of connections and areas Sort all nodes based on their best scores using a heap

Until target clustering ratio is reached: If top node of heap is “valid” then cluster it with its closest

neighbor Else recalculate the top node score and reinsert in heap;

Continue calculate the clustering score of the new node and reinsert

into the heap update netlist and mark all neighbors of the new node as invalid

Page 17: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

17

Two Clustering Concerns

Mark boundaries of clustering hierarchy at each clustering level

allow exact reversal of clustering during unclustering

• Meet target number of objects by avoiding “saturation” bypass small fixed objects during clustering

cluster

fixed object

bypassfixed objects

Page 18: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

18

Multiple Levels of Grids• Adaptive grid size based on average cluster size• Better global optimization

– use solution of placement problem constrained with coarser grids as initial solution for problem constrained with finer grids

• Better scalability– larger grid size spreads modules faster

• Different levels of relaxation for density constraints– According to grid size

Page 19: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

19

New Feature 2: Adaptive WL Weight

• Important to QOR

• Initial weight value– For each cluster level and grid level– Based on wirelength and density partial

derivatives– Goal: Magnitudes of gradients roughly equal

• Decrease WL weight by half whenever CG solver obtains a stable solution

Page 20: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

20

New Feature 3: Legalization and Detailed PlacementVariant of greedy legalization algorithm (Hill’01):

1. Sort all cells from left to right: move each cell in order to the closest legal position

2. Sort all cells from right to left: move each cell in order to the closest legal position(s)

3. Pick the better of (1) and (2)

Detailed Placement Components:• Global cell movement (Goto81, KenningsM98 BoxPlace, FP…)• Whitespace compaction (KahngTZ’99, KahngMR’04)• Cell order polishing (similar to rowIroning, FS detailed placer)

• Intra-row cell reordering• Inter-row cell reordering

Page 21: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

21

Global Moving

• Move cell to “optimal” location among available whitespace– improve quality when utilization is low

• Two steps– search for available location in optimal region of

a cell’s placement – search for available location in “best” bin

• divide placement area into uniform bins• choose “best" bin according to available whitespace

and cost of moving cell to bin center• assume normal distribution of whitespace with width

and estimate if an available location exists

Page 22: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

22

WhiteSpace (WS) Compaction

Each chain represents the possible placement sites for each cell

The cost on the arrows is the change in HWPL of the cell move to each site

The order of chains correspond to the order of cells from left to right in a row

A Shortest path from source to sink gives the best way to compact WS

sites

cell 1

cell 2

cell 3

cell n

row

1 2 3 4 5 6 7 8 9 10 11 12start node

end node

Page 23: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

23

Cell Order Polishing

• Permute a small window of neighboring cells in order to improve wirelength– MetaPlacer’s rowIroning: up to 15 cells in one row

assuming equal whitespace distribution– FengShui's cell ordering: six objects in one or

more rows regarding whitespace as pseudo cells

• Branch-and-bound algorithm– four nearby cells in one or multiple rows– consider optimal placement for each permutation– more accurate, overlap-free permutations and no

cell shifting

Page 24: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

24

Single-Row Cell Ordering

• Cost of placing first j cells of a permutation– cost = wirelength increase when placing a cell – ΔWL≠ 0, only if cell is leftmost of rightmost– remaining cells placed to the right of first j cells – unrelated to order or placement of remaining cells

• B&B algorithm– construct permutations in lexicographic order

• next permutation has same prefix as the previous one • beginning rows of DP table can be reused as possible

– cut branch when minimum cost of placing first j cells > best cost till now

Page 25: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

25

Two- or Three-Row Cell Ordering

• DP algorithm– decide how many cells assigned to each row from

up to down– construct a permutation in lexicographic order– find “optimal” placement within the window

• Y-cost of placing first j cells: accurate– remaining cells placed lower than first j cells

• X-cost of placing first j cells: inaccurate when a net connects placed and unplaced cells – results show still effective with small set of cells

and small window

Page 26: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

26

Outline

• Introduction• Clustering• Global Placement• Detailed Placement• Experimental Results

– IBM ISPD04– IBM-PLACE v2– IBM ICCAD04– IBM ISPD05

• Conclusions and Future Works

Page 27: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

27

IBM ISPD04

3% better than the best other - mPL5 (ISPD05)

• Test basic placer performance with standard cells

APlace2.0 mPL5 Capo9.0 Dragon3 FP1 FS2.6

ibm10 17.20 17.3 1.1 1.04 1.07 1.07

ibm11 13.22 14 1.09 1.03 1.09 1.04

ibm12 21.83 22.3 1.11 1.03 1.08 1.07

ibm13 16.46 16.6 1.1 1.05 1.11 1.09

ibm14 30.55 31.6 1.1 1.05 1.11 1.04

ibm15 38.38 38.5 1.09 1.04 1.13 1.07

ibm16 41.36 43 1.1 1.05 1.07 1.09

ibm17 60.82 61.3 1.09 1.08 1.08 1.08

ibm18 39.32 41 1.09 1.02 1.1 1.04

Average 0.97 1 1.09 1.03 1.08 1.06

Page 28: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

28

IBM Place V2

• Test placer under whitespace presence and routability

Circuit APlace2.0 Vias mPL+WSA

ibm09-easy 3.023 495073 3.5

ibm09-hard 3.027 503410 3.65

ibm10-easy 5.977 758598 6.84

ibm10-hard 5.931 772744 6.76

ibm11-easy 4.577 638523 5.16

ibm11-hard 4.654 656525 5.15

ibm12-easy 8.337 892915 10.52

ibm12-hard 8.317 902465 10.13

Average 0.88 1

• 12% better than mPL-R+WSA (ICCAD04)

Page 29: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

29

IBM ICCAD04

• Test placer performance with cells and blocks (floorplacement)

APlace2.0 FS2.6 Capo9.0

ibm10 28.55 41.96 34.98

ibm11 18.67 21.19 22.31

ibm12 33.51 40.84 40.78

ibm13 23.03 25.45 28.7

ibm14 35.9 39.93 40.97

ibm15 46.82 51.96 59.19

ibm16 54.58 62.77 67

ibm17 66.49 69.38 78.78

ibm18 42.14 45.59 50.39

Average 0.86 1 1.05

14% and 19% better than FS and Capo, respectively

Page 30: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

30

IBM ISPD05

adaptec2 adaptec4 BB1 BB2 BB3 BB4 AVG

APlace2.0 87.31 187.65 94.64 143.82 357.89 833.21 1

mFAR 91.53 190.84 97.7 168.7 379.95 876.28 1.06

Dragon 94.72 200.88 102.39 159.71 380.45 903.96 1.08

mPL 97.11 200.94 98.31 173.22 369.66 904.19 1.09

FastPlace 107.86 204.48 101.56 169.89 458.49 889.87 1.16

Capo 99.71 211.25 108.21 172.3 382.63 1098.76 1.17

NTUP 100.31 206.45 106.54 190.66 411.81 1154.15 1.21

FengShui 122.99 337.22 114.57 285.43 471.15 1040.05 1.5

KW 157.65 352.01 149.44 322.22 656.19 1403.79 1.84

• Test placer performance with cells and movable/fixed blocks

6% better than the best other placer (mFAR)

Page 31: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

31

APlace2.0 Conclusions

• 60 days + clean sheet of paper + Qinke Wang + Sherief Reda• Scalable implementation • State-of-the-art clustering and global placement engines• Improved detailed placement engine• Better than best published results by

• 3% ISPD’04 suite• 14% ICCAD’04• 12% IBMPLACE V.2• 6% ISPD’05 Placement Contest

• Recent Applications (other than restoring functionality)• IR-drop driven placement (ICCD-2005 Best Paper)• Lens aberration-aware placement (DATE-2006)

• Toward APlace3.0: ?

Page 32: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

32

Thank You

Questions?

Page 33: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

33

Goals and Plan

Goals:• Build a new placer to win the competition• Scalable, robust, high-quality implementation • Leave no stone unturned / QOR on the table

Plan and Schedule:• Work within most promising framework: APlace• 30 days for coding + 30 days for tuning

Page 34: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

34

PhilosophyRespect the competition• Well-funded groups with decades of experience

– ABKGroup’s Capo, MLPart, APlace = all unfunded side projects– No placement-related industry interactions

• QOR target: 24-26% better than Capo v9r6 on all known benchmarks– Nearly pulled out 10 days before competition

Work smart• Solve scalability and speed basics first

– Slimmed-down data structure, -msse compiler options, etc.• Ordered list of ~15 QOR ideas to implement• Daily regressions on all known benchmarks• Synthetic testcases to predict bb3, bb4, etc.

Page 35: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

35

Implementation Framework

APlace weaknesses:• Weak clustering• Poor legalization / detailed placement

Clustering

Adaptive APlace engine

WS arrangement

Cell order polishing

Unclustering

Global moving

Legalization

GlobalPhase

DetailedPhase

New APlace Flow

New APlace:1. New clustering2. Adaptive parameter

setting for scalability3. New legalization +

iterative detailed placement

Page 36: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

36

Parameterization and Parallelizing

Tuning Knobs:

Clustering ratio, # top-level clusters, cluster area constraints Initial wirelength weight, wirelength weight reduction ratio Max # CG iterations for each wirelength weight Target placement discrepancy Detailed placement parameters, etc.

Resources: SDSC ROCKS Cluster: 8 Xeon CPUs at 2.8GHz Michigan Prof. Sylvester’s Group: 8 various CPUs UCSD FWGrid: 60 Opteron CPUs at 1.6GHz UCSD VLSICAD Group: 8 Xeon CPUs at 2.4GHz

Wirelength Improvement after Tuning : 2-3%

Page 37: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

37

Artificial Benchmark Synthesis

Synthetic benchmarks to test code scalability and performance

Rapid response to broadcast of s00-nam.pdf Created “synthetic versions of bigblue3 and

bigblue4 within 48 hours Mimicked fixed-block layout diagrams in the

artificial benchmark creation

This process was useful: we identified (and solved) a problem with clustering in presence of many small fixed blocks

Page 38: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

38

Results

CircuitGP

HPWLLeg

HPWLDP

HPWL CPU (h)

adaptec1 80.20 81.80 79.50 3

adaptec2 84.70 92.18 87.31 3

adaptec3 218.00 230.00 218.00 10

adaptec4 182.90 194.75 187.71 13

bigblue1 93.67 97.85 94.64 5

bigblue2 140.68 147.85 143.80 12

bigblue3 357.28 407.09 357.89 22

bigblue4 813.91 868.07 833.21 50

Page 39: Architecture and Details of a High Quality, Large-Scale Analytical Placer Andrew B. Kahng, Sherief Reda and Qinke Wang VLSI CAD Lab University of California,

39

Conclusions

ISPD05 = an exercise in process and philosophy

At end, we were still 4% short of where we wanted

Not happy with how we handled 5-day time frameAuto-tuning first results ~ best results

During competition, wrote but then left out “annealing” DP improvements that gained another 0.5%

Students and IBM ARL did a really, really great job

Currently restoring capabilities (congestion, timing-driven, etc.) and cleaning (antecedents in Naylor patent)