may 10-02 mike drob grant furgiuele ben winters advisor: dr. chris chu client: ibm ibm contact karl...

Circuit Placement on Multicore CPUs

May 10-02

Mike DrobGrant Furgiuele

Ben Winters

Advisor: Dr. Chris ChuClient: IBMIBM Contact – Karl Erickson

Project OverviewCircuit Placement problem is bottleneck of

physical designCurrently only single-core – no threadsWill attempt to parallelize some functions of

the FastPlace algorithm using the linux pthreads library.

Implement RQL idea (IBM) into FastPlace

Project PlanStart with existing serial FastPlace algorithmParallelize FastPlace algorithm to decrease

run-timeHope to gain increases as close to N times

speedup (N = cores) as possibleRealistically, expect 0.75N or 0.5N

End-goal is mostly proof-of-conceptIBM uses in-house algorithmContains proprietary circuit processing

Project DesignWritten in CRun under Linux using POSIX thread libraryConsider scalability – 2, 4, 8, etc. coresRQL implementation

IBM ConceptNetlist optimization for placement

Implementation – OverallUsing Data Parallelism as scheme

Assigning loop iterations to threadsLocalizing variable usage

Where absolutely necessary, using thread synchronization (mutex, etc..)

To maximize speed improvement with threads, minimize total number of tasks for threads to accomplishHave individual threads do as much as possible

Implementation – Thread PoolThreads are created once at startVarious Benefits:

Minimizes overhead from thread creationIncreases cache performanceAllows core scalability – number of threads

running can equal cores available

Implementation - RQLForce-vector Modulation

Forces acting upon cells Forces are modeled as a spring potential energy problem Native Force in the algorithm tries to reduce wire length by bringing

connected cells closer to each other Spreading Force tries to move cells into sparse areas within the placement

region Need a balance of the two to meet placement and wire length objectives

Modulate the Spreading Forces High Spreading Forces means the connection belongs to a fan-out net or

boundary Therefore, cells with connections in the top 5 percentile of spreading forces

are skipped in quadratic placement Skipping these leaves the cell’s other connections minimized instead of

degrading them. Results in placing cells in their overall optimal location

Implementation - RQLDuring quadratic placement (global

placement process) Calculate magnitude of spreading forces for all cells

in each iteration Calculate force on current cell If current cell’s force is above the 5% threshold,

skip its placement

Implementation - FunctionsMove_8pt family

move_8pt, move_8pt_withMap, move_8pt_mixedMode, move_8pt_mixedMode_withMap, move_8pt_clustering, move_8pt_clustering_withMap

Calculates score based on cell coordinates and bin utilization Doesn’t lend well to parallelization The fix?

If a new cell is within 3x3 box of cell being currently calculated for, new cell is skipped

Helps remove significant wirelength degradation

Implementation - FunctionsSwap_move family

swap_move_FM, vswap_move, local_order3_FM, flipAllCells

Row-based data processingBreak up matrix into segments based on

number of threadsAssign each thread to do X rows

TestingProfiled original FastPlace algorithm

gprof gives CPU time per function

Profiling parallel FastPlaceValgrind

FastPlace code outputs actual time elapsedCan be used to compare performanceNot 100% consistent

Testing & ResultsTest results for correctness

Compare “wire length” results Average total wirelength no worse than 1% greater

Threadpool is tested and working

Test results for speedupCompared actual run-timeSee slides on next page

Test Results – RQL ImplementationWire length Results

Between .12% - 2.08% decreased wire length on ISPD98 benchmarks with an average of .98%

Between .11% - 3.18% decreased wire length on ISPD2005 benchmarks with an average of 1.39%

Run-time ResultsSome run-time slow down

Average of 3.36% increased on ISPD98 Average of 4.02% increased on ISPD2005

Test Results – Global Placement

adaptec2 adaptec40

100

200

300

400

500

600

1 Core2 Cores8 Cores

Test Results – Detailed Placement

adaptec2 adaptec40

100

200

300

400

500

600

700

1 Core2 Cores8 Cores

Project ImpactShows that threads can be used to speed up

the placement process

With availability of multi-core CPU’s, and scalability of thread implementation, speed improvement could continue

Reduces bottleneck in development process

Questions?

may 10-02 mike drob grant furgiuele ben winters advisor: dr. chris chu client: ibm ibm contact karl...

Documents