High-Quality, Deterministic Parallel Placement for FPGAson Commodity HardwareAdrian Ludwin, Vaughn Betz & Ketan Padalia
FPGA Seminar Presentation
Nov 10, 2009
Overview
Motivation Review simulated annealing Approaches Summary
Motivation
Simulated Annealing Placement
Probabilistic approach to finding optimal solution Behavior
Moves through solution space Greedily Randomly
Balance between greediness and randomness is controlled by a temperature
Temperature evolves through time based on a cooling schedule
Simulated Annealing Placement
For a single moveCompute change in
cost: ΔCAccept move:
ΔC < 0 ΔC > 0, with
probability e-ΔC/T
Repeat while gradually decreasing T and window size
c4c1
c5
c2
c3t3
Constraints
Runs on commodity hardware Good quality of results
Robust Determinism
Bug reportingConsistent regression results
Selected Previous Work
Close relatedMove accelerationParallel moves
Other methods Independent setsPartitioned placementsSpeculative
Algorithm #1
Algorithm #2
Objective
Determine efficacy Analyze runtime and categorize
MemorySynchronization InfrastructureEvaluationProposal
Methodology
Parallel equivalent flowSerial flow which mimic parallel flowEmulates behavior of multithreaded
application by using only one thread/core Useful for comparison
Accounts for infrastructure overhead
Methodology
Attributing runtime Two types of measurements
Bottom up (bu) measure each component of a move
End-to-end (e2e) measure runtime for entire run
Methodology
Methodology
Test setsSet of 11 Stratix® II FPGA benchmark
designs IP and customer circuits 10k to 100k logic cells
Also tested on 40 Stratix II FPGA circuits Obtained similar result
Results for Algorithm #1
Moves attribution
Overhead analysis
Observations
Theoretical speedup 1.7xMeasured: 1.3x (best)
Increase in evaluation runtimeDue to reduced cache locality
Proposal time is “hidden”
Analysis
Time spent on stall is negligible Evaluation accounts for most of overhead Little to gain by removing determinism
Serial equivalency is less than 3% runtime
Summary for Algorithm #1
Speedup: 1 – 1.3x Memory inefficiency is the biggest
bottleneck Theoretically algorithm should scale
However, difficult to partition and balance two stages
Speedups for Algorithm #2
Attribution on 2 cores
Attribution on 2 cores
Attribution on 4 core
Attribution on 4 cores
Observations
Memory latency due to inter-processor communicationWorsens with more cores
Summary for Algorithm #2
Parallel moves has better scalability than pipelined moves
Bottleneck is still memory Again serial equivalency costs little
Take Home Messages
Memory is important Good algorithms are even more important