![Page 1: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/1.jpg)
Self-Hosted Placementfor
Massively Parallel Processor Arrays(MPPAs)
Graeme Smecher,Steve Wilton, Guy Lemieux
Thursday, December 10, 2009FPT 2009
![Page 2: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/2.jpg)
Landscape
• Massively Parallel Processor Arrays– 2D array of processors• Ambric: 336, PicoChip: 273, AsAP: 167, Tilera: 100
– Processor-to-processor communication
• Placement (locality) matters– Tools/algorithms immature
2
![Page 3: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/3.jpg)
Opportunity
• MPPAs track Moore’s Law– Array size grows
• E.g. Ambric:336, Fermi:512
• Opportunity for FPGA-like CAD?– Compiler-esque speed needed– Self-hosted parallel placement
• M x N array of CPUs computes placement forM x N programs
• Inherently scalable
3
![Page 4: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/4.jpg)
Overview
• Architecture• Placement Problem• Self-Hosted Placement Algorithm• Experimental Results• Conclusions
4
![Page 5: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/5.jpg)
MPPA Architecture
• 32 x 32 = 1024 PEs• PE = RISC + Router• RISC core– In-order pipeline– More powerful
PE than prev talk
• Router– 1-cycle per hop
5
![Page 6: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/6.jpg)
Overview
• Architecture• Placement Problem• Self-Hosted Placement Algorithm• Experimental Results• Conclusions
7
![Page 7: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/7.jpg)
Placement Problem
• Given: netlist graph– Set of “cluster” programs
– One per PE
– Communication paths
• Find: good 2D placement– Use simulated annealing– E.g., minimum total
Manhattan wirelength
8
C
C
CC
C
CC
C
C
CC
C
C
C
C
C
C
C
C
C
C
C
C
C
CC C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
![Page 8: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/8.jpg)
Overview
• Architecture• Placement Problem• Self-Hosted Placement Algorithm• Experimental Results• Conclusions
9
![Page 9: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/9.jpg)
Self-Hosted Placement
• Idea from Wrighton and DeHon, FPGA03– Use FPGA to place itself– Imbalanced: tiny problem size needs HUGE FPGA– N-FPGAs needed to place 1-FPGA design
10
![Page 10: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/10.jpg)
Self-Hosted Placement
• Use MPPA to place itself– PE powerful enough to place itself– Removes imbalance– 2 x 3 PEs to place 6 “clusters” into 2 x 3 array
11
C
C
C
C
C
C
5
0
C
C
3
1
C C
C
C
C
C
C
C
C
C
C
2
4
C
C
C
C
C
C
C
C
C
C
1
4
C
C
0
3
C
C
C
C
C
C
C
C
C
C
C
C
2
5
C
C
C
C
![Page 11: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/11.jpg)
Regular Simulated Annealing
1. initial: random placement
2. for T in {temperatures}1. for n in 1..N clusters
1. Randomly select 2 blocks
2. Compute swap cost3. Accept swap if
i) cost decreases, orii) random trial succeeds
12
PE
![Page 12: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/12.jpg)
Modified Simulated Annealing
1. initial: random placement
2. for T in {temperatures}1. for n in 1..N clusters
1. Consider all pairs in neighbourhood of n
2. Compute swap cost3. Accept swap if
i) cost decreases, orii) random trial succeeds
13
PE
![Page 13: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/13.jpg)
PEPE
Self-Hosted Simulated Annealing
1. initial: random placement
2. for T in {temperatures}1.1. for n in 1..N clustersfor n in 1..N clusters
1. Update position chain2. Consider all pairs in
neighbourhood of n3. Compute swap cost4. Accept swap if
i) cost decreases, orii) random trial succeeds
14
![Page 14: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/14.jpg)
Algorithm Data Structures
• Place-to-block maps • Net-to-block maps
Nets
Blocks(programs)
PEs <x,y>nbmbnm
pbmbpm
STATICDYNAMIC15
![Page 15: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/15.jpg)
pbmstatic
bpm
bnm
nbm16
Full map in each PE Partial map in each PE
Algorithm Data Structures
![Page 16: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/16.jpg)
Swap Transaction• PEs pair up
– Deterministic order, hardcoded in algorithm
• Each PE computes cost for own BlockID– Current placement cost– After cost if BlockID was swapped
• PE 1 sends cost of swap to PE 2– PE 2 adds costs, determines if swap accepted– PE 2 sends decision back to PE 1– PE 1 and PE2 exchange data structures if swap
17
![Page 17: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/17.jpg)
Data Structure Updates
18
Dynamic structuresLocal <x,y>: update on swap
Other <x,y>: update chain
Static structuresExchanged with swap
![Page 18: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/18.jpg)
Data CommunicationSwap Transaction
19
PEs exchangeBlockIDs
PEs exchange nets for their BlockIDs
PEs exchange BlockIDs for their nets
(already updated)
![Page 19: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/19.jpg)
Overview
• Architecture• Placement Problem• Self-Hosted Placement Algorithm• Experimental Results• Conclusions
20
![Page 20: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/20.jpg)
Methodology
• Three versions of Simulated Annealing (SA)– Slow sequential SA• Baseline, generates “ideal” placement• Very slow schedule (200k swaps per T drop)• Impractical, but nearly optimal
– Fast Sequential SA• Vary parameters across practical range
– Fast Self-Hosted SA
21
![Page 21: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/21.jpg)
Benchmark “Programs”
• Behavioral Verilog dataflow circuits– Courtesy Deming Chen, UIUC– Compiled using RVETool into parallel programs
• Hand-coded Motion Estimation kernel– Handcrafted in RVEArch– Not exactly a circuit
22
![Page 22: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/22.jpg)
Benchmark Characteristics
23
Up to 32 x 32 array size
![Page 23: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/23.jpg)
Result Comparisons
• Investigate options– Best neighbourhood size: 4 8 12
– Update chain frequency– Stopping temperature
24
![Page 24: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/24.jpg)
4-Neighbour Swaps
CC
25
![Page 25: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/25.jpg)
8-Neighbour Swaps
CC
26
![Page 26: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/26.jpg)
12-Neighbour Swaps
CC
27
![Page 27: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/27.jpg)
Update-chain Frequency
CC
28
![Page 28: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/28.jpg)
Stopping Temperature
CC
29
![Page 29: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/29.jpg)
Limitations and Future Work
• These results were simulated on a PC– Need to target real MPPA– Performance in <# swaps> vs
<amount of communication> vs <runtime>
• Need to model limited RAM per PE– We assume complete netlist, placement state can be
divided among all PEs– Incomplete state if memory is limited?
• e.g., discard some nets?
30
![Page 30: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/30.jpg)
Conclusions
• Self-Hosted Simulated Annealing– High-quality placements (within 5%)– Excellent parallelism and speed• Only 1/256th number of swaps needed
– Runs on target architecture itself• Eat you own dog food• Computationally scalable• Memory footprint may not scale to uber-large arrays
31
![Page 31: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/31.jpg)
Conclusions
• Self-Hosted Simulated Annealing– High-quality placements (within 5%)– Excellent parallelism and speed• Only 1/256th number of swaps needed
– Runs on target architecture itself• Eat you own dog food• Computationally scalable• Memory footprint may not scale to uber-large arrays
• Thank you!32
![Page 32: Self-Hosted Placement for Massively Parallel Processor Arrays (MPPAs) Graeme Smecher, Steve Wilton, Guy Lemieux Thursday, December 10, 2009 FPT 2009](https://reader033.vdocuments.mx/reader033/viewer/2022051522/5a4d1b657f8b9ab0599afcf2/html5/thumbnails/32.jpg)
EOF
33