coarse and fine grain programmable overlay architectures for fpgas
DESCRIPTION
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs. Alex Brant Advisor: Guy Lemieux University of British Columbia. Outline. Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary. Motivation - 1. FPGA Overlays - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/1.jpg)
Coarse and Fine Grain Programmable Overlay Architectures for FPGAsAlex Brant
Advisor: Guy Lemieux
University of British Columbia
1
![Page 2: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/2.jpg)
Outline
Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary
2
![Page 3: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/3.jpg)
Motivation - 1FPGA Overlays
FPGA designs that can be further programmed by the userWhat are the benefits?
Ease of use (simpler languages, tools, etc.)Optimized for particular problem domainsOpen access to architecture & CADUser-configured logic added to fixed FPGA bitstreamDynamic reconfiguration on any devicePortability between vendors and devices
3
![Page 4: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/4.jpg)
Motivation - 2Fine Grain Overlay – ZUMAFPGA-like architecture
Compatible with VTR CAD tools“Virtual” FPGA for portability of designsOpen source for research and applications
Implements fine grain part of MALIBU architectureGeneric implementation has high area overhead
Overcome by utilizing low level FPGA resources, implementing more efficient structures
4
![Page 5: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/5.jpg)
Motivation - 3Coarse Grain Overlay – CARBONArray of time-multiplexed ALUs
Fast compileHigh densityEfficient mapping of word oriented circuits
Implements coarse grain part of MALIBUTime-multiplexing limits overall performance
Performance gained using overclocking with error tolerance (CARBON-Razor)
5
![Page 6: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/6.jpg)
Contributions
Area efficient implementation of fine grain routing and logic with LUTRAMs
Area efficient 2-stage local routing network and configuration controller
Extension of Razor error tolerance from pipelined processors to 2D processing arrays
Design of an overclockable coarse grain FPGA overlay with in-circuit error correction
6
![Page 7: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/7.jpg)
Publications
7
ZUMA: An Open FPGA Overlay Architecture, Alexander Brant and Guy G.F. Lemieux (FCCM 2012)
Pipeline Frequency Boosting: Hiding Dual-Ported Block RAM Latency using Intentional Clock Skew, Alexander Brant, Ameer Abdelhadi, Aaron Severance, Guy G.F. Lemieux (FPT 2012)
CARBON-Razor: An Error-Tolerant Coarse Grain FPGA (in preparation)
![Page 8: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/8.jpg)
Outline Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary
8
![Page 9: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/9.jpg)
FPGA Architecture
9
Implements any logic function
![Page 10: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/10.jpg)
MALIBU Architecture
10
Hybrid coarse/fine grain FPGA Time-multiplexed ALU (CG) combined with FPGA cluster CG passes data to neighbors through memories
![Page 11: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/11.jpg)
MALIBU Hybrid FPGA CGs are run on fast system clock (e.g. > 1GHz) System clock / Schedule length = User clock rate Advantages:
Greater density from time-multiplexing Ability to trade-off between area and speed Compiles up to 300x faster than normal FPGA Better performance for word-oriented circuits
11
![Page 12: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/12.jpg)
Razor Timing Error Tolerance
Works with feed-forward pipeline circuits Detects timing errors by capturing data a second time
with a delayed clock Tolerates errors by stalling pipeline one cycle
12
![Page 13: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/13.jpg)
Razor Timing Error Example
Data captured in main FF
13
![Page 14: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/14.jpg)
Razor Timing Error Example
Data captured in main FF Fraction of cycle later, data captured by shadow latch
14
![Page 15: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/15.jpg)
Razor Timing Error Example
Data captured in main FF Fraction of cycle later, data captured by shadow latch Main FF and Shadow latch are compared
15
![Page 16: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/16.jpg)
Razor Timing Error Example
Data captured in main FF Fraction of cycle later, data captured by shadow latch Main FF and Shadow latch are compared
If different, shadow data loaded to main FF, pipeline is stalled
16
![Page 17: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/17.jpg)
Razor Timing Error Example
Data captured in main FF Fraction of cycle later, data captured by shadow latch Main FF and Shadow latch are compared
If different, shadow data loaded to main FF, pipeline is stalled If not, pipelining proceeds normally
17
![Page 18: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/18.jpg)
Outline Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary
18
![Page 19: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/19.jpg)
ZUMA Overlay
19
Island style FPGA architecture, implemented on an FPGA
Initially implemented in generic Verilog High area overhead, 125+ host LUTs for each ZUMA
LUT (eLUT) Area efficiency improvements:
Implementation of routing and logic with FPGA LUTRAMs
Design of efficient 2-stage local interconnect
![Page 20: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/20.jpg)
ZUMA Layout
20
K-LUT FFTwo Stage
Crossbar Network
S-Block
Input Block
Logic Cluster
One tile of ZUMA Architecture
![Page 21: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/21.jpg)
Details - LUTRAM
21
we
data outConfig Bits
2k
Decoder
rd addr
wr addr
data in
k
kConfig Bits
2k
Reprogrammable LUTRAM in Xilinx and Altera Devices
![Page 22: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/22.jpg)
Details – LUTRAM Multiplexer
22
6-LUTs0
yy
s1
d1d2d3
d0d1d2d3
d0
d4d5
6-LUT, configured as a 4-to-1 MUX
6-LUT6-LUT, configured as a 6-to-1 MUX in RAM mode
LUTRAM can implement larger MUXs than a normal LUT, need no extra configuration memory
![Page 23: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/23.jpg)
Details – Local Routing Crossbar
23
K-LUTk
1 1
k
k
1 1
k
k
1 1
k
P
1 1
N
P
1 1
N
P
1 1
N
1
k
P k x kLUTRAMs
k P x NLUTRAMs
N k-input LUTs
K-LUT1
k
K-LUT1
k
P=(I+N)/k
I+NInputs
N*kOutputs
Reduced Two Stage Network ZUMAeLUTs
Two-Stage (I+N) x (k*N) crossbar used in ZUMA Logic Cluster
![Page 24: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/24.jpg)
Results
24
Both Xilinx and Altera versions implementedOur generic version is 125-150 LUTs per eLUTArea overhead as low as 40 Host LUTs per eLUT
with improvementsCompared to previous work (vFPGA) on 4-LUT
host, overhead reduced 3x with same parameters
![Page 25: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/25.jpg)
Outline Motivation Contributions Prior Work ZUMA FPGA Overlay CARBON-Razor Overlay Summary
25
![Page 26: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/26.jpg)
CARBON Overlay FPGA implementation of MALIBU CG
Modifications to support FPGA block RAMs Critical Path is Memory to ALU to Memory
26
![Page 27: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/27.jpg)
CARBON-Razor
Razor is applied to the CARBON overlay Error tolerance on memory to memory critical path
How to do it: Shadow registers apply to CARBON memories CARBON schedule 1-3 extra timeslots for error
recovery Stall propagation extend from 1D pipeline (Razor)
to 2D array (CARBON)
27
![Page 28: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/28.jpg)
CARBON-Razor Memory
28
Shadow register paired with RAM Stratix memory mode allows read-back of previously written
data
![Page 29: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/29.jpg)
2D Error PropagationCan’t propagate errors to entire chip fast enough
We can propagate it one tile per cycleError propagation logic can then combine multiple
errors into one stall region
![Page 30: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/30.jpg)
2D Error Propagation Example
Error at tile at cycle 0 Each cycle, stall
propagates to nearest neighbors
0
![Page 31: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/31.jpg)
2D Error Propagation Example
0 1
1
1
1
Error at tile at cycle 0 Each cycle, stall
propagates to nearest neighbors
![Page 32: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/32.jpg)
2D Error Propagation Example
2 2
2
0 1
2
1
1
1
2
2
Error at tile at cycle 0 Each cycle, stall
propagates to nearest neighbors
![Page 33: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/33.jpg)
2D Error Propagation Example
3
3 2
3
2
3 2
0 1
2
1
1
1
2
2
Error at tile at cycle 0 Each cycle, stall
propagates to nearest neighbors
![Page 34: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/34.jpg)
2D Error Propagation Example
4 3
3 2
3
2
3 2
0 1
2
1
1
1
2
2
Error at tile at cycle 0 Each cycle, stall
propagates to nearest neighbors
![Page 35: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/35.jpg)
2D Error Propagation Example
4 3
3 2
3
2
3 2
0 1
2
1
1
1
2
2
Error at tile at cycle 0 Each cycle, stall
propagates to nearest neighbors
![Page 36: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/36.jpg)
Stall Propagation Logic
36
When an error is detected at a CG: Instruction schedule stalls Memories in CG load from shadow register Any writes from neighbor captured in shadow register
Next cycle: Schedule resumes Neighbor’s write performed from shadow register 4 neighbors stall, unless they stalled last cycle
Stall region continues in expanding diamond shaped wave
![Page 37: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/37.jpg)
Carbon Schedule Extension We add 1-3 cycles of slack to schedule
Allows margin of safety Speedup determined by difference in FMAX and schedule
length If no hard deadline is needed (eg. when used as compute
accelerator), average extension of schedule can be used to find speedup
FMAX-Razor * SLBase
FMAX-Base * SLRazor
37
Speedup =
![Page 38: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/38.jpg)
Results
38
Performance compared between CARBON and CARBON-Razor for 4 benchmarks
Maximum performance found by pushing clock speed and shadow register delay
Average increases to 14% with no hard deadline
Benchmark SL Extra Cycles Speedup
Random Ops 24 2 11%
Wang 28 1 6%
Mean(256) 67 2 20%
PR 29 1 3%
Average 13%
![Page 39: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/39.jpg)
Contributions
39
Area efficient implementation of FPGA routing and logic with LUTRAMs
Area efficient 2-stage local routing network and configuration controller
Extension of Razor error tolerance from pipelined processors to 2D processing arrays
Design of an overclockable coarse grain FPGA overlay with in-circuit error correction
![Page 40: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/40.jpg)
SummaryFine Grain Overlay – ZUMAFPGA-like architecture, compatible with VTR CAD toolsHigh area overhead implementing fine grain structures
Overcome by utilizing FPGA resources, implementing alternate structuresArea reduced to 40 host LUTs per eLUT, 3x improvement
Coarse Grain Overlay – CARBONFast compile, efficient mapping of word oriented circuitsTime-multiplexing decreases overall performance
Performance gained using overclocking with error toleranceSpeedup of 13% on average compared to baseline design
40
![Page 41: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/41.jpg)
41
Thank you
![Page 42: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/42.jpg)
ZUMA Config Controller
42
data
2k bit counter
Bitstream In(ROM, JTAG)
Tile
addr
Overflow
FF
Begin Config
weD Q
Count
we
dataTile
addrFF
weD Q we
Shift Chain
dataTile
addr
weLUTRAM
data[0]
data[1]LUTRAM
![Page 43: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/43.jpg)
LUTRAM Crossbar
43
2m x n Memory
rd addr
wr addr
data in
data out
m
m
LUTRAM
we
nn
n x m Crossbar
data in
data out
![Page 44: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/44.jpg)
CARBON Razor Timing
44
Shadow register latches correct data if delay is sufficient
![Page 45: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/45.jpg)
CARBON-Razor Stall Logic
45
![Page 46: Coarse and Fine Grain Programmable Overlay Architectures for FPGAs](https://reader035.vdocuments.mx/reader035/viewer/2022062301/5681492b550346895db66656/html5/thumbnails/46.jpg)
CARBON-Razor Test
46
f~
Dynamic PLLØ+Δ
SystemClock
RazorClock
freq
.
phas
e
enab
le
Rand
omVe
ctor
s
Out
put
Vect
ors
Erro
rCo
unt
Nios II/f
MAL
IBU
–Raz
or