topic 3 - modern fpgas - imperial college london 3 - modern fpgas.… · pykc 9-jan-08 e3.05...
TRANSCRIPT
Topic 3 Slide 1PYKC 9-Jan-08 E3.05 Digital System Design
Topic 3
Modern FPGA Architectures
Peter CheungDepartment of Electrical & Electronic Engineering
Imperial College London
URL: www.ee.imperial.ac.uk/pcheung/E-mail: [email protected]
Topic 3 Slide 2PYKC 9-Jan-08 E3.05 Digital System Design
Key architectural improvements
1. Better programmable logic element (or CLB)2. More flexible on-chip memories3. Faster computation with embedded DSP blocks/Multipliers4. Reduced routing delay and improved performance5. Lower power
We will focus on:• Altera’s latest Stratix III Family• Xilinx’s latest Virtex 5 Family
Topic 3 Slide 3PYKC 9-Jan-08 E3.05 Digital System Design
1. Better programmable logic element (or CLB)
For many years, all FPGAs are based around 4-LUTs (deemed to be most efficient*)Add block memoriesAdd multipliers/DSP modules
* J. Rose, R.J. Francis, D. Lewis and P. Chow, “Architecture of Field-Programmable Gate Arrays: The Effect of Logic Functionality on Area Efficiency”, IEEE Journal of Solid-State Circuits, pp. 1217-1225, 1990.
4-LUTREG
MUX
Topic 3 Slide 4PYKC 9-Jan-08 E3.05 Digital System Design
FPGA’s journey: from Glue Logic to SoC
TimeLower
Higher
Sys
tem
Com
plex
ity
Glue Logic
Control Logic
Data Path
Heart of theSystem
16K LEs50 MHz
40K LEs100 MHz
80K LEs200 MHz
330K LEs300 MHz
Source:
FPGAs are now being used everywhere!
Topic 3 Slide 5PYKC 9-Jan-08 E3.05 Digital System Design
One LUT Size Does Not Fit All
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
7-LUTs 6-LUTs 5-LUTs 4-LUTs 3-LUTs or less
Area Synthesis
Speed Synthesis
Source:
4-LUTs are no longer the best. Here is some results of real benchmarks showing the need for other sizes of LUTs.
Topic 3 Slide 6PYKC 9-Jan-08 E3.05 Digital System Design
Altera Patented design: Adaptive Logic Module
Register
Register
Adder
Adder
12345678
CombinedLogic
ALM
Inpu
ts
ALM
8-Input Fracturable LUTTwo 3 Input Adders Two Registers
Source:
First introduced in Stratix II devices – ALM has 8 inputs 2 outputs
Topic 3 Slide 7PYKC 9-Jan-08 E3.05 Digital System Design
Altera’s Stratix II/III ALM (1)
New and far more flexible Logic Element, now called Adaptive Logic Modules (ALMs)
Each ALM contains a variety of (look-up table) LUT-based resources which can implement functions with up to 7 inputs and complex logic-arithmetic functions
Source:
Topic 3 Slide 8PYKC 9-Jan-08 E3.05 Digital System Design
Altera’s Stratix II/III ALM (2)
Source:
Topic 3 Slide 9PYKC 9-Jan-08 E3.05 Digital System Design
Altera’s Stratix II/III ALM (2)
Source:
Topic 3 Slide 10PYKC 9-Jan-08 E3.05 Digital System Design
ALM in Arithmetic Mode
Each ALM can also be turned into TWO 3-bit, cascadable fast adders
4-LUT
+
4-LUT
4-LUT
+
4-LUT
f0
a
e0
bc
d
f1
e1
carry_in
carry_out
Reg0
Reg1
Source:
Topic 3 Slide 11PYKC 9-Jan-08 E3.05 Digital System Design
Xilinx’s Virtex-5 CLB
Slice
Source:
Xilinx took a different approach with fixed, larger LUTs
Each CLB has two slicesEach slice has FOUR 6-input LUTs
Retain fast look-ahead carry logic
Better intra-CLB routing
Topic 3 Slide 12PYKC 9-Jan-08 E3.05 Digital System Design
Virtex-5 Slice can be configured as memory
Two types of slices• SLICE_L: Logic only
• SLICE_M: Logic / RAM / ShiftReg
Logic-to-memory ratio optimized for area and power• Some CLBs with two SLICE_Ls
• Some CLBs with one SLICE_L plus one SLICE_M
Source:
Topic 3 Slide 13PYKC 9-Jan-08 E3.05 Digital System Design
Xilinx’s New 6-Input LUT with Two Outputs
True 6-input LUT• Any function of 6 variables• No input shared with other LUTs
Second output adds functionality• Reduces average slice count by
10% • 2 independent functions of 5
variables• 1 function of 6 variables plus
1 subfunction of 5 variables• 1 function of 3 variables plus
1 function of 2 other variables• Plus other combinations of
subfunctions...
Source:
Topic 3 Slide 14PYKC 9-Jan-08 E3.05 Digital System Design
Why 6-Input LUT?
Xilinx invented the LUT architecture• LUT4 has been the de-facto FPGA standard since
1988
But:• Transistors got smaller• LUTs became a relatively smaller part of the overall
logic• Interconnect got more significant in area and speed
Time to re-evaluate the architecture• Research shows that a LUT6 is optimal for 65-nm• Slightly bigger LUT, compensated by significant
savings through logic compaction and reduced routing
Source:
Topic 3 Slide 15PYKC 9-Jan-08 E3.05 Digital System Design
Examples of LUT6 Efficiency
-8-16-state FSM, 4-way branch, 6 encoded outputs
-64-32x32-bit Quad-Port memory (1 write + 3 read)
28%9931369Opencores_VGA_LCD
50%4816-input MUX
89%44384MicroBlaze 32x32-bit register file, 2 read ports
75%166464x16 bit RAM
50%3264Ternary adder 32bis wide
209
463
186
834
VirtexVirtex--5 5 LUTsLUTs
23%270Opencores Huffman encoder
32%672Opencores Huffman decoder
9%217Opencores discrete cosine transform
33%1232Opencores triple-DES decryptor
% % ReductionReduction
VirtexVirtex--4 4 LUTsLUTs
Source:
Topic 3 Slide 16PYKC 9-Jan-08 E3.05 Digital System Design
One LUT6 Equals 1.6 Logic Cells
LC has become an industry standard of measuring FPGA capacity• 10 years ago, a LC was defined as a LUT4 plus a flip-flop
Extensive testing with 163 benchmark designs shows that the LUT6 plus FF is equivalent to ~1.6 LCs (old 4-LUT logic cells)
Source:
Topic 3 Slide 17PYKC 9-Jan-08 E3.05 Digital System Design
MLAB(640 Bits)
32 words x 20or
64 words x 10
1 LAB
ALM
ALMALM
ALM
ALMALMALM
ALMALMALM
1 Read and 1 Write Port with No
Efficiency Loss
2. Flexible and more on-chip memories: Altera
For Stratix III, 10 ALM for 1 Logic Array Block (LAB). 50% of LABs in a Stratix III FPGA can be converted to a memory LAB (MLAB) with 640 bit of storage.This can be used as dual port memory (1 Read + 1 Write port).
Source:
Topic 3 Slide 18PYKC 9-Jan-08 E3.05 Digital System Design
Xilinx has even more flexibility to turn LUT to memory
Single port • One LUT6 = 64x1 or 32x2 RAM• Cascadable up to 256x1 RAM
in one CLB
Dual port (D)• 1 read/write port + 1 read port
Simple dual port (SDP)• 1 write-only port + 1 read-only port• Used in LUT FIFOs, and is also key
to Microblaze register file.
Quad-port (Q)• 1 read/write port + 3 read-only port
Source:
Topic 3 Slide 19PYKC 9-Jan-08 E3.05 Digital System Design
Xilinx: Quad-Port Memory in One SLICE_M
Write Port: Four LUT6s can share the write address and data
Read Ports: Three independent read operations
Provides 6x improvement in density over Virtex-4
LUTLUT
LUTLUT
LUTLUT
LUTLUT
••Independent read addressIndependent read address
••Associated dataAssociated data
••Independent read addressIndependent read address
••Associated dataAssociated data
••Independent read addressIndependent read address
••Associated dataAssociated data
CommonCommon
write addresswrite address
CommonCommonwrite datawrite data
Write PortWrite Port Read PortRead Port
New modes for distributed memory that did not exist beforeNew modes for distributed memory that did not exist before
Source:
Topic 3 Slide 20PYKC 9-Jan-08 E3.05 Digital System Design
32-bit Shift Registers in 1 LUT
Each bit uses 2 latches as master/slaveLength is dynamically determined by the A inputsCascadable up to 128x1 shift register in one CLB
Source:
Topic 3 Slide 21PYKC 9-Jan-08 E3.05 Digital System Design
3. Faster computation with embedded DSP blocks
512K RAM
Source:
Topic 3 Slide 22PYKC 9-Jan-08 E3.05 Digital System Design
Stratix II DSP Blocks
Source:
Topic 3 Slide 23PYKC 9-Jan-08 E3.05 Digital System Design
Stratix III DSP Blocks are twice the size
Source:
Topic 3 Slide 24PYKC 9-Jan-08 E3.05 Digital System Design
ACIN BCIN
ACOUT BCOUT
PCIN
PCOUT
Opt
iona
l Pip
elin
e Re
gist
er/
Rout
ing
Logi
cO
ptio
nal P
ipel
ine
Regi
ster
/Ro
utin
g Lo
gic
Opt
iona
l Pip
elin
e Re
gist
er/
Rout
ing
Logi
cO
ptio
nal P
ipel
ine
Regi
ster
/Ro
utin
g Lo
gic
Rout
ing
Logi
cRo
utin
g Lo
gic
Opt
iona
l Reg
iste
rO
ptio
nal R
egis
ter
Mul
tiplie
r
P (48-bit)*Optional P(96-bit)
C (48-bit)
B (18-bit)
A (25-bit)
=
48-bit
Xilinx’s Virtex 5 DSP Block
25x18 Multiplier(better precision and efficiency)
25x18 Multiplier(better precision and efficiency)
Pattern Detect Circuitry(enhanced functionality)Pattern Detect Circuitry(enhanced functionality)
Expanded second stage (SIMD, bitwise logic)
Expanded second stage (SIMD, bitwise logic)
Integrated Cascade Routing(scalable performance )
Integrated Cascade Routing(scalable performance )
Pipeline Registers(550Mhz performance)
Pipeline Registers(550Mhz performance)
Wider data-pathand 96-bit Accumulated output
(Higher precision)
Wider data-pathand 96-bit Accumulated output
(Higher precision)
* 96-bit output using MACC extension mode (uses 2 DSP48E slices) Source:
Topic 3 Slide 25PYKC 9-Jan-08 E3.05 Digital System Design
4. Reduced routing delay and improved performance
Altera’s Stratix III - Minimizing number of hops critical for:• Achieving high performance for faster timing closure• Reducing fitting congestion
Source:
Topic 3 Slide 26PYKC 9-Jan-08 E3.05 Digital System Design
Xilinx’s Virtex-4 routing
Fast Connect1 Hop2 Hops3 Hops
1 CLB
Interconnects can be bottlenecks in demanding designsInterconnects can be bottlenecks in demanding designsSource:
Topic 3 Slide 27PYKC 9-Jan-08 E3.05 Digital System Design
Virtex-4 Interconnect Pattern
Fast Connect1 Hop2 Hops3 Hops
Virtex-5 has More and Faster Routing Between CLBs
1 CLB
Dramatically increases design performanceDramatically increases design performance
Symmetric routing pattern reaches more CLBswith fewer hops
Source:
Topic 3 Slide 28PYKC 9-Jan-08 E3.05 Digital System Design
Result: Shorter Routing Delays
25 %
13 %
Improvement
665 ps751 ps1st Ring
723 ps906 ps2nd Ring
Virtex-5Virtex-4
715715
640 590
701
756789786 742
651
712
636
719
648
713723 699
662
747
813
792
666
681 637
Average delays from the input of the center CLB to inputs of any other CLB
565
First ringFirst ring
Second ringSecond ring
Source:
Topic 3 Slide 29PYKC 9-Jan-08 E3.05 Digital System Design
5. Lower Power – the 65nm Power Challenge
On the positive side: operating voltage (V) and node capacitance (C) decrease with process shrink, lowering Dynamic Power (CV2f)
Source:
Topic 3 Slide 30PYKC 9-Jan-08 E3.05 Digital System Design
Question: How fast should logic be?
Only a small proportion of logic is performance critical
Source:
Topic 3 Slide 31PYKC 9-Jan-08 E3.05 Digital System Design
Altera’s Programmable Power Technology (1)
Source:
Could make all logic fast (and high power)
Critical path shown in red
Topic 3 Slide 32PYKC 9-Jan-08 E3.05 Digital System Design
Programmable Power Technology
Source:
Better if only critical path logic are high speed
The rest are low-power (and low speed)
Topic 3 Slide 33PYKC 9-Jan-08 E3.05 Digital System Design
Programmable Power Technology
High Performance Where You Need It, Lowest Power Everywhere Else
Source:
Topic 3 Slide 34PYKC 9-Jan-08 E3.05 Digital System Design
Stratix III has Selectable Core Voltage
Customer selects the FPGA core voltage• 1.1 V for maximum performance• 0.9 V for minimum power
I/O and PLL voltages unaffected • Still get maximum I/O interface speed
64%
52%
Static Power Reduction From Stratix II FPGAs
0%
25%
Performance Gain Over Stratix II
FPGAs
55%0.9 V
33%1.1 V
Dynamic PowerReduction From Stratix II FPGAs
Core Voltage
Total Power Reduction reflects reduced capacitance, Programmable Power Technology, and other Stratix III architectural optimizations
Source:
Topic 3 Slide 35PYKC 9-Jan-08 E3.05 Digital System Design
Static Power Tamed (85 degree C)
Source:
Topic 3 Slide 36PYKC 9-Jan-08 E3.05 Digital System Design
Xilinx focus on Leakage Current at 65nm
Junction Temp Junction Temp °°CC
Curr
ent
Curr
ent
0 20 40 60 80 100 120 140-20-40
IGATE
IS→D
ITOTAL
source drain
gate
IIGATEGATE
IISS→→DD
ITOTAL = IS→D + IGATE
C G
rade
C G
rade
I Gra
deI G
rade
Typical Design Operating Range
Source-to-Drain Leakage dominant at high temp
Gate Leakage dominant at 25°C
Source:
Topic 3 Slide 37PYKC 9-Jan-08 E3.05 Digital System Design
Xilinx’s way to Optimize Speed and Leakage
Configuration MemoryConfiguration Memory VccoVccint
(Vgg)
Logic / BuffersLogic / Buffers IOBIOB
Interconnect Pass Gates
Transistor TypesThin Ox – Fast switching time, highest leakage
Mid Ox – Fast signal passing, low leakage
Thick Ox – Max Voltage tolerance, lowest leakage
No Compromise on PerformanceNo Compromise on Performance Source:
Topic 3 Slide 38PYKC 9-Jan-08 E3.05 Digital System Design
References & Addition Reading
J. Rose, R.J. Francis, D. Lewis and P. Chow, “Architecture of Field-Programmable Gate Arrays: The Effect of Logic Functionality on Area Efficiency”, IEEE Journal of Solid-State Circuits, pp. 1217-1225, 1990.
Altera Stratix III information:• http://www.altera.com/literature/lit-stx3.jsp
Xilinx Virtex-4 information:• http://www.xilinx.com/products/silicon_solutions/fpgas/virtex/virtex4/index.htm
Xilinx Virtex-5 information:-• http://www.xilinx.com/products/silicon_solutions/fpgas/virtex/virtex5/index.htm