topic 3 - modern fpgas - imperial college london 3 - modern fpgas.… · pykc 9-jan-08 e3.05...

10
Topic 3 Slide 1 PYKC 9-Jan-08 E3.05 Digital System Design Topic 3 Modern FPGA Architectures Peter Cheung Department of Electrical & Electronic Engineering Imperial College London URL: www.ee.imperial.ac.uk/pcheung/ E-mail: [email protected] Topic 3 Slide 2 PYKC 9-Jan-08 E3.05 Digital System Design Key architectural improvements 1. Better programmable logic element (or CLB) 2. More flexible on-chip memories 3. Faster computation with embedded DSP blocks/Multipliers 4. Reduced routing delay and improved performance 5. Lower power We will focus on: Altera’s latest Stratix III Family Xilinx’s latest Virtex 5 Family Topic 3 Slide 3 PYKC 9-Jan-08 E3.05 Digital System Design 1. Better programmable logic element (or CLB) For many years, all FPGAs are based around 4-LUTs (deemed to be most efficient*) Add block memories Add multipliers/DSP modules * J. Rose, R.J. Francis, D. Lewis and P. Chow, “Architecture of Field-Programmable Gate Arrays: The Effect of Logic Functionality on Area Efficiency”, IEEE Journal of Solid-State Circuits, pp. 1217-1225, 1990. 4-LUT R E G M U X Topic 3 Slide 4 PYKC 9-Jan-08 E3.05 Digital System Design FPGA’s journey: from Glue Logic to SoC Time Lower Higher System Complexity Glue Logic Control Logic Data Path Heart of the System 16K LEs 50 MHz 40K LEs 100 MHz 80K LEs 200 MHz 330K LEs 300 MHz Source: FPGAs are now being used everywhere!

Upload: dangnhan

Post on 06-Feb-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Topic 3 Slide 1PYKC 9-Jan-08 E3.05 Digital System Design

Topic 3

Modern FPGA Architectures

Peter CheungDepartment of Electrical & Electronic Engineering

Imperial College London

URL: www.ee.imperial.ac.uk/pcheung/E-mail: [email protected]

Topic 3 Slide 2PYKC 9-Jan-08 E3.05 Digital System Design

Key architectural improvements

1. Better programmable logic element (or CLB)2. More flexible on-chip memories3. Faster computation with embedded DSP blocks/Multipliers4. Reduced routing delay and improved performance5. Lower power

We will focus on:• Altera’s latest Stratix III Family• Xilinx’s latest Virtex 5 Family

Topic 3 Slide 3PYKC 9-Jan-08 E3.05 Digital System Design

1. Better programmable logic element (or CLB)

For many years, all FPGAs are based around 4-LUTs (deemed to be most efficient*)Add block memoriesAdd multipliers/DSP modules

* J. Rose, R.J. Francis, D. Lewis and P. Chow, “Architecture of Field-Programmable Gate Arrays: The Effect of Logic Functionality on Area Efficiency”, IEEE Journal of Solid-State Circuits, pp. 1217-1225, 1990.

4-LUTREG

MUX

Topic 3 Slide 4PYKC 9-Jan-08 E3.05 Digital System Design

FPGA’s journey: from Glue Logic to SoC

TimeLower

Higher

Sys

tem

Com

plex

ity

Glue Logic

Control Logic

Data Path

Heart of theSystem

16K LEs50 MHz

40K LEs100 MHz

80K LEs200 MHz

330K LEs300 MHz

Source:

FPGAs are now being used everywhere!

Topic 3 Slide 5PYKC 9-Jan-08 E3.05 Digital System Design

One LUT Size Does Not Fit All

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

7-LUTs 6-LUTs 5-LUTs 4-LUTs 3-LUTs or less

Area Synthesis

Speed Synthesis

Source:

4-LUTs are no longer the best. Here is some results of real benchmarks showing the need for other sizes of LUTs.

Topic 3 Slide 6PYKC 9-Jan-08 E3.05 Digital System Design

Altera Patented design: Adaptive Logic Module

Register

Register

Adder

Adder

12345678

CombinedLogic

ALM

Inpu

ts

ALM

8-Input Fracturable LUTTwo 3 Input Adders Two Registers

Source:

First introduced in Stratix II devices – ALM has 8 inputs 2 outputs

Topic 3 Slide 7PYKC 9-Jan-08 E3.05 Digital System Design

Altera’s Stratix II/III ALM (1)

New and far more flexible Logic Element, now called Adaptive Logic Modules (ALMs)

Each ALM contains a variety of (look-up table) LUT-based resources which can implement functions with up to 7 inputs and complex logic-arithmetic functions

Source:

Topic 3 Slide 8PYKC 9-Jan-08 E3.05 Digital System Design

Altera’s Stratix II/III ALM (2)

Source:

Topic 3 Slide 9PYKC 9-Jan-08 E3.05 Digital System Design

Altera’s Stratix II/III ALM (2)

Source:

Topic 3 Slide 10PYKC 9-Jan-08 E3.05 Digital System Design

ALM in Arithmetic Mode

Each ALM can also be turned into TWO 3-bit, cascadable fast adders

4-LUT

+

4-LUT

4-LUT

+

4-LUT

f0

a

e0

bc

d

f1

e1

carry_in

carry_out

Reg0

Reg1

Source:

Topic 3 Slide 11PYKC 9-Jan-08 E3.05 Digital System Design

Xilinx’s Virtex-5 CLB

Slice

Source:

Xilinx took a different approach with fixed, larger LUTs

Each CLB has two slicesEach slice has FOUR 6-input LUTs

Retain fast look-ahead carry logic

Better intra-CLB routing

Topic 3 Slide 12PYKC 9-Jan-08 E3.05 Digital System Design

Virtex-5 Slice can be configured as memory

Two types of slices• SLICE_L: Logic only

• SLICE_M: Logic / RAM / ShiftReg

Logic-to-memory ratio optimized for area and power• Some CLBs with two SLICE_Ls

• Some CLBs with one SLICE_L plus one SLICE_M

Source:

Topic 3 Slide 13PYKC 9-Jan-08 E3.05 Digital System Design

Xilinx’s New 6-Input LUT with Two Outputs

True 6-input LUT• Any function of 6 variables• No input shared with other LUTs

Second output adds functionality• Reduces average slice count by

10% • 2 independent functions of 5

variables• 1 function of 6 variables plus

1 subfunction of 5 variables• 1 function of 3 variables plus

1 function of 2 other variables• Plus other combinations of

subfunctions...

Source:

Topic 3 Slide 14PYKC 9-Jan-08 E3.05 Digital System Design

Why 6-Input LUT?

Xilinx invented the LUT architecture• LUT4 has been the de-facto FPGA standard since

1988

But:• Transistors got smaller• LUTs became a relatively smaller part of the overall

logic• Interconnect got more significant in area and speed

Time to re-evaluate the architecture• Research shows that a LUT6 is optimal for 65-nm• Slightly bigger LUT, compensated by significant

savings through logic compaction and reduced routing

Source:

Topic 3 Slide 15PYKC 9-Jan-08 E3.05 Digital System Design

Examples of LUT6 Efficiency

-8-16-state FSM, 4-way branch, 6 encoded outputs

-64-32x32-bit Quad-Port memory (1 write + 3 read)

28%9931369Opencores_VGA_LCD

50%4816-input MUX

89%44384MicroBlaze 32x32-bit register file, 2 read ports

75%166464x16 bit RAM

50%3264Ternary adder 32bis wide

209

463

186

834

VirtexVirtex--5 5 LUTsLUTs

23%270Opencores Huffman encoder

32%672Opencores Huffman decoder

9%217Opencores discrete cosine transform

33%1232Opencores triple-DES decryptor

% % ReductionReduction

VirtexVirtex--4 4 LUTsLUTs

Source:

Topic 3 Slide 16PYKC 9-Jan-08 E3.05 Digital System Design

One LUT6 Equals 1.6 Logic Cells

LC has become an industry standard of measuring FPGA capacity• 10 years ago, a LC was defined as a LUT4 plus a flip-flop

Extensive testing with 163 benchmark designs shows that the LUT6 plus FF is equivalent to ~1.6 LCs (old 4-LUT logic cells)

Source:

Topic 3 Slide 17PYKC 9-Jan-08 E3.05 Digital System Design

MLAB(640 Bits)

32 words x 20or

64 words x 10

1 LAB

ALM

ALMALM

ALM

ALMALMALM

ALMALMALM

1 Read and 1 Write Port with No

Efficiency Loss

2. Flexible and more on-chip memories: Altera

For Stratix III, 10 ALM for 1 Logic Array Block (LAB). 50% of LABs in a Stratix III FPGA can be converted to a memory LAB (MLAB) with 640 bit of storage.This can be used as dual port memory (1 Read + 1 Write port).

Source:

Topic 3 Slide 18PYKC 9-Jan-08 E3.05 Digital System Design

Xilinx has even more flexibility to turn LUT to memory

Single port • One LUT6 = 64x1 or 32x2 RAM• Cascadable up to 256x1 RAM

in one CLB

Dual port (D)• 1 read/write port + 1 read port

Simple dual port (SDP)• 1 write-only port + 1 read-only port• Used in LUT FIFOs, and is also key

to Microblaze register file.

Quad-port (Q)• 1 read/write port + 3 read-only port

Source:

Topic 3 Slide 19PYKC 9-Jan-08 E3.05 Digital System Design

Xilinx: Quad-Port Memory in One SLICE_M

Write Port: Four LUT6s can share the write address and data

Read Ports: Three independent read operations

Provides 6x improvement in density over Virtex-4

LUTLUT

LUTLUT

LUTLUT

LUTLUT

••Independent read addressIndependent read address

••Associated dataAssociated data

••Independent read addressIndependent read address

••Associated dataAssociated data

••Independent read addressIndependent read address

••Associated dataAssociated data

CommonCommon

write addresswrite address

CommonCommonwrite datawrite data

Write PortWrite Port Read PortRead Port

New modes for distributed memory that did not exist beforeNew modes for distributed memory that did not exist before

Source:

Topic 3 Slide 20PYKC 9-Jan-08 E3.05 Digital System Design

32-bit Shift Registers in 1 LUT

Each bit uses 2 latches as master/slaveLength is dynamically determined by the A inputsCascadable up to 128x1 shift register in one CLB

Source:

Topic 3 Slide 21PYKC 9-Jan-08 E3.05 Digital System Design

3. Faster computation with embedded DSP blocks

512K RAM

Source:

Topic 3 Slide 22PYKC 9-Jan-08 E3.05 Digital System Design

Stratix II DSP Blocks

Source:

Topic 3 Slide 23PYKC 9-Jan-08 E3.05 Digital System Design

Stratix III DSP Blocks are twice the size

Source:

Topic 3 Slide 24PYKC 9-Jan-08 E3.05 Digital System Design

ACIN BCIN

ACOUT BCOUT

PCIN

PCOUT

Opt

iona

l Pip

elin

e Re

gist

er/

Rout

ing

Logi

cO

ptio

nal P

ipel

ine

Regi

ster

/Ro

utin

g Lo

gic

Opt

iona

l Pip

elin

e Re

gist

er/

Rout

ing

Logi

cO

ptio

nal P

ipel

ine

Regi

ster

/Ro

utin

g Lo

gic

Rout

ing

Logi

cRo

utin

g Lo

gic

Opt

iona

l Reg

iste

rO

ptio

nal R

egis

ter

Mul

tiplie

r

P (48-bit)*Optional P(96-bit)

C (48-bit)

B (18-bit)

A (25-bit)

=

48-bit

Xilinx’s Virtex 5 DSP Block

25x18 Multiplier(better precision and efficiency)

25x18 Multiplier(better precision and efficiency)

Pattern Detect Circuitry(enhanced functionality)Pattern Detect Circuitry(enhanced functionality)

Expanded second stage (SIMD, bitwise logic)

Expanded second stage (SIMD, bitwise logic)

Integrated Cascade Routing(scalable performance )

Integrated Cascade Routing(scalable performance )

Pipeline Registers(550Mhz performance)

Pipeline Registers(550Mhz performance)

Wider data-pathand 96-bit Accumulated output

(Higher precision)

Wider data-pathand 96-bit Accumulated output

(Higher precision)

* 96-bit output using MACC extension mode (uses 2 DSP48E slices) Source:

Topic 3 Slide 25PYKC 9-Jan-08 E3.05 Digital System Design

4. Reduced routing delay and improved performance

Altera’s Stratix III - Minimizing number of hops critical for:• Achieving high performance for faster timing closure• Reducing fitting congestion

Source:

Topic 3 Slide 26PYKC 9-Jan-08 E3.05 Digital System Design

Xilinx’s Virtex-4 routing

Fast Connect1 Hop2 Hops3 Hops

1 CLB

Interconnects can be bottlenecks in demanding designsInterconnects can be bottlenecks in demanding designsSource:

Topic 3 Slide 27PYKC 9-Jan-08 E3.05 Digital System Design

Virtex-4 Interconnect Pattern

Fast Connect1 Hop2 Hops3 Hops

Virtex-5 has More and Faster Routing Between CLBs

1 CLB

Dramatically increases design performanceDramatically increases design performance

Symmetric routing pattern reaches more CLBswith fewer hops

Source:

Topic 3 Slide 28PYKC 9-Jan-08 E3.05 Digital System Design

Result: Shorter Routing Delays

25 %

13 %

Improvement

665 ps751 ps1st Ring

723 ps906 ps2nd Ring

Virtex-5Virtex-4

715715

640 590

701

756789786 742

651

712

636

719

648

713723 699

662

747

813

792

666

681 637

Average delays from the input of the center CLB to inputs of any other CLB

565

First ringFirst ring

Second ringSecond ring

Source:

Topic 3 Slide 29PYKC 9-Jan-08 E3.05 Digital System Design

5. Lower Power – the 65nm Power Challenge

On the positive side: operating voltage (V) and node capacitance (C) decrease with process shrink, lowering Dynamic Power (CV2f)

Source:

Topic 3 Slide 30PYKC 9-Jan-08 E3.05 Digital System Design

Question: How fast should logic be?

Only a small proportion of logic is performance critical

Source:

Topic 3 Slide 31PYKC 9-Jan-08 E3.05 Digital System Design

Altera’s Programmable Power Technology (1)

Source:

Could make all logic fast (and high power)

Critical path shown in red

Topic 3 Slide 32PYKC 9-Jan-08 E3.05 Digital System Design

Programmable Power Technology

Source:

Better if only critical path logic are high speed

The rest are low-power (and low speed)

Topic 3 Slide 33PYKC 9-Jan-08 E3.05 Digital System Design

Programmable Power Technology

High Performance Where You Need It, Lowest Power Everywhere Else

Source:

Topic 3 Slide 34PYKC 9-Jan-08 E3.05 Digital System Design

Stratix III has Selectable Core Voltage

Customer selects the FPGA core voltage• 1.1 V for maximum performance• 0.9 V for minimum power

I/O and PLL voltages unaffected • Still get maximum I/O interface speed

64%

52%

Static Power Reduction From Stratix II FPGAs

0%

25%

Performance Gain Over Stratix II

FPGAs

55%0.9 V

33%1.1 V

Dynamic PowerReduction From Stratix II FPGAs

Core Voltage

Total Power Reduction reflects reduced capacitance, Programmable Power Technology, and other Stratix III architectural optimizations

Source:

Topic 3 Slide 35PYKC 9-Jan-08 E3.05 Digital System Design

Static Power Tamed (85 degree C)

Source:

Topic 3 Slide 36PYKC 9-Jan-08 E3.05 Digital System Design

Xilinx focus on Leakage Current at 65nm

Junction Temp Junction Temp °°CC

Curr

ent

Curr

ent

0 20 40 60 80 100 120 140-20-40

IGATE

IS→D

ITOTAL

source drain

gate

IIGATEGATE

IISS→→DD

ITOTAL = IS→D + IGATE

C G

rade

C G

rade

I Gra

deI G

rade

Typical Design Operating Range

Source-to-Drain Leakage dominant at high temp

Gate Leakage dominant at 25°C

Source:

Topic 3 Slide 37PYKC 9-Jan-08 E3.05 Digital System Design

Xilinx’s way to Optimize Speed and Leakage

Configuration MemoryConfiguration Memory VccoVccint

(Vgg)

Logic / BuffersLogic / Buffers IOBIOB

Interconnect Pass Gates

Transistor TypesThin Ox – Fast switching time, highest leakage

Mid Ox – Fast signal passing, low leakage

Thick Ox – Max Voltage tolerance, lowest leakage

No Compromise on PerformanceNo Compromise on Performance Source:

Topic 3 Slide 38PYKC 9-Jan-08 E3.05 Digital System Design

References & Addition Reading

J. Rose, R.J. Francis, D. Lewis and P. Chow, “Architecture of Field-Programmable Gate Arrays: The Effect of Logic Functionality on Area Efficiency”, IEEE Journal of Solid-State Circuits, pp. 1217-1225, 1990.

Altera Stratix III information:• http://www.altera.com/literature/lit-stx3.jsp

Xilinx Virtex-4 information:• http://www.xilinx.com/products/silicon_solutions/fpgas/virtex/virtex4/index.htm

Xilinx Virtex-5 information:-• http://www.xilinx.com/products/silicon_solutions/fpgas/virtex/virtex5/index.htm