hc17.s7t2 high-performance processing with 90-nm … · 2013-07-28 · image enhancement for movies...

27
High-Performance High-Performance Processing Processing with 90-nm FPGAs with 90-nm FPGAs Hot Chips 17 Aug 2005 Kees Vissers Erich Goetting Peter Alfke

Upload: dinhthuy

Post on 20-Jun-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

High-PerformanceHigh-Performance

ProcessingProcessing

with 90-nm FPGAswith 90-nm FPGAs

Hot Chips 17

Aug 2005

Kees Vissers

Erich Goetting

Peter Alfke

2

AgendaAgenda

Virtex4 Overview

Processor and FPGA performance

Three Examples

Cellular Telephone Basestation

Scientific Co-Processor

Image Enhancement for Movies

Conclusion

3

ASMBL Using Flip-ChipASMBL Using Flip-Chip

• Application-Specific Modular

BLock Architecture

• Groups specific circuit blocks

in dedicated columns• Logic, DSP, BRAM, Clocking,

DCMs,

I/O, MGTs, PowerPC, Configuration

• I/O columns distributed

throughout the device (Flip-Chip)

4

Three Virtex-4 FamiliesThree Virtex-4 Families

• Application-Specific Modular Block Architecture

makes it easier to create sub-families

– LX has logic, BlockRAMs, DSP-Blocks, I/O

– SX has more DSP Blocks and BlockRAMs, less logic

– FX adds powerful system features:

• PPC, Ethernet controller, 11 Gbps transceivers

Virtex-4 = eight ‘LX, three ‘SX, six ‘FX circuits

17 family members available in 2005

5

3 Families = Scalable Performance3 Families = Scalable Performance

LX 15, 25, 40, 60, 80, 100, 160, 200

FX 12, 20, 40, 60, 100, 140

SX 25, 35, 55 (name = number of Logic Cells x 1000)

DSP BLOCKS

LO

GIC

CE

LL

S (

K) LX200

+ 24 MGTs

+ 2 PPC

SX55

0 100 200 300 400 500

50

100

200

300

400

FX140

6

Dedicated Circuits in FPGAsDedicated Circuits in FPGAs

• “Hard” cores offer density, speed, lower power

– Equal to 90-nm ASICs, but far less expensive

• Expandable, pipelined Multiplier/Accumulator

• Dual-ported BlockRAM with FIFO controller

• ChipSynch I/O serializer/ deserializer + IDELAY

• Multi-Gigabit transceivers, 0.6 to 11 Gbps

• PowerPC µProcessor with co-processor interface (APU) and

Ethernet controller

Dedicated circuits provide a big performance boost

7

Processor and FPGAProcessor and FPGA

PerformancePerformance

8

FPGAsFPGAs

1000:1 Performance Range1000:1 Performance Range

1000 : 1 100 : 1 10 : 1 1 : 1Clocks per sample

500 Ks/sec 5 Ms/sec 50 Ms/sec 500 Ms/sec500 MHz clock

PPCPPCPPC PPC

+ APU+ APUInterleaveInterleaveElement

4-256 KByte 4-256 KByte 10 KByte 1 KByteMemory

GPPGPPDSPsDSPs

ASICsASICs

Applications control → audio → mobile video → HDTV → communications

LUTs LUTs

+ DSP48+ DSP48

9

Covering a Wide RangeCovering a Wide RangeT

hro

ug

hp

ut

Arithmetic Performance

FX

LXSX

Network Processing Supercomputing

Scientific Processing

10

Example:Example:

CellularCellular

Telephone BaseTelephone Base

StationStation

11

Wireless Base StationWireless Base Station3G Functional Diagram3G Functional Diagram

12

TX/RX ImplementationTX/RX Implementation

• 3G specification: each channelhas a sample rate of 3.84Ms/sec

– Digital Filtering at 491 MHz• 128 channels

using 128x the sample rate

– Digital Up/Down Conversion

– Pre-Distortion and Digital Filtering• Up to 512 18x18 multipliers + accumulators

to build massively parallel filter implementations

13

DSP BlockDSP Block• Evolution from embedded multipliers:

– Pipeline registers enable 500 MHz performance

– Cascade logic enables sustained 500 MHz performance throughout DSP column

– Build high-speed multi-level filters using DSP Slices

– Achieve 128x the sample rate of 3.84 Ms/sec (491.5 MHz clocking)

14

FX

FX FX

FX

Wireless Base Station ExampleWireless Base Station ExampleVirtex-4 SolutionVirtex-4 Solution

Support for 128 channels using Virtex-4 FPGAs

15

Example:Example:

Scientific Co-ProcessorScientific Co-Processor

16

SupercomputerSupercomputer

• Programmed with

CarteTM tools,

using C or Fortran

• Overlap of DMA

and computing

Six BanksDual-ported

On-Board Memory(24 MB)

4800 MB/s4800 MB/s(6 x 64b)(6 x 64b)

4800 MB/s4800 MB/s

192b192b

24002400MB/sMB/seacheach

GPIOGPIO

4800 MB/s4800 MB/s

(6 x 64b)(6 x 64b)

Controller

XC2V6000

Microcode

ROM

Config

ROM

User Logic 1

XC2VP100

User Logic 2

XC2VP100

108b108b

4800 MB/s4800 MB/s

(6 x 64b)(6 x 64b)

108b108b

1400 MB/s1400 MB/ssustainedsustainedpayloadpayload

MAP®

Series D & E

1400 MB/s1400 MB/ssustainedsustainedpayloadpayload

Dual-portedMemory(4 MB)

MAP and Carte are trademarks of SRC Computers, Inc.

Series D MAP

17

Number of 2.8 GHz microprocessors required

to equal the performance of one SRC MAP® on select algorithms

___________________________Source: SRC Computers, Inc. - Comparisons are based on measured single microprocessor and single MAP processor performance

MAPMAP®® Logic Performance Logic Performance

0

50

100

150

200

Pe

rfo

rma

nc

e S

pe

ed

up

of

a S

ing

le M

AP

vs

. D

LD

s

1000

69

P7Viterbi

52

Bit Matrix

Multiply

206

3DES

150

SPAM

40

Wav elet

Compression

2000

SCALE CHANGE280

3DES/Med. Filter & Edge

Det./3DES

2170

MedianFilter & Edge

Detector

Series C MAP(XC2V6000 User Logic)

Series D & E MAP(XC2VP100 User Logic)

30

Median

Filter & Edge

Detector

18

Example:Example:

Image EnhancementImage Enhancement

for Moviesfor Movies

19

image ops / s

data rate

HDTV (high-end) – Film/Broadcast, Blu-ray Disc, HD-DVD

x 100

HDTV – High resolution TV (720p)

SDTV (DVD)

x 6

x 6

x 9

x 9

digital film processing with

computation-intensive

special functions

Digital Film ProcessingDigital Film Processing

High-resolution digital film scanning

and preprocessing

20

Original 1. Filter stage2 2. Filter stage 3. Filter stage

Application: Noise ReductionApplication: Noise Reduction

21

Application: Noise ReductionApplication: Noise Reduction

bi-directional

motion

analysis

motion

compensation

IN

wavelet-

based noise

reduction

3D

median filter

noise post

processinginverse

3D-DWT

3D-DWT

OUT

22

Film Processing StatisticsFilm Processing StatisticsImage 2048x2048, 10bits per RGB component =30 bits/pixel

2D intra-frame algorithm

• Software solution: compiled from Matlab -> C/C++,

on Pentium IV 2.4GHz, 1.5GB RAM, 10.2 Specint2000

Performance: 70 seconds per frame

• Hardware solution: One XC2VP50-6*, 120 MHz, synthesis,

Performance: 24 frames per second

3.62

Compare

1,680

FPGA speed-up

Giga-ops

Operations

5.819.44

MultiplyAdd

* an XC2VP50 is comparable to a Virtex-4 LX40

23

Film Processing StatisticsFilm Processing StatisticsImage size 2048x2048, 10 bit per RGB component

3D motion estimation/compensation inter-frame algorithm

• Software solution: compiled from Matlab -> C/C++,

on Pentium IV 2.4GHz, 1.5GB RAM, 10.2 Specint2000

Performance: 11 minutes per frame

• Hardware solution: Four XC2VP50-6, 120 MHz, synthesis,

Performance: 24 frames per second

11.88

Comp

15,840

FPGA speed-up

Giga-ops

Operations

11.63179.61

MultiplyAdd

24

DSP Programming ModelDSP Programming ModelW

ork

in t

he

lan

gu

age

of

you

r p

rob

lem System Level Modeling & Simulation Framework

HDL

HW in the loop

MATLAB

Methodology re-couples behavior with implementation

(while abstracting hardware details whenever possible)

C

25

ConclusionConclusion

• FPGAs combine configurable logic, fast I/O,processors, DSP elements, and hard cores

• Virtex4 has a family of scalable solutions

• For high-performance applications, the speed-upover a good general purpose processor canrange from tens to several thousands

• Excellent FPGA-based solutions exist forbase-stations, video-applications, networking,and high performance compute platforms

26

AcknowledgementsAcknowledgements

• The whole V4 team of the Advanced Product

Division at Xilinx, see www.xilinx.com

• SRC Computers, Inc. for data on the compute

platform, see www.srccomp.com

• University of Braunschweig, Prof. Rolf Ernst and

Amilcar Lucas for the film processing data, see

www.flexfilm.org

27

Appendix : Virtex-4 FamilyAppendix : Virtex-4 Family

2442192896209,936142,128XC4VFX140

2042160768126,76894,896XC4VFX100

1642128576124,17656,880XC4VFX60

12424844882,59241,904XC4VFX40

8213232041,22419,224XC4VFX20

-2132320464812,312XC4VFX12

---51264085,76055,296XC4VSX55

---19244883,45634,560XC4VSX35

---12832042,30423,040XC4VSX25

---96960126,048200,448XC4VLX200

---96960125,184152,064XC4VLX160

---96960124,320110,592XC4VLX100

---80768123,60080,640XC4VLX80

---6464082,88059,904XC4VLX60

---6464081,72841,472XC4VLX40

---4844881,29624,192XC4VLX25

---32320486413,824XC4VLX15

RocketIO

Transceiver

10/100/

1000 EMACPowerPC

XtremeDSP

SliceSelectIODCM

Block

RAM [Kb]

Logic

CellsDevice