fpgas for signal processing and communication systems raghu rao wireless and signal processing...

FPGAs for Signal Processing and Communication Systems

Raghu RaoWireless and Signal Processing Group,Xilinx Inc.05/14/2010

R. M. Rao, 2008

Agenda• Overview of FPGAs

– Building DSP sub-systems on FPGAs– Digital baseband

• The Platform FPGA• Communication systems and DSP on FPGAs• Architectural tradeoffs for FPGAs

– The Matrix inversion problem• FPGA tools and design methodology

2

R. M. Rao, 2008

What are FPGAs?

• An array of configurable logic blocks with configurable interconnects between them.

• Each logic block can implement any 6-input combinatorial function.

• Logic blocks can be connected to generate larger circuits.

• Additional DSP specific resources (multiply accumulate units).

3

R. M. Rao, 2008

4

Virtex-4/5 FPGA ArhitectureHigh-Level View

• FPGA family with 3 members tailored for specific classes of processing– SX: DSP

– LX: Logic centric

– FX: Full featured

• Embedded PowerPC hard IP

• Giga-bit serial connectivity

• DSP processing tiles “DSP48”

R. M. Rao, 2008

5

Virtex-5 FPGA Platform

• 2 slices per CLB, 4 LUTs per CLB• Can be configured as a shift register• Can be configured as distributed memory

Can be configured as RAM

Can be configured as a shift register

R. M. Rao, 2008

6

ACIN BCIN

ACOUT BCOUT

PCIN

PCOUTO

ptio

nal P

ipel

ine

Reg

iste

r/R

outin

g Lo

gic

Opt

iona

l Pip

elin

e R

egis

ter/

Rou

ting

Logi

c

Opt

iona

l Pip

elin

e R

egis

ter/

Rou

ting

Logi

cO

ptio

nal P

ipel

ine

Reg

iste

r/R

outin

g Lo

gic

Rou

ting

Logi

cR

outin

g Lo

gic

Opt

iona

l Reg

iste

rO

ptio

nal R

egis

ter

Mul

tiplie

rP (48-bit)Optional P(96-bit)

C (48-bit)

B (18-bit)A (25-bit)

=

48-bit

Virtex-5 DSP48EFull Custom Design Enabling Efficient DSP

New 25x18 input increases precision and efficiency

New 25x18 input increases precision and efficiency

Pattern detect circuitry increases functionality Pattern detect circuitry increases functionality

New second stage enables SIMD and bitwise logic operationsNew second stage enables SIMD and bitwise logic operations

Cascade routing enables scalable performance

Cascade routing enables scalable performance

Pipeline registers enable 550Mhz performance

Pipeline registers enable 550Mhz performance

Wider internal data-pathand 96-accumulated output enable higher precision

Wider internal data-pathand 96-accumulated output enable higher precision

R. M. Rao, 2008

7

Dynamically ReconfigurableDSP OPMODEs

6 5 4 3 2 1 0Zero 0 0 0 0 0 0 0 +/- CinHold P 0 0 0 0 0 1 0 +/- (P + Cin)A:B Select 0 0 0 0 0 1 1 +/- (A:B + Cin)Multiply 0 0 0 0 1 0 1 +/- (A * B + Cin)C Select 0 0 0 1 1 0 0 +/- (C + Cin)Feedback Add 0 0 0 1 1 1 0 +/- (C + P + Cin)36-Bit Adder 0 0 0 1 1 1 1 +/- (A:B + C + Cin)P Cascade Select 0 0 1 0 0 0 0 PCIN +/- CinP Cascade Feedback Add 0 0 1 0 0 1 0 PCIN +/- (P + Cin)P Cascade Add 0 0 1 0 0 1 1 PCIN +/- (A:B + Cin)P Cascade Multiply Add 0 0 1 0 1 0 1 PCIN +/- (A * B + Cin)P Cascade Add 0 0 1 1 1 0 0 PCIN +/- (C + Cin)P Cascade Feedback Add Add 0 0 1 1 1 1 0 PCIN +/- (C + P + Cin)P Cascade Add Add 0 0 1 1 1 1 1 PCIN +/- (A:B + C + Cin)Hold P 0 1 0 0 0 0 0 P +/- CinDouble Feedback Add 0 1 0 0 0 1 0 P +/- (P + Cin)Feedback Add 0 1 0 0 0 1 1 P +/- (A:B + Cin)Multiply-Accumulate 0 1 0 0 1 0 1 P +/- (A * B + Cin)Feedback Add 0 1 0 1 1 0 0 P +/- (C + Cin)Double Feedback Add 0 1 0 1 1 1 0 P +/- (C + P + Cin)Feedback Add Add 0 1 0 1 1 1 1 P +/- (A:B + C + Cin)C Select 0 1 1 0 0 0 0 C +/- CinFeedback Add 0 1 1 0 0 1 0 C +/- (P + Cin)36-Bit Adder 0 1 1 0 0 1 1 C +/- (A:B + Cin)Multiply-Add 0 1 1 0 1 0 1 C +/- (A * B + Cin)17-Bit Shift P Cascade Select 1 0 1 0 0 0 0 Shift(PCIN) +/- Cin17-Bit Shift P Cascade Feedback Add 1 0 1 0 0 1 0 Shift(PCIN) +/- (P + Cin)17-Bit Shift P Cascade Add 1 0 1 0 0 1 1 Shift(PCIN) +/- (A:B + Cin)17-Bit Shift P Cascade Multiply Add 1 0 1 0 1 0 1 Shift(PCIN) +/- (A * B + Cin)17-Bit Shift P Cascade Add 1 0 1 1 1 0 0 Shift(PCIN) +/- (C + Cin)17-Bit Shift P Cascade Add Add 1 0 1 1 1 1 1 Shift(PCIN) +/- (A:B + C + Cin)17-Bit Shift Feedback 1 1 0 0 0 0 0 Shift(P) +/- Cin17-Bit Shift Feedback Feedback Add 1 1 0 0 0 1 0 Shift(P) +/- (P + Cin)17-Bit Shift Feedback Add 1 1 0 0 0 1 1 Shift(P) +/- (A:B + Cin)17-Bit Shift Feedback Multiply Add 1 1 0 0 1 0 1 Shift(P) +/- (A * B + Cin)17-Bit Shift Feedback Add 1 1 0 1 1 0 0 Shift(P) +/- (C + Cin)

OpMode OutputXYZ

– Over 40 Different Modes Each XtremeDSP Slice

individually controllable Change operation in a single

clock cycle Enables resource sharing for

maximum utilization

R. M. Rao, 2008

8

Reconfigurability

Waveform identification module

Waveform 1Waveform 2

Waveform 3 can be reconfiguredinto this region of the FPGA.

Waveform 2 can be “reloaded” into its region when Waveform identification module detects waveform 2 being received.

R. M. Rao, 2008

Virtex-6 resources

9

R. M. Rao, 2008

GMACs Performance DSP48 slices

10

R. M. Rao, 2008

11

Processing capabilities of FPGAs

BDTI Certified(tm) Results (c) 2008 BDTI. For more info and results see www.BDTI.com.

R. M. Rao, 2008

12

Processing capabilities of FPGAs

BDTI Certified(tm) Results (c) 2008 BDTI. For more info and results see www.BDTI.com.

R. M. Rao, 2008

13

Z

Y

X

36

36

48

A

B

BCIN

18

18

18

P48

CIN

SUB

3618

18

18

BCOUT

48

ZERO 48

48

PCOUT48

PCIN

48

18

72

Wire Shift Right By 17b

C

48

48

48

To Adjacent DSP48 Tile

Register

48

Pipelined Multiplier

3 delay latency

18

18B

AP (PCOUT)

LS Word

MS Word

48

36b product sign extended to 48b

z-3

R. M. Rao, 2008

14

Pipelined Complex 18x18 MPY

Ar18

Bi18

‘0’

48

Ar18

Bi18

48

S1

S2

48

sn = Slice n

Ar18

Br18

‘0’

48

Ai18

Bi18

48

S3

S4

48-

Pi

Pr

Register

36

Sign Extension

R. M. Rao, 2008

15

Wide Filters At Full Speed Within the Virtex-4 DSP Slice Column

• Systolic N-tap FIR– Scalable N-levels deep implementation– N-levels deep at 500MHz performance

• Uses Integrated Pipeline Registers to Synchronize Filter Inputs

• Utilizes Input and Output Cascade Routing

Build Massively Parallel 512-TAP FIR Filter Build Massively Parallel 512-TAP FIR Filter In a Single Device Achieving In a Single Device Achieving 256 GMACCs/s Performance256 GMACCs/s Performance

Build Massively Parallel 512-TAP FIR Filter Build Massively Parallel 512-TAP FIR Filter In a Single Device Achieving In a Single Device Achieving 256 GMACCs/s Performance256 GMACCs/s Performance

Equivalent Implementation Would Consume Equivalent Implementation Would Consume

444 Embedded Multipliers and 77,008 LCs 444 Embedded Multipliers and 77,008 LCs

And Would Only Achieve ½ The Performance And Would Only Achieve ½ The Performance

Equivalent Implementation Would Consume Equivalent Implementation Would Consume

444 Embedded Multipliers and 77,008 LCs 444 Embedded Multipliers and 77,008 LCs

And Would Only Achieve ½ The Performance And Would Only Achieve ½ The Performance

R. M. Rao, 2008

16

Xilinx FFT IP (4)

• FFT fully utilizes FPGA arithmetic hardware resources

• FFT viewed as a recursion using a butterfly kernel

Phase factors: e-j2k/N

e-j2k/N

CADD1CADD2

CMPY

• CADD{1|2}: complex adder• CMPY: complex multiplier

R. M. Rao, 2008

17

Virtex-4 DSP Slice• DSP slice key for

implementing high-performance arithmetic

• Embedded 18x18 MPY and 48b adder– Butterfly phase rotator– Cross-addition

R. M. Rao, 2008

18

Butterfly CMPLX MPY

• Complex MPY used in FFT butterfly

• Optimized to employ Virtex-4 DSP Slice– 4 and 3 MPY option

• Complex MPY available as IP module†

Ar

Br

Ai

Bi

Pi

Pr

DSP Slice 1

DSP Slice 4

DSP Slice 2

DSP Slice 3

Pr + jPi = (Ar+jAi) x (Br + jBi)

† Available: 6.2i IP Update 2

R. M. Rao, 2008

19

Performance/Parallelism/Area• FPGA: highly parallel computing machine• Achieve performance using functional unit parallelism

• Area/throughput tradeoff delivered via Xilinx IP library

• Butterfly array to produce high-performance FFT processor

• High computation rate using (possibly) hundreds of DSP slices– Allocate resources as appropriate to meet

system requirements• Large memory bandwidth using multi-

port memory constructed from BRAMs

Mem read BW: 320 x 36 x 500e6 = 5.76 Tera-bps

R. M. Rao, 2008

20

FFT Architecture• For small number of carriers and modest data rates single

butterfly (I)FFT is probably suitable - Small FPGA footprint

switc

h

PhaseFactor ROM

DataRam 0

DataRam 1

switc

h

Output Data

Input Data

Iteration Engine

R. M. Rao, 2008

21

Block boundary detection/Fine timing acquisition

Z-1 Z-1 Z-1Z-1 Z-1 Z-1 Z-1Z-1

Z-1 Z-1 Z-1Z-1 Z-1 Z-1 Z-1Z-1

||2

()*

arg

SAMPLES

KNOWNSEQUENCE

1 OFDM block ofrepeated data

Timing Est

Freq Est

ave

Half an OFDM block

F. Tufvesson, O. Edfors, M. Faulkner, “Time and Frequency Synchronization for OFDM using PN-Sequence Preambles”, VTC-1999/Fall, vol 4, pp.2203-7, New Jersey, 1999.

R. M. Rao, 2008

22

Fine-timing acquisition using a clipped correlator

1

ynsysgencast

bc3

sysgencast

bc2sysgen

d

en

qz-1

in0

in1out0

Register1

sysgen

a

b

suba b

AddSub

3

ld

2

coeff

1

a

2

xnz

1

ynsysgenaddrz-1

ROM1

sysgen

d

addr

en

q

R

a

coeff

ld

yn

MACsysgenz-1

Delay2

4

LD

3

CAddr

2

DAddr

1

xn

1

y

BaudClk

Data Addr

Coef Addr

load

FSM

sysgenenz-1

Delay7

sysgenenz-7

Delay6

sysgenenz-1

Delay5

sysgenz-1

Delay4

sysgenenz-8

Delay3

sysgenz-1

Delay2

sysgenenz-8

Delay1

sysgenz-2

Delay

xn

DAddr

CAddr

LD

yn

xnz

C7

xn

DAddr

CAddr

LD

yn

xnz

C6

xn

DAddr

CAddr

LD

yn

xnz

C5

xn

DAddr

CAddr

LD

yn

xnz

C4

xn

DAddr

CAddr

LD

yn

xnz

C3

xn

DAddr

CAddr

LD

yn

xnz

C2

xn

DAddr

CAddr

LD

yn

xnz

C1

sysgen

a b

en

a +

bz-1AddSub4

sysgen

a b

en

a +

bz-1AddSub2sysgen

a b

en

a +

bz-1AddSub13

sysgen

a b

en

a +

bz-1AddSub12sysgen

a b

en

a +

bz-1AddSub1sysgen

a b

en

a +

bz-1AddSub

2

BaudClk

1

x

Bank of correlators

1-bit correlator

10 time multiplexedcorrelators

Each 1-bit correlator :10 slices

Total for clipped correlator :589 slices

Full precision correlators :32 embedded multipliers896 flipflops

R. M. Rao, 2008

23

Serial Gigabit OBSAI/CPRI Proprietary serial

backplane Inter-chip connectivity

Embedded Software

MAC (Media Access)Decision oriented

tasks CORBARTOSNBAPSCA (JTRS radios)

Conn

ectiv

ity

DACDACADCADC

Logic & IO OBSAI/CPRI SRIO AD/DA interface EMIF

DUC,DDCCFR,DPD

RACHSearcher

OFDM PHYTCC

MIMO

High Performance Processing

High MIPs tasks Radio PHYSupported by embedded

DSP tiles, distributed memory, block memory and logic fabric

SRIO

EMIF

The Platform

R. M. Rao, 2008

24

Digital Receiver Architecture:Abstracted Architecture

• Common model of abstraction for digital receiver is inner/outer receiver

Ø Frequency Offset Estimation/CorrectionØ Sample Clock Offset CorrectionØ Channel Estimation/EqualizationØ Frame detectionØ AGCØ Successive Interference CancellationØ Space-Time-CodingØ IFFT/FFTØ Per sub-carrier processing

Inner Receiver

Receiver Abstraction

Outer Receiver

Control, Protocol and Link Layer processing

Digital IF Processing

q Beamformingq QRD-RLS

Ø Up-ConversionØ Down-ConversionØ ChannelizerØ Fast AGC

Ø Channel Coding

q LDPCq TPCq CTCq Viterbiq (De-) Interleave

Ø Medium Access Control (MAC)Ø Link Layer Processing

Ø System Initialization, Control and MonitoringØ Application

Ø EthernetØ PCI ExpressØ SRIO

Ø CPRIØ OBSAI

R. M. Rao, 2008

25

Receiver Abstraction and Projection on to Platform FPGA

Receiver Function

Characteristics FPGA Platform

Comments

Digital IF Processing

MAC Intensive SX DSP48 main requirement

Inner Receiver MAC intensive Some functions LUT

intensive CORDIC in QRD-RLS

FFT processing for OFDM Correlation processing for

timing Per-carrier complexity

processing (MIMO-OFDM)

SX/LX DSP48 leveraged FFT

FPGA fabric for CORDIC FFT

Outer Receiver

Symbol rate tasks Channel coding

LX ACS/ACSO dominated by low bit precision add/multiplexors

Good match for fabric

Lots of memory required

Control/ Protocol

Gigabit connectivity Linux OS “heavy” tasks TCP/IP

FX Embedded PPC used Rocket IO for

PCI Express SRIO

Num. Sub-carriersTX RXN N

SX/LX

Receiver Abstraction

LX

FX

SX

FPGA product portfolioTailored for various processing Tasks in communicationsreceiver

R. M. Rao, 2008

26

Digital Frontend

Digital upconversion (downconversion)Crest factor reductionDigital pre-distortion

R. M. Rao, 2008

Wired Communications

27

• Flexible serial transceivers support multi-rate applications.• GTX transceivers run at 150Mbps to 6.5Gbps with 25% lower power consumption.• GTH transceivers support line rates beyond 11Gbps to enable 40G and 100G

protocols and more.

R. M. Rao, 2008

28

Orthogonal Frequency Division Multiplexing (OFDM)

Frequency

Ma

gn

itud

e

OFDM divides a frequency selective channel into a numberof flat fading channels

R. M. Rao, 2008

29

OFDM Modulation

QAMMapping

IFFTCyclicPrefix

S/P P/SD/AandRF

(a)

RFandA/D

Stripcyclicprefix

S/P FFT P/SQAM

decoding

(b)

FEQ

• A QAM symbol is modulated onto each subcarrier

• IFFT/FFT are used for efficient modulation and demodulation

Frequency Domain Time Domain

Time Domain Frequency Domain

R. M. Rao, 2008

30

MIMO Systems

Tx Antenna 1

Tx Antenna 2

Rx Antenna 1

Rx Antenna 2

Tx Antenna M Rx Antenna N

H

• MIMO systems:• Multiple Antennas at the transmitter and

receiver.• 3 types of MIMO Systems:

• STBC MIMO systems• Diversity gain.

• Spatial Multiplexing MIMO systems• Capacity/throughput gain.

• Feedback MIMO systems• Higher performance thru interference

reduction.• MISO (multiple input single output) Systems:

• STBC can be used with just 1 receive antenna.• Provides diversity gain.• To achieve array gain, need knowledge of

channel at the transmitter (feedback).

R. M. Rao, 2008

31

Spatial Multiplexing

• A spatial multiplexing MIMO system transmits different data symbols from each transmitter.

• The signals from each transmitter combine over the air and are received by multiple receive antennas.

• SM systems have a rate=M (num transmit antennas). The diversity order depends on the type of encoding and receiver (uncoded SM with ML decoding has diversity order=N (num receive antennas)).

MODULATOR

MODULATOR

MODULATOR

MIMOReceiverMIMO

Receiver

x(t)

y(t)

z(t)

r1(t) = a11x(t)+a12y(t)+a13z(t)

r3(t) = a31x(t)+a32y(t)+a33z(t)

x(n)

y(n)

z(n)

x(n)

y(n)

z(n)

R. M. Rao, 2008

32

MIMO and OFDM

• MIMO – Multiple Input Multiple Output Communication System. Employs multiple antennas at both transmitter and receiver.

• OFDM – Orthogonal Frequency Division Multiplexing. Breaks up a broadband channel into many parallel narrowband channels (subcarriers).

• MIMO-OFDM – A Combination of MIMO and OFDM. Appears like many parallel MIMO systems on orthogonal subcarriers.

R. M. Rao, 2008

33

MIMO-OFDM System

OFDM TRANSMITTER 1

OFDM TRANSMITTER N

OFDMDEMODULATOR 1

OFDMDEMODULATOR N

RIC

H S

CA

TT

ER

ING

EN

VIR

ON

ME

NT

MIM

O D

EC

OD

ER

Each transmitter is an independent OFDM modulator.

The source symbols could be space-time block coded or just QAM modulated for spatial multiplexing.

Each receiver is an OFDM demodulator combined with a MIMO decoder to invert the channel on each subcarrier and extract the source symbols.

R. M. Rao, 2008

34

Spatial Multiplexing Receivers

Zero Forcing receiver:

11h

22h

21h

12hTx Antenna 1

Tx Antenna 2

Rx Antenna 1

Rx Antenna 2

1 11 1 12 2 1

2 21 1 22 2 2

1 11 12 1 1

2 21 22 2 2

1 1

2 2

1

1 11 12 1

2 21 22 2

ˆ

ˆ

ˆ

ˆ

y h x h x n

y h x h x n

y h h x n

y h h x n

x y

x y

x h h y

x h h y

W

Significant increase in noise when the channel is in a deep fade.

For ZF receivers 1W H

R. M. Rao, 2008

35

Spatial Multiplexing Receivers

• MMSE MIMO Decoders:– Cancels interference and minimizes noise.– Minimizes the over all error (mean squared error).

2ˆ[( ) ]E x x

1H H

MMSE Ms

M MW H H I H

E SNR

R. M. Rao, 2008

36

QRD

• One of the popular methods of matrix inversion is based on QRD.

• Q is Unitary and R is upper triangular• A Unitary matrix has a trival inverse, • An upper triangular matrix can be inverted by

back-substitution

H QR

1 HQ Q

1 1 HH R Q

R. M. Rao, 2008

37

Architectures for QRD

• There are many architectures to get the QR decomposition of any matrix.– Givens Rotations and its variations– Householder transformations, etc.

• A systolic structure makes implementation straightforward and scalable.

• Givens rotations based QRD has a nice and easy systolic structure.

R. M. Rao, 2008

38

Givens Rotations

• For a 2x1 vector of real numbers

• For a NxM matrix, repeat the process 2 cells at a time.

2 2

2 2 2 2

0

,

c s a a bs c b

a bc s

a b a b

11 12 13 11 12 1311 12 1311 12 13

21 22 23 21 22 23 22 23 22 23

31 32 33 32 33 32 33 33

0 0

0 0 0 0

a a a a a aa a aa a a

a a a a a a a a a a

a a a a a a a a

R. M. Rao, 2008

39

Systolic Arrays

• Structured arrays with identical cells. Usually a “boundary” cell and an “internal” cell for the QRD process.

Boundary cell

Internal cell 1. The boundary cell generates the rotations.

2. Internal cell applies the rotations to all the cells in the row.

3. The systolic array in this figure can handle any matrix below 3x3.

R. M. Rao, 2008

40

Boundary and Internal Cell

2

1Z

s/w

1c

mode

a/x

c/1

1Z

-ve

mode

s/w

c/1

x

r

mode

-ve

0

1

z

-ve in mode 0, +ve in mode 1

This negative is needed since W12=-(W11a12)W22

This register needs to be initialized to 1, since in the

first cycle the output needs to be +1

1Z

R. M. Rao, 2008

41

Triangularization mode• For QRD of upto a 3x3

matrix we need 3 boundary cells and 3 internal cells.

• Boundary cells calculate rotation vectors and internal cells store them.

• Data is fed column-wise into the systolic array.

• This may have to be staggered depending on the pipelining delays thru the boundary cell and internal cell.

11 12 1311 12 13 11 12 1311 12 13

21 22 23 22 23 22 23 22 23

31 32 33 31 32 33 32 33 33

0 0 0

0 0 0

a a aa a a a a aa a a

a a a a a a a a a

a a a a a a a a a

31

21

11

a

a

a

32

22

12

a

a

a

33

23

13

a

a

a

The rotation factors for zeroing out cell A(2,1) are stored in cell A(1,2), etc.

R. M. Rao, 2008

42

Back-substitution mode• Computing R-1 with back-

substitution

• The is already computed in the boundary cell and stored away. So just use it.

1

11

1

( )

0

( )

( )

ij

ij jj

j

ij im mj jjm

if i j

W

elseif i j

W r

elseif i j

W W r r

end

1ij jjW r

11 12 1311 12 13

22 23 22 23

3333

0 0

0 00 0

a a a W W W

a a W W

Wa

1 0 0

12 11 12 22W W a W

13 11 13 12 23 33W W a W a W

R. M. Rao, 2008

43

Q-matrix computation mode

H

H H

Q A R

Q I Q

11 12 1321 21 31 31 11 12 13

32 32 21 21 21 22 23 22 23

32 32 31 31 31 32 33 33

1 0 0 0 0

0 0 0 1 0 0

0 0 0 1 0 0 0

a a ac s c s a a a

c s s c a a a a a

s c s c a a a a

0

0

1

0

1

0

1

0

0

first column of Q matrix

second column of Q matrix

third column of Q matrix

* *

* . * .

* . * .

;

s x I s s I c

z x I c s I s

c c

HQ RA

R. M. Rao, 2008

44

Scalability• A 4x4 systolic array needs 4

boundary cells and 6 internal cells and can handle all matricies below 4x4. (i.e. 1x1, 1x2, .. 2x2, …, 3x4, 4x4)

• But if your design is restricted to only a 2x2, you need only a 2x2 systolic array. With this you can handle 1x1, 1x2 and 2x2.

4x1 matrix

4x4 matrix

3x3 matrix2x2 matrix

R. M. Rao, 2008

45

FPGA Tools for DSP Systems Design

• Higher level tools are raising the level of abstraction.

• Allows non-hardware engineers (algorithm designers) to get a first look at hardware.

• System Generator– Simulink to Hardware

• C-to-Gates tools– C or “higher” level languages to gates

R. M. Rao, 2008

46

Xilinx DSP Tools and Flows Accelerate DSP Design

MATLAB / Simulink

SimulinkMATLAB

Mixed FlowGraphical

Based Flow

FPGA Implementation with ISE

RTL RTLRTL

C/C++ESL

Partners

Language Based Flow

RTL

R. M. Rao, 2008

47

System GeneratorSystem Level Modeling & Simulation Framework

Work in the language of your problem

HDL

C

R. M. Rao, 2008

48

HDL Simulation Flow

1. Develop Algorithm &System Model

Download to FPGA

DSP Development Flow

2. Automatic CodeGeneration

Simulink MDL

Bitstream

System Generator Flow

3. Xilinx Implementation Flow

HDL Test Bench Test Vectors

RTL VHDL & Cores

FPGA

R. M. Rao, 2008

49

Hardware/Software Co-simulation

HDL co-simulation

Hardwareco-simulation

•Encapsulates HDL semantics•Simulink as verification framework

R. M. Rao, 2008

ADVANCED SYSTEMS TECHNOLOGY GROUP (ASTG) 50

FlexOFDM• A Configurable MIMO-OFDM Technology Demonstrator.• Not specific to any standard, but can be configured (with some

effort) to showcase technologies that are part of some of the Wireless standards.

• Provides an architecture for the PHY and MAC layers, which can act as a starting point or spring board for product development.

• Investigate communication algorithms and architectures as they efficiently map to Xilinx FPGAs.

This is not a product/IP from Xilinx, but is available to partners, to speed up their MIMO-OFDM development efforts, on an AS IS basis.

R. M. Rao, 2008

51

Configurable MIMO-OFDM Transmitter

8

ImagOut4

7

RealOut4

6

ImagOut3

5

RealOut3

4

ImagOut2

3

RealOut2

2

ImagOut1

1

RealOut1

RealIn

ImagIn

WriteFIFO

BaudClk

RealOut1

ImagOut1

RealOut2

ImagOut2

RealOut3

ImagOut3

RealOut4

ImagOut4

Spatial Demultiplexing

RealIn

ImagIn

SampleClk

Bdata

rfd

Preamble

BFrame

FFTbusy

RealOut

ImagOut

Start

Enable

DataRequest

DataSubcarrier

Pilot Insertionand Data loading

DataIn

SampleClk

Zeroblks

Preamble

Bdata

DataSubc

DataEnable

RealOut

ImagOut

Packetizationand Encoding

SampleClk

Zeroblks

Preamble

Bdata

BFrame

Packet Controller

sysgenandz-0

Logical2

sysgenandz-0

Logical

sysgennot

Inverter FFT

xn_re

xn_im

start

enable

xk_re

xk_im

xk_index

rfd

vout

Busy

FFT

Clock Generator

SampleClk

BaudClk

ClockGenerator

RealIn

ImagIn

Addr

WriteFIFO

RealOut

ImagOut

ReadFIFO

Add Cyclic Extension

3

DataDone2

DataEnable

1

DataIndouble double

double

double

double double

double

Fix_16_10

UFix_6_0double

double

double

Fix_16_10

doubledouble

double

double

double

double

double

double

double

double

double

double

double

double

double

double

Bool

Bool

Bool

double double

Booldouble

double

Packet Controller

Packetization and configurable STBC

encoding

Pilot insertion and data loading

Time shared FFT across antennas

Add Cyclic Extension/Block

Shaping

Spatial Demultiplexing

and Interpolation

Resource sharing (folding factor)Ratio of System clock rate to symbol rate > 8 needed for a 4 transmit antenna system

R. M. Rao, 2008

52

MIMO Receiver Architecture

Samples processed at sample clock rate Samples processedat system clock rate

Packet Detection

Packet Detection

Packet Detection

Packet Detection

Block Boundary Detection

BlockBoundary

Coarse CFOestimate

Coarse CFOestimate

CFO estimator

Strip CP

Strip CP

Strip CP

Strip CP

Input FIFO

Input FIFO

Input FIFO

Input FIFO

FFT

FFT

FFT

FFT

Rx 1

Rx 2

Rx 3

Rx 4

Channel Estimator

Output FIFO

Output FIFO

Output FIFO

Output FIFO

Combine PD

MIMO Decoder Matrix

(MMSE, etc)

MIMO Decode

Soft Decisions

MIMO Decoder

FIFO

Pilot based CFO estimator

Packet Controller

Preamble

Payload

CF

O C

ompe

nsat

or

R. M. Rao, 2008

53

Fine-timing acquisition using a clipped correlator

1

ynsysgencast

bc3

sysgencast

bc2sysgen

d

en

qz-1

in0

in1out0

Register1

sysgen

a

b

suba b

AddSub

3

ld

2

coeff

1

a

2

xnz

1

ynsysgenaddrz-1

ROM1

sysgen

d

addr

en

q

R

a

coeff

ld

yn

MACsysgenz-1

Delay2

4

LD

3

CAddr

2

DAddr

1

xn

1

y

BaudClk

Data Addr

Coef Addr

load

FSM

sysgenenz-1

Delay7

sysgenenz-7

Delay6

sysgenenz-1

Delay5

sysgenz-1

Delay4

sysgenenz-8

Delay3

sysgenz-1

Delay2

sysgenenz-8

Delay1

sysgenz-2

Delay

xn

DAddr

CAddr

LD

yn

xnz

C7

xn

DAddr

CAddr

LD

yn

xnz

C6

xn

DAddr

CAddr

LD

yn

xnz

C5

xn

DAddr

CAddr

LD

yn

xnz

C4

xn

DAddr

CAddr

LD

yn

xnz

C3

xn

DAddr

CAddr

LD

yn

xnz

C2

xn

DAddr

CAddr

LD

yn

xnz

C1

sysgen

a b

en

a +

bz-1AddSub4

sysgen

a b

en

a +

bz-1AddSub2sysgen

a b

en

a +

bz-1AddSub13

sysgen

a b

en

a +

bz-1AddSub12sysgen

a b

en

a +

bz-1AddSub1sysgen

a b

en

a +

bz-1AddSub

2

BaudClk

1

x

Bank of correlators

1-bit correlator

10 time multiplexedcorrelators

Each 1-bit correlator :10 slices

Total for clipped correlator :589 slices

Full precision correlators :32 embedded multipliers896 flipflops

R. M. Rao, 2008

54

MIMO-OFDM Receiver

10

ValidOut

9

PacketDetect

8

SoftDecImag4

7

SoftDecReal4

6

SoftDecImag3

5

SoftDecReal3

4

SoftDecImag2

3

SoftDecReal2

2

SoftDecImag1

1

SoftDecReal1

Ch_tx1rx1

Ch_tx1rx2

Ch_tx1rx3

Ch_tx1rx4

Ch_tx2rx1

Ch_tx2rx2

Ch_tx2rx3

Ch_tx2rx4

Ch_tx3rx1

Ch_tx3rx2

Ch_tx3rx3

Ch_tx3rx4

Ch_tx4rx1

Ch_tx4rx2

Ch_tx4rx3

Ch_tx4rx4

En

Addr

wreal_1_1

wimag_1_1

wreal_1_2

wimag_1_2

wreal_1_3

wimag_1_3

wreal_1_4

wimag_1_4

wreal_2_1

wimag_2_1

wreal_2_2

wimag_2_2

wreal_2_3

wimag_2_3

wreal_2_4

wimag_2_4

wreal_3_1

wimag_3_1

wreal_3_2

wimag_3_2

wreal_3_3

wimag_3_3

wreal_3_4

wimag_3_4

wreal_4_1

wimag_4_1

wreal_4_2

wimag_4_2

wreal_4_3

wimag_4_3

wreal_4_4

wimag_4_4

Weight Matrix Computation

Rxreal1

Rximag1

Rxreal2

Rximag2

Rxreal3

Rximag3

Rxreal4

Rximag4

ValidData

Addr

Out_real1

Out_imag1

Out_real2

Out_imag2

Out_real3

Out_imag3

Out_real4

Out_imag4

ReadFIFO

AddrOut

Output FIFO

RealIn1

ImagIn1

RealIn2

ImagIn2

Baud_clk

PacketDetect

CFO_Est

PktDetPulse

MIMO Packet Detect1

Rxreal1

Rximag1

Rxreal2

Rximag2

Rxreal3

Rximag3

Rxreal4

Rximag4

ReadFIFO

Addr

wreal_1_1

wimag_1_1

wreal_1_2

wimag_1_2

wreal_1_3

wimag_1_3

wreal_1_4

wimag_1_4

wreal_2_1

wimag_2_1

wreal_2_2

wimag_2_2

wreal_2_3

wimag_2_3

wreal_2_4

wimag_2_4

wreal_3_1

wimag_3_1

wreal_3_2

wimag_3_2

wreal_3_3

wimag_3_3

wreal_3_4

wimag_3_4

wreal_4_1

wimag_4_1

wreal_4_2

wimag_4_2

wreal_4_3

wimag_4_3

wreal_4_4

wimag_4_4

BaudClk

Out_real1

Out_imag1

valid_out

ReadWeightMatrix

Out_real2

Out_imag2

Out_real3

Out_imag3

Out_real4

Out_imag4

MIMO Decoder

WriteFIFO

RxStream1

RxStream2

RxStream3

RxStream4

Enable

ReadFIFO

CFO_est

FFT_Start

CFO_Valid

RxOut1

RxOut2

RxOut3

RxOut4

FIFO_status_flag

Input Buffer

RealIn

ImagIn

BaudClk

Out2

BBDValid

Fine Timing Acquisition

RxStream1

RxStream2

RxStream3

RxStream4

FIFO_status_flag

Enable

CFO_Valid

Reset

RxReal1

RxImag1

RxReal2

RxImag2

RxReal3

RxImag3

RxReal4

RxImag4

Valid out

Addr

FFT_RFD

FFT_Start

FFT

0

Display2

0

Display1

z-1 Delay8

enz-1

Delay7

enz-1

Delay6

enz-1

Delay5

enz-1

Delay4

enz-1

Delay3

enz-1

Delay2

enz-1

Delay1

enz-1

Delay

BlkBounDetect

RealIn1

ImagIn1

RealIn2

ImagIn2

RealIn3

ImagIn3

RealIn4

ImagIn4

PacketDetect

BaudClk

ReadEnable

RxStream1

RxStream2

RxStream3

RxStream4

Cyclic Prefix Removal

Clock Generator

SampleClk

BaudClk

ClockGenerator

Rxreal1

Rximag1

Rxreal2

Rximag2

Rxreal3

Rximag3

Rxreal4

Rximag4

ValidData

Addr

ReadAddr

Ch_1_1

Ch_1_2

Ch_1_3

Ch_1_4

Ch_2_1

Ch_2_2

Ch_2_3

Ch_2_4

Ch_3_1

Ch_3_2

Ch_3_3

Ch_3_4

Ch_4_1

Ch_4_2

Ch_4_3

Ch_4_4

CFO_Est

CFO_Est_Valid

Channel Estimation

a

ba - b

AddSub

9

Reset

8

ImagIn4

7

RealIn4

6

ImagIn3

5

RealIn3

4

ImagIn2

3

RealIn2

2

ImagIn1

1

RealIn1

Packet Detection

Fine Timing Acq

Cyclic prefix removal

Channel Estimation

Weight Matrix Computation

MIMO Decoder

FFT

Carrier Frequency Offset Correction

Output FIFO

R. M. Rao, 2008

55

Channel Estimation

32

Chimag16

31

Chreal1630

Chimag15

29

Chreal1528

Chimag14

27

Chreal1426

Chimag13

25

Chreal13

24

Chimag12

23

Chreal1222

Chimag11

21

Chreal1120

Chimag10

19

Chreal10

18

Chimag9

17

Chreal9

16

Chimag8

15

Chreal814

Chimag7

13

Chreal7

12

Chimag6

11

Chreal6

10

Chimag5

9

Chreal5

8

Chimag4

7

Chreal4

6

Chimag3

5

Chreal3

4

Chimag2

3

Chreal2

2

Chimag1

1

Chreal1

Enable

Reset

Pilot_real

Training SymbolsTx4

Enable

Reset

Pilot_real

Training SymbolsTx3

Enable

Reset

Pilot_real

Training SymbolsTx2

Enable

Reset

Pilots

Addr

Training SymbolsTx1

simout11

To Workspace2

addr

Real

Imag

WE

EN

real_out

imag_out

Single Port RAM3

addr

Real

Imag

WE

EN

real_out

imag_out

Single Port RAM2

addr

Real

Imag

WE

EN

real_out

imag_out

Single Port RAM1

addr

Real

Imag

WE

EN

real_out

imag_out

Single Port RAM

sysgen

sel

d0

d1

Mux1

sysgen

sel

d0

d1

Mux

sysgenandz-2

Logical

sysgenz-2

Delay9

sysgenz-2

Delay8

sysgenz-2

Delay7

sysgenz-1 Delay6

sysgenz-2

Delay5

sysgenz-2

Delay4

sysgenz-2

Delay3

sysgenz-2

Delay2

sysgenz-2

Delay12

sysgenz-2

Delay11

sysgenz-2

Delay10

sysgenz-3

Delay1

sysgenrst

enout

Counter2

sysgenrst

enout

Counter1

ValidData

ChEstPilots

ChEstEn

ChEstRst

En

Rst

En2

ChEstPilots1

ControlSignals

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx4-Rx4

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx4-Rx3

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx4-Rx2

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx4-Rx1

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx3-Rx4

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx3-Rx3

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx3-Rx2

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx3-Rx1

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx2-Rx4

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx2-Rx3

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx2-Rx2

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx2-Rx1

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx1-Rx4

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx1-Rx3

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx1-Rx2

addr

Pilots1

Real

Imag

WE

VDATA

real_out

imag_out

Real_in

Imag_in

ChEst Tx1-Rx1

sysgenx 0.3535

CMult7

sysgenx 0.3535

CMult6

sysgenx 0.3535

CMult5

sysgenx 0.3535

CMult4

sysgenx 0.3535

CMult3

sysgenx 0.3535

CMult2

sysgenx 0.3535

CMult1

sysgenx 0.3535

CMult

12

ReadAddr

11

ChEstPilots

10

Addr

9

ValidData

8

Rximag4

7

Rxreal4

6

Rximag3

5

Rxreal3

4

Rximag2

3

Rxreal2

2

Rximag1

1

Rxreal1

double

double

Bool

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

UFix_6_0

Fix_16_10

UFix_6_0

UFix_6_0

UFix_6_0

Fix_16_10

Fix_16_10

double

double

double

Bool

double

double

UFix_6_0

Fix_16_10

Fix_16_10

Bool

double

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_32_20

Fix_32_20

Fix_32_20

double

double

Fix_32_20

Fix_32_20

Fix_32_20

Fix_32_20

Fix_32_20

Fix_32_20

Fix_32_20

double

Fix_32_20

Fix_32_20

Fix_32_20

Fix_32_20

Fix_2_0

Fix_32_20

Fix_32_20

Fix_32_20

double

Fix_32_20

Fix_32_20

Fix_32_20

Fix_32_20

Fix_2_0

UFix_6_0

double

double

double (8)

double

double

double

double

double

double

double

doubleFix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_32_20

Fix_32_20

double

Fix_32_20

Fix_32_20

Fix_32_20

Fix_32_20

Fix_32_20

Channel Estimation Pilots for Tx4

Channel Estimation Pilots for Tx1

4x4 Channel Estimation Memory

Control Signals

Input FIFO

R. M. Rao, 2008

56

Packet Detection

Schmidl and Cox algorithm for Packet Detection and coarse carrier frequency offset estimation.

T. M. Schmidl, D. C. Cox, “Low Overhead Low Complexity Synchronization for OFDM”, ICC 1996, Vol 3, pp 1301-1306. Z-D

C

P

2

2( )

r(n) c(n)

p(n)

m(n)*

*

Identical halves of 1 OFDM symbol

R. M. Rao, 2008

57

Pre-FFT Carrier Frequency Offset Estimation

CFO_Est1

Truncate

In1

In2

In3

Out1

Out2

Out3

Rising edgedetector

In1

Out1

Register1

drsten

qz- 1

Packet Detection 3

RealIn 1

ImagIn 1

RealIn 2

ImagIn 2

BaudClk

Rst

CorrMetric _ real

CorrMetric _ imag

AvePwr

Delay6

enz-24

Delay5

enz-14

Convert

cast

CORDIC ATAN

z-17

x

y

mag

atan

CMult8

x 0.003906z-2

BBD7

Rst6

Baud_clk5

ImagIn24

RealIn23

ImagIn12

RealIn11

The angle of the correlation metric is proportional to the Carrier frequency offset.

Right size the number of bits before the CORDIC operation.

CORDIC ATAN from the Xilinx Math library calculates the angle.

ˆ

22

sN

R. M. Rao, 2008

58

Carrier Frequency Offset Correction

ImagOut 4

8

RealOut 4

7

ImagOut 3

6

RealOut 3

5

ImagOut 2

4

RealOut 2

3

ImagOut 1

2

RealOut 1

1

Rising edgedetector

In1 Out1

Relational 1

a

b

a<=b

z-0

Relational

a

b

a<b

z-0

Negate 1

x(-1)

Logical 1

orz-0

Logical

and

z-0

Delay 7

z-1

Delay 6

z-1

Delay 5

z-1

Delay 4

z-1

Delay 3

z-1

Delay 2

z-1

Delay 1

z-1

Delay

z-1

DDS

freq_off

Enable

Reset

cos_out

sin_out

Counter

rst out

Constant 3

1

Constant 2

78

Constant 1

0

Complex Multiply 3

Complex Multiply

RealIn 1

ImagIn 1

RealIn 2

ImagIn 2

BaudClk

RealOut

ImagOut

Complex Multiply 2

Complex Multiply

RealIn 1

ImagIn 1

RealIn 2

ImagIn 2

BaudClk

RealOut

ImagOut

Complex Multiply 1

Complex Multiply

RealIn 1

ImagIn 1

RealIn 2

ImagIn 2

BaudClk

RealOut

ImagOut

Complex Multiply

Complex Multiply

RealIn 1

ImagIn 1

RealIn 2

ImagIn 2

BaudClk

RealOut

ImagOut

CMult

x 0.01563

Reset

12

CFO_Est_valid

11

FFT_Start

10

CFO_Est

9

ImagIn 4

8

RealIn 4

7

ImagIn 3

6

RealIn 3

5

ImagIn 2

4

RealIn 2

3

ImagIn 1

2

RealIn 1

1

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Bool

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_15

Fix_16_15Fix_17_15

Fix_16_12

Fix_16_10

Fix_16_10

UFix_16_0

UFix_16_0

UFix_16_0

Bool

Bool

BoolBool

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Fix_16_10

Bool

Bool

Fix_16_10

Fix_16_10

Fix_16_16

double

Direct digital synthesizer (DDS) from the Xilinx DSP SysGen library.

R. M. Rao, 2008

59

Design methodology issues

• FPGA tools– Where to from here?

• C-to-gates– Higher level design languages to gates– Raising the level of abstraction

R. M. Rao, 2008

60

‘C’ or higher level language to Gates

• There is interest in higher level design methodologies, such as C-to-Gates from the design community.

• ESL (Electronic system level) tools/design methodologies are being explored.

• But, extracting all the concurrency from a sequential description is not an easy problem.

R. M. Rao, 2008

C to Gates evaluation flow

61

Source: BDTI. For more info and results see www.BDTI.com.

R. M. Rao, 2008

C to Gates evaluation by BDTI

62

Source: BDTI. For more info and results see www.BDTI.com.

R. M. Rao, 2008

63

Conclusion

• FPGAs are finding wide use in infrastructure communication systems and signal processing systems.

• FPGA are an efficient choice for exploring VLSI architectures.

• FPGA tools are raising the level of abstraction to allow algorithm designers the ability to explore h/w architectures without learning “h/w design tools/languages”.

R. M. Rao, 2008

64

Questions?

fpgas for signal processing and communication systems raghu rao wireless and signal processing...

Documents

shift registerr

b caccumulate loadzero

shiftshiftp axbmult

logic centricfx

b cgroup c rnd

b caddsub twos compzero

bgroup b acc

reconfigurable dsp opmodesover