fpgas for signal processing and communication systems raghu rao wireless and signal processing...
TRANSCRIPT
FPGAs for Signal Processing and Communication Systems
Raghu RaoWireless and Signal Processing Group,Xilinx Inc.05/14/2010
R. M. Rao, 2008
Agenda• Overview of FPGAs
– Building DSP sub-systems on FPGAs– Digital baseband
• The Platform FPGA• Communication systems and DSP on FPGAs• Architectural tradeoffs for FPGAs
– The Matrix inversion problem• FPGA tools and design methodology
2
R. M. Rao, 2008
What are FPGAs?
• An array of configurable logic blocks with configurable interconnects between them.
• Each logic block can implement any 6-input combinatorial function.
• Logic blocks can be connected to generate larger circuits.
• Additional DSP specific resources (multiply accumulate units).
3
R. M. Rao, 2008
4
Virtex-4/5 FPGA ArhitectureHigh-Level View
• FPGA family with 3 members tailored for specific classes of processing– SX: DSP
– LX: Logic centric
– FX: Full featured
• Embedded PowerPC hard IP
• Giga-bit serial connectivity
• DSP processing tiles “DSP48”
R. M. Rao, 2008
5
Virtex-5 FPGA Platform
• 2 slices per CLB, 4 LUTs per CLB• Can be configured as a shift register• Can be configured as distributed memory
Can be configured as RAM
Can be configured as a shift register
R. M. Rao, 2008
6
ACIN BCIN
ACOUT BCOUT
PCIN
PCOUTO
ptio
nal P
ipel
ine
Reg
iste
r/R
outin
g Lo
gic
Opt
iona
l Pip
elin
e R
egis
ter/
Rou
ting
Logi
c
Opt
iona
l Pip
elin
e R
egis
ter/
Rou
ting
Logi
cO
ptio
nal P
ipel
ine
Reg
iste
r/R
outin
g Lo
gic
Rou
ting
Logi
cR
outin
g Lo
gic
Opt
iona
l Reg
iste
rO
ptio
nal R
egis
ter
Mul
tiplie
rP (48-bit)Optional P(96-bit)
C (48-bit)
B (18-bit)A (25-bit)
=
48-bit
Virtex-5 DSP48EFull Custom Design Enabling Efficient DSP
New 25x18 input increases precision and efficiency
New 25x18 input increases precision and efficiency
Pattern detect circuitry increases functionality Pattern detect circuitry increases functionality
New second stage enables SIMD and bitwise logic operationsNew second stage enables SIMD and bitwise logic operations
Cascade routing enables scalable performance
Cascade routing enables scalable performance
Pipeline registers enable 550Mhz performance
Pipeline registers enable 550Mhz performance
Wider internal data-pathand 96-accumulated output enable higher precision
Wider internal data-pathand 96-accumulated output enable higher precision
R. M. Rao, 2008
7
Dynamically ReconfigurableDSP OPMODEs
6 5 4 3 2 1 0Zero 0 0 0 0 0 0 0 +/- CinHold P 0 0 0 0 0 1 0 +/- (P + Cin)A:B Select 0 0 0 0 0 1 1 +/- (A:B + Cin)Multiply 0 0 0 0 1 0 1 +/- (A * B + Cin)C Select 0 0 0 1 1 0 0 +/- (C + Cin)Feedback Add 0 0 0 1 1 1 0 +/- (C + P + Cin)36-Bit Adder 0 0 0 1 1 1 1 +/- (A:B + C + Cin)P Cascade Select 0 0 1 0 0 0 0 PCIN +/- CinP Cascade Feedback Add 0 0 1 0 0 1 0 PCIN +/- (P + Cin)P Cascade Add 0 0 1 0 0 1 1 PCIN +/- (A:B + Cin)P Cascade Multiply Add 0 0 1 0 1 0 1 PCIN +/- (A * B + Cin)P Cascade Add 0 0 1 1 1 0 0 PCIN +/- (C + Cin)P Cascade Feedback Add Add 0 0 1 1 1 1 0 PCIN +/- (C + P + Cin)P Cascade Add Add 0 0 1 1 1 1 1 PCIN +/- (A:B + C + Cin)Hold P 0 1 0 0 0 0 0 P +/- CinDouble Feedback Add 0 1 0 0 0 1 0 P +/- (P + Cin)Feedback Add 0 1 0 0 0 1 1 P +/- (A:B + Cin)Multiply-Accumulate 0 1 0 0 1 0 1 P +/- (A * B + Cin)Feedback Add 0 1 0 1 1 0 0 P +/- (C + Cin)Double Feedback Add 0 1 0 1 1 1 0 P +/- (C + P + Cin)Feedback Add Add 0 1 0 1 1 1 1 P +/- (A:B + C + Cin)C Select 0 1 1 0 0 0 0 C +/- CinFeedback Add 0 1 1 0 0 1 0 C +/- (P + Cin)36-Bit Adder 0 1 1 0 0 1 1 C +/- (A:B + Cin)Multiply-Add 0 1 1 0 1 0 1 C +/- (A * B + Cin)17-Bit Shift P Cascade Select 1 0 1 0 0 0 0 Shift(PCIN) +/- Cin17-Bit Shift P Cascade Feedback Add 1 0 1 0 0 1 0 Shift(PCIN) +/- (P + Cin)17-Bit Shift P Cascade Add 1 0 1 0 0 1 1 Shift(PCIN) +/- (A:B + Cin)17-Bit Shift P Cascade Multiply Add 1 0 1 0 1 0 1 Shift(PCIN) +/- (A * B + Cin)17-Bit Shift P Cascade Add 1 0 1 1 1 0 0 Shift(PCIN) +/- (C + Cin)17-Bit Shift P Cascade Add Add 1 0 1 1 1 1 1 Shift(PCIN) +/- (A:B + C + Cin)17-Bit Shift Feedback 1 1 0 0 0 0 0 Shift(P) +/- Cin17-Bit Shift Feedback Feedback Add 1 1 0 0 0 1 0 Shift(P) +/- (P + Cin)17-Bit Shift Feedback Add 1 1 0 0 0 1 1 Shift(P) +/- (A:B + Cin)17-Bit Shift Feedback Multiply Add 1 1 0 0 1 0 1 Shift(P) +/- (A * B + Cin)17-Bit Shift Feedback Add 1 1 0 1 1 0 0 Shift(P) +/- (C + Cin)
OpMode OutputXYZ
– Over 40 Different Modes Each XtremeDSP Slice
individually controllable Change operation in a single
clock cycle Enables resource sharing for
maximum utilization
R. M. Rao, 2008
8
Reconfigurability
Waveform identification module
Waveform 1Waveform 2
Waveform 3 can be reconfiguredinto this region of the FPGA.
Waveform 2 can be “reloaded” into its region when Waveform identification module detects waveform 2 being received.
R. M. Rao, 2008
Virtex-6 resources
9
R. M. Rao, 2008
GMACs Performance DSP48 slices
10
R. M. Rao, 2008
11
Processing capabilities of FPGAs
BDTI Certified(tm) Results (c) 2008 BDTI. For more info and results see www.BDTI.com.
R. M. Rao, 2008
12
Processing capabilities of FPGAs
BDTI Certified(tm) Results (c) 2008 BDTI. For more info and results see www.BDTI.com.
R. M. Rao, 2008
13
Z
Y
X
36
36
48
A
B
BCIN
18
18
18
P48
CIN
SUB
3618
18
18
BCOUT
48
ZERO 48
48
PCOUT48
PCIN
48
18
72
Wire Shift Right By 17b
C
48
48
48
To Adjacent DSP48 Tile
Register
48
Pipelined Multiplier
3 delay latency
18
18B
AP (PCOUT)
LS Word
MS Word
48
36b product sign extended to 48b
z-3
R. M. Rao, 2008
14
Pipelined Complex 18x18 MPY
Ar18
Bi18
‘0’
48
Ar18
Bi18
48
S1
S2
48
sn = Slice n
Ar18
Br18
‘0’
48
Ai18
Bi18
48
S3
S4
48-
Pi
Pr
Register
36
Sign Extension
R. M. Rao, 2008
15
Wide Filters At Full Speed Within the Virtex-4 DSP Slice Column
• Systolic N-tap FIR– Scalable N-levels deep implementation– N-levels deep at 500MHz performance
• Uses Integrated Pipeline Registers to Synchronize Filter Inputs
• Utilizes Input and Output Cascade Routing
Build Massively Parallel 512-TAP FIR Filter Build Massively Parallel 512-TAP FIR Filter In a Single Device Achieving In a Single Device Achieving 256 GMACCs/s Performance256 GMACCs/s Performance
Build Massively Parallel 512-TAP FIR Filter Build Massively Parallel 512-TAP FIR Filter In a Single Device Achieving In a Single Device Achieving 256 GMACCs/s Performance256 GMACCs/s Performance
Equivalent Implementation Would Consume Equivalent Implementation Would Consume
444 Embedded Multipliers and 77,008 LCs 444 Embedded Multipliers and 77,008 LCs
And Would Only Achieve ½ The Performance And Would Only Achieve ½ The Performance
Equivalent Implementation Would Consume Equivalent Implementation Would Consume
444 Embedded Multipliers and 77,008 LCs 444 Embedded Multipliers and 77,008 LCs
And Would Only Achieve ½ The Performance And Would Only Achieve ½ The Performance
R. M. Rao, 2008
16
Xilinx FFT IP (4)
• FFT fully utilizes FPGA arithmetic hardware resources
• FFT viewed as a recursion using a butterfly kernel
Phase factors: e-j2k/N
e-j2k/N
CADD1CADD2
CMPY
• CADD{1|2}: complex adder• CMPY: complex multiplier
R. M. Rao, 2008
17
Virtex-4 DSP Slice• DSP slice key for
implementing high-performance arithmetic
• Embedded 18x18 MPY and 48b adder– Butterfly phase rotator– Cross-addition
R. M. Rao, 2008
18
Butterfly CMPLX MPY
• Complex MPY used in FFT butterfly
• Optimized to employ Virtex-4 DSP Slice– 4 and 3 MPY option
• Complex MPY available as IP module†
Ar
Br
Ai
Bi
Pi
Pr
DSP Slice 1
DSP Slice 4
DSP Slice 2
DSP Slice 3
Pr + jPi = (Ar+jAi) x (Br + jBi)
† Available: 6.2i IP Update 2
R. M. Rao, 2008
19
Performance/Parallelism/Area• FPGA: highly parallel computing machine• Achieve performance using functional unit parallelism
• Area/throughput tradeoff delivered via Xilinx IP library
• Butterfly array to produce high-performance FFT processor
• High computation rate using (possibly) hundreds of DSP slices– Allocate resources as appropriate to meet
system requirements• Large memory bandwidth using multi-
port memory constructed from BRAMs
Mem read BW: 320 x 36 x 500e6 = 5.76 Tera-bps
R. M. Rao, 2008
20
FFT Architecture• For small number of carriers and modest data rates single
butterfly (I)FFT is probably suitable - Small FPGA footprint
switc
h
PhaseFactor ROM
DataRam 0
DataRam 1
switc
h
Output Data
Input Data
Iteration Engine
R. M. Rao, 2008
21
Block boundary detection/Fine timing acquisition
Z-1 Z-1 Z-1Z-1 Z-1 Z-1 Z-1Z-1
Z-1 Z-1 Z-1Z-1 Z-1 Z-1 Z-1Z-1
||2
()*
arg
SAMPLES
KNOWNSEQUENCE
1 OFDM block ofrepeated data
Timing Est
Freq Est
ave
Half an OFDM block
F. Tufvesson, O. Edfors, M. Faulkner, “Time and Frequency Synchronization for OFDM using PN-Sequence Preambles”, VTC-1999/Fall, vol 4, pp.2203-7, New Jersey, 1999.
R. M. Rao, 2008
22
Fine-timing acquisition using a clipped correlator
1
ynsysgencast
bc3
sysgencast
bc2sysgen
d
en
qz-1
in0
in1out0
Register1
sysgen
a
b
suba b
AddSub
3
ld
2
coeff
1
a
2
xnz
1
ynsysgenaddrz-1
ROM1
sysgen
d
addr
en
q
R
a
coeff
ld
yn
MACsysgenz-1
Delay2
4
LD
3
CAddr
2
DAddr
1
xn
1
y
BaudClk
Data Addr
Coef Addr
load
FSM
sysgenenz-1
Delay7
sysgenenz-7
Delay6
sysgenenz-1
Delay5
sysgenz-1
Delay4
sysgenenz-8
Delay3
sysgenz-1
Delay2
sysgenenz-8
Delay1
sysgenz-2
Delay
xn
DAddr
CAddr
LD
yn
xnz
C7
xn
DAddr
CAddr
LD
yn
xnz
C6
xn
DAddr
CAddr
LD
yn
xnz
C5
xn
DAddr
CAddr
LD
yn
xnz
C4
xn
DAddr
CAddr
LD
yn
xnz
C3
xn
DAddr
CAddr
LD
yn
xnz
C2
xn
DAddr
CAddr
LD
yn
xnz
C1
sysgen
a b
en
a +
bz-1AddSub4
sysgen
a b
en
a +
bz-1AddSub2sysgen
a b
en
a +
bz-1AddSub13
sysgen
a b
en
a +
bz-1AddSub12sysgen
a b
en
a +
bz-1AddSub1sysgen
a b
en
a +
bz-1AddSub
2
BaudClk
1
x
Bank of correlators
1-bit correlator
10 time multiplexedcorrelators
Each 1-bit correlator :10 slices
Total for clipped correlator :589 slices
Full precision correlators :32 embedded multipliers896 flipflops
R. M. Rao, 2008
23
Serial Gigabit OBSAI/CPRI Proprietary serial
backplane Inter-chip connectivity
Embedded Software
MAC (Media Access)Decision oriented
tasks CORBARTOSNBAPSCA (JTRS radios)
Conn
ectiv
ity
DACDACADCADC
Logic & IO OBSAI/CPRI SRIO AD/DA interface EMIF
DUC,DDCCFR,DPD
RACHSearcher
OFDM PHYTCC
MIMO
High Performance Processing
High MIPs tasks Radio PHYSupported by embedded
DSP tiles, distributed memory, block memory and logic fabric
SRIO
EMIF
The Platform
R. M. Rao, 2008
24
Digital Receiver Architecture:Abstracted Architecture
• Common model of abstraction for digital receiver is inner/outer receiver
Ø Frequency Offset Estimation/CorrectionØ Sample Clock Offset CorrectionØ Channel Estimation/EqualizationØ Frame detectionØ AGCØ Successive Interference CancellationØ Space-Time-CodingØ IFFT/FFTØ Per sub-carrier processing
Inner Receiver
Receiver Abstraction
Outer Receiver
Control, Protocol and Link Layer processing
Digital IF Processing
q Beamformingq QRD-RLS
Ø Up-ConversionØ Down-ConversionØ ChannelizerØ Fast AGC
Ø Channel Coding
q LDPCq TPCq CTCq Viterbiq (De-) Interleave
Ø Medium Access Control (MAC)Ø Link Layer Processing
Ø System Initialization, Control and MonitoringØ Application
Ø EthernetØ PCI ExpressØ SRIO
Ø CPRIØ OBSAI
R. M. Rao, 2008
25
Receiver Abstraction and Projection on to Platform FPGA
Receiver Function
Characteristics FPGA Platform
Comments
Digital IF Processing
MAC Intensive SX DSP48 main requirement
Inner Receiver MAC intensive Some functions LUT
intensive CORDIC in QRD-RLS
FFT processing for OFDM Correlation processing for
timing Per-carrier complexity
processing (MIMO-OFDM)
SX/LX DSP48 leveraged FFT
FPGA fabric for CORDIC FFT
Outer Receiver
Symbol rate tasks Channel coding
LX ACS/ACSO dominated by low bit precision add/multiplexors
Good match for fabric
Lots of memory required
Control/ Protocol
Gigabit connectivity Linux OS “heavy” tasks TCP/IP
FX Embedded PPC used Rocket IO for
PCI Express SRIO
Num. Sub-carriersTX RXN N
SX/LX
Receiver Abstraction
LX
FX
SX
FPGA product portfolioTailored for various processing Tasks in communicationsreceiver
R. M. Rao, 2008
26
Digital Frontend
Digital upconversion (downconversion)Crest factor reductionDigital pre-distortion
R. M. Rao, 2008
Wired Communications
27
• Flexible serial transceivers support multi-rate applications.• GTX transceivers run at 150Mbps to 6.5Gbps with 25% lower power consumption.• GTH transceivers support line rates beyond 11Gbps to enable 40G and 100G
protocols and more.
R. M. Rao, 2008
28
Orthogonal Frequency Division Multiplexing (OFDM)
Frequency
Ma
gn
itud
e
OFDM divides a frequency selective channel into a numberof flat fading channels
R. M. Rao, 2008
29
OFDM Modulation
QAMMapping
IFFTCyclicPrefix
S/P P/SD/AandRF
(a)
RFandA/D
Stripcyclicprefix
S/P FFT P/SQAM
decoding
(b)
FEQ
• A QAM symbol is modulated onto each subcarrier
• IFFT/FFT are used for efficient modulation and demodulation
Frequency Domain Time Domain
Time Domain Frequency Domain
R. M. Rao, 2008
30
MIMO Systems
Tx Antenna 1
Tx Antenna 2
Rx Antenna 1
Rx Antenna 2
Tx Antenna M Rx Antenna N
H
• MIMO systems:• Multiple Antennas at the transmitter and
receiver.• 3 types of MIMO Systems:
• STBC MIMO systems• Diversity gain.
• Spatial Multiplexing MIMO systems• Capacity/throughput gain.
• Feedback MIMO systems• Higher performance thru interference
reduction.• MISO (multiple input single output) Systems:
• STBC can be used with just 1 receive antenna.• Provides diversity gain.• To achieve array gain, need knowledge of
channel at the transmitter (feedback).
R. M. Rao, 2008
31
Spatial Multiplexing
• A spatial multiplexing MIMO system transmits different data symbols from each transmitter.
• The signals from each transmitter combine over the air and are received by multiple receive antennas.
• SM systems have a rate=M (num transmit antennas). The diversity order depends on the type of encoding and receiver (uncoded SM with ML decoding has diversity order=N (num receive antennas)).
MODULATOR
MODULATOR
MODULATOR
MIMOReceiverMIMO
Receiver
x(t)
y(t)
z(t)
r1(t) = a11x(t)+a12y(t)+a13z(t)
r3(t) = a31x(t)+a32y(t)+a33z(t)
x(n)
y(n)
z(n)
x(n)
y(n)
z(n)
R. M. Rao, 2008
32
MIMO and OFDM
• MIMO – Multiple Input Multiple Output Communication System. Employs multiple antennas at both transmitter and receiver.
• OFDM – Orthogonal Frequency Division Multiplexing. Breaks up a broadband channel into many parallel narrowband channels (subcarriers).
• MIMO-OFDM – A Combination of MIMO and OFDM. Appears like many parallel MIMO systems on orthogonal subcarriers.
R. M. Rao, 2008
33
MIMO-OFDM System
OFDM TRANSMITTER 1
OFDM TRANSMITTER N
OFDMDEMODULATOR 1
OFDMDEMODULATOR N
RIC
H S
CA
TT
ER
ING
EN
VIR
ON
ME
NT
MIM
O D
EC
OD
ER
Each transmitter is an independent OFDM modulator.
The source symbols could be space-time block coded or just QAM modulated for spatial multiplexing.
Each receiver is an OFDM demodulator combined with a MIMO decoder to invert the channel on each subcarrier and extract the source symbols.
R. M. Rao, 2008
34
Spatial Multiplexing Receivers
Zero Forcing receiver:
11h
22h
21h
12hTx Antenna 1
Tx Antenna 2
Rx Antenna 1
Rx Antenna 2
1 11 1 12 2 1
2 21 1 22 2 2
1 11 12 1 1
2 21 22 2 2
1 1
2 2
1
1 11 12 1
2 21 22 2
ˆ
ˆ
ˆ
ˆ
y h x h x n
y h x h x n
y h h x n
y h h x n
x y
x y
x h h y
x h h y
W
Significant increase in noise when the channel is in a deep fade.
For ZF receivers 1W H
R. M. Rao, 2008
35
Spatial Multiplexing Receivers
• MMSE MIMO Decoders:– Cancels interference and minimizes noise.– Minimizes the over all error (mean squared error).
2ˆ[( ) ]E x x
1H H
MMSE Ms
M MW H H I H
E SNR
R. M. Rao, 2008
36
QRD
• One of the popular methods of matrix inversion is based on QRD.
• Q is Unitary and R is upper triangular• A Unitary matrix has a trival inverse, • An upper triangular matrix can be inverted by
back-substitution
H QR
1 HQ Q
1 1 HH R Q
R. M. Rao, 2008
37
Architectures for QRD
• There are many architectures to get the QR decomposition of any matrix.– Givens Rotations and its variations– Householder transformations, etc.
• A systolic structure makes implementation straightforward and scalable.
• Givens rotations based QRD has a nice and easy systolic structure.
R. M. Rao, 2008
38
Givens Rotations
• For a 2x1 vector of real numbers
• For a NxM matrix, repeat the process 2 cells at a time.
2 2
2 2 2 2
0
,
c s a a bs c b
a bc s
a b a b
11 12 13 11 12 1311 12 1311 12 13
21 22 23 21 22 23 22 23 22 23
31 32 33 32 33 32 33 33
0 0
0 0 0 0
a a a a a aa a aa a a
a a a a a a a a a a
a a a a a a a a
R. M. Rao, 2008
39
Systolic Arrays
• Structured arrays with identical cells. Usually a “boundary” cell and an “internal” cell for the QRD process.
Boundary cell
Internal cell 1. The boundary cell generates the rotations.
2. Internal cell applies the rotations to all the cells in the row.
3. The systolic array in this figure can handle any matrix below 3x3.
R. M. Rao, 2008
40
Boundary and Internal Cell
2
1Z
s/w
1c
mode
a/x
c/1
1Z
-ve
mode
s/w
c/1
x
r
mode
-ve
0
1
z
-ve in mode 0, +ve in mode 1
This negative is needed since W12=-(W11a12)W22
This register needs to be initialized to 1, since in the
first cycle the output needs to be +1
1Z
R. M. Rao, 2008
41
Triangularization mode• For QRD of upto a 3x3
matrix we need 3 boundary cells and 3 internal cells.
• Boundary cells calculate rotation vectors and internal cells store them.
• Data is fed column-wise into the systolic array.
• This may have to be staggered depending on the pipelining delays thru the boundary cell and internal cell.
11 12 1311 12 13 11 12 1311 12 13
21 22 23 22 23 22 23 22 23
31 32 33 31 32 33 32 33 33
0 0 0
0 0 0
a a aa a a a a aa a a
a a a a a a a a a
a a a a a a a a a
31
21
11
a
a
a
32
22
12
a
a
a
33
23
13
a
a
a
The rotation factors for zeroing out cell A(2,1) are stored in cell A(1,2), etc.
R. M. Rao, 2008
42
Back-substitution mode• Computing R-1 with back-
substitution
• The is already computed in the boundary cell and stored away. So just use it.
1
11
1
( )
0
( )
( )
ij
ij jj
j
ij im mj jjm
if i j
W
elseif i j
W r
elseif i j
W W r r
end
1ij jjW r
11 12 1311 12 13
22 23 22 23
3333
0 0
0 00 0
a a a W W W
a a W W
Wa
1 0 0
12 11 12 22W W a W
13 11 13 12 23 33W W a W a W
R. M. Rao, 2008
43
Q-matrix computation mode
H
H H
Q A R
Q I Q
11 12 1321 21 31 31 11 12 13
32 32 21 21 21 22 23 22 23
32 32 31 31 31 32 33 33
1 0 0 0 0
0 0 0 1 0 0
0 0 0 1 0 0 0
a a ac s c s a a a
c s s c a a a a a
s c s c a a a a
0
0
1
0
1
0
1
0
0
first column of Q matrix
second column of Q matrix
third column of Q matrix
* *
* . * .
* . * .
;
s x I s s I c
z x I c s I s
c c
HQ RA
R. M. Rao, 2008
44
Scalability• A 4x4 systolic array needs 4
boundary cells and 6 internal cells and can handle all matricies below 4x4. (i.e. 1x1, 1x2, .. 2x2, …, 3x4, 4x4)
• But if your design is restricted to only a 2x2, you need only a 2x2 systolic array. With this you can handle 1x1, 1x2 and 2x2.
4x1 matrix
4x4 matrix
3x3 matrix2x2 matrix
R. M. Rao, 2008
45
FPGA Tools for DSP Systems Design
• Higher level tools are raising the level of abstraction.
• Allows non-hardware engineers (algorithm designers) to get a first look at hardware.
• System Generator– Simulink to Hardware
• C-to-Gates tools– C or “higher” level languages to gates
R. M. Rao, 2008
46
Xilinx DSP Tools and Flows Accelerate DSP Design
MATLAB / Simulink
SimulinkMATLAB
Mixed FlowGraphical
Based Flow
FPGA Implementation with ISE
RTL RTLRTL
C/C++ESL
Partners
Language Based Flow
RTL
R. M. Rao, 2008
47
System GeneratorSystem Level Modeling & Simulation Framework
Work in the language of your problem
HDL
C
R. M. Rao, 2008
48
HDL Simulation Flow
1. Develop Algorithm &System Model
Download to FPGA
DSP Development Flow
2. Automatic CodeGeneration
Simulink MDL
Bitstream
System Generator Flow
3. Xilinx Implementation Flow
HDL Test Bench Test Vectors
RTL VHDL & Cores
FPGA
R. M. Rao, 2008
49
Hardware/Software Co-simulation
HDL co-simulation
Hardwareco-simulation
•Encapsulates HDL semantics•Simulink as verification framework
R. M. Rao, 2008
ADVANCED SYSTEMS TECHNOLOGY GROUP (ASTG) 50
FlexOFDM• A Configurable MIMO-OFDM Technology Demonstrator.• Not specific to any standard, but can be configured (with some
effort) to showcase technologies that are part of some of the Wireless standards.
• Provides an architecture for the PHY and MAC layers, which can act as a starting point or spring board for product development.
• Investigate communication algorithms and architectures as they efficiently map to Xilinx FPGAs.
This is not a product/IP from Xilinx, but is available to partners, to speed up their MIMO-OFDM development efforts, on an AS IS basis.
R. M. Rao, 2008
51
Configurable MIMO-OFDM Transmitter
8
ImagOut4
7
RealOut4
6
ImagOut3
5
RealOut3
4
ImagOut2
3
RealOut2
2
ImagOut1
1
RealOut1
RealIn
ImagIn
WriteFIFO
BaudClk
RealOut1
ImagOut1
RealOut2
ImagOut2
RealOut3
ImagOut3
RealOut4
ImagOut4
Spatial Demultiplexing
RealIn
ImagIn
SampleClk
Bdata
rfd
Preamble
BFrame
FFTbusy
RealOut
ImagOut
Start
Enable
DataRequest
DataSubcarrier
Pilot Insertionand Data loading
DataIn
SampleClk
Zeroblks
Preamble
Bdata
DataSubc
DataEnable
RealOut
ImagOut
Packetizationand Encoding
SampleClk
Zeroblks
Preamble
Bdata
BFrame
Packet Controller
sysgenandz-0
Logical2
sysgenandz-0
Logical
sysgennot
Inverter FFT
xn_re
xn_im
start
enable
xk_re
xk_im
xk_index
rfd
vout
Busy
FFT
Clock Generator
SampleClk
BaudClk
ClockGenerator
RealIn
ImagIn
Addr
WriteFIFO
RealOut
ImagOut
ReadFIFO
Add Cyclic Extension
3
DataDone2
DataEnable
1
DataIndouble double
double
double
double double
double
Fix_16_10
UFix_6_0double
double
double
Fix_16_10
doubledouble
double
double
double
double
double
double
double
double
double
double
double
double
double
double
Bool
Bool
Bool
double double
Booldouble
double
Packet Controller
Packetization and configurable STBC
encoding
Pilot insertion and data loading
Time shared FFT across antennas
Add Cyclic Extension/Block
Shaping
Spatial Demultiplexing
and Interpolation
Resource sharing (folding factor)Ratio of System clock rate to symbol rate > 8 needed for a 4 transmit antenna system
R. M. Rao, 2008
52
MIMO Receiver Architecture
Samples processed at sample clock rate Samples processedat system clock rate
Packet Detection
Packet Detection
Packet Detection
Packet Detection
Block Boundary Detection
BlockBoundary
Coarse CFOestimate
Coarse CFOestimate
CFO estimator
Strip CP
Strip CP
Strip CP
Strip CP
Input FIFO
Input FIFO
Input FIFO
Input FIFO
FFT
FFT
FFT
FFT
Rx 1
Rx 2
Rx 3
Rx 4
Channel Estimator
Output FIFO
Output FIFO
Output FIFO
Output FIFO
Combine PD
MIMO Decoder Matrix
(MMSE, etc)
MIMO Decode
Soft Decisions
MIMO Decoder
FIFO
Pilot based CFO estimator
Packet Controller
Preamble
Payload
CF
O C
ompe
nsat
or
R. M. Rao, 2008
53
Fine-timing acquisition using a clipped correlator
1
ynsysgencast
bc3
sysgencast
bc2sysgen
d
en
qz-1
in0
in1out0
Register1
sysgen
a
b
suba b
AddSub
3
ld
2
coeff
1
a
2
xnz
1
ynsysgenaddrz-1
ROM1
sysgen
d
addr
en
q
R
a
coeff
ld
yn
MACsysgenz-1
Delay2
4
LD
3
CAddr
2
DAddr
1
xn
1
y
BaudClk
Data Addr
Coef Addr
load
FSM
sysgenenz-1
Delay7
sysgenenz-7
Delay6
sysgenenz-1
Delay5
sysgenz-1
Delay4
sysgenenz-8
Delay3
sysgenz-1
Delay2
sysgenenz-8
Delay1
sysgenz-2
Delay
xn
DAddr
CAddr
LD
yn
xnz
C7
xn
DAddr
CAddr
LD
yn
xnz
C6
xn
DAddr
CAddr
LD
yn
xnz
C5
xn
DAddr
CAddr
LD
yn
xnz
C4
xn
DAddr
CAddr
LD
yn
xnz
C3
xn
DAddr
CAddr
LD
yn
xnz
C2
xn
DAddr
CAddr
LD
yn
xnz
C1
sysgen
a b
en
a +
bz-1AddSub4
sysgen
a b
en
a +
bz-1AddSub2sysgen
a b
en
a +
bz-1AddSub13
sysgen
a b
en
a +
bz-1AddSub12sysgen
a b
en
a +
bz-1AddSub1sysgen
a b
en
a +
bz-1AddSub
2
BaudClk
1
x
Bank of correlators
1-bit correlator
10 time multiplexedcorrelators
Each 1-bit correlator :10 slices
Total for clipped correlator :589 slices
Full precision correlators :32 embedded multipliers896 flipflops
R. M. Rao, 2008
54
MIMO-OFDM Receiver
10
ValidOut
9
PacketDetect
8
SoftDecImag4
7
SoftDecReal4
6
SoftDecImag3
5
SoftDecReal3
4
SoftDecImag2
3
SoftDecReal2
2
SoftDecImag1
1
SoftDecReal1
Ch_tx1rx1
Ch_tx1rx2
Ch_tx1rx3
Ch_tx1rx4
Ch_tx2rx1
Ch_tx2rx2
Ch_tx2rx3
Ch_tx2rx4
Ch_tx3rx1
Ch_tx3rx2
Ch_tx3rx3
Ch_tx3rx4
Ch_tx4rx1
Ch_tx4rx2
Ch_tx4rx3
Ch_tx4rx4
En
Addr
wreal_1_1
wimag_1_1
wreal_1_2
wimag_1_2
wreal_1_3
wimag_1_3
wreal_1_4
wimag_1_4
wreal_2_1
wimag_2_1
wreal_2_2
wimag_2_2
wreal_2_3
wimag_2_3
wreal_2_4
wimag_2_4
wreal_3_1
wimag_3_1
wreal_3_2
wimag_3_2
wreal_3_3
wimag_3_3
wreal_3_4
wimag_3_4
wreal_4_1
wimag_4_1
wreal_4_2
wimag_4_2
wreal_4_3
wimag_4_3
wreal_4_4
wimag_4_4
Weight Matrix Computation
Rxreal1
Rximag1
Rxreal2
Rximag2
Rxreal3
Rximag3
Rxreal4
Rximag4
ValidData
Addr
Out_real1
Out_imag1
Out_real2
Out_imag2
Out_real3
Out_imag3
Out_real4
Out_imag4
ReadFIFO
AddrOut
Output FIFO
RealIn1
ImagIn1
RealIn2
ImagIn2
Baud_clk
PacketDetect
CFO_Est
PktDetPulse
MIMO Packet Detect1
Rxreal1
Rximag1
Rxreal2
Rximag2
Rxreal3
Rximag3
Rxreal4
Rximag4
ReadFIFO
Addr
wreal_1_1
wimag_1_1
wreal_1_2
wimag_1_2
wreal_1_3
wimag_1_3
wreal_1_4
wimag_1_4
wreal_2_1
wimag_2_1
wreal_2_2
wimag_2_2
wreal_2_3
wimag_2_3
wreal_2_4
wimag_2_4
wreal_3_1
wimag_3_1
wreal_3_2
wimag_3_2
wreal_3_3
wimag_3_3
wreal_3_4
wimag_3_4
wreal_4_1
wimag_4_1
wreal_4_2
wimag_4_2
wreal_4_3
wimag_4_3
wreal_4_4
wimag_4_4
BaudClk
Out_real1
Out_imag1
valid_out
ReadWeightMatrix
Out_real2
Out_imag2
Out_real3
Out_imag3
Out_real4
Out_imag4
MIMO Decoder
WriteFIFO
RxStream1
RxStream2
RxStream3
RxStream4
Enable
ReadFIFO
CFO_est
FFT_Start
CFO_Valid
RxOut1
RxOut2
RxOut3
RxOut4
FIFO_status_flag
Input Buffer
RealIn
ImagIn
BaudClk
Out2
BBDValid
Fine Timing Acquisition
RxStream1
RxStream2
RxStream3
RxStream4
FIFO_status_flag
Enable
CFO_Valid
Reset
RxReal1
RxImag1
RxReal2
RxImag2
RxReal3
RxImag3
RxReal4
RxImag4
Valid out
Addr
FFT_RFD
FFT_Start
FFT
0
Display2
0
Display1
z-1 Delay8
enz-1
Delay7
enz-1
Delay6
enz-1
Delay5
enz-1
Delay4
enz-1
Delay3
enz-1
Delay2
enz-1
Delay1
enz-1
Delay
BlkBounDetect
RealIn1
ImagIn1
RealIn2
ImagIn2
RealIn3
ImagIn3
RealIn4
ImagIn4
PacketDetect
BaudClk
ReadEnable
RxStream1
RxStream2
RxStream3
RxStream4
Cyclic Prefix Removal
Clock Generator
SampleClk
BaudClk
ClockGenerator
Rxreal1
Rximag1
Rxreal2
Rximag2
Rxreal3
Rximag3
Rxreal4
Rximag4
ValidData
Addr
ReadAddr
Ch_1_1
Ch_1_2
Ch_1_3
Ch_1_4
Ch_2_1
Ch_2_2
Ch_2_3
Ch_2_4
Ch_3_1
Ch_3_2
Ch_3_3
Ch_3_4
Ch_4_1
Ch_4_2
Ch_4_3
Ch_4_4
CFO_Est
CFO_Est_Valid
Channel Estimation
a
ba - b
AddSub
9
Reset
8
ImagIn4
7
RealIn4
6
ImagIn3
5
RealIn3
4
ImagIn2
3
RealIn2
2
ImagIn1
1
RealIn1
Packet Detection
Fine Timing Acq
Cyclic prefix removal
Channel Estimation
Weight Matrix Computation
MIMO Decoder
FFT
Carrier Frequency Offset Correction
Output FIFO
R. M. Rao, 2008
55
Channel Estimation
32
Chimag16
31
Chreal1630
Chimag15
29
Chreal1528
Chimag14
27
Chreal1426
Chimag13
25
Chreal13
24
Chimag12
23
Chreal1222
Chimag11
21
Chreal1120
Chimag10
19
Chreal10
18
Chimag9
17
Chreal9
16
Chimag8
15
Chreal814
Chimag7
13
Chreal7
12
Chimag6
11
Chreal6
10
Chimag5
9
Chreal5
8
Chimag4
7
Chreal4
6
Chimag3
5
Chreal3
4
Chimag2
3
Chreal2
2
Chimag1
1
Chreal1
Enable
Reset
Pilot_real
Training SymbolsTx4
Enable
Reset
Pilot_real
Training SymbolsTx3
Enable
Reset
Pilot_real
Training SymbolsTx2
Enable
Reset
Pilots
Addr
Training SymbolsTx1
simout11
To Workspace2
addr
Real
Imag
WE
EN
real_out
imag_out
Single Port RAM3
addr
Real
Imag
WE
EN
real_out
imag_out
Single Port RAM2
addr
Real
Imag
WE
EN
real_out
imag_out
Single Port RAM1
addr
Real
Imag
WE
EN
real_out
imag_out
Single Port RAM
sysgen
sel
d0
d1
Mux1
sysgen
sel
d0
d1
Mux
sysgenandz-2
Logical
sysgenz-2
Delay9
sysgenz-2
Delay8
sysgenz-2
Delay7
sysgenz-1 Delay6
sysgenz-2
Delay5
sysgenz-2
Delay4
sysgenz-2
Delay3
sysgenz-2
Delay2
sysgenz-2
Delay12
sysgenz-2
Delay11
sysgenz-2
Delay10
sysgenz-3
Delay1
sysgenrst
enout
Counter2
sysgenrst
enout
Counter1
ValidData
ChEstPilots
ChEstEn
ChEstRst
En
Rst
En2
ChEstPilots1
ControlSignals
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx4-Rx4
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx4-Rx3
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx4-Rx2
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx4-Rx1
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx3-Rx4
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx3-Rx3
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx3-Rx2
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx3-Rx1
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx2-Rx4
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx2-Rx3
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx2-Rx2
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx2-Rx1
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx1-Rx4
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx1-Rx3
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx1-Rx2
addr
Pilots1
Real
Imag
WE
VDATA
real_out
imag_out
Real_in
Imag_in
ChEst Tx1-Rx1
sysgenx 0.3535
CMult7
sysgenx 0.3535
CMult6
sysgenx 0.3535
CMult5
sysgenx 0.3535
CMult4
sysgenx 0.3535
CMult3
sysgenx 0.3535
CMult2
sysgenx 0.3535
CMult1
sysgenx 0.3535
CMult
12
ReadAddr
11
ChEstPilots
10
Addr
9
ValidData
8
Rximag4
7
Rxreal4
6
Rximag3
5
Rxreal3
4
Rximag2
3
Rxreal2
2
Rximag1
1
Rxreal1
double
double
Bool
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
UFix_6_0
Fix_16_10
UFix_6_0
UFix_6_0
UFix_6_0
Fix_16_10
Fix_16_10
double
double
double
Bool
double
double
UFix_6_0
Fix_16_10
Fix_16_10
Bool
double
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_32_20
Fix_32_20
Fix_32_20
double
double
Fix_32_20
Fix_32_20
Fix_32_20
Fix_32_20
Fix_32_20
Fix_32_20
Fix_32_20
double
Fix_32_20
Fix_32_20
Fix_32_20
Fix_32_20
Fix_2_0
Fix_32_20
Fix_32_20
Fix_32_20
double
Fix_32_20
Fix_32_20
Fix_32_20
Fix_32_20
Fix_2_0
UFix_6_0
double
double
double (8)
double
double
double
double
double
double
double
doubleFix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_32_20
Fix_32_20
double
Fix_32_20
Fix_32_20
Fix_32_20
Fix_32_20
Fix_32_20
Channel Estimation Pilots for Tx4
Channel Estimation Pilots for Tx1
4x4 Channel Estimation Memory
Control Signals
Input FIFO
R. M. Rao, 2008
56
Packet Detection
Schmidl and Cox algorithm for Packet Detection and coarse carrier frequency offset estimation.
T. M. Schmidl, D. C. Cox, “Low Overhead Low Complexity Synchronization for OFDM”, ICC 1996, Vol 3, pp 1301-1306. Z-D
C
P
2
2( )
r(n) c(n)
p(n)
m(n)*
*
Identical halves of 1 OFDM symbol
R. M. Rao, 2008
57
Pre-FFT Carrier Frequency Offset Estimation
CFO_Est1
Truncate
In1
In2
In3
Out1
Out2
Out3
Rising edgedetector
In1
Out1
Register1
drsten
qz- 1
Packet Detection 3
RealIn 1
ImagIn 1
RealIn 2
ImagIn 2
BaudClk
Rst
CorrMetric _ real
CorrMetric _ imag
AvePwr
Delay6
enz-24
Delay5
enz-14
Convert
cast
CORDIC ATAN
z-17
x
y
mag
atan
CMult8
x 0.003906z-2
BBD7
Rst6
Baud_clk5
ImagIn24
RealIn23
ImagIn12
RealIn11
The angle of the correlation metric is proportional to the Carrier frequency offset.
Right size the number of bits before the CORDIC operation.
CORDIC ATAN from the Xilinx Math library calculates the angle.
ˆ
22
sN
R. M. Rao, 2008
58
Carrier Frequency Offset Correction
ImagOut 4
8
RealOut 4
7
ImagOut 3
6
RealOut 3
5
ImagOut 2
4
RealOut 2
3
ImagOut 1
2
RealOut 1
1
Rising edgedetector
In1 Out1
Relational 1
a
b
a<=b
z-0
Relational
a
b
a<b
z-0
Negate 1
x(-1)
Logical 1
orz-0
Logical
and
z-0
Delay 7
z-1
Delay 6
z-1
Delay 5
z-1
Delay 4
z-1
Delay 3
z-1
Delay 2
z-1
Delay 1
z-1
Delay
z-1
DDS
freq_off
Enable
Reset
cos_out
sin_out
Counter
rst out
Constant 3
1
Constant 2
78
Constant 1
0
Complex Multiply 3
Complex Multiply
RealIn 1
ImagIn 1
RealIn 2
ImagIn 2
BaudClk
RealOut
ImagOut
Complex Multiply 2
Complex Multiply
RealIn 1
ImagIn 1
RealIn 2
ImagIn 2
BaudClk
RealOut
ImagOut
Complex Multiply 1
Complex Multiply
RealIn 1
ImagIn 1
RealIn 2
ImagIn 2
BaudClk
RealOut
ImagOut
Complex Multiply
Complex Multiply
RealIn 1
ImagIn 1
RealIn 2
ImagIn 2
BaudClk
RealOut
ImagOut
CMult
x 0.01563
Reset
12
CFO_Est_valid
11
FFT_Start
10
CFO_Est
9
ImagIn 4
8
RealIn 4
7
ImagIn 3
6
RealIn 3
5
ImagIn 2
4
RealIn 2
3
ImagIn 1
2
RealIn 1
1
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Bool
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_15
Fix_16_15Fix_17_15
Fix_16_12
Fix_16_10
Fix_16_10
UFix_16_0
UFix_16_0
UFix_16_0
Bool
Bool
BoolBool
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Fix_16_10
Bool
Bool
Fix_16_10
Fix_16_10
Fix_16_16
double
Direct digital synthesizer (DDS) from the Xilinx DSP SysGen library.
R. M. Rao, 2008
59
Design methodology issues
• FPGA tools– Where to from here?
• C-to-gates– Higher level design languages to gates– Raising the level of abstraction
R. M. Rao, 2008
60
‘C’ or higher level language to Gates
• There is interest in higher level design methodologies, such as C-to-Gates from the design community.
• ESL (Electronic system level) tools/design methodologies are being explored.
• But, extracting all the concurrency from a sequential description is not an easy problem.
R. M. Rao, 2008
C to Gates evaluation flow
61
Source: BDTI. For more info and results see www.BDTI.com.
R. M. Rao, 2008
C to Gates evaluation by BDTI
62
Source: BDTI. For more info and results see www.BDTI.com.
R. M. Rao, 2008
63
Conclusion
• FPGAs are finding wide use in infrastructure communication systems and signal processing systems.
• FPGA are an efficient choice for exploring VLSI architectures.
• FPGA tools are raising the level of abstraction to allow algorithm designers the ability to explore h/w architectures without learning “h/w design tools/languages”.
R. M. Rao, 2008
64
Questions?