arithmetic acceleration techniques for wireless communication receivers
DESCRIPTION
Http://www.ece.rice.edu/. Arithmetic Acceleration Techniques for Wireless Communication Receivers. Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro {suman,sridhar,chaitali,cavallar}@rice.edu Rice University. - PowerPoint PPT PresentationTRANSCRIPT
Arithmetic Acceleration Techniques for Wireless Communication Receivers
Suman Das, Sridhar Rajagopal, Chaitali Sengupta and Joseph R.Cavallaro
{suman,sridhar,chaitali,cavallar}@rice.edu
Rice University
This work is supported by Nokia, Texas Instruments, Texas Advanced Technology Program and NSF
Http://www.ece.rice.edu/
Objective
Next generation Wireless Base-station
– Real-Time Requirements
Multiuser Channel Estimation and Detection
– High Complexity Algorithms for Advanced Receiver Structures
Task Decomposition
Potential for parallelism
Application-Specific Design / Single Processor
Outline
Motivation
Real-time Requirements
Joint Estimation and Detection
Task Decomposition
Results
Summary
Motivation
Next Generation Wireless Systems
– Higher Data Rates , up to 2 Mbps
– Multimedia Capabilities
– Multi-rate, QoS
High Complexity in Proposed Algorithms
Pressure on existing hardware
– Time, power, size constraints
Acceleration on Hardware Needed
Wireless Communication Uplink
Asynchronous CDMA System
Multiple Users
Channel Effects
– Fading
– Multiple paths
– Multiple Access Interference
Direct PathReflected
Paths
Noise +MAI
User 1
User 2
Base Station
Base-Station Receiver
The Physical Layer
Multiple Users
Channel Estimation
Multiuser Detection Decoder
Data
Pilot
Demod-ulator
Antenna
Decision Feedback
MUX
Detected Bits
+
Base-station Receiver
Delay
MUX
d
b
Real -Time Requirements
W-CDMA Transmission done by multiplication of signature
waveform (Spreading) Data Transmission in 10 ms Frames Multiple Data Rates by Varying Spreading Factors Detection needs to be done in real-time
– 1953 cycles available in a C6x DSP at 250MHz to detect 1 bit at 128 Kbps
SpreadingFactor
Number ofBits / Frame
Data RateRequirement
4 10240 1024 Kbps32 1280 128 Kbps
256 160 16 Kbps
Joint Estimation and Detection
Algorithm to jointly estimate the channel response
and detect all the user’s bits.
Shown to have better performance as well as
reduced computational complexity.
Maximum Likelihood Based Channel Estimation– [C.Sengupta et al. : PIMRC’1998 WCNC’1999]
Differencing Multistage Detection based on Parallel
Interference Cancellation– [G.Xu et al. : SPIE’1999]
Computations Involved
Model
Compute Correlation Matrices
rbRH
iibr L 1
bbRT
iibb L 1
CrRb
N
i
K
i
2Bits of K async. users aligned at times I and I-1
Received bits of spreading length N for K users
iiii bAr ri
bi bi-1
time
delay
Multishot Detection
b
b
b
b
A
AAAA
DK
D
K
0
10
10
r
,
,1
1,
1,1
000
00
00
CAKDND
Multishot Detection
AAA 10i
Solve for the channel estimate, Ai
RAR bribb
CANK
i
2
Differencing Multistage Detection
Stage 0
Stage 1
Successive Stages
)(
]Re[
)(
]Re[
11
001
00
0
ysignd
dSAAyy
ysignd
rAy
H
H
)(
]Re[11
1
1
ll
lHll
lll
ysignd
xSAAyy
ddx
S=diag(AHA)
y - soft decision
d - detected bits
(hard decision)
Structure of AHA
AAAA
AAAAAAAAAAAA
H11
H
H0 1
HH00H
H0
H
01
1101
100
00
0
00
KDKDH RAA
Block Bi-Diagonal Matrix
Bottlenecks
Identify using C6x DSP Implementation
Channel Estimation
– Can be done less frequently
– Depends on BER needed
Multiuser Detection
– Needs to be done all the time
– Differencing Multistage
Less computations on successive stages
Analysis on Various levels of Optimization for Detection
Task Decomposition
Matrix Products
InverseCorrelation Matrices (Per
Bit)
Rbr[I]O(KN)
A0HA1
O(K2N)
AHrO(KND)
A1HA1
O(K2N)
A0HA0
O(K2N)RbbAH = Rbr[I]O(K2N)
Multistage Detection
(Per Window)
O(DK2Me)
b
Pilot
Data
MUX
d
Data’MUX
RbbAH
= Rbr[R]O(K2N)
d
Rbr[R]O(KN)
Rbb
O(K2)
Block I Block II Block III
Block IV
Channel Estimation Multistage Detection
Task A
Task B
Sequential / Pipeline A B
DataAHr
O(KND)O(DK2Me)
d
Block IV
(Single PE) Sequential : A+B: 13272 + 3367*Me : 10.7 Kbps
(2 PE) Pipeline : A B : max(13272, 3367*Me) : 18.8 Kbps
13272 cycles 3367*Me cycles
Real-time
1953 cycles,128 Kbps
Task ATask A
Task BTask B
*Me =3
(Parallel A) B
Data
AHrO(ND)
O(DK2Me)d
Block IV
K
1
Real-time
1953 cycles,128 Kbps3367*Me cycles
885 cycles
(K+1 PE) Parallel A B : 3367*Me : 24.75 Kbps
Task ATask A
Task BTask B
Parallel A Pipeline B Parallel A Parallel + Pipeline B
K
1
Real-time
1953 cycles,128 Kbps
(K +3 PE) Parallel A Pipeline B : 3367 : 74.25 Kbps
((Me+1)K PE) Parallel A Parallel + Pipeline B : 885 : 282.5 Kbps
885 cycles
O(N)
3367 cycles O(K2)
225 cycles O(K)
Task ATask A
Task BTask B
At this step
Stage 1 Stage2 Stage3…
Block IV
K
1
Block III
Block I &II
Data
Multistage Detection
Task ATask A
Task BTask B
Achieved Data Rates
9 10 11 12 13 14 150
0.5
1
1.5
2
2.5
3x 10
5
Number of Users
Dat
a R
ates
Data Rates for Different Levels of Pipelining and Parallelism
(Parallel A) (Parallel+Pipe B)(Parallel A) (Pipe B) (Parallel A) B A B Sequential A + B
Data Rate Requirement = 128 Kbps
Mapping to Hardware
Analysis independent of hardware– DSP with coprocessors
– Multiple Processors
– Combination of a processor with ASIC/FPGA
– Single ASIC
Minimize Idle time in processing elements– Some computations can be shared
Assumptions– Critical processing elements have functional units similar to C6x
– No communication overhead between processors
Number of elements dependent on number of users
Summary
Acceleration Techniques for Multiuser Estimation
and Detection : computationally intensive algorithm
Task Decomposition
C6x DSP Simulator
Real-time Analysis
Hardware Mapping Issues
Application Specific Design more effective than a
single processor solution
Future Work
Fixed Point Implementation
– LU Decomposition
– Other Algorithms for decomposition
Matrix Oriented Architectures
– Vector Processor with SIMD
– 2 Levels of Parallelism
Complex Arithmetic