dsp architectures for wireless communications
DESCRIPTION
DSP architectures for wireless communications. Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX ECE Pizza Talk March 28, 2003. This work has been supported in part by Nokia, TI, TATP and NSF. Wireless Cellular. Wireless LAN. Bluetooth/ - PowerPoint PPT PresentationTRANSCRIPT
RICE UNIVERSITY
DSP architectures for wireless communications
Sridhar Rajagopal
Department of Electrical and Computer EngineeringRice University, Houston TX
ECE Pizza Talk March 28, 2003
This work has been supported in part by Nokia, TI, TATP and NSF
2RICE UNIVERSITY
Future wireless devices :
High data rate mobile devices with multimedia
Multiple antennas w/ complex algorithms, GOPs of
computation
Area-Time-Power constraints
Seamless connection across environments and standards
Use the fastest and cheapest available service
Bluetooth/Home Networks
Wireless Cellular
Wireless LAN
3RICE UNIVERSITY
Aim of the talk
Design me
4RICE UNIVERSITY
Trends
Past Current Future Year 1990’s 2002-2005 2006+
Function Voice Data Multimedia
Data rates 10’s of Kbps 100’s of Kbps (10x) 10’s of Mbps (10-100x)
Complexity KOPs MOPs (1000x) GOPs (1000x)
Power < 500 mW < 500 mW < 500mW
Antennas Single Single Multiple
Standard GSM (Europe) CDMA (Qualcomm)
TDMA (Nokia) (different devices)
GSM/TDMA/CDMA on same device
GSM/TDMA/CDMA/EDGE/ Wireless LAN/Bluetooth on same
device
FLEXIBILITY
5RICE UNIVERSITY
Change in flexibility requirements
Physical Layer
MAC Layer
Network Layer
Application LayerNo change
(already flexible)
Maximum change(needs to support multiple
environments, algorithms and standards)
6RICE UNIVERSITY
Architecture trade-offs
Past : more DSP + less ASIC, Current : less DSP + more ASIC
Reason: need less flexibility OR DSPs not powerful enough?
Can’t we build better DSPs? How much flexibility do we need?
ASICs
Intermediate
Programmable
Area-Time-PowerbenefitsFlexibility
Time-to-marketSoftware updates
7RICE UNIVERSITY
Problems with current DSPs
Current DSPsNot enough functional units (FUs) for GOPs of
computationNeed 100’s of FUsNot low power enough!!
Cannot extend to more FUsLimited Instruction Level Parallelism (ILP)Limited Subword Parallelism (such as MMX)Cannot support more registers (area,ports)Compilers: difficult to find ILP as FUs increase
8RICE UNIVERSITY
Scalable Wireless Application-specific Procesors (SWAPs)
Exploit data parallelism (DP)Available in many wireless algorithmsThis is what ASICs do!!
Example:int i,a,b,c; // 32 bitsshort int d,e,f; // 16 bits packed
for (i = 1; i<= 1024; ++i) { a[i] = b[i] + c[i]; d[i] = e[i] + f[i]; }
ILP
DP
Subword
9RICE UNIVERSITY
SWAPs: stream processors for wireless
Kernel
Viterbidecoding
StreamInput Data
Output Data
Correlator channelestimation
receivedsignal
Matchedfilter
InterferenceCancellation
Decoded bits
Kernels (computation) and streams (communication) Operations on kernels use local data Streams expose data parallelism
Imagine stream processor at Stanford
10RICE UNIVERSITY
DSP vs. SWAPs
+++***
InternalMemory
ILP
Stream Register File (SRF)
DSP(1 cluster)
SWAPs(max. clusters
All clusters same &do same operations)
+++***
+++***
+++***
+++***
…ILP
DP
11RICE UNIVERSITY
Arithmetic clusters
FUs (+,*,/) Scratch-pad (Sp)
Indexed accesses Comm. unit (CU)
Intercluster comm. Distributed reg. Files
more FUs
Intercluster Network
From/To SRF
Cross Point
Local Register File
CU
+
+
+*
*/
+
/
+
+
+*
*/
+
/
Sp
SRF
12RICE UNIVERSITY
SWAPs vs. DSPs trade-offs
Same internal memory size as DSPs Dependent on application, not architecture
Needs more area to support more functional unitsArea is less of a constraint than power
Varying levels of DP in applicationsNeeds reconfiguration!!Need to turn off unused clusters (and FUs)
More parallelism lower clock frequency lower voltage
low power (CV2f + leakage) in spite of larger area
13RICE UNIVERSITY
Design methodology
Chain of receiver algorithms
Low “complexity”, parallel, fixed point
High level language implementation
Modular programmablearchitecture design
ASICdesign
FPGA, customized,
reconfigurable, heterogeneous
designs DSP, SWAPs
learn
H-SWAPs
learn
Architecture exploration
Flexibility-performance
tradeoffs
14RICE UNIVERSITY
Physical layer of wireless receivers
Antenna
Channel estimation
Detection DecodingHigher(MAC/
Network/OS)Layers
RF Front-end
Baseband processing
Receiver more complex than transmitter
15RICE UNIVERSITY
Algorithms for
Multiple antenna systems (MIMO systems) Complexity exponential with transmit * receive antennas
Wide range of extremely complex algorithms Optimal depends on fading, mobility, bandwidth, antennas GOPs of computations
Estimation: Linear MMSE, blind, conjugate gradient….
Detection: FFT, (blind) interference cancellation….
Decoding: Viterbi, Turbo, LDPC….
Implement ALL of them AND the NEXT one in line Use for the best for the situation
Example for concept demonstration: Viterbi decoding
16RICE UNIVERSITY
Parallel Viterbi Decoding
1. Add-Compare-Select (ACS) : trellis interconnectParallelism depends on constraint length (#states)
2. Conventional Traceback Sequential (No DP)Difficult to implement in parallel architecture
Use Register Exchange (RE) parallel solution
17RICE UNIVERSITY
Re-ordering for parallel Viterbi
a. Trellis
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
X(0)
X(2)
X(4)
X(6)
X(8)
X(10)
X(12)
X(14)
X(1)
X(3)
X(5)
X(7)
X(9)
X(11)
X(13)
X(15)
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
b. Shuffled Trellis
Exploiting Viterbi DP in SWAPs:Re-order ACS, RE Overhead
18RICE UNIVERSITY
SWAP: Algorithms + Architecture
Algorithm design for parallelism
Architecture design?
19RICE UNIVERSITY
SWAP design
Decide how many clustersExploit DP
Decide what to put within each clusterMaximize ILP with high functional unit efficiencySearch design space with “explore” tool
See how it meets time-area-power constraints
+?**
+
**
+
**
+
**
…ILP
DP
? ? ?
20RICE UNIVERSITY
Inside a SWAP cluster: EXPLORE
Auto-exploration of adders and multipliers for “ACS"
1
2
3
4
5
1
2
3
4
5
40
60
80
100
120
140
160
(43,58)
(54,59)
(39,41)
(62,62)
(47,43)
#Multipliers
(40,32)
(70,59)
(65,45)
(49,33)
(39,27)
(80,34)
(73,41)
(61,33)
(48,26)
(39,22)
(50,22)
(85,24)
(76,33)
(60,26)
#Adders
(61,22)
(85,17)
(72,22)
(72,19)
(85,13)
(85,11)
Inst
ruct
ion c
ount
(Adder FU%, Multiplier FU%)
21RICE UNIVERSITY
“Explore” tool benefits
Instruction count vs. functional unit efficiencyWhat goes inside each cluster
Explore all algorithms turn off functional units not in use for given kernel
Design customized application-specific unitsBetter performance with increased FU utilization
Algorithm 1 : 3 adders, 3 multipliers, 32 clustersAlgorithm 2 : 4 adders, 1 multiplier, 64 clusters
Architecture: 4 adders, 3 multipliers, 64 clusters
22RICE UNIVERSITY
Viterbi reconfiguration
Packet 1Constraint length 7
(16 clusters)
Packet 2Constraint length 9
(64 clusters)
Packet 3Constraint length 5
(4 clusters)
DP Can be turned OFF
23RICE UNIVERSITY
Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz
1 10 1001
10
100
1000
Number of clusters
Fre
qu
en
cy n
eed
ed
to a
ttain
real-
tim
e (
in M
Hz)
K = 9K = 7 K = 5Static
architecture
SWAPs
DSP
Ideal C64x (w/o co-proc) needs ~200 MHz for real-time
24RICE UNIVERSITY
SWAPs : Salient features
1-2 orders of magnitude better than 1 processor DSP
Any constraint length 10 MHz at 128 Kbps
Same code for all constraint lengths no need to re-compile or load another codeas long as parallelism/cluster ratio is constant
Power savings due to dynamic cluster scaling
25RICE UNIVERSITY
Expected SWAP power consumption
64 clusters and 1 multiplier per cluster: 0.13 micron, 1.2 V Peak Active Power: ~9 mW at 1 MHz Area: ~53.7 mm2
10 MHz, 128 Kbps with reconfiguration
*Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp. 153-164
0 10 20 30 40 50 60 700102030405060708090
Active Clusters (max 64)P
ow
er (
in m
W)Viterbi Clusters used Peak Power
K = 9 64 ~90 mW
K = 7 16 ~28.57 mW
K = 5 4 ~13.8 mW
overhead 0 ~8.1 mW
26RICE UNIVERSITY
Flexibility vs. performance
Suitable for mobile devices?SWAPs: Real-time at ~10-100 mWMaybe ; but can we do better?
ASICs : Real-time at ~10-100 W
No special customization for the applicationNo application-specific unitsGeneric inter-cluster communication networkOverhead for extracting parallelism
SWAPs suitable for base-stations?Why not? – power is not a primary constraint!
27RICE UNIVERSITY
Multiuser Estimation-Detection+Decoding
Real-time target : 128 Kbps per user
1 10 10010
100
1000
10000
100000
Number of clusters
Fre
qu
en
cy
ne
ed
ed
to
att
ain
re
al-
tim
e (
in M
Hz)
FASTMEDIUMSLOW
32-user base-station
Mobile
DSP
Ideal C64x (w/o co-proc) needs ~15 GHz for real-time
28RICE UNIVERSITY
Current research
SWAPs : Completely flexible and general
How do we trade-off flexibility for better performance?
Handset SWAPs (H-SWAPs)
29RICE UNIVERSITY
H-SWAPs: Potential advantages
DSP (RE)
SWAP
ASIC/FPGA – Real-time performance
DP
Task PipeliningDedicated interconnect
DSP (RE)
H-SWAP
Partial DP + Task Pipelining
Application-specific units
ASIC/FPGA – Real-time performance
Dedicated interconnect
H-SWAPsSWAPs
Execu
tion t
ime
30RICE UNIVERSITY
Conclusions
Need flexible architectures for future wireless devicesHigher data rates, lower power, more complex algorithms
Design methodology (SWAPs, H-SWAPs, ASICs)Flexibility vs. performance trade-offsBlurs distinction between ASICs and programmable solutions
Also need parallel, low precision algorithms for efficient mapping
Inter-disciplinary research: Computer architecture, VLSI, wireless communications,
computer arithmetic, compilers