dsp architectures for wireless communications

RICE UNIVERSITY

DSP architectures for wireless communications

Sridhar Rajagopal

Department of Electrical and Computer EngineeringRice University, Houston TX

ECE Pizza Talk March 28, 2003

This work has been supported in part by Nokia, TI, TATP and NSF

2RICE UNIVERSITY

Future wireless devices :

High data rate mobile devices with multimedia

Multiple antennas w/ complex algorithms, GOPs of

computation

Area-Time-Power constraints

Seamless connection across environments and standards

Use the fastest and cheapest available service

Bluetooth/Home Networks

Wireless Cellular

Wireless LAN

3RICE UNIVERSITY

Aim of the talk

Design me

4RICE UNIVERSITY

Trends

Past Current Future Year 1990’s 2002-2005 2006+

Function Voice Data Multimedia

Data rates 10’s of Kbps 100’s of Kbps (10x) 10’s of Mbps (10-100x)

Complexity KOPs MOPs (1000x) GOPs (1000x)

Power < 500 mW < 500 mW < 500mW

Antennas Single Single Multiple

Standard GSM (Europe) CDMA (Qualcomm)

TDMA (Nokia) (different devices)

GSM/TDMA/CDMA on same device

GSM/TDMA/CDMA/EDGE/ Wireless LAN/Bluetooth on same

device

FLEXIBILITY

5RICE UNIVERSITY

Change in flexibility requirements

Physical Layer

MAC Layer

Network Layer

Application LayerNo change

(already flexible)

Maximum change(needs to support multiple

environments, algorithms and standards)

6RICE UNIVERSITY

Architecture trade-offs

Past : more DSP + less ASIC, Current : less DSP + more ASIC

Reason: need less flexibility OR DSPs not powerful enough?

Can’t we build better DSPs? How much flexibility do we need?

ASICs

Intermediate

Programmable

Area-Time-PowerbenefitsFlexibility

Time-to-marketSoftware updates

7RICE UNIVERSITY

Problems with current DSPs

Current DSPsNot enough functional units (FUs) for GOPs of

computationNeed 100’s of FUsNot low power enough!!

Cannot extend to more FUsLimited Instruction Level Parallelism (ILP)Limited Subword Parallelism (such as MMX)Cannot support more registers (area,ports)Compilers: difficult to find ILP as FUs increase

8RICE UNIVERSITY

Scalable Wireless Application-specific Procesors (SWAPs)

Exploit data parallelism (DP)Available in many wireless algorithmsThis is what ASICs do!!

Example:int i,a,b,c; // 32 bitsshort int d,e,f; // 16 bits packed

for (i = 1; i<= 1024; ++i) { a[i] = b[i] + c[i]; d[i] = e[i] + f[i]; }

ILP

DP

Subword

9RICE UNIVERSITY

SWAPs: stream processors for wireless

Kernel

Viterbidecoding

StreamInput Data

Output Data

Correlator channelestimation

receivedsignal

Matchedfilter

InterferenceCancellation

Decoded bits

Kernels (computation) and streams (communication) Operations on kernels use local data Streams expose data parallelism

Imagine stream processor at Stanford

10RICE UNIVERSITY

DSP vs. SWAPs

+++***

InternalMemory

ILP

Stream Register File (SRF)

DSP(1 cluster)

SWAPs(max. clusters

All clusters same &do same operations)

+++***

+++***

+++***

+++***

…ILP

DP

11RICE UNIVERSITY

Arithmetic clusters

FUs (+,*,/) Scratch-pad (Sp)

Indexed accesses Comm. unit (CU)

Intercluster comm. Distributed reg. Files

more FUs

Intercluster Network

From/To SRF

Cross Point

Local Register File

CU

+

+

+*

*/

+

/

+

+

+*

*/

+

/

Sp

SRF

12RICE UNIVERSITY

SWAPs vs. DSPs trade-offs

Same internal memory size as DSPs Dependent on application, not architecture

Needs more area to support more functional unitsArea is less of a constraint than power

Varying levels of DP in applicationsNeeds reconfiguration!!Need to turn off unused clusters (and FUs)

More parallelism lower clock frequency lower voltage

low power (CV2f + leakage) in spite of larger area

13RICE UNIVERSITY

Design methodology

Chain of receiver algorithms

Low “complexity”, parallel, fixed point

High level language implementation

Modular programmablearchitecture design

ASICdesign

FPGA, customized,

reconfigurable, heterogeneous

designs DSP, SWAPs

learn

H-SWAPs

learn

Architecture exploration

Flexibility-performance

tradeoffs

14RICE UNIVERSITY

Physical layer of wireless receivers

Antenna

Channel estimation

Detection DecodingHigher(MAC/

Network/OS)Layers

RF Front-end

Baseband processing

Receiver more complex than transmitter

15RICE UNIVERSITY

Algorithms for

Multiple antenna systems (MIMO systems) Complexity exponential with transmit * receive antennas

Wide range of extremely complex algorithms Optimal depends on fading, mobility, bandwidth, antennas GOPs of computations

Estimation: Linear MMSE, blind, conjugate gradient….

Detection: FFT, (blind) interference cancellation….

Decoding: Viterbi, Turbo, LDPC….

Implement ALL of them AND the NEXT one in line Use for the best for the situation

Example for concept demonstration: Viterbi decoding

16RICE UNIVERSITY

Parallel Viterbi Decoding

1. Add-Compare-Select (ACS) : trellis interconnectParallelism depends on constraint length (#states)

2. Conventional Traceback Sequential (No DP)Difficult to implement in parallel architecture

Use Register Exchange (RE) parallel solution

17RICE UNIVERSITY

Re-ordering for parallel Viterbi

a. Trellis

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

X(0)

X(2)

X(4)

X(6)

X(8)

X(10)

X(12)

X(14)

X(1)

X(3)

X(5)

X(7)

X(9)

X(11)

X(13)

X(15)

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X(11)

X(12)

X(13)

X(14)

X(15)

b. Shuffled Trellis

Exploiting Viterbi DP in SWAPs:Re-order ACS, RE Overhead

18RICE UNIVERSITY

SWAP: Algorithms + Architecture

Algorithm design for parallelism

Architecture design?

19RICE UNIVERSITY

SWAP design

Decide how many clustersExploit DP

Decide what to put within each clusterMaximize ILP with high functional unit efficiencySearch design space with “explore” tool

See how it meets time-area-power constraints

+?**

+

**

+

**

+

**

…ILP

DP

? ? ?

20RICE UNIVERSITY

Inside a SWAP cluster: EXPLORE

Auto-exploration of adders and multipliers for “ACS"

1

2

3

4

5

1

2

3

4

5

40

60

80

100

120

140

160

(43,58)

(54,59)

(39,41)

(62,62)

(47,43)

#Multipliers

(40,32)

(70,59)

(65,45)

(49,33)

(39,27)

(80,34)

(73,41)

(61,33)

(48,26)

(39,22)

(50,22)

(85,24)

(76,33)

(60,26)

#Adders

(61,22)

(85,17)

(72,22)

(72,19)

(85,13)

(85,11)

Inst

ruct

ion c

ount

(Adder FU%, Multiplier FU%)

21RICE UNIVERSITY

“Explore” tool benefits

Instruction count vs. functional unit efficiencyWhat goes inside each cluster

Explore all algorithms turn off functional units not in use for given kernel

Design customized application-specific unitsBetter performance with increased FU utilization

Algorithm 1 : 3 adders, 3 multipliers, 32 clustersAlgorithm 2 : 4 adders, 1 multiplier, 64 clusters

Architecture: 4 adders, 3 multipliers, 64 clusters

22RICE UNIVERSITY

Viterbi reconfiguration

Packet 1Constraint length 7

(16 clusters)


(64 clusters)


(4 clusters)

DP Can be turned OFF

23RICE UNIVERSITY

Viterbi decoding: rate 1/2 at 128 Kbps = 10 MHz

1 10 1001

10

100

1000

Number of clusters

Fre

qu

en

cy n

eed

ed

to a

ttain

real-

tim

e (

in M

Hz)

K = 9K = 7 K = 5Static

architecture

SWAPs

DSP

Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

24RICE UNIVERSITY

SWAPs : Salient features

1-2 orders of magnitude better than 1 processor DSP

Any constraint length 10 MHz at 128 Kbps

Same code for all constraint lengths no need to re-compile or load another codeas long as parallelism/cluster ratio is constant

Power savings due to dynamic cluster scaling

25RICE UNIVERSITY

Expected SWAP power consumption

64 clusters and 1 multiplier per cluster: 0.13 micron, 1.2 V Peak Active Power: ~9 mW at 1 MHz Area: ~53.7 mm2

10 MHz, 128 Kbps with reconfiguration

*Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003, Anaheim, California, USA, pp. 153-164

0 10 20 30 40 50 60 700102030405060708090

Active Clusters (max 64)P

ow

er (

in m

W)Viterbi Clusters used Peak Power

K = 9 64 ~90 mW

K = 7 16 ~28.57 mW

K = 5 4 ~13.8 mW

overhead 0 ~8.1 mW

26RICE UNIVERSITY

Flexibility vs. performance

Suitable for mobile devices?SWAPs: Real-time at ~10-100 mWMaybe ; but can we do better?

ASICs : Real-time at ~10-100 W

No special customization for the applicationNo application-specific unitsGeneric inter-cluster communication networkOverhead for extracting parallelism

SWAPs suitable for base-stations?Why not? – power is not a primary constraint!

27RICE UNIVERSITY

Multiuser Estimation-Detection+Decoding

Real-time target : 128 Kbps per user

1 10 10010

100

1000

10000

100000

Number of clusters

Fre

qu

en

cy

ne

ed

ed

to

att

ain

re

al-

tim

e (

in M

Hz)

FASTMEDIUMSLOW

32-user base-station

Mobile

DSP

Ideal C64x (w/o co-proc) needs ~15 GHz for real-time

28RICE UNIVERSITY

Current research

SWAPs : Completely flexible and general

How do we trade-off flexibility for better performance?

Handset SWAPs (H-SWAPs)

29RICE UNIVERSITY

H-SWAPs: Potential advantages

DSP (RE)

SWAP

ASIC/FPGA – Real-time performance

DP

Task PipeliningDedicated interconnect

DSP (RE)

H-SWAP

Partial DP + Task Pipelining

Application-specific units

ASIC/FPGA – Real-time performance

Dedicated interconnect

H-SWAPsSWAPs

Execu

tion t

ime

30RICE UNIVERSITY

Conclusions

Need flexible architectures for future wireless devicesHigher data rates, lower power, more complex algorithms

Design methodology (SWAPs, H-SWAPs, ASICs)Flexibility vs. performance trade-offsBlurs distinction between ASICs and programmable solutions

Also need parallel, low precision algorithms for efficient mapping

Inter-disciplinary research: Computer architecture, VLSI, wireless communications,

computer arithmetic, compilers

dsp architectures for wireless communications

Documents