a programmable communications processor for future wireless systems

RICE UNIVERSITY

A programmable communications processor for

future wireless systemsSridhar Rajagopal

Scott Rixner, Joseph R. Cavallaro, Behnaam Aazhang

This work has been supported by Nokia, TI, TATP and NSF

RICE UNIVERSITY

Overview of research at Rice

Center for Multimedia CommunicationsBehnaam Aazhang (wireless communications)Joseph R. Cavallaro (VLSI signal processing)

http://cmc.rice.edu

Computer ArchitectureScott Rixner (Microprocessor architecture)Vijay Pai (Simulators, Network Processors)

http://www.cs.rice.edu/CS/Architecture

RICE UNIVERSITY

Motivation

Wireless Mobiledevice

BasebandProgrammable

CommunicationsProcessor

RF UnitA/DD/A

Mobile: Switch between standards and between parameters

Base-station: varying no. of users with different parameters

Programmability - flexibility is good

RICE UNIVERSITY

Motivation

Processor Type Algorithms Data rate targets Constraints

Mobile W-CDMA, W-LAN 1Mbps, 100Mbps/ #users Time,Power,Area

Base-station W-CDMA 4 Mbps Time, maybe area

Base-station W-LAN 100 Mbps Time, maybe area

GPP

DSP

FPGA

VLSI

Performance Flexibility

RICE UNIVERSITY

Lower bounds on + and * for a 500 MHz system

0 50 100 150 200 250 30010

0

101

102

103

Ad

der

s/M

ult

iplie

rs r

equ

ired

to

mee

t re

al-t

ime

Estimation, Detection and Decoding in a W-CDMA multiuser system

Number of users

AddMul

SLOW FADING (estimation every 1000 bits)

MEDIUM FADING(estimation every 100 bits)

FAST FADING(estimation every 10 bits)

DATA RATES

RICE UNIVERSITY

The Problem

Algorithms well understood at data-flow level

Can design real-time systems in VLSI.

Pushing implementation higher in the chain

Current DSPs not powerful enough for our application

Use an architecture simulator to design our own

RICE UNIVERSITY

Proposed solution

Current solutions to meet real-time(Racks of DSPs)

ProgrammableProcessor for4G wireless

systems

< x cm

< x cm

Future wireless architecturesx = 2.5 (W-CDMA BS)x = 2.0 (W-LAN BS)x = 1.5 (Mobile Handset)

RICE UNIVERSITY

Ph.D. Thesis OutlineAlgorithm

(in Matlab)

Complexity ?

Parallelize ?

Fixed point ?

Compiler

Operation Count

Parameter-free

Architecture Design

New Algorithm Characteristics

?

Real-Time (Area/Power) Requirement

s

Architecture Synthesizer

Processor Architecture Parameters (# Functional units, # registers, #

memory ....)

ArchitectureCode

New architecture

design

Future Work

RICE UNIVERSITY

Advantages of this solution

Fast and smooth transition to future standards that simultaneously meets real-time and other constraints

Avoids re-designing the system from scratch

Joint algorithm–architecture hardware-software co-design

Matlab code can be re-used when new standards are being designed.

Tries to account for data rate increases and future algorithm changes

RICE UNIVERSITY

Past research contributions

Algorithms

DSP

VLSI

FPGA

IMAGINE

Multiuser channel estimationMultiuser detection

Task-partitioningParallelism Pipelining

Conventional arithmeticOn-line arithmetic

Architecture innovationsFunctional unit design and usage

DistantPast

RecentPast

Recent andNear Future

Sys

tem

Des

ign

RICE UNIVERSITY

Contents

Motivation

Parallel algorithms for estimation/detection/decoding

The “Imagine” simulator

Performance comparisons and results

RICE UNIVERSITY

Typical workload representation (Base-station)

Equalization? FFT Viterbi decoding

Multiuser channel estimation Multiuser detection Viterbi decoding

Turbo decoding Multiple antenna systems (MIMO)

Wireless LAN

W-CDMA

Advanced receiver schemes

RICE UNIVERSITY

Parallel estimation/detection/decoding

Multiuser estimationreplaced matrix inversion by gradient descent

Multiuser detectionParallel Interference Cancellation (PIC)Pipelined algorithm that avoids block-based

detection

Viterbi decodingTrellis structures suited for decodingRegister exchange for survivor memoryNo traceback latency

RICE UNIVERSITY

Estimation/Detection (64,32 sizes)

TTLLbbbb bbbbRR 00 **

HHLLbrbr rbrbRR 00 **

)RR*A(AA brbb

1ii1iii RxCxLxyy )y(signd ii

H

1H10

H01

H10

H0

1H0

L R

)]AAAdiag(AAAARe[A C

]ARe[A L

)y(signd

]xAxARe[y

ii

1iH1i

H0i

MultiuserEstimation

Kernel 1,2,3

MultiuserDetection

Kernel 6, 7

Massaging matricesfor detection

Kernel 4, 5

RICE UNIVERSITY

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(2)

X(4)X(6)

X(8)

X(10)

X(12)X(14)

X(1)

X(3)

X(5) X(7)

X(9)

X(11)

X(13) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

a. Unsuitable Trellis b. Suitable Trellis c. Shuffled Suitable TrellisX(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

Trellis for rate ½ code with K = 5

Upper bound on parallel clusters for good FU utilization : N/2k

Maximum 8 parallel units for rate ½ with 16 states

RICE UNIVERSITY

Survivor Management in Viterbi

Two techniquesTraceback : Commonly used Register Exchange

Traceback is good for VLSI architecturesDrawback: Sequential and additional latency

Register exchange is good for programmable solutions Parallel updatesPacking decoded bits in the register needs to access

the entire register

RICE UNIVERSITY

Contents

Motivation




RICE UNIVERSITY

The IMAGINE architecture

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

con

trol

ler

RICE UNIVERSITY

Why IMAGINE simulator?

RSIM, SimpleScalar: GPP simulators

Great for media processing algorithms

Has a VLIW-based cluster -- DSP comparisons A good base architecture : 1024-pt FFT

Processor Type Area Time Frequency Power Energy

Imagine[Float] 2.5 cm2 7.4 s 500 MHz 3.8 W 28 JTI C6711[Float] - 138 s 150 MHz 1.3 W 180 JTI C6411[Fixed] - 40 s 300 MHz 0.25 W 10 JVirtex II [Fixed] - 2 s 125 MHz <1 W <2 J

RICE UNIVERSITY

Simulator knobs that we can turn

Cycle-accurate simulator

Varying number of Functional units and their design

Varying memory, register sizes

Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead …

Almost anything can be changed, some changes easier than others!

RICE UNIVERSITY

Programming Imagine

2 level C++ programming

StreamC:

• transfers streams of data between main memory and stream register file (SRF)

KernelC:

• transfers streams from the SRF to the ALU clusters

Code optimized to the number of ALU clusters and the size of the data

RICE UNIVERSITY

Contents

Motivation




RICE UNIVERSITY

Kernel 2 (mmult) for 3 +,2*

Adders have limited FU utilization

O(N3) *, O(N3) +

Multipliers 100% in loop

Divider not being utilized

Replace / with *

Communication(waiting for input)

TIM

E

LOOP

FU unavailable(input ready but

FU busy)

RICE UNIVERSITY

Kernel 2 (mmult)for 3 +,3*

better adder utilization

needs sufficient registers for scaling [register allocation

may fail]

code may also need slight tuning of variables for

optimization

TIM

E

RICE UNIVERSITY

Kernel computational time

Algorithm Kernel Functional unit

utilization* (3 +, 2 *)

Execution Time

(cycles)

Functional unit utilization* (3 +, 3 *)

Execution Time

(cycles)

Performance Improvement (Expected:1.5)

1 70%,100% 1224 78.6%,78% 1064 1.15 Est- 2 53%,91% 22720 85%,99% 14360 1.5822

imate 3 55%,42% 1058 55%,42% 1058 1 Total 14464

Glue 4 59%,91% 7468 78%,84% 5573 1.341 Matrices 5 63%,96% 12192 68%,71% 11084 1.1

Total 16657 Detect 6 67%,100% 366 90%,89.6% 275 1.33

7 67%,96% 996 89%,84.2% 760 1.31 Total 1035

Decode 8 70%,10% 32576 32576

Time available at 128 Kbps for each of 32 users at 500 MHz : 4000 cycles

RICE UNIVERSITY

Kernels(Micro-controller

executing)

Memoryoperations

Init

iali

zati

on

Idle time betweenkernels

Communication overhead

RICE UNIVERSITY

Comparisons with TI C6701 DSPs

0 5 10 15 20 25 30 3510

-6

10-5

10-4

10-3

10-2

Ex

ecu

tio

n t

ime

(in

se

con

ds

)

Users

Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user Our architecture based on ImagineI

Efficiency = ?IMAGINEwith increasingfunctional units

1 DSP

2 DSPs

RICE UNIVERSITY

Future work

Real-time design possible with larger number of

functional units but efficiency is the key

Eliminating communication stalls between kernels

Support for matrix transposes and bit-level operations

Power and area constraints

Scalability with data rates – Boundaries of architecture

Handset algorithms

RICE UNIVERSITY

Conclusions

Various programmable architectures can be investigated and implemented for future systems depending on algorithms, time, area and power constraints QUICKLY

The insights gained from the design can be applied to DSPs and other processors with constraints on time, area and power.

http://www.ece.rice.edu/~sridhar/[email protected]

a programmable communications processor for future wireless systems

Documents