a programmable communications processor for future wireless systems
DESCRIPTION
A programmable communications processor for future wireless systems. Sridhar Rajagopal Scott Rixner, Joseph R. Cavallaro, Behnaam Aazhang. This work has been supported by Nokia, TI, TATP and NSF. Overview of research at Rice. Center for Multimedia Communications - PowerPoint PPT PresentationTRANSCRIPT
RICE UNIVERSITY
A programmable communications processor for
future wireless systemsSridhar Rajagopal
Scott Rixner, Joseph R. Cavallaro, Behnaam Aazhang
This work has been supported by Nokia, TI, TATP and NSF
RICE UNIVERSITY
Overview of research at Rice
Center for Multimedia CommunicationsBehnaam Aazhang (wireless communications)Joseph R. Cavallaro (VLSI signal processing)
http://cmc.rice.edu
Computer ArchitectureScott Rixner (Microprocessor architecture)Vijay Pai (Simulators, Network Processors)
http://www.cs.rice.edu/CS/Architecture
RICE UNIVERSITY
Motivation
Wireless Mobiledevice
BasebandProgrammable
CommunicationsProcessor
RF UnitA/DD/A
Mobile: Switch between standards and between parameters
Base-station: varying no. of users with different parameters
Programmability - flexibility is good
RICE UNIVERSITY
Motivation
Processor Type Algorithms Data rate targets Constraints
Mobile W-CDMA, W-LAN 1Mbps, 100Mbps/ #users Time,Power,Area
Base-station W-CDMA 4 Mbps Time, maybe area
Base-station W-LAN 100 Mbps Time, maybe area
GPP
DSP
FPGA
VLSI
Performance Flexibility
RICE UNIVERSITY
Lower bounds on + and * for a 500 MHz system
0 50 100 150 200 250 30010
0
101
102
103
Ad
der
s/M
ult
iplie
rs r
equ
ired
to
mee
t re
al-t
ime
Estimation, Detection and Decoding in a W-CDMA multiuser system
Number of users
AddMul
SLOW FADING (estimation every 1000 bits)
MEDIUM FADING(estimation every 100 bits)
FAST FADING(estimation every 10 bits)
DATA RATES
RICE UNIVERSITY
The Problem
Algorithms well understood at data-flow level
Can design real-time systems in VLSI.
Pushing implementation higher in the chain
Current DSPs not powerful enough for our application
Use an architecture simulator to design our own
RICE UNIVERSITY
Proposed solution
Current solutions to meet real-time(Racks of DSPs)
ProgrammableProcessor for4G wireless
systems
< x cm
< x cm
Future wireless architecturesx = 2.5 (W-CDMA BS)x = 2.0 (W-LAN BS)x = 1.5 (Mobile Handset)
RICE UNIVERSITY
Ph.D. Thesis OutlineAlgorithm
(in Matlab)
Complexity ?
Parallelize ?
Fixed point ?
Compiler
Operation Count
Parameter-free
Architecture Design
New Algorithm Characteristics
?
Real-Time (Area/Power) Requirement
s
Architecture Synthesizer
Processor Architecture Parameters (# Functional units, # registers, #
memory ....)
ArchitectureCode
New architecture
design
Future Work
RICE UNIVERSITY
Advantages of this solution
Fast and smooth transition to future standards that simultaneously meets real-time and other constraints
Avoids re-designing the system from scratch
Joint algorithm–architecture hardware-software co-design
Matlab code can be re-used when new standards are being designed.
Tries to account for data rate increases and future algorithm changes
RICE UNIVERSITY
Past research contributions
Algorithms
DSP
VLSI
FPGA
IMAGINE
Multiuser channel estimationMultiuser detection
Task-partitioningParallelism Pipelining
Conventional arithmeticOn-line arithmetic
Architecture innovationsFunctional unit design and usage
DistantPast
RecentPast
Recent andNear Future
Sys
tem
Des
ign
RICE UNIVERSITY
Contents
Motivation
Parallel algorithms for estimation/detection/decoding
The “Imagine” simulator
Performance comparisons and results
RICE UNIVERSITY
Typical workload representation (Base-station)
Equalization? FFT Viterbi decoding
Multiuser channel estimation Multiuser detection Viterbi decoding
Turbo decoding Multiple antenna systems (MIMO)
Wireless LAN
W-CDMA
Advanced receiver schemes
RICE UNIVERSITY
Parallel estimation/detection/decoding
Multiuser estimationreplaced matrix inversion by gradient descent
Multiuser detectionParallel Interference Cancellation (PIC)Pipelined algorithm that avoids block-based
detection
Viterbi decodingTrellis structures suited for decodingRegister exchange for survivor memoryNo traceback latency
RICE UNIVERSITY
Estimation/Detection (64,32 sizes)
TTLLbbbb bbbbRR 00 **
HHLLbrbr rbrbRR 00 **
)RR*A(AA brbb
1ii1iii RxCxLxyy )y(signd ii
H
1H10
H01
H10
H0
1H0
L R
)]AAAdiag(AAAARe[A C
]ARe[A L
)y(signd
]xAxARe[y
ii
1iH1i
H0i
MultiuserEstimation
Kernel 1,2,3
MultiuserDetection
Kernel 6, 7
Massaging matricesfor detection
Kernel 4, 5
RICE UNIVERSITY
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(2)
X(4)X(6)
X(8)
X(10)
X(12)X(14)
X(1)
X(3)
X(5) X(7)
X(9)
X(11)
X(13) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
a. Unsuitable Trellis b. Suitable Trellis c. Shuffled Suitable TrellisX(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
Trellis for rate ½ code with K = 5
Upper bound on parallel clusters for good FU utilization : N/2k
Maximum 8 parallel units for rate ½ with 16 states
RICE UNIVERSITY
Survivor Management in Viterbi
Two techniquesTraceback : Commonly used Register Exchange
Traceback is good for VLSI architecturesDrawback: Sequential and additional latency
Register exchange is good for programmable solutions Parallel updatesPacking decoded bits in the register needs to access
the entire register
RICE UNIVERSITY
Contents
Motivation
Parallel algorithms for estimation/detection/decoding
The “Imagine” simulator
Performance comparisons and results
RICE UNIVERSITY
The IMAGINE architecture
Stream Register FileNetworkInterface
StreamController
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory SystemM
icro
con
trol
ler
RICE UNIVERSITY
Why IMAGINE simulator?
RSIM, SimpleScalar: GPP simulators
Great for media processing algorithms
Has a VLIW-based cluster -- DSP comparisons A good base architecture : 1024-pt FFT
Processor Type Area Time Frequency Power Energy
Imagine[Float] 2.5 cm2 7.4 s 500 MHz 3.8 W 28 JTI C6711[Float] - 138 s 150 MHz 1.3 W 180 JTI C6411[Fixed] - 40 s 300 MHz 0.25 W 10 JVirtex II [Fixed] - 2 s 125 MHz <1 W <2 J
RICE UNIVERSITY
Simulator knobs that we can turn
Cycle-accurate simulator
Varying number of Functional units and their design
Varying memory, register sizes
Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead …
Almost anything can be changed, some changes easier than others!
RICE UNIVERSITY
Programming Imagine
2 level C++ programming
StreamC:
• transfers streams of data between main memory and stream register file (SRF)
KernelC:
• transfers streams from the SRF to the ALU clusters
Code optimized to the number of ALU clusters and the size of the data
RICE UNIVERSITY
Contents
Motivation
Parallel algorithms for estimation/detection/decoding
The “Imagine” simulator
Performance comparisons and results
RICE UNIVERSITY
Kernel 2 (mmult) for 3 +,2*
Adders have limited FU utilization
O(N3) *, O(N3) +
Multipliers 100% in loop
Divider not being utilized
Replace / with *
Communication(waiting for input)
TIM
E
LOOP
FU unavailable(input ready but
FU busy)
RICE UNIVERSITY
Kernel 2 (mmult)for 3 +,3*
better adder utilization
needs sufficient registers for scaling [register allocation
may fail]
code may also need slight tuning of variables for
optimization
TIM
E
RICE UNIVERSITY
Kernel computational time
Algorithm Kernel Functional unit
utilization* (3 +, 2 *)
Execution Time
(cycles)
Functional unit utilization* (3 +, 3 *)
Execution Time
(cycles)
Performance Improvement (Expected:1.5)
1 70%,100% 1224 78.6%,78% 1064 1.15 Est- 2 53%,91% 22720 85%,99% 14360 1.5822
imate 3 55%,42% 1058 55%,42% 1058 1 Total 14464
Glue 4 59%,91% 7468 78%,84% 5573 1.341 Matrices 5 63%,96% 12192 68%,71% 11084 1.1
Total 16657 Detect 6 67%,100% 366 90%,89.6% 275 1.33
7 67%,96% 996 89%,84.2% 760 1.31 Total 1035
Decode 8 70%,10% 32576 32576
Time available at 128 Kbps for each of 32 users at 500 MHz : 4000 cycles
RICE UNIVERSITY
Kernels(Micro-controller
executing)
Memoryoperations
Init
iali
zati
on
Idle time betweenkernels
Communication overhead
RICE UNIVERSITY
Comparisons with TI C6701 DSPs
0 5 10 15 20 25 30 3510
-6
10-5
10-4
10-3
10-2
Ex
ecu
tio
n t
ime
(in
se
con
ds
)
Users
Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user Our architecture based on ImagineI
Efficiency = ?IMAGINEwith increasingfunctional units
1 DSP
2 DSPs
RICE UNIVERSITY
Future work
Real-time design possible with larger number of
functional units but efficiency is the key
Eliminating communication stalls between kernels
Support for matrix transposes and bit-level operations
Power and area constraints
Scalability with data rates – Boundaries of architecture
Handset algorithms
RICE UNIVERSITY
Conclusions
Various programmable architectures can be investigated and implemented for future systems depending on algorithms, time, area and power constraints QUICKLY
The insights gained from the design can be applied to DSPs and other processors with constraints on time, area and power.
http://www.ece.rice.edu/~sridhar/[email protected]