Wireless Communication Extensions for DSPs and
General Purpose Processors
Sridhar Rajagopal
COMP 625
April 17, 2000
April 17,2000 Sridhar Rajagopal 2
Motivation
Wireless, the next wave after Multimedia Highly Compute-Intensive Algorithms Real-Time Requirements Design based on Time-to-Market
April 17,2000 Sridhar Rajagopal 3
Outline
Processor Core with Reconfigurable Support Permutation Based Interleaved Memory Processor Architecture -EPIC Instruction Set Extensions Truncated Multipliers Software Support Needed
April 17,2000 Sridhar Rajagopal 4
Characteristics of Wireless Algorithms
Massive Parallelism Bit-level Computations Matrix Based Operations Memory Intensive Complex-valued Data Approximate Computations
April 17,2000 Sridhar Rajagopal 5
What’s wrong with Current Architectures for these applications?
April 17,2000 Sridhar Rajagopal 6
Problems with Current Architectures
UltraSPARC, C6x, MMX, IA-64 Not enough MIPs/FLOPs Unable to fully exploit parallelism Bit Level Computations Memory Bottlenecks Specialized Instructions for Wireless
Communications
April 17,2000 Sridhar Rajagopal 7
Why Reconfigurable
Adapt algorithms to environment Seamless and Continuous Data Processing during
Handoffs
Home Area Wireless LAN
High Speed Office Wireless LAN
Outdoor CDMA Cellular Network
April 17,2000 Sridhar Rajagopal 8
Reconfigurable Support
User InterfaceTranslation
SynchronizationTransport Network
OSILayers3-7
Data Link Layer(Converts Frames
to Bits)
OSILayer2
Physical Layer(hardware;
raw bit stream)
OSILayer1
April 17,2000 Sridhar Rajagopal 9
Different Protocols
Source Coding Channel Coding
Channel
Decoding
Source
Decoding
Multiuser
Detection
Channel
Estimation
MPEG-4, H.723 - Voice,Multimedia
Convolutional,Turbo - Channel Coding
April 17,2000 Sridhar Rajagopal 10
A New Architecture
Processor Core
(GPP/DSP)
Cache
Q Q
Crossbar
Reconfigurable
Logic
Real-Time I/O
Bit Stream
Main
Memory
RF Unit
Processor
Add-on PCMCIA Network Interface Card
April 17,2000 Sridhar Rajagopal 11
Why Reconfigurable
Process initial bit level computations Optimize for fast I/O transfer
Reconfigurable
Logic
Real-Time I/O
Bit StreamRF Unit
April 17,2000 Sridhar Rajagopal 12
Reconfigurable Support
Configuration Caches
2 64-bit data buses1 64-bit address bus
ControlBlocks
SequencerGARP Architecture at UC,Berkeley
Boolean values 64-bit Datapath Fast I/O
April 17,2000 Sridhar Rajagopal 13
Reconfigurable Support
Wide Path to Memory
– Data Transfer
– Minimize Load Times
Configuration Caches
– Recently Displaced Configurations(5 cycles)
– Can hold 4 full size Configurations
Independent Execution
April 17,2000 Sridhar Rajagopal 14
Reconfigurable Support
Access to same Memory System as Processor
– Minimize overhead
When idle
– Load Configurations
– Transfer Data
April 17,2000 Sridhar Rajagopal 15
Operation
Load Configuration
– If in configuration cache, minimal time
Copy initial data with coprocessor move instructions
Start execution
Issue wait that interlocks while active
Copy registers back at kernel completion
April 17,2000 Sridhar Rajagopal 16
Memory Interface
Access to Main Memory and L1 Data Cache– Large, fast Memory Store
Memory Prefetch Queues for Sequential Accesses– Read aheads and Write Behinds
Processor Core
(GPP/DSP)
L1 Data Cache
Q Q
Crossbar
Main
Memory
FPGA
Instruction Cache
April 17,2000 Sridhar Rajagopal 17
Permutation Based Interleaved Memory (PBI)
High Memory Bandwidth Needed Stride-Insensitive Memory System for Matrices Multiple Banks Sustained Peak Throughput (95%)
L1 Data Cache
Main
Memory
April 17,2000 Sridhar Rajagopal 18
PBI Scheme
N- address length
M = 2n Banks
2N-n words in each bank
To access a word,
– n-bit bank number
– N-n bit address (high-order)
Calculation of the n-bit Bank Number
April 17,2000 Sridhar Rajagopal 19
Calculate Bank Number
Use all N bits to get n-bit vector Y = A X , A = n*N matrix of 0’s & 1’s
Y = AhXh + Al Xl (N-n,n) [Al -rank n]
N-bit parity circuit with logkN levels of XOR gates (k-
Fanin)
Parity Ckt.
Row 0 of A
Parity Ckt.
Row 1 of A
Parity Ckt.
Row n-1 of A
N-bit address
Decoder
n parity bit signals
2n bank select signals
April 17,2000 Sridhar Rajagopal 20
Interleaved Memory Model
Address Source
M(0) M(1) M(M-1)
Data Sink Data Sequencer
Input Buffers
Output Buffers
Memory Banks
April 17,2000 Sridhar Rajagopal 21
Processor Core
64-bit EPIC Architecture with Extensions(IA-64/C6x) Statically determined Parallelism;exploit ILP Execution Time Predictability
Processor Core
(GPP/DSP)
Cache
Q Q
Crossbar
FPGA
April 17,2000 Sridhar Rajagopal 22
EPIC Principle
Explicitly Parallel Instruction Computing
Evolution of VLIW Computing
Compiler- Key role
Architecture to assist Compiler
Better cope with dynamic factors
– which limited VLIW Parallelism
April 17,2000 Sridhar Rajagopal 23
Aspects of EPIC
Designing Plan of Execution(POE) at Compile Time
Permitting Compiler to play Statistics– Conditional Branches, Memory references
Communicating POE to the hardware– Static Scheduling– Branch information
April 17,2000 Sridhar Rajagopal 24
Architecture Features in EPIC
Static Scheduling– MultiOP– Non-Unit Assumed Latency (NUAL)
The Branch Problem– Predicated Execution– Control Speculation– Predicated Code Motion
The Memory Problem– Cache Specifiers– Data Speculation
April 17,2000 Sridhar Rajagopal 25
Instruction Set Extensions
To accelerate Bit level computations in Wireless
Real/Complex Integer - Bit Multiplications
– Used in Multiuser Detection, Decoding
Bit - Bit Multiplications
– Used in Outer Product Updates
– Correlation, Channel Estimation
Complex Integer-Integer Multiplications
Useful in other Signal Processing applications
– Speech, Video,,,
April 17,2000 Sridhar Rajagopal 26
Architecture Support
Support via Instruction Set Extensions
Minimal ALU Modifications necessary
Transparent to Register Files/Memory
Additional 8-bit Special Purpose Registers
April 17,2000 Sridhar Rajagopal 27
Integer - Bit Multiplications
64-bit Register A 64-bit Register C
+/- +/- +/-
64-bit Register D
D[I] = D[I] + b[J]*C[j]Eg: Cross-Correlation
8-bit Register b
Register Renaming?
April 17,2000 Sridhar Rajagopal 28
8-bit to 64-bit conversions
D = D + b*bT
Eg: Auto-Correlation
b1 = b(1:8),b(1:8),….b(1:8) b2 = b(1)b(1)……b(8)b(8)
b(1)..b(8) b(1) b(1) b(8)
b(1)..b(8) b(1) b(2) b(8)b(7)
b(8)
8-bit Register b 64-bit Register A
1.1 1.2
2.1
April 17,2000 Sridhar Rajagopal 29
Bit-Bit Multiplications
D = D + b*bT
Eg: Auto-Correlation
64-bit Register A = b1 64-bit Register B=b2
Ex-NOR
b1*b2Bit-Bit Multiplications
64-bit Register C=b1*b2
B1 B2 B1*B2
0 0 10 1 01 0 01 1 1
April 17,2000 Sridhar Rajagopal 30
Increment/Decrement
64-bit Register D
+/- +/- +/-
64-bit Register (D+b1*b2)
8-bit Register b1*b2
1
D = D + b*bT
Eg: Auto-Correlation
April 17,2000 Sridhar Rajagopal 31
Complex-valued Data Processing
Is it easy to add ? Is this worth an additional ALU Support ? Typically supported by Software!
?
April 17,2000 Sridhar Rajagopal 32
Truncated Multipliers
Many applications need approximate computations Adaptive Algorithms :Y = Y + mu*(Y*C) Truncate lower bits Truncated Multipliers - half the area/half the delay Can do 2 truncated multiplies in parallel with
regular
Multiplier 1 Multiplier 2Truncated
Multiplier
ALU Multipliers
April 17,2000 Sridhar Rajagopal 33
Software Support
Greater Interaction between Compilers and Architectures
– EPIC– Reconfigurable Logic
Compiler needs to find and exploit bit level computations
Reconfigurable Logic Programming
April 17,2000 Sridhar Rajagopal 34
Area Estimates
Area increase by 20% over a IA-64 architecture size
due to reconfigurable Support
Instruction Set extensions need min hardware
support
Parallel Interleaved Memory Banks will need larger
area
April 17,2000 Sridhar Rajagopal 35
Other Uses
Reconfigurable Logic– For accelerating loops of general purpose processors
Bit Level Support– For other voice, video and multimedia applications
April 17,2000 Sridhar Rajagopal 36
Conclusions
Processor Core with Reconfigurable Support developed for Wireless Applications
Instruction Set Extensions added for accelerating performance of the algorithms
Integration of Wireless Appliances with General Purpose Processors
Great Impact on Performance of Wireless Algorithms
April 17,2000 Sridhar Rajagopal 37
Future Work
Simulations for finding performance improvements
Other Processor Architectures– Bit Slice Architectures– Out-of-order
April 17,2000 Sridhar Rajagopal 38
References
The GARP Architecture and C Compiler
– T.C. Callahan,J.R.Hauser,J.Wawrzynek, IEEE Computer,April 2000, pp62-
69
http://brass.cs.berkeley.edu
EPIC:Explicitly Parallel Instruction Computing
– M.S.Schlansker,B.R.Rau, IEEE Computer, Feb 2000, pp 37-45
High-Bandwidth Interleaved Memories for Vector
Processors - A Simulation Study
– G.S.Sohi, IEEE Transactions on Computers, Vol.42,No.1,Jan 1993,pp34-44
April 17,2000 Sridhar Rajagopal 39
Acknowledgements
Vijay Pai Partha Ranganathan Joseph Cavallaro