Download - Wireless Communication Extensions for DSPs and General Purpose Processors

Wireless Communication Extensions for DSPs and

General Purpose Processors

Sridhar Rajagopal

COMP 625

April 17, 2000

April 17,2000 Sridhar Rajagopal 2

Motivation

Wireless, the next wave after Multimedia Highly Compute-Intensive Algorithms Real-Time Requirements Design based on Time-to-Market


Outline

Processor Core with Reconfigurable Support Permutation Based Interleaved Memory Processor Architecture -EPIC Instruction Set Extensions Truncated Multipliers Software Support Needed


Characteristics of Wireless Algorithms

Massive Parallelism Bit-level Computations Matrix Based Operations Memory Intensive Complex-valued Data Approximate Computations


What’s wrong with Current Architectures for these applications?


Problems with Current Architectures

UltraSPARC, C6x, MMX, IA-64 Not enough MIPs/FLOPs Unable to fully exploit parallelism Bit Level Computations Memory Bottlenecks Specialized Instructions for Wireless

Communications


Why Reconfigurable

Adapt algorithms to environment Seamless and Continuous Data Processing during

Handoffs

Home Area Wireless LAN

High Speed Office Wireless LAN

Outdoor CDMA Cellular Network


Reconfigurable Support

User InterfaceTranslation

SynchronizationTransport Network

OSILayers3-7

Data Link Layer(Converts Frames

to Bits)

OSILayer2

Physical Layer(hardware;

raw bit stream)

OSILayer1


Different Protocols

Source Coding Channel Coding

Channel

Decoding

Source

Decoding

Multiuser

Detection

Channel

Estimation

MPEG-4, H.723 - Voice,Multimedia

Convolutional,Turbo - Channel Coding


A New Architecture

Processor Core

(GPP/DSP)

Cache

Q Q

Crossbar

Reconfigurable

Logic

Real-Time I/O

Bit Stream

Main

Memory

RF Unit

Processor

Add-on PCMCIA Network Interface Card


Why Reconfigurable

Process initial bit level computations Optimize for fast I/O transfer

Reconfigurable

Logic

Real-Time I/O

Bit StreamRF Unit



Configuration Caches

2 64-bit data buses1 64-bit address bus

ControlBlocks

SequencerGARP Architecture at UC,Berkeley

Boolean values 64-bit Datapath Fast I/O



Wide Path to Memory

– Data Transfer

– Minimize Load Times

Configuration Caches

– Recently Displaced Configurations(5 cycles)

– Can hold 4 full size Configurations

Independent Execution



Access to same Memory System as Processor

– Minimize overhead

When idle

– Load Configurations

– Transfer Data


Operation

Load Configuration

– If in configuration cache, minimal time

Copy initial data with coprocessor move instructions

Start execution

Issue wait that interlocks while active

Copy registers back at kernel completion


Memory Interface

Access to Main Memory and L1 Data Cache– Large, fast Memory Store

Memory Prefetch Queues for Sequential Accesses– Read aheads and Write Behinds

Processor Core

(GPP/DSP)

L1 Data Cache

Q Q

Crossbar

Main

Memory

FPGA

Instruction Cache


Permutation Based Interleaved Memory (PBI)

High Memory Bandwidth Needed Stride-Insensitive Memory System for Matrices Multiple Banks Sustained Peak Throughput (95%)

L1 Data Cache

Main

Memory


PBI Scheme

N- address length

M = 2n Banks

2N-n words in each bank

To access a word,

– n-bit bank number

– N-n bit address (high-order)

Calculation of the n-bit Bank Number


Calculate Bank Number

Use all N bits to get n-bit vector Y = A X , A = n*N matrix of 0’s & 1’s

Y = AhXh + Al Xl (N-n,n) [Al -rank n]

N-bit parity circuit with logkN levels of XOR gates (k-

Fanin)

Parity Ckt.

Row 0 of A

Parity Ckt.

Row 1 of A

Parity Ckt.

Row n-1 of A

N-bit address

Decoder

n parity bit signals

2n bank select signals


Interleaved Memory Model

Address Source

M(0) M(1) M(M-1)

Data Sink Data Sequencer

Input Buffers

Output Buffers

Memory Banks


Processor Core

64-bit EPIC Architecture with Extensions(IA-64/C6x) Statically determined Parallelism;exploit ILP Execution Time Predictability

Processor Core

(GPP/DSP)

Cache

Q Q

Crossbar

FPGA


EPIC Principle

Explicitly Parallel Instruction Computing

Evolution of VLIW Computing

Compiler- Key role

Architecture to assist Compiler

Better cope with dynamic factors

– which limited VLIW Parallelism


Aspects of EPIC

Designing Plan of Execution(POE) at Compile Time

Permitting Compiler to play Statistics– Conditional Branches, Memory references

Communicating POE to the hardware– Static Scheduling– Branch information


Architecture Features in EPIC

Static Scheduling– MultiOP– Non-Unit Assumed Latency (NUAL)

The Branch Problem– Predicated Execution– Control Speculation– Predicated Code Motion

The Memory Problem– Cache Specifiers– Data Speculation


Instruction Set Extensions

To accelerate Bit level computations in Wireless

Real/Complex Integer - Bit Multiplications

– Used in Multiuser Detection, Decoding

Bit - Bit Multiplications

– Used in Outer Product Updates

– Correlation, Channel Estimation

Complex Integer-Integer Multiplications

Useful in other Signal Processing applications

– Speech, Video,,,


Architecture Support

Support via Instruction Set Extensions

Minimal ALU Modifications necessary

Transparent to Register Files/Memory

Additional 8-bit Special Purpose Registers


Integer - Bit Multiplications

64-bit Register A 64-bit Register C

+/- +/- +/-

64-bit Register D

D[I] = D[I] + b[J]*C[j]Eg: Cross-Correlation

8-bit Register b

Register Renaming?


8-bit to 64-bit conversions

D = D + b*bT

Eg: Auto-Correlation

b1 = b(1:8),b(1:8),….b(1:8) b2 = b(1)b(1)……b(8)b(8)

b(1)..b(8) b(1) b(1) b(8)

b(1)..b(8) b(1) b(2) b(8)b(7)

b(8)

8-bit Register b 64-bit Register A

1.1 1.2

2.1


Bit-Bit Multiplications

D = D + b*bT


64-bit Register A = b1 64-bit Register B=b2

Ex-NOR

b1*b2Bit-Bit Multiplications

64-bit Register C=b1*b2

B1 B2 B1*B2

0 0 10 1 01 0 01 1 1


Increment/Decrement

64-bit Register D

+/- +/- +/-

64-bit Register (D+b1*b2)

8-bit Register b1*b2

1

D = D + b*bT



Complex-valued Data Processing

Is it easy to add ? Is this worth an additional ALU Support ? Typically supported by Software!

?


Truncated Multipliers

Many applications need approximate computations Adaptive Algorithms :Y = Y + mu*(Y*C) Truncate lower bits Truncated Multipliers - half the area/half the delay Can do 2 truncated multiplies in parallel with

regular

Multiplier 1 Multiplier 2Truncated

Multiplier

ALU Multipliers


Software Support

Greater Interaction between Compilers and Architectures

– EPIC– Reconfigurable Logic

Compiler needs to find and exploit bit level computations

Reconfigurable Logic Programming


Area Estimates

Area increase by 20% over a IA-64 architecture size

due to reconfigurable Support

Instruction Set extensions need min hardware

support

Parallel Interleaved Memory Banks will need larger

area


Other Uses

Reconfigurable Logic– For accelerating loops of general purpose processors

Bit Level Support– For other voice, video and multimedia applications


Conclusions

Processor Core with Reconfigurable Support developed for Wireless Applications

Instruction Set Extensions added for accelerating performance of the algorithms

Integration of Wireless Appliances with General Purpose Processors

Great Impact on Performance of Wireless Algorithms


Future Work

Simulations for finding performance improvements

Other Processor Architectures– Bit Slice Architectures– Out-of-order


References

The GARP Architecture and C Compiler

– T.C. Callahan,J.R.Hauser,J.Wawrzynek, IEEE Computer,April 2000, pp62-

69

http://brass.cs.berkeley.edu

EPIC:Explicitly Parallel Instruction Computing

– M.S.Schlansker,B.R.Rau, IEEE Computer, Feb 2000, pp 37-45

High-Bandwidth Interleaved Memories for Vector

Processors - A Simulation Study

– G.S.Sohi, IEEE Transactions on Computers, Vol.42,No.1,Jan 1993,pp34-44


Acknowledgements

Vijay Pai Partha Ranganathan Joseph Cavallaro

Download - Wireless Communication Extensions for DSPs and General Purpose Processors

Top Related