dsp architectures for next-generation wireless...

36
1 1 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol DSP Architectures for Next-Generation Wireless Communications Chris Nicol Bell Laboratories Australia Lucent Technologies [email protected] Ingrid Verbauwhede Department of Electrical Engineering University of California Los Angeles [email protected] 2 ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol Mobile Wireless Trends Subscribers in (000) 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Global W ireline Gobal Wireless Wireless CAGR 21% Global Penetration (2010) - 21% (Cellular+PCS+WLAS+Other) Wireline CAGR - 5% Global Penetration (2010) - 20% Global Pop - 7 bill CAGR 1995-2010 - 1.4% Subscribers (000) World-wide deployment of mobile communications is exceeding expectations

Upload: phamtu

Post on 10-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

1

1ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

DSP Architectures for Next-Generation Wireless Communications

Chris NicolBell Laboratories Australia

Lucent Technologies

[email protected]

Ingrid VerbauwhedeDepartment of Electrical EngineeringUniversity of California Los Angeles

[email protected]

2ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Mobile Wireless TrendsSubscribers in (000)

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

G lobal W irelineG obal W ire less

W ireless C AG R 21%G lo bal Penetratio n (2010) - 21%(Cellu lar+PCS+W L AS+O ther)

W ire line CAG R - 5%G lo bal Penetration (2010) - 20%

G lobal Pop - 7 billCAG R 1995-2010 - 1 .4%

Subs

crib

ers

(000

)

World-wide deployment of mobile communications is exceeding expectations

Page 2: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

2

3ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

DSP Evolution and Markets

Power

(mw/MIP)

1980 1985 1990 1995 2000

DSP-1 ($150)

DSP16A ($15) DSP1600 (<$10)

1K

100

10

10KM68000 ($200)

80286 ($200)

80386 ($300)Pentium ($300)

1

DSP-32C ($250)

DSP16210

Pentium (MMX) ($700)

Cellular InfrastructureMobile HandsetsCordlessGPS

Wireless

$1.01BModem

$727 MV.34V.90xDSL Consumer &

Automotive

Disk

$270 MOther

Source: Forward Concepts 1996

$2B market, 30% growth rate

DSP Market

Power

(mw/MIP)

4ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

The DSP Market Splits - and so does this tutorial

Today’s general purposeassembly coded

DSP

Low cost,low power

DSPs

HighPerformance

DSPs

• 1-10 GOPS• 1-5 watts• < $50

• 200-1000 MOPS• < 100 mW• $10

• 100 MOPS• 250 mW• $40

Chris NicolIngridVerbauwhede

InfrastructureMobile Terminals

Page 3: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

3

5ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Overview• Introduction• Low Power DSP Architectures for Handsets

• Domain Specific Processors• DSP Processor Fundamentals• Datapath Design, Instruction Set Design• Pipeline Control, Memory Architecture, Low Power Design• for FIR - Viterbi - speech codec

• High performance DSP Processors for BTS• 2G and 3G Wireless Standards• Mobile Wireless Basestation Systems

• Receiver Algorithms, Smart Antennas• Wideband TRX Architectures• Convolutional and Turbo coding

• High Performance DSP Architectures for 3G Wireless• LU DSP16210, TI ‘C6x, Starcore SC140• Future Trends - MIMD DSP

6ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Domain Specific Processors

ASIC Application Specific

Domain Specific

General DSP

General Purpose

low

none very high

Performance / Power:

Programmability:

high

Low power programmable DSP’s for wireless communications

high

none parameters

Page 4: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

4

7ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Domain Specific Processors

Domain specific processors: to combine

High performance

Low Power

High degree of programmability

Application domains that need it:Wireless communications (baseband processing)

Application domain is narrower, hence need high volume to compensate development cost.

Video processors

Embedded micro controllersEtc.

8ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Application domain: wireless communications

Receiver

Tran

smit

Synt

hesi

ze

PA

TCXO

Receiver

Tran

smit

Synt

hesi

ze

PA

TCXO

Exte

rnal

Mem

orie

s

DigitalASIC

MicroProcessor

DSP

BatteryPack

AnalogASIC

PowerSupply

AudioCodec

No network

* 0 #7 8 94 5 61 2 3

clr

RF Board

Baseband board

Page 5: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

5

9ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Performance requirements: digital cellular phone

RFReceive

RFSend

Demodulation Channeldecoder

Speechdecoder

Modulation Channelencoder

Speechencoder

Communication Application

Goal: Minimum “MIPS” to get the job done.

10ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Note: Definition of MIPS, MOPS

What is inside a MIPS = Million Instructions per Second ?

DSPs use Complex Instructions

One instruction = 5 operationsE.g. Lode instruction: 2 Memory operations, 2 address generationsand 1 arithmetic operation

So: benchmarks are expressed in minimum number of operationsto finish a job, usually expressed in “MIPS”

Small Example: Viterbi butterfly operation in 4 cycles/butterflyLarge Example: GSM Half rate speech codec in only 12 “MIPS”

Page 6: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

6

11ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Application Domain: compute intensive functions

Source encoder/decoder = speech codersAdvanced vocoders for improved speech quality & higher capacity:Example: ACELP derivatives for GSM and IS136A

• Digital filtering (FIR, IIR)• Vector quantization, code book search

(square distance computation)

Channel encoder/decoder = error correctingComplex wireless modems:

• Galois field arithmetic• Convolution coders based on Viterbi trellis search• Turbo coders

Modulation/demodulation =

• Receivers based on Maximum Likelihood Sequence Estimation(requires again fast Viterbi butterfly operations)

12ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Compute intensive functions: evolution of DSP’s

Simple FIR example

Square distance

Speed-up of FIR example

Viterbi acceleration

Evolution of DSPs follows these examples

Page 7: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

7

13ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Evolution of DSP processors

Generation Features Examples

0 (1980) Von Neumann architecture DSP-1 (AT&T)

1 (1982) Basic Harvard architecture TMS320C10 (TI)NEC7720

2 (1986) 1data/program bus,1 data bus

TMS320C25 (TI)DSP16A (AT&T)

3 (1990) Extra Addressing modes,extra functions

TMS320C5x (TI)DSP16xx (AT&T)

4 (1994) 2 data busses1 program bus

TMS320C54x (TI)

5 (1996 – now) 2 data busses,1 program bus,multiple units

Lucent 16xxxAtmel LodeSiemens Carmel

14ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

DSP Processor Fundamentals

Data PathProcessing

Unit

InterconnectProcessing

Unit

MemoryManagement

Unit

InstructionProcessing

Unit

Processor Components [Skillikorn-88]

Page 8: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

8

15ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Basic Harvard Architecture

ProgramMemory

DataMemory

MultiplyAccumulate

InstructionProcessing

Unit

Separate data memory from program memory!

16 x 16 mpy

ALU

Different from Von Neumann machine:one address bus - one data bus - one memory space

16ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Example 1: TMS320C10 (1982)

Data RAM Program ROM1.5K x 16144 x 16

16-bit T-register

16 x 16 Multiply

32-bit P-register

16-bit BarrelShifter (L)

32-bit ALU

32-bit Accumulator

ShiftL (0,1,4)

2 Auxiliary RegsFour Level H/W Stack

Status Register

CPU

D (15-0)

A (11-0)

I/O Ports8 x 16

PA (7-0)(A 2-0, D 15-0)

160/200ns Instructioncycle time4K word externaladdress reach

60 general purpose andDSP specific instructionsSingle cycle multiply

16-bit Barrel Shifter

External interrupt andpolled input pins

Eight 16-bit I/O ports

40-pin DIP/44-pin PLCC

Courtesy: Texas Instruments

Page 9: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

9

17ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Compute Intensive function 1: FIR

x(n)

X

(50 TAPS)

Z-1 Z-1 Z-1

X X X

+ + +

x(n-1)

y(n)

c(0) c(N-1)

x(n-(N-1))

ΣΣΣΣy(n) = c(i) x(n-i)N-1

i=0

TMS320C10 TMS320C25LTD RPTK 49MPY MACDLTDMPYLTD

MPY

LTDMOVAPAC

LTDMOVAPACMPY

3 Words Prog Memory53 Cycles

100 Words Prog Memory100 Cycles

...

Single Cycle Multiply - Accumulate!

18ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

16 16

16

32

32

32

32

32

Example 2: Single Cycle MAC

TMS320C2x Multiplier/ALU

Left Shifter (0-7)

Left Shifter (0-16)3232

16

Single Cycle 16x16 bitMultiply yielding a32-bit product

Supports simultaneousProgram and two DataOperand aquisition

Supports simultaneousALU and Multiplieroperations

0-16 bit Left Post-Shifter

Data Bus

Program Bus

LeftShifter(0-16)

T Register (16)

Multiplier (16x16)

P Register (32)

MUX

Arithmetic Logic Unit (ALU)

Accumulator Register (32)C

MUX

16

16

16

32

Courtesy: Texas Instruments

Page 10: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

10

19ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Compute Intensive function 1: FIR (cont.)

x(n)

X

(50 TAPS)

Z-1 Z-1 Z-1

X X X

+ + +

x(n-1)

y(n)

c(0) c(N-1)

x(n-(N-1))

ΣΣΣΣy(n) = c(i) x(n-i)N-1

i=0

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);

y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);

. . .

y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));

One output = 2N reads, N MAC’s, 1 write

Classic Harvard: one output = N cycles

20ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

FIR speed-up

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);

y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);

. . .

y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));

Run MAC at double frequency, read two 32-bit numbers

FIR filtering: two outputs in parallel

Two outputs = 4N reads, 2N MAC’s, 2 writesDual Mac Architecture with ONLY 2 data busses??

Read two 32-bit numbers instead of four 16-bit numbers Solution by Lucent 16000 core with dual MAC

Solution by MatsushitaInsert delay register

Solution by Atmel’s LODE

Page 11: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

11

21ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Example 3: Lucent DSP16210

Horizontal parallelism, one sample at a time

2G mobile wireless base-stations

16 x 16 mpy 16 x 16 mpy

p0 (32) p1 (32)

Shift/Sat.

ADD BMU

ACC File8 x 40

Y(32) X(32)

ALU

Shift/Sat.

do 14 { //one instruction !

a0=a0+p0+p1

p0=xh*yh p1=xl*yl

y=*r0++ x=*pt0++

}

Inner loop of 32-tap FIR Filter XDB(32)IDB(32)

Outer Loop: 19 cycles, 38 bytes1 cycle in inner loop

5 exec units used in inner loop2 MACs per cycle

Courtesy: Gareth Hughes, Bell Labs Australia

22ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

FIR on Lode

FIR filter: two outputs in parallel with delay register y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);

y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);

. . .

y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));

Total energy for one output sample:

Energy SingleMAC

DualMAC

Dual MACwith REG

No. of MAC operations N N N

No of Memory reads 2N 2N N

No of Instruction Cycles N N/2 N/2

Page 12: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

12

23ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

FIR on Lode

Two MAC units with dedicated bus network

x(n-i)

X

LREG

+

y(n+1) y(n)

c(i)

X

+

c(i)x(n-i+1)

A0 A1

MAC1 MAC0

DB1(16)DB0(16)

• DB0 fetches coefficient

• DB1 fetches data

• LREG delays input data

• A0 stores y(n) output

• A1 stores y(n+1) output

Same structure can be used for IIR

24ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Compute Intensive function 2: Viterbi

i

i+ s/2

2i

2i+1

+a

-a

-a+a

. . .

. . .

Viterbi butterfly

i = state indexs = # of states = 2w = decoding window

Basic equations:

d(2n) = min { d(i) + a, d(i + s/2) - a }d(2i + 1) = min { d(i) - a, d(i + s/2) + a }

IS-95: k = 8, w = 192, corresponds to 2 x 192 x (cycles for one ACS)

k-1

7

Basic algorithm in Viterbi channel decoders and MLSE based receivers,modified version in turbo decoders.

Key operation: Add-Compare-Select (ACS)

Page 13: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

13

25ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Viterbi on Lode

Two MAC units & ALU: Add-Compare-Select

• DMAC operates as dual add/subtract unit

• ALU finds minimum

• Shortest distance saved

• Path indicator saved

• 4 cycles / butterfly

+

A1

MAC0

DB1(16)DB0(16)

µ2

+

µ1

A0

MAC1

Γ1 Γ2

Min()ALU

A3Γ

A2

decision bit

to memory

Γ = min [(Γ1 + µ1), (Γ2 + µ2)]

26ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

MSW/LSWSelect

Viterbi on TIC54x

ALU and CSSU: Add-Compare-Select

• ALU splits in 16 bit halves

• ACC splits in half

• Shortest distance saved

• CSSU compares halves

• Path indicator saved

• 4 cycles / butterfly

+

TREG

ALU

DB1(16)DB0(16)

µ2

+

µ1

AccumulatorΓ1 Γ2

CompALU

TRN regΓ

decision bit

Data bus EB, to memory

Γ = min [(Γ1 + µ1), (Γ2 + µ2)]

Page 14: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

14

27ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Viterbi on LU DSP16210

do 8 {a0=a4+y a1=a5-y *r3++=a0ha2=a4-y a3=a5+y *r5++=a2ha0=cmp1(a1,a0) yh=*r0 r0=r1+j j=k k=*pt1++a2=cmp1(a3,a2) a4_5h=*pt0++

}

GSM (K=5, 16 states)

AR0

AR0

AR0

AR0

. . .

a0=cmp1(a1,a0)

a2=cmp1(a3,a2)

a2=cmp1(a3,a2)

• Hardware support for Viterbi algorithm:– ACS calculations are efficient– Minimal overhead

• 4 cycles per butterfly– 32 cycles per GSM timeslot.

• Comparison functions store ACS decision bits:

. . .

Results writtento memory

Courtesy: Gareth Hughes, Bell Labs Australia

28ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Square distance on Lode

ALU in parallel with MAC: Sum of square distance

• ALU performs subtraction and absolute value

• MAC performs squaring and accumulation

Vector quantization in vocoders:vector size N = 50, codebook > 1000

D = Σ || x(i) - y(i) || N-1

i = 0

2

X

+

D

x(i)

-

y(i)

A0

MAC

ALU

DB1(16)

DB0(16)

Page 15: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

15

29ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Lode Core Architecture

30ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Domain specific instruction set

Basic instruction set for general purpose DSPe.g. MAC, min, max, etc.

Extra instructions for performance with every new generatione.g. “square distance and accumulate

D = Σ || x(i) - y(i) || N-1

i = 0

2

One 32 bit instruction:

a3 = abs (*r0 - *r1 < asr), a0 = a0 + sqr(a3), r0++, r1++;

Bus network and instruction set design go together

CISC, thus compiler unfriendly

Page 16: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

16

31ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Control & Pipeline for DSP’sRISC: load/store machinememory access with load/store instructions (DLX, MIPS, D10V)

MemoryAccessDecodeFetch Execute Write

Back

Memory access / branchExecution/ address generation

Excellent for complex decision making!

Memory accessExecution

DSP: register-memory architecture (TI, Lucent, HX, Lode)

Excellent for number crunching!

ExecuteDecodeFetch MemoryAccess

WriteBack

32ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Pipeline RISC compared to DSPRISC:example

DSP: memory intensive applications:

r0 = *p0; // load dataa0 = a0 + r0; // execute

MemoryAccessDecodeFetch Execute

MemoryAccessDecodeFetch Execute

MemoryAccessDecodeFetch Execute

Too expensive for DSP

ExecuteDecodeFetchMemoryAccess

ExecuteDecodeFetchMemoryAccess

ExecuteDecodeFetchMemoryAccess

ExecuteDecodeFetchMemoryAccess

Penalty: data dependent branch is expensive

Page 17: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

17

33ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Other control features

Hardware looping:

• Because software branch is expensive• “Zero overhead hardware loops” (for tight FIR loops)

hardware supported

Interrupts: hardware with shadow registers for extremely fastcontext switching.

Special instruction cache:• Single instruction “repeat” buffer• Multiple instruction cache: under programmers control!• E.g. Lucent DSP16210:31x 32 instruction cache

Predictable worst case execution time!

34ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Low Power DSP’sC54x 1V DSP(Texas Instruments - ISSCC 1997)

DSP 1600 Core(Lucent - 1609 low cost consumer 16-bit)

0.35µ 3LM CMOS80 M 16b MAC/s at 3.3V1.4 mW/MHz at 3.3V30 µW stand-by power

0.25µ 3LM CMOS65 M 16b MAC/s at 1.0V0.21 mW/MHz at 1.0V4.0 mW stand-by power

Dual Vt process

Page 18: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

18

35ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

BUT: DSP Software Development

• Complex DSP architecture not amenable to compiler technology

• Algorithms are modeled in high level language (e.g. C++)

• Solutions are implemented and debugged in hand-optimized assembler - large development effort with minimal tool support

HLL

algorithmic

model

prototype

code

production

code

hand coded assembler

optimize & debug

Long, frustrating time to market

Fragile legacy code

Still used in handhelds, but change in basestations, Part II

36ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Mobile Wireless Evolution

SERVICE

First Generation

Mobile TelephoneService: Carphone

Analog CellularTechnology

MacrocellularSystems

Past

Second Generation

Digital Voice +and Messaging/Data

Services

Fixed Wireless Loop

Digital CellularTechnology + INemergence

Microcellular &Picocellular:capacity, quality

Enhanced CordlessTechnology

Now

Third Generation

Integrated High QualityAudio and Data.Narrowband andBroadband MultimediaServices + IN integration

Broader BandwidthEfficient Radio Transmission

Information Compression

Higher FrequencySpectrum Utilization

IN + Network Managementintegration

Year 2000-2005

Fourth Generation

TelePresencing

Education, training anddynamic information access

Wireless- Wireline andBroadbandTransparency

Knowledge-BasedNetwork Operations

Unified Service Network

Year 2010?

TECHNOLOGY

WCDMAUWC-136 TDMAcdma2000

NMTTACSAnalog AMPS

GSMIS-54/ 136 TDMAIS-95/ cdmaOnePDCDECT

We are entering the decade of wireless data communications - and World-War 3G

Global roaming

Page 19: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

19

37ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Mobile Data Services• Carriers invest >$500 per subscriber but subscriber voice calls (and therefore revenues) are reducing.

• Data currently 3% of wireless traffic - projected to >50% by 2005

• Wireless Internet : Average internet connection 30 mins

• Text Messaging: Saturating 2G voice networks

2.5 Generation Mobile Standards [1]GPRS: Packet Data over GSM - timeslot multiplexing, multi-slots per user.EDGE: 8-PSK modulation + GPRS, 384 Kbps max to 1 user.

3G - IMT2000 Proposals144 Kbps Automobile, 384 Kbps Pedestrian, 2 Mbps stationary.Several Proposals - UWC 136 (200Khz, TDMA, 8-PSK = EDGE).UMTS, CDMA-2000 are both CDMA proposals.

38ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Evolution of Mobile Wireless Network Architecture

BaseStations

PacketMode

ServersHigh Speed Data,

Multimedia,Voice over IP,

etc.

WirelessControlServers

(Feature Control,Network Management,

Billing, etc.)

RadioClients

MSC

BSC

Internet / Advanced ServicesPSTN

CircuitMode

Servers(Voice, LowSpeed Data,

etc.)

PSTN

NetworkServers

MobileSwitches

Packet Connectivity (ATM / IP)

2G Network IP-based 3G Network

Mobile networks are being upgraded in preparation for the delivery of high speed data services.

Page 20: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

20

39ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Mobile Wireless Infrastructure

Macro-cell GSM Basestation(6-12 TRX)

Micro-cell GSM Basestation(2 TRX)

40ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

2G Basestation Baseband Processing

• Multiple DSPs used for baseband processing.• RISC Microcontroller for timing, framing, I/O control• Software upgradable over the network• DSPs dominate cost and power consumption

DSP RISCMicro

Controller

I/O

T1/E1

DSP

DSP

DSP

DSP

DSP

DSP

DSP

I/O

I/O I/O ASIC

DSP

DSP

AFE

AFE

ChannelEqualization

ChannelDe/coding Encryption

RAM

RAM

Tx

TxRx

Rx

Tx/Rx baseband processing board for 2-carrier GSM basestation

Future trend - integratebaseband processing -low cost Pico BTS

Page 21: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

21

41ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

3G Basestation Baseband Processing

• Increased Receiver Algorithm Sensitivity• Antenna Arrays - Smart Antennas• Multi-Standard Basestations using Software Radio Architecture• 3G - constraint length 9, rate 1/2 convolutional coding for voice.• 3G - constraint length 4, Turbo codes for data

Increased DSP performance needed in next-generation basestation

High Performance DSPs+ Custom Logic needed for 3G (Viterbi decoding and Turbo decoding)

RAKE combinerreassemble multipath

(DSP, ASIC)

Sliding correlatordespreading

(ASIC)

Deinterleaver(DSP)

DecoderViterbi algorithmTurbo decoding

(DSP, ASIC)

Code trackingdelay-lock-loop(ASIC, DSP)

Channel estimation(DSP)

Code generatorchannelisation code

scambling code(ASIC))

Code generatorchannelisation code

scambling code(ASIC))

Synchronisationcell search

slot syn, frame syn.(DSP)

Path search(ASIC)

SIR measurementfast power control

(DSP)

Power control

Courtesy: Bing Xu: Bell Labs Australia

42ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Receiver Algorithms for GSM Basestation

• Enhanced Receiver Sensitivity• Larger Cells in Suburban Areas = Reduced network cost• Mobile transmits with less power = Increased battery life

EstimatingWirelessChannel

EqualizingMulti-pathEffects

ChannelDecoding

SpeechDecoding

Existing Receiver

New Iterative Receiver

Challenge - requires 6x DSP MIPS of existing receiver in basestation

EstimatingWirelessChannel

EqualizingMulti-pathEffects

ChannelDecoding

SpeechDecoding

SpeechStatistics

1.3dB improvement

Courtesy: Magnus Sandell: Bell Labs UK

Page 22: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

22

43ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

OmnidirectionalCell Site

Three SectorCell Site

Intelligent AntennaCell Site

• A multiple antenna element system• Combined with a base station architecture and signal processingtechniques designed to dynamically select or form the “optimum” beam pattern per user

Smart Antennas

Increased cost in RF electronics and enhanced DSP requirements.

44ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Fixed Multi-Beam Versus Adaptive Beam

Mobile

Reflected Ray

Select from--or use--multiple “fixed” antenna beams to optimize

performance.

Fixed Multi-BeamMobile 1

Direct Ray

Reflected Rays

Mobile 1

Mobile 2

Adaptively “weight” and combine multiple antenna elements to optimize

performance.

Adaptive Beam

Mobile 2

Interferer

Direct Ray

Interferer

Page 23: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

23

45ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Digital Radio Trends - Software Radio

Digital Processing

RF/AnalogProcessing

A/D

NetworkNetworkInterface

AMP

DSPs - higher speed, more powerful

Filtering ModulationDemodulation EqualizationRake receiver CorrelatorChannel coding EncryptionDiversity . . .

RF/IF

Linear amplificationCombining

Higher dynamic rangeSmallerAmplifiersMixersFilters . . .

Antennas

multi-standardbasestation

46ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Wideband Receiver Architecture

HighSpeed

A/D

BasebandProcessing

......

CH1

CHM

CH1

CH2

CH3

CHM

. . .

freqfBB

CH1

CHM

DigitalChanneliser

RF-IF &Filter

CH1

freq

CHM

freq

CH1

CH2

CH3

CHM

. . .

freqfRF

CH1

CH2

CH3

CHM

. . .

freqfIF

Increased DSP performanceneeded for Software Radio

Page 24: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

24

47ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Turbo Codes

• Parallel concatenation of convolutional codes is used to give the codes structure so they can be decoded

• Pseudorandom interleaving is used to give the codes performance which approaches that for random coding

• Resulting encoder structure: Two Recursive Systematic Convolutional(RSC) Codes

Encoder#1

Encoder#2Int

erlea

ver MUX

Input

ParityOutput

Systematic Output

For 3G Wireless (UMTS and CDMA2000)• Voice service: BER requirement 10-3

• Data service: BER requirement 10-5

48ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Turbo Decoding

• Key idea: iterative decoding (up to 10 iterations for 3G)• There is one decoder for each elementary encoder.• Each decoder estimates the a-posteriori probability (APP) of each data

bit.• The APP’s are used as a priori information by the other decoder.

Decoder#1

Decoder#2

DeMUX

Interleaver

Interleaver

Deinterleaver

systematicdataparitydata

APPAPP

hard bitdecisions

Page 25: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

25

49ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Soft-Output Decoding Algorithms

Requirements for Turbo:– Accept Soft-Inputs in the form of a priori probabilities (APP) – Produce APP estimates of the data.– “Soft-Input Soft-Output”

Trellis-Based Estimation Algorithms

ViterbiAlgorithm

MAPAlgorithm

max-log-MAP

log-MAP

Sequence Estimation

Symbol-by-symbolEstimation

Improved SOVA

SOVA

SOVA and log-MAP use modified Add-Compare-Select operations - not onlyselect the maximum path metric - but also need to keep the difference.

Today’s High-performance DSPs are highly MAC-focussed (for filtering in modem applications). Some DSPsprovide hardware support for efficient implementation of Viterbi - none support SOVA or log-MAP

Iterative channel estimation also usesSoft-Input Soft-Output decoders.

50ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

The Maximum A Posteriori (MAP) Algorithm

( ) [ ][ ]

( ){ }

( ){ }

���

� ′

� ′

=��

��

=+==

=′

=′

0:,

1:,,,

,,

ln0Pr1Prln

k

k

uss

uss

kk

k ssp

ssp

uuuL

y

y

yy

( ) [ ][ ]0Pr

1Prln===

dddLLog-Likelihood Ratio: ( ) ( )

( )( )( ) ( )dL

dypdyp

ydyd

ydL +�

���

==

=���

���

==

=01

ln0Pr1Pr

ln

• A Priori value of Pr[d=1],Pr[d=0]• Output of decoder contains additional extrinsic information• The sum of the a priori information and the extrinsic information will be the a priori information for the next-stage of decoding, for both 2nd decoder or 1st decoder in the next iteration

1) uk is the kth bit of the desired data sequence, 2) y be the observed sequence, 3) the state transitions from state s’ at time k-1 to state s at time k, 4) We want to evaluate this LLR for every k

( ) ( ) ( ) ( )spsspspssp kjkkj >< ⋅′⋅′=′ yyyy ,,,, ( ) ( )kjk sps <− ′= y,1α( )sp kjk >= yβ

( ) ( )sspss kk ′=′ y,,γBreak the probability computation into: Gamma:Alpha:Beta:

Page 26: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

26

51ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Gamma, Alpha and Beta CalculationsGamma: Calculated from known bits up to k, needs to be stored

where is calculated from the a priori information and is calculatedfrom the received bits

( ) ( ) ( ) ( ) ( ) ( )kkkkkk upuPsspssPsspss yyy ⋅=′⋅=′=′ ,',,γ

( )kuP ( )kk up y

Alpha: Calculated by a forward recursion through the trellis based on Gamma

Beta: Calculated by a backward recursion from the end of the trellis

( ) ( ) ( )′⋅′=′

−s

kkk ssss 1, αγα

( ) ( ) ( )⋅′=′−s

kkk ssss βγβ ,1

Alpha BetaGamma

Window algorithm

DummyBeta’s

52ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Log MAP and MAX-log MAP

( )21ln δδ ee +

Compute logarithms of alpha, beta and gamma, which means we compute:

Log-MAP: ( ) ( ) ( )2121 ,maxln 21 δδδδδδ −+∝+ cfee

MAX-Log-MAP: ( ) ( )21 ,maxln 21 δδδδ ∝+ ee Correction function (impl. table)

2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.910-6

10-5

10-4

10-3

10-2

10-1

BER

MaxlogAPPLogAPP MAX-log MAP suffers approx 0.5dB

from log MAP.

For log-MAP, small correction tableneeded (approx 6 non-zero values).Absolute difference used as tablelook-up. We need the difference!

Courtesy: Bing Xu: Bell Labs Australia

Page 27: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

27

53ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

High Performance DSP Requirements• Very high levels of DSP integer performance

• Scalability to meet wide range of cost, power, performance.

• Large memory and I/O bandwidth.

• Friendly, compiler driven, programming environment.

• Support for complex real-time synchronous applications (latency, predictable throughput, synchronization)

• Cost & power efficient solution.

100K

10K

1000

100

101997 1999 2001

V.34

GSMterm

ADSL500k

ADSL6M

24 ch.modem

DABrcvr16 HR

GSM

1G eth. xcvr

set-topbox

MPEGIIencode

Soft radio

3-D graphics?

MOPS

K56PCSterm

traditional DSP

3G Wireless

Some DSP Applications

54ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Compiler Driven VLIW

Large orthogonal register set, regular interconnect

Data memory

RegisterArray

Interconnect

ex1(alu)

ex2(alu)

ex3(mpy)

ex4(ld/st)

exn(ld/st)

cond/branch ex1 ex2 ex3 ….. exnInstruction format:

Atomic RISC-like operations => heavily pipelined, high freq. clock

Page 28: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

28

55ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Explicitly Parallel Instruction Computing

Execution ClustersData memory

RegisterArray

Interconnect

ex1(alu)

ex4(alu)

ex5(mpy)

ex3(ld/st)

ex6(ld/st)

RegisterArray

Interconnect

ex2(alu)

Execution Sets

1 1 1 0 1 0 1 0

fetch set

exec. set

56ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Explicitly Parallel Instruction ComputingPredication (guarded) exec.

Instruction modifiers

any instructioncond

- eliminates branches - improves compiler efficiency- eliminates branches - removes pipeline bubbles- fill delayed branch slots with predicated instructions

instr1modifier instr2 instr3 instr4

- allows shorter instruction length- extend register addressing- predication- execution set identifier- looping- extended operations

Page 29: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

29

57ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Texas Instruments ‘C6201

ALU shift mpy add ALU shift mpy add

Register Bank A(16 x 32)

Register Bank B(16 x 32)

Instruction Dispatch & Decode

Program Memory(16K x 32)

256

Data Memory(32K x 16)

8-way VLIW with two execution clusters256 bit (8x32) instruction fetch with variable length execute setEach 32 bit instruction individually predicated11 stage pipeline1600 MIPS, 400 MMACs @ 200 MHz

58ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

FIR Filter on TI ‘C6x

loop:

ldw .d1t1 *a4++,a5

|| ldw .d2t2 *b4++,b5

||[b0] sub .s2 b0,1,b0

||[b0] b .s1 loop

|| mpy .m1x a5,b5,a6

|| mpyh .m2x a5,b5,b6

|| add .l1 a7,a6,a7

|| add .l2 b7,b6,b7

• Outer Loop: 23 cycles, 180 bytes– 1 cycle in inner loop

• All 8 exec units used in inner loop - maximum efficiency– 2 MACs per cycle

Hand-coded assembly: 32-tap FIR filter

Assembly syntax more difficult to learn.Hard to get full use of all 8 execution units at once.Software pipelining difficult to implement, and requires longer prolog/epilog (larger

code size).

Courtesy: Gareth Hughes: Bell Labs Australia

Page 30: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

30

59ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Viterbi on TI ‘C6x

LOOP:[b1] b .s1 LOOP

||[b1] sub .s2 b1,1,b1||[!a2] sth .d1 b12,*+a6[8]||[!a2] add .d2 b0,b14,b14|| cmpgt .l1 a11,a10,a1|| cmpgt .l2 b11,b10,b0|| mpy .m1x 1,b5,a4

[a2] sub .s1 a2,1,a2||[!a2] sth .d1 a12,*a6++||[a1] add .s2 2,b0,b0||[b0] mpy .m2 1,b11,b12|| mpy .m1 1,a10,a12|| sub .l2x a7,b5,b10|| ldh .d2 *++b9,b5

shl .s2 b14,2,b14||[a1] mpy .m1 1,a11,a12|| add .s1 a7,a4,a10|| sub .l1x b13,a4,a11|| add .l2 b13,b5,b11|| mpy .m2 1,b10,b12|| ldh .d2 *b4++[2],a7|| ldh .d1 *a5++[2],b13; end of LOOP

Cycle 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

.D1 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH sd1 STH m[2] STH m[3]

.D2 ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj SUB m LDH sd0 STH m[5] STH m[4]

.M1 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0

.M2 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8

.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 ADD m0 SUB -m0

.L2 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 SUB old SUB -m1 SUB m1 SUB I

.S1 B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k

.S2 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 ADD tr B JLOOP MVK j

Cycle 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

.D1 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH m[0] STH m[1] LDH old1

.D2 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 STH trans STH m[1] STH m[6] LDH old0

.M1 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0

.M2 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 MPY mj

.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 SUB new ADD old ADD SP

.L2 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8

.S1 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 MVK k

.S2 *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr B JLOOP

Utilization of execution units in Viterbi decoder

• 16-state Viterbi decoder for GSM from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm

– 3 cycles per butterfly– 32 cycles per GSM timeslot (8 butterflies)– MPY instructions used to move data

3-cycle 2-ACS Inner-Loop

x 8

60ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Lucent / Motorola Star*Core SC140

6-way VLIW with 128 bit (8x16) instruction fetchPrefix instructions for high performance without sacrificing code densityEach execution set (parallel instructions + prefix) predicated5 stage pipeline1800 MIPS, 1200 MMACs @ 300 MHz

Program / Data Memory

ProgramSequencerInstructionDispatcher

AddressRegisters

(27)

AAU

Data Registers(16)

MACALU

BFUAAU

MACALU

BFU

MACALU

BFU

MACALU

BFU

Page 31: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

31

61ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Viterbi on Star*Core

• Hardware support for Viterbi algorithm:– max2vit instruction.– vsl instruction

• 1 cycle per butterfly through software-pipelining

• Decision bits are manually stored using the Viterbi Shift Left (VSL) instruction:

GSM (K=5, 16 states)[ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2 ][ add2 d0,d4 sub2 d6,d2sub2 d4,d0 add2 d2,d6 ]

[ max2vit d4,d2 max2vit d0,d6 ][ vsl.4w d2:d6:d1:d3,(r2)+n0vsl.4f d2:d6:d1:d3,(r3)+n0 ]

max2vit d4,d2 max2vit d0,d6

SR

D1

D3

D2

D6

vsl.4w d2:d6:d1:d3,(r2)+n0

Results writtento memory

x 4

decisions

decisions

path metricspath metrics

Courtesy: Gareth Hughes: Bell Labs Australia

62ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Log-MAP on Star*Core

d0: a+x d1: b+x

d1: bd0: a d6: x

d5: a-xd4: b-x

d3: d1-d5d2: d0-d4

max max

n0: |d2|

n0: |d3|

r6

r6d4: d4+d2 d5: d5+d3

d5: max(d1,d5)d4: max(d0,d4)

Cycle 2

Cycle 3

Cycle 4

Cycle 6

Cycle 5

Cycle 7

Cycle 9

move.w (r0)+,d0 move.w (r1)+,d1

add d0,d6,d0 sub d6,d0,d5

sub d6,d1,d4 add d1,d6,d1

max d0,d4 max d1,d5

abs d2 abs d3

sub d0,d4,d2 sub d1,d5,d3

move.l d2,n0

move.l d3,n0 move.w (r6+n0),d2

add d4,d2,d4 move.w (r6+n0),d3

add d5,d3,d5

move.2w d4:d5,(r2)+

Cycle 1

Cycle 8

Cycle 10Cycle 11

This code uses 2 of the 4 ALUs and can be software pipelined to achieve 6 cycles per LOG-MAP Butterfly

Star*Core code for log-MAP Butterfly

Courtesy: Gareth Hughes: Bell Labs Australia

d2:

d3:

Page 32: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

32

63ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Parallel DSP Architectures

Arch. Parallelism Compile? Power ?

S/scalar Dynamic instruction level��������

VLIW Static instruction level����

SIMD Highly regular, data dependent��������

MIMD Task level����

MIMD with VLIW / SIMD provides high order parallel execution

The future of high performance DSPs is MIMD

64ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Daytona: A Multiprocessor DSP Architecture

ProgrammableProcessing

Element(PE)

HardwareAccelerator

Chip

split transaction bus (128 bits)

ProgrammableProcessing

Element(PE)

I/O Subsystem

I/O Interfaces

BufferedI/O

External Memory

ArbitrationSynchronization

I/O Interfaces

Scalable Architecture - multiple programmable DSPs on a single chip1 Bus supports different programmable DSPs and Microcontrollers

Page 33: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

33

65ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Split Transaction Bus

Arbiter(round-robin)

ID

data

ID

data

ID

addrAddressBus (100MHz)

DataBus (128 bits 100MHz)

Multiple outstanding transactions - varying size/priority

Separate Bus Arbitration

ID

data

IDIDMemory

ControllerPE

addraddr

Separate Address and Data busses - each with pipelined protocol

Arbiter(round-robin)

66ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Memory Hierarchy in MIMD DSPs

Multiple copies of 1 application (e.g. odd/even slot channel equalisation)

Mix of different applications (e.g. equalisation, convolutional decoding)

• Heterogenous mix of applications

• Multiple copies of same software - Shared memory multiprocessing

SRAM

DSP

SRAM

DSP DSPCache

DSPCache

DRAM

2 copies of software 1 copy of software

Flat Memory Architecture vs. Hierarchical Memory Architecture

Inefficient

Page 34: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

34

67ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Shared Memory Multiprocessing

64 Semaphores provided for process synchronization

DSP

hit

DSP DSPDSPAccessto shareddata

Snoop(miss)

Snoop(hit)

Snoop(miss)

Coherent TransactionMemoryController

Access to shared datauses coherent transaction.Caches “snoop” the addressand query their tag RAMs.A cache hit prevents the memory controller fromservicing the request.

L-1 cache coherency using a snoopy protocol (modified MESI used)

68ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Daytona Multiprocessor DSP Chip

128-b Split Transaction Bus

HostInterface

I/O &Memory

Controller

Test &JTAG Port

Arbiter

Semaphore

120mmCore Area

100 MHzSpeed

4WPower

Tech

Chip Characteristics2

0.25um

Bell Laboratories Research Chip for 3G Wireless Base-stations / Head-end xDSL

64-b 4-MACSIMD DSP

32-b RISC

Cache Memory

64-b 4-MACSIMD DSP

32-b RISC

Cache Memory

64-b 4-MACSIMD DSP

32-b RISC

Cache Memory

64-b 4-MACSIMD DSP

32-b RISC

Cache Memory

Paper 4.2, ISSCC2000

Page 35: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

35

69ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Photomicrograph of Daytona Test Chip

8KB Re-configurable Memory

DLLSPARC

Vector Unit (RVU)

BUS IN

T

HDS

LRU

I/O Subsystem

ArbiterSemph

Proces

sing Elem

ent (P

E)

Split

Tra

nsac

tion

Bus

Proces

sing Elem

ent (P

E)

Proces

sing Elem

ent (P

E)

Paper 4.2, ISSCC2000

70ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

AcknowledgementsThe following people contributed to the work in this tutorial:

Low Power DSPs for WirelessWanda Gass: Texas InstrumentsMihran Touriguian: Atmel

High Performance DSPs for Wireless InfrastructureBryan Ackland: Bell Labs US - High Perf. DSP ArchitectureGareth Hughes: Bell Labs Australia - LU DSP16210, ‘C6x and Starcore benchmarksBing Xu: Bell Labs Australia - SOVA, MAP, LOG-MAPRan-Hong Yan: Bell Labs UK - 3G WirelessDaytona Team: (J Williams, K.J. Singh, J. Othmer, B. Ackland), Bell Labs US.

Page 36: DSP Architectures for Next-Generation Wireless Communicationsingrid/Presentations/isscc_dsptut.pdf · DSP Architectures for Next-Generation Wireless Communications ... 1data/program

36

71ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

References

[1] P. Lapsley, J. Bier, A. Shoham, E. Lee, “DSP Processor Fundamentals,” IEEE Press, New York, 1997.[2] D. Skillikorn, “A Taxonomy for Computer Architectures,” Computer Magazine, Nov. 1988.[3] H. Kabuo, M. Okamoto, I. Tanaka, H. Yasoshima, S. Marui, M. Yamasaki, T. Sugimura, K. Ueda, T. Ishikawa, H. Suzuki, R. Asahi, “An 80 MOPS-Peak High-Speed and Low-Power-Consumption 16-b Digital Signal Processor,” IEEE Journal of Solid-State Circuits, Vol. 31, No. 4, April 1996, pg. 494-503.[4] E. A. Lee, D. G. Messerschmitt, Digital communication, Boston: Kluwer Academic Publishers, 1988.[5] W. Lee et al., “A 1V DSP for Wireless Communications,” Proceedings IEEE International Solid-State Circuits Conference, pp. 92-93, February 1997. [6] S. Lin, and J. Costello Jr., Error Control Coding: Fundamentals and applications, Prentice Hall, New Jersey, 1983[7] Lucent 16000, http://www.lucent.com/micro/ or http://www.lucent.dk/micro/dsp16000/[8] Thomas Parsons, Voice and Speech Processing, McGraw-Hill Book Company, New York, 1987.[9] TMS320C54x User’s Guide, available from the Texas Instruments Literature Response Center.[10] I. Verbauwhede, M. Touriguian, “A Low Power DSP Engine for Wireless Communications,” Journal of VLSI Signal Processing 18, pg. 177-186, 1998, Kluwer Academic Publishers.[11] I. Verbauwhede, M. Touriguian, “Wireless digital signal processors,” Chapter in Digital Signal Processing for Multimedia Systems, Edited by K.K. Parhi, T. Nishitani, Publisher: Marcel Dekker, New York, 1999. [12] M. Okamoto, K. Stone, T. Sawai, H. Kabuo, S. Marui, M. Yamasaki, Y. Uto, Y. Sugisawa, Y. Sasagawa, T. Ishikawa, H. Suzuki, N. Minamida, R. Yamanaka, K. Ueda, “A High Performance DSP Architecture for Next Generation Mobile Phone Systems,” 1998 IEEE DSP Workshop.[13] Lode specifications, available from www.atmel.com[14] M.W. Oliphant, “The Mobile Phone meets the Internet”, IEEE Spectrum pp. 20-28, Aug. 1999.[15] L. C. Godara, “Application of Antenna Arrays to Mobile Communications: Part 1”, Proc. IEEE, Vol 85, No. 7. pp1031-1060, July 97

72ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

References (cont)[16] G. D. Forney, Jr., “Maximum Likelihood Sequence Estimation of Digital Sequences in the Presence of IntersymbolInterference”, IEEE Trans. Inform. Theory, V IT-18, pp. 363-378, May 1972.[17] C. Berrou, A. Glavieux, P. Thitimajshima, “Near Shannon Limit Error-Correcting Coding and Decoding: Turbo-Codes (1)”, Proc. ICC’93, May 1993.[18] J. Hagenauer, P. Hoeher, “A Viterbi Algorithm with Soft-Decision Outputs and its Applications”, Proc. Globecom 89, Nov. 1989, pp.47.1.1-47.1.7[19] L. Bahl, J. Cocke, F. Jelinek, J. Raviv, “Optimal Decoding of Linear Codes for Minimizing Symbol Error Rate”, IEEE Trans. Inform. Theory, V IT-20, pp. 284-287, Mar. 1974.[20] J. Turley, H. Hakkaraainen, “TI’s new ‘C6x DSP Screams at 1600 MIPS”, Microprocessor Report, Vol 11, No. 2, pp14, Feb 1997[21] “Starcore Launched First Architecture”, Microprocessor Report, V12, No. 14. pp 22, Oct 1998[22] B. Ackland & P. D’Arcy, “A New Generation of DSP Architectures”, Proc. IEEE CICC99, Paper 25.1.1[23] J. Williams, K.J. Singh, C.J. Nicol, B. Ackland, “A 3.2 GOPs Multiprocessor DSP for Communication Applications”,Proc. IEEE ISSCC2000, Paper 4.2