architecture exploration lecture 9iverbauw/courses/... · • architecture alternatives bit...

30
1 1 HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9 Architecture exploration Lecture 9 Ingrid Verbauwhede Departement Elektrotechniek, afdeling ESAT/COSIC [email protected] 2 HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9 Motivation Architecture exploration Specification: MATLAB, SPW, C/C++, Java • Floating point • Fixed point • Algorithm transformations • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target compiler technologies) DSP extensions to RISC DSP processors (Gezel, Tensilica) (TI TMS320C54x, TMS320C55x, ADI Blackfin, etc. )

Upload: others

Post on 09-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

1

1HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Architecture exploration Lecture 9

Ingrid Verbauwhede

Departement Elektrotechniek, afdeling ESAT/COSIC

[email protected]

2HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Motivation

• Architecture exploration

• Specification: MATLAB, SPW, C/C++, Java

• Floating point

• Fixed point

• Algorithm transformations

• Architecture alternatives

Bit parallel (Bit serial)

ASIC SpecialPurpose

(Art Designer)

Retargetablecoprocessor

(Target compilertechnologies)

DSP extensionsto RISC

DSP processors

(Gezel,Tensilica)

(TI TMS320C54x,TMS320C55x,ADI Blackfin, etc. )

Page 2: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

2

3HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

References

• The origins:• E.A. Lee, “Programmable DSP Processors,” Part I, IEEE ASSP

magazine, October 1988, pg. 4-19.• Part II, IEEE ASSP magazine, January 1989, pg. 4-14

• Continue on this:• I. Verbauwhede, C. Nicol, “Low power DSP's for wireless communications,” 2000 International Symposium on Low Power Electronics and Design (ISLPED), July 2000 • I. Verbauwhede, P. Schaumont, C. Piguet, B. Kienhuis, “Architectures and design techniques for energy efficient embedded DSP and multimedia,” 2004 Design Automation and Test in Europe (DATE 2004).

4HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Today

• SOC components (continue)– DSP processors– VLIW processors

• Design of SOC itself

Page 3: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

3

5HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

DSP Processors

Today’s general purposeassembly coded

DSP

Low cost,low power

DSPs

HighPerformance

DSPs

• 1-10 GOPS• 1-5 watts• < $50

• 200-1000 MOPS• < 100 mW • $10

• 100 MOPS• 250 mW• $40

InfrastructureMobile Terminals

Highly optimizedDomain specificProcessors

Compiler FriendlyVLIW type of DSPprocessors

6HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

DSP processors -

• Last lecture: DSP = domain specific processor– Highly optimized for wireless communication– EVERY component of the processor:

• Datapath = MAC• Memory = Harvard or Modified Harvard• Address arithmetic: indirect – modulo – bit reverse (FFT)• Control: CISC with specialized instruction set

– Example of FIR calculation

• Today:– Pipeline specifics of DSP processors

Page 4: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

4

7HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Pipelining:

ExecuteDecodeFetch MemoryAccess

ExecuteDecodeFetch MemoryAccess

ExecuteDecodeFetch MemoryAccess

Fetch = fetch instructionDecode = decode instructionMemory access = address generation and read operandsExecute = perform operation

Time

8HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Pipelining

How does pipeline appears to the programmer?Lee’s paper (part II) discusses 3 variations(the difference is often blurry):• interlocking• time stationary coding• data stationary coding

Trade-off between efficiency and “ease-of-use”

Interlocking: the instructions appear if executed one after another

Page 5: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

5

9HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Interlocking on C10

LTPMEM MPY LTD

ExecuteDecodeFetch MemoryAccess

ExecuteDecodeFetch MemoryAccess

ExecuteDecodeFetch MemoryAccess

LT

MPY

LTD

ExecuteDecodeFetch MemoryAccess

MPY

MPY

DMEM data coef1 data coef2

ALU

MPY

Reservation table:

LTD MPY

. . .

Instruction cycles

10HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Interlocking on C2x

Programmer does not know the pipelineIf an access conflict occurs: hardware will “stall” and finish one (part) of anInstruction before finishing a second part.

RPTKPMEM MACD coef1 coef2

DMEM data1 data2

ALU

MPY

Reservation table:

. . .

RPTK 49MACD

coef3

Page 6: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

6

11HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Time stationary

Instruction specifies “one instruction cycle”.So it specifies, all that occurs in parallel.

ExecuteDecodeFetch MemoryAccess

ExecuteDecodeFetch MemoryAccess

ExecuteDecodeFetch MemoryAccess

ExecuteDecodeFetch MemoryAccess

Example:Motorola:

MAC X0, Y0, A X:(R0)+, X0 Y:(R4-), Y0(multiply-acc of values read from memory in the previous cycle)

Lucent 16xa0 = a0 + p, p = x * y, y = *r0++, x = *pt ++

12HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Data stationary

Time stationary: working on different samples in one instructionData stationary: describes what happens with one input data fromstart to end.

Example (Lode):

*r3++ = a0+ = a2 * *r2++;(read from memory with pointer reg r2,Multiply with a2, add to a0 and store back in a0,Store the result in memory with pointer r3,Post modify r2 and r3)

ExecuteDecodeFetch Read Write

Page 7: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

7

13HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Control & Pipeline for DSP’sRISC: load/store machinememory access with load/store instructions (DLX, MIPS, D10V)

MemoryAccessDecodeFetch Execute Write

Back

Memory access / branchExecution/ address generation

Excellent for complex decision making!

Memory accessExecution

DSP: register-memory architecture (TI, Lucent, HX, Lode)

Excellent for number crunching!

ExecuteDecodeFetch MemoryAccess

WriteBack

14HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Pipeline RISC compared to DSPRISC:example

DSP: memory intensive applications:

r0 = *p0; // load dataa0 = a0 + r0; // execute

MemoryAccessDecodeFetch Execute

MemoryAccessDecodeFetch Execute

MemoryAccessDecodeFetch Execute

Too expensive for DSP

ExecuteDecodeFetchMemoryAccess

ExecuteDecodeFetchMemoryAccess

ExecuteDecodeFetchMemoryAccess

ExecuteDecodeFetchMemoryAccess

Penalty: data dependent branch is expensive

Page 8: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

8

15HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Application domain: wireless communications

Receiver

Tran

smit

Syn

thes

ize

PA

TCXO

Receiver

Tran

smit

Syn

thes

ize

PA

TCXO

Ext

erna

lM

emor

ies

DigitalASIC

MicroProcessor

DSP

BatteryPack

AnalogASIC

PowerSupply

AudioCodec

No network

* 0 #7 8 94 5 61 2 3

clr

RF Board

Baseband board

16HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Performance requirements: digital cellular phone

RFReceive

RFSend

Demodulation Channeldecoder

Speechdecoder

Modulation Channelencoder

Speechencoder

Communication Application

Goal: Minimum “MIPS” to get the job done.

Page 9: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

9

17HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Application Domain: compute intensive functions

Source encoder/decoder = speech codersAdvanced vocoders for improved speech quality & higher capacity:Example: ACELP derivatives for GSM and IS136A

• Digital filtering (FIR, IIR)

• Vector quantization, code book search (square distance computation)

Channel encoder/decoder = error correctingComplex wireless modems:

• Galois field arithmetic

• Convolution coders based on Viterbi trellis search

• Turbo coders

18HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Compute intensive functions: evolution of DSP’s

Simple FIR example

Square distance for speech processing

Speed-up of FIR example

Viterbi acceleration for communication algorithms

Evolution of DSPs follows these examples

Page 10: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

10

19HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

The Viterbi Decoding (Introduction)

• Error Correcting Decoding Algorithm for Convolutional Code• Trellis Representation• Maximum Likelihood Decoding Algorithm• GSM System

20HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Convolutional Code (ex. Wyner-Ash Code)

• Generator matrix G(D) = [ 1 1+D ]• Input sequence u(D) = 1, 1, 0, 1, 0, …• Output Sequence c(D) = u(D)G(D)

=11, 10, 01, 11, 01, …

D

Page 11: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

11

21HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Constraint length K and Rate

• v = 1, K = 2, 2states

• Rate = 1/2, one input bit generates twocoded output bits.

D

100,00 1,111,10

0,01

22HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Trellis Representation

• Example G(D)=[ 1+D2 1+D+D2 ]v = 2, K = 3, 4 states

• Instead of writing a State Diagram,

D D

t0 1 2 3 4

S00

S10

S01

S00

S10

S01

S00

S10

S01

S00

S10

S01

S00

S10

S01

S11 S11 S11 S11 S11

00

11

00

11

00

11

00

11

1001

1001

1001

11

00

11

00

01

10

01

10

Page 12: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

12

23HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Efficiency of Viterbi decoding

• Identifies the path through the Trellis--- Selecting survivor paths for each states by calculating Hamming Distance

• The total number of paths grows exponentially with the number of states--- K increasing, H/W Complexity increases exponentially

but the Error Rate decreases

24HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Viterbi Decoding Algorithm (1)

• Assume N = 7 blocks

t

S00

S10

S01

S11

00

11

00

11

00

11

00

11

1001

1001

1001

11

00

11

00

01

10

01

10

0 1 2 3 4 5 6 7

000000

11

1001 01

11 11

10

11

00

01

10

Information Data

Convolution Codes

Error Sequence

Received Data

0

00

00

00

1

11

01

10

1

10

10

00

0

10

00

10

1

00

00

00

0

01

10

11

0

11

00

11

Tail Bit

Page 13: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

13

25HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

S00

S10

S01

S11

00

11

00

11

00

11

00

11

1001

1001

1001

11

00

11

00

01

10

01

10

000000

11

1001 01

11 11

10

11

00

01

10

0 1 10

12

4

2

Viterbi Decoding Algorithm (2)

• Calculate Hamming Distance (Choose smaller one)

t0 1 2 3 4 5 6 7

Information Data

Convolution Codes

Error Sequence

Received Data

0

00

00

00

1

11

01

10

1

10

10

00

0

10

00

10

1

00

00

00

0

01

10

11

0

11

00

11

26HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Viterbi Decoding Algorithm (3)

• Selecting the Optimal Path

t0 1 2 3 4 5 6 7

Information Data

Convolution Codes

Error Sequence

Received Data

0

00

00

00

1

11

01

10

1

10

10

00

0

10

00

10

1

00

00

00

0

01

10

11

0

11

00

11

S00

S10

S01

S11

00

11

00

11

00

11

00

11

1001

1001

1001

11

00

11

00

01

10

01

10

000000

11

1001 01

11 11

10

11

00

01

10

0 1 1 20 2 33

1 3 22 2

2 2 34

2 42 3

3

Page 14: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

14

27HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Traceback

• We cannot wait for the end of sequence for some applications

• The amount of “delay” is called tracebackdepth LD.

--- Larger LD , better performancebut need more memory and complexity

28HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Viterbi in GSM

• Full-rate speech channel 22.8kbps: Rate = 1/2, K = 5

• Half-rate speech channel :11.4kbps: Rate = 1/3, K = 7

Page 15: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

15

29HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Required Performance

30HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Compute Intensive function 2: Viterbi

i

i+ s/2

2i

2i+1

+a

-a

-a

+a

. . .

. . .

Viterbi butterfly

i = state indexs = # of states = 2w = decoding window

Basic equations:

d(2n) = min { d(i) + a, d(i + s/2) - a }d(2i + 1) = min { d(i) - a, d(i + s/2) + a }

IS-95: k = 8, w = 192, corresponds to 2 x 192 x (cycles for one ACS)

k-1

7

Basic algorithm in Viterbi channel decoders,modified version in turbo decoders.

Key operation: Add-Compare-Select (ACS)

Page 16: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

16

31HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Viterbi on Atmel’s Lode

Two MAC units & ALU: Add-Compare-Select

• DMAC operates as dual add/subtract unit

• ALU finds minimum

• Shortest distance saved

• Path indicator saved

• 4 cycles / butterfly

+

A1

MAC0

DB1(16)DB0(16)

µ2

+

µ1

A0

MAC1

Γ1 Γ2

Min()ALU

A3Γ

A2

decision bit

to memory

Γ = min [(Γ1 + µ1), (Γ2 + µ2)]

32HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

MSW/LSWSelect

Viterbi on TIC54x

ALU and CSSU: Add-Compare-Select

• ALU splits in 16 bit halves

• ACC splits in half

• Shortest distance saved

• CSSU compares halves

• Path indicator saved

• 4 cycles / butterfly

+

TREG

ALU

DB1(16)DB0(16)

µ2

+

µ1

AccumulatorΓ1 Γ2

CompALU

TRN reg

Γ

decision bit

Data bus EB, to memory

Γ = min [(Γ1 + µ1), (Γ2 + µ2)]

Page 17: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

17

33HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Viterbi on LU DSP16210

do 8 {a0=a4+y a1=a5-y *r3++=a0ha2=a4-y a3=a5+y *r5++=a2ha0=cmp1(a1,a0) yh=*r0 r0=r1+j j=k k=*pt1++a2=cmp1(a3,a2) a4_5h=*pt0++

}

GSM (K=5, 16 states)

AR0

AR0

AR0

AR0

. . .

a0=cmp1(a1,a0)

a2=cmp1(a3,a2)

a2=cmp1(a3,a2)

• Hardware support for Viterbialgorithm:– ACS calculations are efficient– Minimal overhead

• 4 cycles per butterfly– 32 cycles per GSM timeslot.

• Comparison functions store ACS decision bits:

. . .

Results writtento memory

Courtesy: Gareth Hughes, Bell Labs Australia

34HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

BUT: DSP Software Development

• Complex DSP architecture not amenable to compiler technology

• Algorithms are modeled in high level language (e.g. C++)

• Solutions are implemented and debugged in hand-optimized assembler - large development effort with minimal tool support

HLL

algorithmic

model

prototype

code

production

code

hand coded assembler

optimize & debug

Long, frustrating time to market

Fragile legacy code

Widely used in handhelds, but change in basestations VLIW

Page 18: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

18

35HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

2G Basestation Baseband Processing

• Multiple DSPs used for baseband processing.• RISC Microcontroller for timing, framing, I/O control• Software upgradable over the network• DSPs dominate cost and power consumption

DSP RISCMicro

Controller

I/O

T1/E1

DSP

DSP

DSP

DSP

DSP

DSP

DSP

I/O

I/O I/O ASIC

DSP

DSP

AFE

AFE

ChannelEqualization

ChannelDe/coding Encryption

RAM

RAM

Tx

TxRx

Rx

Tx/Rx baseband processing board for 2-carrier GSM basestation

Future trend - integratebaseband processing -low cost Pico BTS

36HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Compiler Driven VLIW

Large orthogonal register set, regular interconnect

Data memory

RegisterArray

Interconnect

ex1(alu)

ex2(alu)

ex3(mpy)

ex4(ld/st)

exn(ld/st)

cond/branch ex1 ex2 ex3 ….. exnInstruction format:

Atomic RISC-like operations => heavily pipelined, high freq. clock

Page 19: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

19

37HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Explicitly Parallel Instruction Computing

Execution ClustersData memory

RegisterArray

Interconnect

ex1(alu)

ex4(alu)

ex5(mpy)

ex3(ld/st)

ex6(ld/st)

RegisterArray

Interconnect

ex2(alu)

Execution Sets

1 1 1 0 1 0 1 0

fetch set

exec. set

38HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Texas Instruments ‘C6201

ALU shift mpy add ALU shift mpy add

Register Bank A(16 x 32)

Register Bank B(16 x 32)

Instruction Dispatch & Decode

Program Memory(16K x 32)

256

Data Memory(32K x 16)

8-way VLIW with two execution clusters256 bit (8x32) instruction fetch with variable length execute setEach 32 bit instruction individually predicated11 stage pipeline1600 MIPS, 400 MMACs @ 200 MHz

Page 20: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

20

39HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

FIR Filter on TI ‘C6x

loop:

ldw .d1t1 *a4++,a5

|| ldw .d2t2 *b4++,b5

||[b0] sub .s2 b0,1,b0

||[b0] b .s1 loop

|| mpy .m1x a5,b5,a6

|| mpyh .m2x a5,b5,b6

|| add .l1 a7,a6,a7

|| add .l2 b7,b6,b7

• Outer Loop: 23 cycles, 180 bytes– 1 cycle in inner loop

• All 8 exec units used in inner loop - maximum efficiency– 2 MACs per cycle

Hand-coded assembly: 32-tap FIR filter

Assembly syntax more difficult to learn.Hard to get full use of all 8 execution units at once.Software pipelining difficult to implement, and requires longer prolog/epilog (larger

code size).

Courtesy: Gareth Hughes: Bell Labs Australia

40HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Viterbi on TI ‘C6x

LOOP: [b1] b .s1 LOOP||[b1] sub .s2 b1,1,b1||[!a2] sth .d1 b12,*+a6[8]||[!a2] add .d2 b0,b14,b14|| cmpgt .l1 a11,a10,a1|| cmpgt .l2 b11,b10,b0|| mpy .m1x 1,b5,a4

[a2] sub .s1 a2,1,a2||[!a2] sth .d1 a12,*a6++||[a1] add .s2 2,b0,b0||[b0] mpy .m2 1,b11,b12|| mpy .m1 1,a10,a12|| sub .l2x a7,b5,b10|| ldh .d2 *++b9,b5

shl .s2 b14,2,b14||[a1] mpy .m1 1,a11,a12|| add .s1 a7,a4,a10|| sub .l1x b13,a4,a11|| add .l2 b13,b5,b11|| mpy .m2 1,b10,b12|| ldh .d2 *b4++[2],a7|| ldh .d1 *a5++[2],b13; end of LOOP

Cycle 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

.D1 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH sd1 STH m[2] STH m[3]

.D2 ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj SUB m LDH sd0 STH m[5] STH m[4]

.M1 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0

.M2 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8

.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 ADD m0 SUB -m0

.L2 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 SUB old SUB -m1 SUB m1 SUB I

.S1 B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k

.S2 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 ADD tr B JLOOP MVK j

Cycle 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

.D1 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH m[0] STH m[1] LDH old1

.D2 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 STH trans STH m[1] STH m[6] LDH old0

.M1 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0

.M2 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 MPY mj

.L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 SUB new ADD old ADD SP

.L2 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8

.S1 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 MVK k

.S2 *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr B JLOOP

Utilization of execution units in Viterbi decoder

• 16-state Viterbi decoder for GSM from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm

– 3 cycles per butterfly– 32 cycles per GSM timeslot (8 butterflies)– MPY instructions used to move data

3-cycle 2-ACS Inner-Loop

x 8

Page 21: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

21

41HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Lucent / Motorola Star*Core SC140

6-way VLIW with 128 bit (8x16) instruction fetchPrefix instructions for high performance without sacrificing code densityEach execution set (parallel instructions + prefix) predicated5 stage pipeline1800 MIPS, 1200 MMACs @ 300 MHz

Program / Data Memory

ProgramSequencerInstructionDispatcher

AddressRegisters

(27)

AAU

Data Registers(16)

MAC

ALU

BFUAAU

MAC

ALU

BFU

MAC

ALU

BFU

MAC

ALU

BFU

42HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Viterbi on Star*Core

• Hardware support for Viterbi algorithm:– max2vit instruction.– vsl instruction

• 1 cycle per butterfly through software-pipelining

• Decision bits are manually stored using the Viterbi Shift Left (VSL) instruction:

GSM (K=5, 16 states)[ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2 ][ add2 d0,d4 sub2 d6,d2

sub2 d4,d0 add2 d2,d6 ][ max2vit d4,d2 max2vit d0,d6 ][ vsl.4w d2:d6:d1:d3,(r2)+n0

vsl.4f d2:d6:d1:d3,(r3)+n0 ]

max2vit d4,d2 max2vit d0,d6

SR

D1

D3

D2

D6

vsl.4w d2:d6:d1:d3,(r2)+n0

Results writtento memory

x 4

decisions

decisions

path metrics

path metrics

Courtesy: Gareth Hughes: Bell Labs Australia

Page 22: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

22

43HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

SOC

44HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Energy-Efficient SoC are distributed

[‘Under the Hood’, EET, D. Carey, 9/5/02]

TIBaseband

DSP

HTCInterface

ASIC

TIPower

Management

Intel32Mb Flash

Intel128Mb Flash

Winbond128Mb

SDRAM

TIRF Synth

TIRF TX/RX

ConexantPower Amp

IntelStrongArm

SonyLCD

Interface

Sony240x320

color LCD

PhilipsAudio Codec

TouchscreenSIM

MMICExpansion

T-MobilePocketPC Phone

Page 23: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

23

45HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

DisplayAD7873Digitizer

MotorolaDragonBall

8M SDRAM

4M FLASH

FPGA

PhilipsUSB

MaximTransceivers

Agere POMBaseband

MotorolaTransceiver

RF MicroPoweramp

MaximControl

Driver

MemoryCardSlot

architecture tuned to applicationPalmPilot i705

46HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Power Cost

???

GeneralPurpose

Fixed

Platform

Application

ASIC

Energy-flexibility trade-off

Page 24: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

24

47HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Also general purpose architectures become heterogeneous.

IBM PowerPC ®

RISC CPU

Synchronous Dual-Port RAM

SelectIO-ltra™ SystemIO™ & XCITE ™

Conexant3.125Gb Serial

XtremeDSP™

Source: Xilinx webpage

48HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Question

• Energy - flexibility are opposite demands!• How to navigate in this jungle?• 3D design space:

• Next question: how to map (or compile) an application onto such an architecture?

Computational Abstraction Level

Reconfigurable featureBinding rate

Page 25: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

25

49HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Flexibility (1) - Abstraction level

Computational Abstraction Level

• Instruction set level = “programmable”

• CLB level = “reconfigurable”

50HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Flexibility (2) - Reconfigurable feature

• Basic components:

CLB RAM details

Switches, Muxes

Implementation

Execution unit type

Register file

Cross-bar Busses

Micro-architecture

Custom instructions

Register set

Size address/ data bus

Instruction set Architecture

Number & type of processes

Memory hierarchy

Interconnect network

Systems

ComputationStorageCommunication

Reconfigurable feature

Computational Abstraction Level

Page 26: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

26

51HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Flexibility (3) - Binding rate

Binding rate

Compare processing to binding• Configurable (“compile-time”)• Re-configurable• Dynamic reconfigurable (“adaptive”)

52HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

SOC architecture: RINGS

Networking Video

StandardAlgorithm

ArchitectureµArchitecture

Circuit

MEMORY

Reconfigurable Interconnect

CPU

RF

BasebandProcessing

VideoEngine

Domain-Specific

Hardware

SoftwareNetworking

Medium accessBaseband ProcµArchitecture

Circuit

Signal Proc

DSP

AlgorithmArchitectureµArchitecture

Assembly

Page 27: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

27

53HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Instruction set extension

• Instruction set extension• Register mapped• Tightly coupled• Experiment: DFT

12.5 times5.76 mJ67.6 mJEnergy

Improve-ment

SW with HW datapath

SW onEmbedded proc.

1000iterations

54HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Co-processor

• Memory mapped• Loosely coupled• Experiment: AES

LocalMemory

25 times13.5 mJ89.2 mJEnergy

Improve-ment

SW with HW

datapath

SW on emb. Proc.

175iterations

Page 28: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

28

55HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Independent IP

• Loosely coupled• Network on chip

connected• Flexible interconnect• Experiment: TCP/IP

checksum

router

router

84 times0.20 mJ17.0 mJEnergy

Improve-ment

HW datapath

SW on emb. Proc.

100packets

56HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Example: The Security Pyramid

DQ

Vcc

CPUCrypto

MEM

JCA

Java

JVM

CLK

Protocol

Algorithm

Architecture

Circuit

Micro-Architecture

Identification

ConfidentialityIntegrity

Kasumi, Rijndael,RC4, MD5, …

Page 29: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

29

57HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Example: AES Coprocessor

InputFSM

ProcFSM

OutputFSM

>>

Encrypt

KeySchedule

>>

instruction

roundkey16 16256256

handshakeCORE

[DAC 2002]

58HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

[1] Amphion CS5230 on Virtex2 + Xilinx Virtex2 Power Estimator[2] Helger Lipmaa PIII assembly handcoded + Intel Pentium III (1.13 GHz) Datasheet[3] gcc, 1 mW/MHz @ 120 Mhz Sparc – assumes 0.25 u CMOS[4] Java on KVM (Sun J2ME, non-JIT) on 1 mW/MHz @ 120 MHz Sparc – assumes 0.25 u CMOS

648 Mbits/secAsmPentium III [2] 41.4 W 0.015 (1/1900)

Java [4]Emb. Sparc 450 bits/sec 120 mW 0.0000037

(1/9.600.000)

C Emb. Sparc [3] 133 Kbits/sec 0.0011 (1/33000)

56 mW

Power

1.32 Gbit/secFPGA [1]

35.7 (1/1)2 Gbits/sec0.18µm CMOS

Figure of Merit(Gb/s/W)

ThroughputAES 128bit key128bit data

490 mW 2.7 (1/11)

120 mW

Design options: AES acceleration: Gbits/Joule

Page 30: Architecture exploration Lecture 9iverbauw/Courses/... · • Architecture alternatives Bit parallel (Bit serial) ASIC Special Purpose (Art Designer) Retargetable coprocessor (Target

30

59HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Applications

Mapped

onto

Architectures

Conclusion

Design Methods

= Low Power!

60HJ94, Spring 2004, Ingrid Verbauwhede, lecture 9

Motivation

• Architecture exploration

• Specification: MATLAB, SPW, C/C++, Java

• Floating point

• Fixed point

• Algorithm transformations

• Architecture alternatives

Bit parallel (Bit serial)

ASIC SpecialPurpose

(Art Designer)

Retargetablecoprocessor

(Target compilertechnologies)

DSP extensionsto RISC

DSP processors

(Gezel,Tensilica)

(TI TMS320C54x,TMS320C55x,ADI Blackfin, etc. )