chapter 4 pipelining and parallel processing - soc & dsp...

15
VLSI DSP 2008 Y.T. Hwang 5-1 Chapter 4 Pipelining and Parallel Processing VLSI DSP 2008 Y.T. Hwang 5-2 Introduction (1) Pipelining Reduction in critical path Increase the clock speed Reduce power consumption at same speed Parallel processing Parallelism Increase effective sampling speed Reduction of power consumption

Upload: phungtuong

Post on 18-Mar-2018

263 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-1

Chapter 4 Pipelining and Parallel Processing

VLSI DSP 2008 Y.T. Hwang 5-2

Introduction (1)

PipeliningReduction in critical path

Increase the clock speed

Reduce power consumption at same speed

Parallel processingParallelism

Increase effective sampling speed

Reduction of power consumption

Page 2: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-3

Introduction (2)

A 3-tap FIR filtery(n)=ax(n)+bx(n-1)+cx(n-2)

Critical path: 1 multiply and 2 add

AMsample

AMsam ple

TTf

TTT

2

1

2

VLSI DSP 2008 Y.T. Hwang 5-4

Introduction (3)

Pipelining or parallel processing to sampling frequency

Critical path: 2 add

Pipelining

Parallel processing

Page 3: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-5

Pipelining of FIR digital filters (1)

Feed forward cut set Two iterations are computed concurrently

Critical path reduced from TM+2TA to TM+TA

Latency increased from 1 to 2

VLSI DSP 2008 Y.T. Hwang 5-6

Pipelining of FIR digital filters (2)

Drawbacks of pipeliningIncrease in the number of latches and in system latency

ObservationsThe clock period is limited by the longest path between Two latches

An input and a latch

A latch and an output

An input and an output

Critical path can be reduced by suitably placing the pipelining latches

Pipelining latches can be placed across any feed-forward cutset of the graph

Page 4: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-7

Pipelining of FIR digital filters (3)

Cut setA set of edges of a graph such that if these edges are removed from the graph, the graph becomes disjoint

Feed-forward cut setThe data move in the forward direction on all the edges of the cut set

We can arbitrarily place latches on a feed-forward cut set w/o affecting the functionality of the algorithm

VLSI DSP 2008 Y.T. Hwang 5-8

Pipelining of FIR digital filters (4)

Example 3.2.1

Incorrect pipelining correct pipelining

Original critical path: A3 → A5 → A4 → A6

After pipelining: A3 → A5 or A4 → A6

Critical path is reduced by one half

Page 5: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-9

Direct v.s. transpose form

Direct form with long critical path

Transpose form with data broadcast structureCritical path is reduced to TM + TA

VLSI DSP 2008 Y.T. Hwang 5-10

Fine-Grain pipelining

Pipelining the function unitAssume TM = 10 units, TA = 2 units

After pipelining, the critical path is 6 units

Page 6: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-11

Parallel processing of FIR filter (1)

Block processing of size Ly(n)=ax(n)+bx(n-1)+cx(n-2) y(3k)=ax(3k)+bx(3k-1)+cx(3k-2)

y(3k+1)=ax(3k+1)+bx(3k)+cx(3k-1)

y(3k+2)=ax(3k+2)+bx(3k+1)+cx(3k)

Block delay (L-slow): placing a latch at any line of MIMO structures produces an effective delay of L clocks at the sample rate

VLSI DSP 2008 Y.T. Hwang 5-12

Parallel processing of FIR filter (2)

Block size 33 times hardware

Critical path remains unchanged TM+2TA

Tclk ≥ TM+2TA

3 samples are produced in 1 clock cycle

effective iteration period is

Note: Tclk ≠Tsample

)2(3

11AMclksam pleiter TTT

LTT

Page 7: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-13

Parallel processing of FIR filter (3)

MIMO system

Complete parallel processingSystem with block size 4

A serial-to-parallel converter

A parallel-to-serial converter

VLSI DSP 2008 Y.T. Hwang 5-14

Pipelining v.s. parallel processing

Limitation of pipelining processingInput/output bottleneck, i.e. communication bounded system

Pipelining period cannot be smaller than the communication or I/O bound

Page 8: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-15

pipelining & parallel processing

Combined fine grain pipelining and parallel processing for 3-tap FIR filter

L = 3, M = 2

6

14)2(

6

1

1

AM

clksampleiter

TT

TLM

TT

VLSI DSP 2008 Y.T. Hwang 5-16

Pipelining & parallel processing for low power

Advantages of pipelining and parallel processingHigh speed

Low power

CMOS circuit model1st order analysis

Propagation delay

Power consumption fVCP

VVk

VCT

total

t

echpd

20

20

0arg

)(

Page 9: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-17

Pipelining for low power (1)

Sequential version

M-level pipelined versionWorking at the same frequency, i.e. f = 1/Tseq remains unchanged

Capacitance in each pipeline stage is reduced to Ccharge/M

Only V0 ( < 1) is needed to charge Ccharge/M in Tseq

seqtotalseq TffVCP /1 ,20

seqtotalpip PfVCP 220

2

VLSI DSP 2008 Y.T. Hwang 5-18

Pipelining for low power (2)

Calculation of

20

20

20

0arg

20

0arg

)()(

let

)(

)(

tt

pipseq

t

ech

pip

t

echseq

VVVVM

TT

VVk

VM

C

T

VVk

VCT

Page 10: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-19

Pipelining for low power (3)

Example3-tap FIR filter

Tm = 10, Ta = 2, Cm = 5Ca

Pipelined multiplier, Tm1 = 6, Tm2 = 4, Cm1 = 3Ca , Cm2 = 2Ca

V0 = 5V, Vt = 0.6V

Supply voltage calculationCcharge = Cm + Ca = 6Ca

Pipelined: Ccharge = Cm1 =Cm2 + Ca = 3Ca

502 - 31.36 + 0.72 = 0 = 0.6033

Vpip = V0 = 3.0165V

Power consumption ratio = 2 = 36.4%

VLSI DSP 2008 Y.T. Hwang 5-20

Parallel processing for low power (1)

L-parallel versionWorking at the one Lth frequency, i.e. f = 1/(LTseq)

Total Capacitance is increased to LCcharge

Since each Ccharge is charged in LTseq, Only V0 ( < 1) is needed to charge

Page 11: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-21

Parallel processing for low power (2)

Calculation of

seqech

echpar

tt

t

echseq

t

echseq

PfVC

L

fVLCP

VVVVL

VVk

VCLT

VVk

VCT

220arg

2

20arg

20

20

20

0arg2

0

0arg

))((

)()(

)( ,

)(

VLSI DSP 2008 Y.T. Hwang 5-22

Parallel processing for low power (3)

Example of 2-parallel version4-tap FIR filter

Tm = 8, Ta = 1, Cm = 8Ca

Tseq = 9

V0 = 3.3V, Vt = 0.45V

Page 12: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-23

Parallel processing for low power (4)

2-parallel FIR filter designNote each delay is 2-slow

x(2k-1)

x(2k-2)

VLSI DSP 2008 Y.T. Hwang 5-24

Parallel processing for low power (5)

Supply voltage calculationCcharge = Cm + Ca = 9Ca

2-parallel: Ccharge = Cm + 2Ca = 10Ca

Vpar = V0 = 2.17437V

Power consumption ratio = 2 = 43.41%

)(0282.0or 6589.0

08225.13425.6701.98

)(9)(5

22let

)(

10)(

9

2

20

20

20

0

20

0

tt

seqsamplepar

t

apar

t

aseq

VVVV

TTT

VVk

VCT

VVk

VCT

Page 13: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-25

Parallel processing for low power (6)

Area efficient 2-parallel version

Multiplier: 8 → 6, adder: 6 → 7 Delay: 3 → 4

VLSI DSP 2008 Y.T. Hwang 5-26

Parallel processing for low power (7)

Architecture verification

)22()12()2()12(

)12(

)32()22()12()2(

delay]block 1after [)2(

)12()12(

))12()22()(())12()2()((

)22()2(

)3()2()1()()(

3210

3210

31

3210

20

3210

kxhkxhkxhkxh

yyyky

kxhkxhkxhkxh

yyky

kxhkxhy

kxkxhhkxkxhhy

kxhkxhy

nxhnxhnxhnxhny

CAB

CA

C

B

A

Page 14: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-27

Parallel processing for low power (8)

Supply voltage calculationCcharge = Cm + Ca = 9Ca

2-parallel: Ccharge = Cm + 4Ca = 12Ca

Vpar = V0 = 2.4585V

)(025.0or 745.0

06075.0155.2567.32

)(

12

)(

92

22let

)(

12)(

9

2

20

02

0

0

20

0

20

0

t

a

t

a

seqsamplepar

t

apar

t

aseq

VVk

VC

VVk

VC

TTT

VVk

VCT

VVk

VCT

VLSI DSP 2008 Y.T. Hwang 5-28

Parallel processing for low power (9)

Power consumption ratio

%6.4335

555.0

2

155 ,35

2

1

2

1

,5576

,3534

2

20

220

2)()(

20

)()(

seq

par

saparsaseq

sseqpar

parparpar

totalparaampar

total

seqseq

totalseqaamseq

total

P

Pratio

fVCPfVCP

fff

fVCPCCCC

fVCPCCCC

Page 15: Chapter 4 Pipelining and Parallel Processing - SOC & DSP …socdsp.ee.nchu.edu.tw/class/download/vlsi_dsp_102/night/DSP/Ch4... · Chapter 4 Pipelining and Parallel Processing VLSI

VLSI DSP 2008 Y.T. Hwang 5-29

Combining pipelining and parallel processing

PipeliningReduces the capacitance to be charged/discharged in 1 clock period

Parallel processingIncreases the clock period for charging/discharging the original capacitance

3-parallel 2-stage pipelining

VLSI DSP 2008 Y.T. Hwang 5-30

pipelining + parallel processing

Propagation delay of the parallel pipelined filter

Solution of

20

0charge2

0

0charge

)()(

)/(

ttpd

VVk

VLC

VVk

VMCLT

20

20 )()( tt VVVVML