chapter 4 pipelining and parallel processing - soc & dsp...

VLSI DSP 2008 Y.T. Hwang 5-1

Chapter 4 Pipelining and Parallel Processing


Introduction (1)

PipeliningReduction in critical path

Increase the clock speed

Reduce power consumption at same speed

Parallel processingParallelism

Increase effective sampling speed

Reduction of power consumption


Introduction (2)

A 3-tap FIR filtery(n)=ax(n)+bx(n-1)+cx(n-2)

Critical path: 1 multiply and 2 add

AMsample

AMsam ple

TTf

TTT

2

1

2


Introduction (3)

Pipelining or parallel processing to sampling frequency

Critical path: 2 add

Pipelining

Parallel processing


Pipelining of FIR digital filters (1)

Feed forward cut set Two iterations are computed concurrently

Critical path reduced from TM+2TA to TM+TA

Latency increased from 1 to 2



Drawbacks of pipeliningIncrease in the number of latches and in system latency

ObservationsThe clock period is limited by the longest path between Two latches

An input and a latch

A latch and an output

An input and an output

Critical path can be reduced by suitably placing the pipelining latches

Pipelining latches can be placed across any feed-forward cutset of the graph



Cut setA set of edges of a graph such that if these edges are removed from the graph, the graph becomes disjoint

Feed-forward cut setThe data move in the forward direction on all the edges of the cut set

We can arbitrarily place latches on a feed-forward cut set w/o affecting the functionality of the algorithm



Example 3.2.1

Incorrect pipelining correct pipelining

Original critical path: A3 → A5 → A4 → A6

After pipelining: A3 → A5 or A4 → A6

Critical path is reduced by one half


Direct v.s. transpose form

Direct form with long critical path

Transpose form with data broadcast structureCritical path is reduced to TM + TA


Fine-Grain pipelining

Pipelining the function unitAssume TM = 10 units, TA = 2 units

After pipelining, the critical path is 6 units


Parallel processing of FIR filter (1)

Block processing of size Ly(n)=ax(n)+bx(n-1)+cx(n-2) y(3k)=ax(3k)+bx(3k-1)+cx(3k-2)

y(3k+1)=ax(3k+1)+bx(3k)+cx(3k-1)

y(3k+2)=ax(3k+2)+bx(3k+1)+cx(3k)

Block delay (L-slow): placing a latch at any line of MIMO structures produces an effective delay of L clocks at the sample rate



Block size 33 times hardware

Critical path remains unchanged TM+2TA

Tclk ≥ TM+2TA

3 samples are produced in 1 clock cycle

effective iteration period is

Note: Tclk ≠Tsample

)2(3

11AMclksam pleiter TTT

LTT



MIMO system

Complete parallel processingSystem with block size 4

A serial-to-parallel converter

A parallel-to-serial converter


Pipelining v.s. parallel processing

Limitation of pipelining processingInput/output bottleneck, i.e. communication bounded system

Pipelining period cannot be smaller than the communication or I/O bound


pipelining & parallel processing

Combined fine grain pipelining and parallel processing for 3-tap FIR filter

L = 3, M = 2

6

14)2(

6

1

1

AM

clksampleiter

TT

TLM

TT


Pipelining & parallel processing for low power

Advantages of pipelining and parallel processingHigh speed

Low power

CMOS circuit model1st order analysis

Propagation delay

Power consumption fVCP

VVk

VCT

total

t

echpd

20

20

0arg

)(


Pipelining for low power (1)

Sequential version

M-level pipelined versionWorking at the same frequency, i.e. f = 1/Tseq remains unchanged

Capacitance in each pipeline stage is reduced to Ccharge/M

Only V0 ( < 1) is needed to charge Ccharge/M in Tseq

seqtotalseq TffVCP /1 ,20

seqtotalpip PfVCP 220

2



Calculation of

20

20

20

0arg

20

0arg

)()(

let

)(

)(

tt

pipseq

t

ech

pip

t

echseq

VVVVM

TT

VVk

VM

C

T

VVk

VCT



Example3-tap FIR filter

Tm = 10, Ta = 2, Cm = 5Ca

Pipelined multiplier, Tm1 = 6, Tm2 = 4, Cm1 = 3Ca , Cm2 = 2Ca

V0 = 5V, Vt = 0.6V

Supply voltage calculationCcharge = Cm + Ca = 6Ca

Pipelined: Ccharge = Cm1 =Cm2 + Ca = 3Ca

502 - 31.36 + 0.72 = 0 = 0.6033

Vpip = V0 = 3.0165V

Power consumption ratio = 2 = 36.4%


Parallel processing for low power (1)

L-parallel versionWorking at the one Lth frequency, i.e. f = 1/(LTseq)

Total Capacitance is increased to LCcharge

Since each Ccharge is charged in LTseq, Only V0 ( < 1) is needed to charge



Calculation of

seqech

echpar

tt

t

echseq

t

echseq

PfVC

L

fVLCP

VVVVL

VVk

VCLT

VVk

VCT

220arg

2

20arg

20

20

20

0arg2

0

0arg

))((

)()(

)( ,

)(



Example of 2-parallel version4-tap FIR filter

Tm = 8, Ta = 1, Cm = 8Ca

Tseq = 9

V0 = 3.3V, Vt = 0.45V



2-parallel FIR filter designNote each delay is 2-slow

x(2k-1)

x(2k-2)




2-parallel: Ccharge = Cm + 2Ca = 10Ca

Vpar = V0 = 2.17437V

Power consumption ratio = 2 = 43.41%

)(0282.0or 6589.0

08225.13425.6701.98

)(9)(5

22let

)(

10)(

9

2

20

20

20

0

20

0

tt

seqsamplepar

t

apar

t

aseq

VVVV

TTT

VVk

VCT

VVk

VCT



Area efficient 2-parallel version

Multiplier: 8 → 6, adder: 6 → 7 Delay: 3 → 4



Architecture verification

)22()12()2()12(

)12(

)32()22()12()2(

delay]block 1after [)2(

)12()12(

))12()22()(())12()2()((

)22()2(

)3()2()1()()(

3210

3210

31

3210

20

3210

kxhkxhkxhkxh

yyyky

kxhkxhkxhkxh

yyky

kxhkxhy

kxkxhhkxkxhhy

kxhkxhy

nxhnxhnxhnxhny

CAB

CA

C

B

A




2-parallel: Ccharge = Cm + 4Ca = 12Ca

Vpar = V0 = 2.4585V

)(025.0or 745.0

06075.0155.2567.32

)(

12

)(

92

22let

)(

12)(

9

2

20

02

0

0

20

0

20

0

t

a

t

a

seqsamplepar

t

apar

t

aseq

VVk

VC

VVk

VC

TTT

VVk

VCT

VVk

VCT



Power consumption ratio

%6.4335

555.0

2

155 ,35

2

1

2

1

,5576

,3534

2

20

220

2)()(

20

)()(

seq

par

saparsaseq

sseqpar

parparpar

totalparaampar

total

seqseq

totalseqaamseq

total

P

Pratio

fVCPfVCP

fff

fVCPCCCC

fVCPCCCC


Combining pipelining and parallel processing

PipeliningReduces the capacitance to be charged/discharged in 1 clock period

Parallel processingIncreases the clock period for charging/discharging the original capacitance

3-parallel 2-stage pipelining


pipelining + parallel processing

Propagation delay of the parallel pipelined filter

Solution of

20

0charge2

0

0charge

)()(

)/(

ttpd

VVk

VLC

VVk

VMCLT

20

20 )()( tt VVVVML

chapter 4 pipelining and parallel processing - soc & dsp...

Documents