5012: vlsi signal processing

30
1 ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-1 5012: VLSI Signal Processing 5012: VLSI Signal 5012: VLSI Signal Processing Processing Lecture 1 Pipelining & Retiming Lecture 1 Pipelining & Retiming ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-2 Introduction DSP System Real time requirement Data driven – synchronized by data Programmable processor ASIC or custom DSP DSPs What to use depends on requirements • Throughput Power, Energy • Area • Wordlength – precision • Flexibility • Time-to-market • Time-in-market

Upload: others

Post on 14-Apr-2022

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 5012: VLSI Signal Processing

1

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-1

5012: VLSI Signal Processing

5012: VLSI Signal 5012: VLSI Signal ProcessingProcessing

Lecture 1 Pipelining & RetimingLecture 1 Pipelining & Retiming

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-2

Introduction• DSP System

– Real time requirement– Data driven – synchronized by data

– Programmable processor– ASIC or custom DSP DSPs

– What to use depends on requirements• Throughput• Power, Energy• Area• Wordlength – precision• Flexibility• Time-to-market• Time-in-market

Page 2: 5012: VLSI Signal Processing

2

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-3

Introduction• Selecting Application Processors for

Mobile Multimedia (http://www.bdti.com/articles/030930CDC_Application_Processors.pdf)

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-4

DSP Representations• DSP Algorithm are non-terminating

• Iteration period: the time for one iteration, one execution of the loop within DSP algorithm

• Critical path: longest path without delay• Critical path sets bound on clock frequency, i.e. Tclk≥Tcritical• Sampling rate: number of samples processed / second• Latency: time difference between output and corresponding

input (combinational logic = gate delays, sequential logic = number of clock cycles)

• The clock rate (or frequency) of the DSP system is not the same as its sampling rate

]3[]2[]1[][][ 3210 −+−+−+= nxhnxhnxhnxhny

Page 3: 5012: VLSI Signal Processing

3

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-5

Block DiagramThe graphical representations of DSP algorithm exhibit all the parallelism (concurrence) and data-driven properties (dependence) of the system, as well as provide insight into space and time tradeoffs.

• Consists of functional blocks connected with directed edges, which represent data flow from its input block to its output block

]3[]2[]1[][][ 3210 −+−+−+= nxhnxhnxhnxhny

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-6

Signal-Flow Graph (SFG)• Nodes: represent computations and/or task• Directed edge (j,k): denote a linear transformation

from the input signal at node j to the output signal at node k

• Source: no entering edge• Sink: only entering edges

]3[]2[]1[][][ 3210 −+−+−+= nxhnxhnxhnxhny

Page 4: 5012: VLSI Signal Processing

4

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-7

Transposition of an SFG• Applicable to linear SISO systems• Flow graph reversal

– Reserve the direction of all edges– Exchange input and output

]3[]2[]1[][][ 3210 −+−+−+= nxhnxhnxhnxhny

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-8

Data-Flow Graph (DFG)• DFG captures the data-driven property of the DSP algorithm• Nodes represent computations, functions, or tasks• Directed edges represents data flow with non-negative

number of delays• A node can execute whenever all its input data are available• Each edge describes a precedence constraint between two

nodes in DFG– Intra iteration precedence constraint: if no delay– Inter-iteration precedence constraint: if existing one or more

delays

]3[]2[]1[][][ 3210 −+−+−+= nxhnxhnxhnxhny

Page 5: 5012: VLSI Signal Processing

5

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-9

Examples of DFGs• DFGs and Block diagrams can be used to describe both

linear single-rate and nonlinear multi-rate DSP systems.• Synchronous DFG (SDFG): the number of input and output

samples are specified at each node in each execution.

• Multi-rate SDFG

Decimator (down-sampling)

Expander(up-sampling)

3fA=5fB

2fB=3fC

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-10

Critical Path – fundamental limit• 4 possible paths

– Input nodes to delay element– Input node to output node– Delay element to delay element– Delay element to output

• Critical path: the path with the longest computation time among all paths without delay elements

• Clock period is bounded by the computation time of the critical path• Example: 5-tap FIR filter and assume TA=4ns, TM=10ns

critical paths = 26ns

Page 6: 5012: VLSI Signal Processing

6

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-11

RemarksDirect Form 4-tap FIR Transposed Form

1. TCritical=TM+(N-1) TA, where N is the

number of taps

2. Short word-length delay elements

1. TCritical=TM+TA

2. High fan out causes high

driving force

3. Long word-length delay

elements

10-bit 10-bit

20-bit 20-bit

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-12

Iteration Period• Iteration: execution of all computations of an algorithm once• Iteration period: the time required for execution of on iteration

• Iteration rate: the number iterations executed per second• Sample rate (throughput): the number of samples processed per

second

MA TT +

A0→B0⇒A1 →B1⇒A2….

Page 7: 5012: VLSI Signal Processing

7

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-13

Loop Bound• Loop: a directed path that begins and ends at the same node• Loop bound of the loop j: Tj / Wj, where Tj is the loop

computation time and Wj is the number of delays in the loopExamples:– assume TA+TM =6ns,

a, b, c, a is a loop with loop bound = 6ns

– assume TA+TM =6ns, the loop bound = 3ns

y(n)=ay(n-2)+x(n)

2MA

sTTT +≥

MAs TTT +≥

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-14

Remarks

AN⇒BN+1 ⇒AN+2 ⇒BN+3…

retiming

Page 8: 5012: VLSI Signal Processing

8

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-15

• Critical loop: the loop with the maximum loop bound• Iteration bound: the loop bound of the critical loop• Not possible to achieve iteration period lower than iteration

bound even with infinite processing power• Definition

• Example

Iteration Bound – fundamental limit

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-16

Algorithms for Computing Iteration Bound

• Longest Path Matrix (LPM) Algorithm• Minimum Cycle Mean (MCM) Algorithm

Page 9: 5012: VLSI Signal Processing

9

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-17

LPM Algorithm (1/2)• d: number of delays, di: ith delay element• Construct matrices L(m), 1≤m≤d, where

each entry ℓi,j(m) is the longest path from di

to dj with m-1 delays• ℓi,j

(m)=-1 if no such path• L(m+1) can be obtained from L(1) and L(m)

recursively by, if there is k such that ℓi,j

(m+1)= ℓi,k(1) + ℓk,j

(m), otherwise ℓi,j(m)=-1

• Note that the diagonal element represents the longest computation time of all loops with m delays contains di

• Then the iteration bound can be calculated by T∞=max{ℓi,i

(m) / m}, for 1≤ i,m ≤ d

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-18

LPM Algorithm (2/2)

Page 10: 5012: VLSI Signal Processing

10

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-19

T∞ of Multirate DFGs (MRDFGs)• Construct the equivalent single-rate DFG (SRDFG)• Compute the iteration bound of the equivalent

SRDFG (unrolling or unfolding)• The iteration bound of the MRDFG is the same as

the iteration bound of the equivalent SRDFG.

3fA=5fB

2fB=3fC

fB=(3/5)fAfc=(2/3)fB=(2/5)fA

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-20

Transformation of MRDFG

kx: the number of nodes related to “x”

Note: same number of delay elements

Page 11: 5012: VLSI Signal Processing

11

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-21

Pipelining & Parallel Processing• Pipelining processing

– By using pipelining latches to reduce critical path– Either to increase the clock speed or sample speed, or to

reduce the power consumption at same speed• Parallel processing

– By using replicating hardware to process multiple samples in parallel in a clock period

– The effective sampling speed is increased by the level of parallelism.

– Can be used to the reduction of power consumption

continue sendingTsample=TCLK Tsample≠TCLK

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-22

Parallel Processing• Parallel processing and pipelining processing are duals each

other• Parallel processing is just the block processing. • The number of inputs processed in a clock cycle is referred

to as the block size.

)2()1()()( −+−+= ncxnbxnaxny

⎪⎩

⎪⎨

++++=+−+++=+−+−+=

)3()13()23()23()13()3()13()13()23()13()3()3(

kcxkbxkaxkykcxkbxkaxkykcxkbxkaxky

delay Block delay

Page 12: 5012: VLSI Signal Processing

12

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-23

Pipelining LatchesDirect Form 4-tap FIR

1. TCritical=TM+(N-1) TA, where N is

the number of taps

2. Clock speed limited by critical

path

1. TCritical=TM+TA

2. Pipelining will not affect

functionality but introduce latency

Feed-forward pipelining latches

2-level pipelined architecture

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-24

Cutset Pipelining (1/2)• Cutset: a set of edges of a graph such that if

these edges are removed from the graph, the graph becomes disjoint

• Feed-forward cutset: the data move in the forward direction on all edges of the cutset

• In a synchronous DFG (SDFG),– communications occurs on the edges only– data flow through the functional units directly and the

operations are synchronized with the arranged data arrival times (via delay insertion)

Page 13: 5012: VLSI Signal Processing

13

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-25

Cutset Pipelining (2/2)• Cutset pipelining preserves architecture’s

behavior– the data movement between the two disjoint sub-graphs

only occurs on the feed-forward cutset, delaying or advancing the data movement along all edges on the cutset by the same amount of time do not change the architecture’s behavior

• Two main drwaback– Increase in the number of latches– System latency

+ kD

+kD

+ kD

cutset

G2

G1

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-26

Feed-forward Cutset

Must place delays on all edges in the cutset

DD

D

Not a valid pipelining

Page 14: 5012: VLSI Signal Processing

14

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-27

Retiming• Retiming is to move around existing delays s.t.

– Does not alter the latency of the system– Changes (Reduces) the critical path of the system– Reduces the number of register

• Pipelining is equivalent to introducing many delays at the input followed by retiming

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-28

Example -- Cutset Retiming

Example+ kD

+ kD

- kD

cutset

G2

G1

A

B

C

D

E

F

2D

D

cutset

D

D

D

D

Add delays to edges going one way and remove from ones going the other

Page 15: 5012: VLSI Signal Processing

15

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-29

Example – Node RetimingTo cutset around one specified node

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-30

Node Retiming• The cutset can be selected such that one of the disjoint sub-

graphs contains only one node• Retiming value r(V): defined as the number of delays moved from

each output edge of the node V to each of its input edges• Feasibility constraints: for each directed edge UV of the retimed

SDFG, the number of delay elements must be non-negative

• For any SDFG, there is a retiming design space defined by the system of inequalities (each edge contributes an inequality)

• Solving systems of inequalities

( ) ( ) ( ) ( )( ) ( ) ( )

( ) ( )lyrespective retiming before andafter

edge on the elementsdelay ofnumber theare and where

0

UVewew

ewVrUrUrVrewew

r

r

≤−≥−+=

Page 16: 5012: VLSI Signal Processing

16

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-31

Node Retiming with Formation

Result

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-32

Properties of Retiming• P.1:The weight of a retimed path p = V0, V1, … , Vk is given by

wr(p) = w(p)+r(Vk)-r(V0)

• P.2: Retiming does not change the number of delays in a cycle (special case of P.1 with Vk=V0)

• P.3: Retiming does not alter the iteration bound in a DFG as the number of delays in a cycle does not change

• P.4: Adding a constant value j to the retiming value of each node does not change the mapping from G to Gr

Page 17: 5012: VLSI Signal Processing

17

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-33

Pipelining – Feedforward Cutset• A special case of cutset retiming by

placing delays at feedforward cutset

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-34

Remarks3-stage Lattice Filter

TCritical=2TM+(N-1) TA, where N is the number of taps

Cutset retiming

TCritical=4TM+4TA, a constant, which is independent of N

Page 18: 5012: VLSI Signal Processing

18

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-35

K Slow-down and RetimingReplace each D by KD

2-slowtransformation

1. K-1 null operation must be inserted2. Hardware utilization is reduced (e.g. 50%)

D

Retiming 2-slow version

Hardware can be fully utilized if two independent operations are available

Tclk=1 t.u., Titer=2 t.u.

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-36

Cutset Retiming2-slow Lattice Filter

Inserting of “0” to preserve behavior !!

x0 0 x1 0 x2 0 x3 0 …

Tsample = 2 Tcritical

2-slow

TCritical=2TM+(N-1) TA, where N is tap number

Page 19: 5012: VLSI Signal Processing

19

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-37

Retiming for Min Clock Period (1/3)• : the minimum feasible clock period is defined as

• In the retiming design space (defined by the system of inequalities), find a retimed SDFG with the minimum

• Definition:– W(U,V): the minimum number of registers on any path

from node U to node V

– D(U,V): the maximum computation time among all paths from U to V with weight W(U,V)

( )GΦ

( ) { }0)(:)(max ==Φ pwptG

( )GΦ

( ) ( ){ }VUpwVUW p⎯→⎯= :min,

( ) ( ) ( ) ( ){ }VUWpwVUptVUD p , and : max, =⎯→⎯=

w(p) denotes the delays of the patht(p) denotes the computation time of the path

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-38

Retiming for Min Clock Period (2/3)Step 1: Compute W(U,V) and D(U,V)

• Let M = tmaxn, where tmax is the maximum computation time of the nodes in G and n is the number of nodes in G

• Form a new graph G’, which is the same as G except the edge weights are replaced by w’(e) = Mw(e) – t(u) for all edges from U to V

• Solve the all-pair shortest path problem on G’ (using Floyd-Warshall algorithm). Let S’UV be the shortest path from U to V

• If U = V,– W(U,V) = 0 and D(U,V) = t(U)

else,– W(U,V) = ceiling (S’UV/M) and D(U,V) = MW(U,V) - S’UV + t(V)

Page 20: 5012: VLSI Signal Processing

20

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-39

Retiming for Min Clock Period (3/3)Step 2: Given a clock period c, find a feasible solution such that

• The retimed design space is now defined by– feasibility constraints

– critical path constraints

i.e.

• If the system of inequalities is not , try a smaller c

( ) cG ≤Φ

( ) ( ) ( )ewVrUr ≤−

( ) ( ) ( )( ) cVUD

VUWVrUr>

−≤−,such that G,in V and Unodes allfor

1,

( ) ( ) ( ) ( )( ) ( ) ( ) ( ) cVUDUrVrVUW

VUWVrUrcVUD≤<−+⇔

−≤−>, ,1, if

1, ,, if

φ

r(U)-r(V) ≤ W(U,V)

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-40

Retiming for Register Minimization• Find a feasible solution with minimum number of registers in the

design space constrained by feasibility and critical-path constraints (the refined retimed space with minimum clock period)

• Multiple fan-out problem

• Modeled as an integer linear programming (ILP) problemMinimize , while satisfying– longest fanout constraints– feasibility constraints– critical path constraints

U 3D

D

7D

V1

V2

V3 U D 2D 4D

V1

V2

V3

( ) ( ) ( )ewVrUr ≤−( ) ( ) ( )

( ) cVUDVUWVrUr

>−≤−

,such that G,in V and Unodes allfor 1,

( )∑ Uq

( ) ( ) ( ) ( )UqUrVrew ≤−+

Note solving the ILP model is NP-hard, and a gadget model can be found in the literature.

Page 21: 5012: VLSI Signal Processing

21

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-41

Pipelining & Parallel Processing• Pipelining processing

– By using pipelining latches to reduce critical path– Either to increase the clock speed or sample speed, or to

reduce the power consumption at same speed• Parallel processing

– By using replicating hardware to process multiple samples in parallel in a clock period

– The effective sampling speed is increased by the level of parallelism.

– Can be used to the reduction of power consumption

continue sendingTsample=TCLK Tsample≠TCLK

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-42

Pipelining & Parallel Processing (2/1)• Communication bounded

– When the critical path is less than that the I/O bound, i.e. output-pad delay plus input-pad delay and the wire delay between two chips, then the system is communication bounded.

– Pipelining can be used only to the extent such that the criticalpath is limited by the communication bound.

– Once this is reached, pipelining can no longer increase the speed

Page 22: 5012: VLSI Signal Processing

22

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-43

Pipelining & Parallel Processing (2/2)• M-level pipelining processing

– The critical path is reduced to 1/M

• L-level parallel processing– The critical path of the block processing system remains

unchanged.– L samples are processed in 1 (not L) clock cycle, the

iteration period is Titeration=Tsample=Tclk/L, where Tclk≥Tcritical

• By combining M-level pipelining and L-level parallel processing, then – Titeration=Tsample=(1/LM)Tclk

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-44

Power Consumption in CMOS• Ptotal=Pdynamic+Pshort-circuit+Pstatic

• Short circuit Power Consumption: current spikes

• Static Power Consumption: due to leakage current

Page 23: 5012: VLSI Signal Processing

23

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-45

CMOS Power Consumption• (Dynamic) power dissipation

– Ctotal: the total capacitance of the CMOS circuit• Propagation delay

– Ccharge : the capacitance to be charged or discharged in a single clock cycle (along the critical path)

– V0 : the supply voltage; Vt : the threshold voltage– K : technology parameter

fVCP total ⋅⋅= 20

( )20

0chargepd

tVVkVC

T−

⋅=

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-46

Reduce…• Capacitances

– Transistor/Gate C– Load C– Interconnects– External

• Activity• Frequency• Power supply• Some comments

– Off-chip connections have high capacitive load– System integration

Page 24: 5012: VLSI Signal Processing

24

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-47

Pipelining for Low Power (1/2)• For an M-level pipelined architecture, the critical path is

reduced to 1/M and the capacitance to be charged or discharged in a single cycle (i.e. Ccharge) is also reduced to 1/M

• If the same clock speed is maintained (i.e. f = 1/Tpd), only 1/M of the non-pipelined capacitance is required to be charged or discharged, which suggests voltage reduction

• Suppose the voltage can be reduced to , the power consumption of a pipelined architecture becomes

0V⋅β

( )

pipelinednon

totalpipelined

P

fVCP

−⋅=

⋅⋅⋅=2

20

β

β

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-48

Remarks

Propagation delay of the original filter and the pipelined filter

Page 25: 5012: VLSI Signal Processing

25

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-49

Pipelining for Low Power (2/2)• Solving

– propagation delay of the original architecture

– propagation delay of the pipelined architecture

– setting the above two equations equal, the following quadratic equation can be obtained to solve

( )20

0chargepipelined-non

tVVkVC

T−

⋅=

( )20

0charge

pipelinedtVVk

VM

C

T−⋅

⋅⋅=

β

β

β

( ) ( )202

0 tt VVVVM −⋅=−⋅ ββ

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-50

Example: Reduce Power by Pipelining• Consider the following two FIR filters.

• Assume– the multiplication and the addition take 10 u.t. and 2 u.t. respectively– the capacitance of the multiplier is 5 times that of an adder– In the fine-grain pipelined filter, the multiplier is broken into m1and m2,

with 6 and 4 u.t. computation delay and 3 and 2 times capacitance of an adder

– supply voltage V0 and threshold voltage Vt are 5 and 0.6 respectively• Problems:

• What is the supply voltage of the pipelined architecture if the clock periods are identical?

• What is the relative power consumption?

D y(n)D

x(n)

D y(n)D

x(n)

D D D

m1

m2

m1 m1

m2 m2

Page 26: 5012: VLSI Signal Processing

26

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-51

Solution• Calculate Ccharge: let CM, CA, Cm1 and Cm2 be the capacitance

of the multiplier, the adder, and the two fine-grain pipelining parts of the pipelined FIR filter respectively– Non-pipelined: Ccharge= CM + CA = 6CA– Pipelined: Ccharge= Cm1 = Cm2 +CA = 3CA

Notice the pipelining level M = 2 and the charging capacitance of the pipelined filter is one-half of that of the original one

• Solve

• The supply voltage of the pipelined filter is

• Power consumption ratio is

( ) ( )

)11950 (invalid, 02390or 6033007203631506056052

0

2

22

tV.Vβ..β.β.β

.β.β

<=⋅==+⋅−⋅

−⋅=−⋅⋅

Q

0165.30 =⋅= VVpipelined β

%4.362 =β

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-52

Parallel Processing for Low Power (1/2)• For an L-parallel architecture, the charge capacitance

remains the same, but the total capacitance (i.e. Ctotal) is increased L times

• The clock speed of the L-parallel architecture is reduced to 1/L (i.e. f = 1/LTpd) to maintain the same sample rate, which means the Ccharge is charged or discharged L times longer.

• The supply voltage can be reduced to , since more time is allowed to charged or discharged the same capacitance. Thus the power consumption of the parallel architecture becomes

0V⋅β

( ) ( )

parallelnon

totalparallel

PLfVCLP

−⋅=

⋅⋅⋅⋅=

2

20

β

β

Page 27: 5012: VLSI Signal Processing

27

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-53

RemarksPropagation delay of the original filter and the parallel filter

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-54

Parallel Processing for Low Power (2/2)• Solving

– propagation delay of the original architecture

– propagation delay of the parallel architecture

– setting these two propagation delays equal, the following quadratic equation can be obtained to solve β

β

( )20

0chargeparallel-non

tVVkVC

T−

⋅=

( )LT

VVkVC

Tt

⋅=−⋅

⋅⋅= parallel-non2

0

0chargeparallel β

β

( ) ( )202

0 tt VVVVL −⋅=−⋅ ββ

Page 28: 5012: VLSI Signal Processing

28

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-55

Example: Reduce Power by Parallel• Consider the following two FIR filters, with critical paths denoted

in dash lines respectively

• Assume– the multiplication and the addition take 8 u.t. and 1 u.t. respectively– the capacitance of the multiplier is 8 times that of an adder– both architectures operate at the sample period of 9 u.t.– supply voltage V0 and threshold voltage Vt are 3.3 and 0.45 respectively

• What is the supply voltage of the parallel architecture?• What is the relative power consumption?

D y(n)D

x(n)

D

D y(2k+1)

x(2k)

y(2k)D D

x(2k+1)

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-56

Solution• Calculate Ccharge: let CM and CA be the capacitance of the

multiplier and the adder respectively– Non-parallel: Ccharge= CM + CA = 9CA– Parallel: CM +2CA = 10CA

Notice the charging capacitance of the two architectures are not equal, and we cannot use the equation directly

• Solve

• The supply voltage of the parallel filter is• Power consumption ratio is

β

( ) ( )

( ) ( )

)093060 (invalid, 02820or 6589008225.13425.6701.98

95

2

10 and 9

0

2

20

20

20

02

0

0

t

tt

parallelnonparallel

t

Aparallel

t

Aparallelnon

V.Vβ..βββ

VVVV

TTVVkVCT

VVkVCT

<=⋅==+⋅−⋅

−⋅⋅=−⋅⋅

⋅=−⋅⋅⋅

=−⋅

=

Q

Q

ββ

ββ

17437.20 =⋅= VVpipelined β%41.432 =β

Page 29: 5012: VLSI Signal Processing

29

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-57

Parallel Processing• Block processing

– the number of inputs processed in a clock cycle is referred to as the block size

– at the k-th clock cycle, three inputs x(3k), x(3k+1), and x(3k+2) are processed simultaneously to generate y(3k), y(3k+1), and y(3k+2)

Serial toParallel

Converter

SISOx(n) y(n)

MIMO

x(3k) y(3k)x(3k+1)x(3k+2)

y(3k+1)y(3k+2)

Parallel toSerial

Converterx(n) y(n)

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-58

I/O Conversion• Serial to parallel converter

• Parallel to serial converter

3k

D DT/3T/3

sampling period

y(3k)y(3k+1)y(3k+2)

y(n)

x(n) D D

x(3k)x(3k+1)x(3k+2)

T/3T/3

sampling period

Page 30: 5012: VLSI Signal Processing

30

ADSP Lecture1 - Pipelining & Retiming ([email protected]) 1-59

Mathematical Formulation• e.g. y(n) = ay(n-9) + x(n)• 2-parallel

Y(2k) = ay(2k-9) + x(2k)Y(2k+1) = ay(2k-8) + x (2k+1)

• In 2-parallel SDFG, one active clock edge leads two samplesY(2k) = ay(2(k-5)+1) + x(2k)Y(2k+1) = ay(2(k-4)+0) + x(2k+1)

• Dependency with less than # parallelism of sample delays can be implemented with internal routing