chapter 7 systolic arrays - york university · 2012. 3. 19. · chapter 7 systolic arrays cse4210...
TRANSCRIPT
1
YORK UNIVERSITY CSE4210
Chapter 7Systolic Arrays
Mokhtar AboelazeCSE4210 Winter 2012
YORK UNIVERSITY CSE4210
Systolic Architecture
• A number of usually similar processing elements connected together to implement a specific algorithm.
• Data move between PE’s in a rhythmic fashion.
PE PE PE
Processor
. . .
2
YORK UNIVERSITY CSE4210
Systolic Architecture• Typically, fully pipelined (all communication
between PE’s contain delay element (why?). Also communication between neighboring PE’s only.
• Some relaxation techniques can get rid of the delay. Also, there may be communication between close bet not neighboring PE’s
• Some processors (especially boundary ones may be different than the rest.
• Could be used as a coprocessor
YORK UNIVERSITY CSE4210
Design Methodology• Using linear mapping techniques from the
dependence space to the space-time• Usually, algorithm is described by a dependence
graph.• Dependence graph is regular if the presence of
any edge connected to a node, means the existence of a similar edge in every node.
• There is no concept of time in the dependence graph.
3
YORK UNIVERSITY CSE4210
FIR Filter
x0 x1 x2 x3
ω2
ω1
ω0
• Y(n)=ω0 x(n)+ ω1 x(n-1) + ω2 x(n-2)
• Data is moving in three directions
• X in [0 1]T
• ω in [1 0]T
• Y in [1 -1]T
YORK UNIVERSITY CSE4210
Design Methodology• We map the N-dimensional DG to a lower
dimension systolic architecture• Three vectors are introduced• Projection vector d=[d1 d2]T• Processor space vector PT = [p1 p2]• Scheduling vector ST=[S1 S2]
• Hardware Utilization Efficiency =1/|STd|
4
YORK UNIVERSITY CSE4210
Design Methodology• Projection vector
– Two nodes are displaced by d or multiple of it, are mapped to the same processor
• Processor space vector– Any node in the DG I (IT=(i,j)) is mapped to processor
PTI• Scheduling vector
– Any node in the DG I would be executed at time STI• Subject to some constraints
YORK UNIVERSITY CSE4210
Design Methodology• Steps
– Represent algorithm as a DG– Apply mapping (projection and scheduling)– Edge mapping
• If an edge e exists in the DG, then an edge PTe is introduced in the systolic array with STe delay
– Construct the systolic array
5
YORK UNIVERSITY CSE4210
Design Methodology• Constraints
– Processor space vector and the projection vector must be orthogonal PTD=0. if IA-IB = multiple of d, they are executed by the same processor
– If A and B are mapped to the same processor, they should not be executed at the same time STIA ≠ STIB i.e. STd≠0
YORK UNIVERSITY CSE4210
Example -- IIR• Y(n)=ω0 x(n)+ ω1 x(n-1) + ω2 x(n-2)
y0 y1 y2 y3
x0 x1 x2 x3
ω2
ω1
ω0
Single assignment format with broadcasting data:
Do n=1,2, . . .y1(n,-1)=0Do k=0,Ky1(n,k)=y1(n,k-1)
+w(k)*x(n-k)enddoy(n)=y1(n,K)
Enddo
6
YORK UNIVERSITY CSE4210
Design I[ ] [ ]0110
01
==⎥⎦
⎤⎢⎣
⎡= TT SPd
1-1Y(1 -1)
01X(0 1)
10ω(1 0)
STePTeEdge eω ω ω
yx
D D
If an edge e, then an edge PTeis introduced in the array with delay STe
Weights stay, broadcast input, move results
YORK UNIVERSITY CSE4210
Design Ix0 x1 x2 x3
ω2
ω1
ω0
time0 1 2 3
1
2
3pr
oces
sor
Point I is executed in PE PTI at time STI
Point (i,j) is executed at PE j at time i
D ⊗⊕
ω
D ⊗⊕
ω
D ⊗⊕
ω
DD
7
YORK UNIVERSITY CSE4210
⊗ ⊗ ⊗
⊕ ⊕ ⊕DD
DDDω2ω1ω0
X3
⊗ ⊗ ⊗
⊕ ⊕ ⊕DD
DDDω2ω1ω0
X4
⊗ ⊗ ⊗
⊕ ⊕ ⊕DD
DDDω2ω1ω0
X5
?
?
?
YORK UNIVERSITY CSE4210
Design II[ ] [ ]0111
11
==⎥⎦
⎤⎢⎣
⎡−
= TT SPd
10Y(1 -1)
01X(0 1)
11W(1 0)
STePTeEdge e
ω D D
X[n]
Y[n]
Point I is executed in PE PTI at time STI
Point (i,j) is executed at PE i+j at time i
Broadcast input, move weights, result stay
8
YORK UNIVERSITY CSE4210
Design II
YORK UNIVERSITY CSE4210
Design II• How can you square the previous design
with.• Point (i,j) is executed at PE i+j at time i
x0 x1 x2 x3
ω2
ω1
ω00,0
x,y means at time x in processor y
1,0
2,0
0,1
2,1
3,1
2,2
3,2
4,2 5,3
4,3
3,3
9
YORK UNIVERSITY CSE4210
Design III[ ] [ ]1110
01
==⎥⎦
⎤⎢⎣
⎡= TT SPd
0-1Y(1 -1)
11X(0 1)
10W(1 0)
STePTeEdge e
Weights stay, move input, fan in output
YORK UNIVERSITY CSE4210
Design III
10
YORK UNIVERSITY CSE4210
Design IV[ ] [ ]1111
11
−==⎥⎦
⎤⎢⎣
⎡−
= TT SPd
20Y(1 -1)
1-1X(0 1)
11W(1 0)
STePTeEdge e
YORK UNIVERSITY CSE4210
Design IV
D
11
YORK UNIVERSITY CSE4210
Design V[ ] [ ]1211
11
==⎥⎦
⎤⎢⎣
⎡−
= TT SPd
10Y(1 -1)
11X(0 1)
21W(1 0)
STePTeEdge e
YORK UNIVERSITY CSE4210
Design V
D
D
D D
D
⊗ ⊕D
D
D
12
YORK UNIVERSITY CSE4210
Dual
[ ] [ ]21111
1==⎥
⎦
⎤⎢⎣
⎡−
= TT SPd
10Y(1 -1)
21X(0 1)
11W(1 0)
STePTeEdge e
Dual of the previous design.
X and w are exchanged
YORK UNIVERSITY CSE4210
Design VI[ ] [ ]1110
01
−==⎥⎦
⎤⎢⎣
⎡= TT SPd
2-1Y(1 -1)
1-1X(0 1)
10W(1 0)
STePTeEdge e
PE0 PE1 PE0x D D
2D 2D
13
YORK UNIVERSITY CSE4210
Dual
YORK UNIVERSITY CSE4210
Transformation
14
YORK UNIVERSITY CSE4210
Scheduling Vector• Consider the dependence X Y• Y can start after X has started and
completed.• We also have to take into consideration the
time it will take the data to travel from X to Y• Constraints on the scheduling vector.
YORK UNIVERSITY CSE4210
yy
yy
Ty
xx
xx
Tx
y
yy
Ty
x
xx
Tx
xxy
y
yY
x
xx
Ji
ssISS
Ji
ssISS
Ji
ssISS
Ji
ssISS
TSS
Ji
IYJi
IX
γ
γ
+⎟⎟⎠
⎞⎜⎜⎝
⎛==
+⎟⎟⎠
⎞⎜⎜⎝
⎛==
⎟⎟⎠
⎞⎜⎜⎝
⎛==
⎟⎟⎠
⎞⎜⎜⎝
⎛==
+≥
⎟⎟⎠
⎞⎜⎜⎝
⎛=→⎟⎟
⎠
⎞⎜⎜⎝
⎛=
)(
)(
)(
)(
::
21
21
21
21Linear
scheduling
Affine scheduling
15
YORK UNIVERSITY CSE4210
xxyyxT
xxxT
yyT
TeS
TISIS
≥−+
++≥+
→ γγ
γγ
edgean for inequality scheduling The
,schedulingaffine Using
Assume that ex y=Iy – Ix
YORK UNIVERSITY CSE4210
Scheduling Vector• Capture all the fundamental edges
(Reduced Dependence Graph RDG).• Use the Regular Iterative Algorithm (RIA)
to describe the problem.• Construct the scheduling inequalities and
solve them for a possible ST
16
YORK UNIVERSITY CSE4210
RIA Description• The regular iterative algorithm has two
standard forms• Standard Input if the index of the inputs
are all the same for all equations• Standard Output if the index of the output
are all the same for all equations
YORK UNIVERSITY CSE4210
RIA Description• W(i+1,j) = W(i,j)• X(i,j+1) = X(i,j)• Y(i+1,j-1) = Y(i,j)+W(i+1,j-1)X(i+1,j-1)x0 x1 x2 x3
ω2
ω1
ω0
Not RIA Indices are
not the same
17
YORK UNIVERSITY CSE4210
RIA Description• W(i,j) = W(i-1,j)• X(i,j) = X(i,j-1)• Y(i,j) = Y(i-1,j+1)+W(i,j)X(i,j)x0 x1 x2 x3
ω2
ω1
ω0
RIA
YORK UNIVERSITY CSE4210
RIA Description
),(),()1,1(),()1,(),(),1(),(
jiXjiWjiYjiYjiXjiX
jiWjiW
++−=−=
−=
Reduced RIA graph for the FIR filter
18
YORK UNIVERSITY CSE4210
[ ]
[ ]
[ ]
[ ]
[ ] 125,1
1:
0,00
:
1,01
:
1,10
:
0,00
:
2121
21
121
221
21
++≥−+−⎟⎟⎠
⎞⎜⎜⎝
⎛−
=→
≥−⎟⎟⎠
⎞⎜⎜⎝
⎛=→
≥−+⎟⎟⎠
⎞⎜⎜⎝
⎛=→
≥−+⎟⎟⎠
⎞⎜⎜⎝
⎛=→
≥−⎟⎟⎠
⎞⎜⎜⎝
⎛=→
yy
xy
ww
xx
wy
sssseYY
sseYX
ssseWW
ssseXX
sseYW
γγ
γγ
γγ
γγ
γγ
xxyyxT Tes ≥−+→ γγ Tmul=5,Tad=2
Tcomm=1
YORK UNIVERSITY CSE4210
FIR• Solving the set of equation assuming all γ’s to be zero.
• A possible solution is s=[9 1]• A possible selection for d=[1,-1] and p =
[1 1]• sTd=8,HUE =1/ 8 eT PTe STe
W(1,0) 1 9
X(0,1) 1 1
Y(1,-1) 0 8
19
YORK UNIVERSITY CSE4210
FIRDD
9D 9D
X
W
8D8D8D
YORK UNIVERSITY CSE4210
Matrix Multiplication
2222122122
2122112121
2212121112
2112111111
2221
1211
2221
1211
2221
1211
babacbabacbabacbabac
bbbb
aaaa
cccc
+=+=+=+=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=⎟⎟
⎠
⎞⎜⎜⎝
⎛
20
YORK UNIVERSITY CSE4210
Matrix Multiplication
YORK UNIVERSITY CSE4210
Matrix Multiplication
21
YORK UNIVERSITY CSE4210
Matrix Multiplication
Taccu = 1 T com = 0
YORK UNIVERSITY CSE4210
Matrix Multiplication[ ] [ ]
⎥⎦
⎤⎢⎣
⎡=
==
010001
100111
p
dS TT
22
YORK UNIVERSITY CSE4210
Solution 2
HUE = 1
[ ] [ ] ⎥⎦
⎤⎢⎣
⎡=−==
110101
,111111 pdS TT
YORK UNIVERSITY CSE4210
2,2 3,2 4,2
2,3 3,3 4,3
2,4 3,4 4,4
0,0 1,0
0,1 1,1 2,1 3,1 4,1
2,0 3,0 4,0
0,2 1,2
0,3 1,3
0,4 1,4
a20 a21 a22a10 a11 a12
a00 a01 a02
b00
b01 b10
b02 b11 b20
b12 b21
b22
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
DD
D
D DD
DD
D DD
D D
D D D
D
D
D
D
D
D
DD
D
D
D
D
D D
DD
23
YORK UNIVERSITY CSE4210
Solution 3
( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛ −=
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛==
010101
,101
,111 TT PdS
YORK UNIVERSITY CSE4210
Solution 4
( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛−−
=⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛==
110101
,111
,111 TT PdS
24
YORK UNIVERSITY CSE4210
Solution 5
( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛
−==
001110
,1
10
,121 TT PdS
Vector PTe sTe
a(0,1,0) 1,0 2
b(1,0,0) 0,1 1
c(0,0,1) 1,0 1
YORK UNIVERSITY CSE4210
0,0 1,0 2,0 3,0 4,0
0,1 1,1 2,1 3,1 4,1
0,2 1,2 2,2 3,2 4,2
a00b00
a01
b10
a02
b22
b01
b11b21b02
b12
a10 a11
b20
a12
a20 a21 a22
Assuming one delay element in every PE
25
YORK UNIVERSITY CSE4210
Solution 6
( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛−−−
=⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛==
110111
,112
,111 TT PdS
YORK UNIVERSITY CSE4210
Solution 7
( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛−
=⎟⎟⎟
⎠
⎞
⎜⎜⎜
⎝
⎛−−
==011011
,211
,121 TT PdS
26
YORK UNIVERSITY CSE4210
Solution 4