systolic architecture
DESCRIPTION
Systolic Architecture. Conventional architecture operate on load and store operations from memory. This requires more memory references which slows down the system as shown below:. Systolic Architecture. - PowerPoint PPT PresentationTRANSCRIPT
Systolic Architecture• Conventional architecture operate on load
and store operations from memory.• This requires more memory references which
slows down the system as shown below:
Systolic Architecture
• In systolic processing, data to be processed flows through various operation stages and finally put in memory as shown below:
Systolic Architecture• The basic architecture constitutes processing
elements (PEs) that are simple and identical in behavior at all instants.
• Each PE may have some registers and an ALU.• PEs are interlinked in a manner dictated by
the requirements of the specific algorithm.• E.g. 2D mesh, hexagonal arrays etc.
Systolic Architecture• PEs at the boundary of structure are connected
to memory • Data picked up from memory is circulated
among PEs which require it in a rhythmic manner and the result is fed back to memory and hence the name systolic
• Example : Multiplication of two n x n matrices
Example : Multiplication of two n x n matrices
• Every element in input is picked up n times from memory as it contributes to n elements in the output.
• To reduce this memory access, systolic architecture ensures that each element is pulled only once
• Consider an example where n = 3
Matrix Multiplicationa11 a12 a13a21 a22 a23a31 a32 a33 *
b11 b12 b13b21 b22 b23b31 b32 b33
=c11 c12 c13c21 c22 c23c31 c32 c33
Conventional Method: O(n3)
For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];
Systolic MethodThis will run in O(n) time!
To run in n time we need n x n processing units, in our example n = 9.
P9P8P7
P6P5P4
P1 P2 P3
For systolic processing, the input data need to be modified as:
a13 a12 a11a23 a22 a21a33 a32 a31
b31 b32 b33b21 b22 b23b11 b12 b13
Flip columns 1 & 3
Flip rows 1 & 3
and finally stagger the data sets for input.
At every tick of the global system clock, data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register.
a13 a12 a11
a23 a22 a21
a33 a32 a31
b31b21b11
b32b22b12
b33b23b13
P9P8P7
P6P5P4
P1 P2 P3
3 4 2 2 5 33 2 5
* =
3 4 2 2 5 33 2 5
23 36 28 25 39 3428 32 37
Using a systolic array.
2 4 3
3 5 2
5 2 3
323
254
532
P9P8P7
P6P5P4
P1 P2 P3
P1 9
P2 0
P3 0
P4 0
P5 0
P6 0
P7 0
P8 0
P9 0
2 4
3 5 2
5 2 3
32
254
532
P9P8P7
P6P5P4
3*3 P2 P3
Clock tick : 1
P1 9+8=17
P2 12
P3 0
P4 6
P5 0
P6 0
P7 0
P8 0
P9 0
2
3 5
5 2 3
325
532
P9P8P7
P6P52*3
4*2 3*4 P3
Clock tick : 2
P1 17+6=23
P2 12+20=32
P3 6
P4 6+10=16
P5 8
P6 0
P7 9
P8 0
P9 0
3
5 2
2
53
P9P83*3
P62*45*2
2*3 4*5 3*2
Clock tick : 3
P1 23
P2 32+4=36
P3 6+12=18
P4 16+9=25
P5 8+25=33
P6 4
P7 9+4=13
P8 12
P9 05
5
P93*42*2
2*25*53*3
23 2*2 4*3
Clock tick : 4
P1 23
P2 36
P3 18+10=28
P4 25
P5 33+6=39
P6 4+15=19
P7 13+15=28
P8 12+10=22
P9 63*22*55*3
5*33*225
23 36 2*5
Clock tick : 5
P1 23
P2 36
P3 28
P4 25
P5 39
P6 19+15=34
P7 28
P8 22+10=32
P9 6+6=122*35*228
3*53925
23 36 28
Clock tick : 6
P1 23
P2 36
P3 28
P4 25
P5 39
P6 34
P7 28
P8 32
P9 12+25=375*53228
343925
23 36 28
Clock tick : 7
P1 23
P2 36
P3 28
P4 25
P5 39
P6 34
P7 28
P8 32
P9 37373228
343925
23 36 28
End
Samba: Systolic Accelerator for Molecular Biological Applications
This systolic array contains 128 processors shared into 32 full custom VLSI chips. One chip houses 4 processors, and one processor performs 10 millions matrix cells per second.