systolic architecture

Systolic Architecture• Conventional architecture operate on load

and store operations from memory.• This requires more memory references which

slows down the system as shown below:

Systolic Architecture

• In systolic processing, data to be processed flows through various operation stages and finally put in memory as shown below:

Systolic Architecture• The basic architecture constitutes processing

elements (PEs) that are simple and identical in behavior at all instants.

• Each PE may have some registers and an ALU.• PEs are interlinked in a manner dictated by

the requirements of the specific algorithm.• E.g. 2D mesh, hexagonal arrays etc.

Systolic Architecture• PEs at the boundary of structure are connected

to memory • Data picked up from memory is circulated

among PEs which require it in a rhythmic manner and the result is fed back to memory and hence the name systolic

• Example : Multiplication of two n x n matrices

Example : Multiplication of two n x n matrices

• Every element in input is picked up n times from memory as it contributes to n elements in the output.

• To reduce this memory access, systolic architecture ensures that each element is pulled only once

• Consider an example where n = 3

Matrix Multiplicationa11 a12 a13a21 a22 a23a31 a32 a33 *

b11 b12 b13b21 b22 b23b31 b32 b33

=c11 c12 c13c21 c22 c23c31 c32 c33

Conventional Method: O(n3)

For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];

Systolic MethodThis will run in O(n) time!

To run in n time we need n x n processing units, in our example n = 9.

P9P8P7

P6P5P4

P1 P2 P3

For systolic processing, the input data need to be modified as:

a13 a12 a11a23 a22 a21a33 a32 a31

b31 b32 b33b21 b22 b23b11 b12 b13

Flip columns 1 & 3

Flip rows 1 & 3

and finally stagger the data sets for input.

At every tick of the global system clock, data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register.

a13 a12 a11

a23 a22 a21

a33 a32 a31

b31b21b11

b32b22b12

b33b23b13

P9P8P7

P6P5P4

P1 P2 P3

3 4 2 2 5 33 2 5

* =

3 4 2 2 5 33 2 5

23 36 28 25 39 3428 32 37

Using a systolic array.

2 4 3

3 5 2

5 2 3

323

254

532

P9P8P7

P6P5P4

P1 P2 P3

P1 9

P2 0

P3 0

P4 0

P5 0

P6 0

P7 0

P8 0

P9 0

2 4

3 5 2

5 2 3

32

254

532

P9P8P7

P6P5P4

3*3 P2 P3

Clock tick : 1

P1 9+8=17

P2 12

P3 0

P4 6

P5 0

P6 0

P7 0

P8 0

P9 0

2

3 5

5 2 3

325

532

P9P8P7

P6P52*3

4*2 3*4 P3

Clock tick : 2

P1 17+6=23

P2 12+20=32

P3 6

P4 6+10=16

P5 8

P6 0

P7 9

P8 0

P9 0

3

5 2

2

53

P9P83*3

P62*45*2

2*3 4*5 3*2

Clock tick : 3

P1 23

P2 32+4=36

P3 6+12=18

P4 16+9=25

P5 8+25=33

P6 4

P7 9+4=13

P8 12

P9 05

5

P93*42*2

2*25*53*3

23 2*2 4*3

Clock tick : 4

P1 23

P2 36

P3 18+10=28

P4 25

P5 33+6=39

P6 4+15=19

P7 13+15=28

P8 12+10=22

P9 63*22*55*3

5*33*225

23 36 2*5

Clock tick : 5

P1 23

P2 36

P3 28

P4 25

P5 39

P6 19+15=34

P7 28

P8 22+10=32

P9 6+6=122*35*228

3*53925

23 36 28

Clock tick : 6

P1 23

P2 36

P3 28

P4 25

P5 39

P6 34

P7 28

P8 32

P9 12+25=375*53228

343925

23 36 28

Clock tick : 7

P1 23

P2 36

P3 28

P4 25

P5 39

P6 34

P7 28

P8 32

P9 37373228

343925

23 36 28

End

Samba: Systolic Accelerator for Molecular Biological Applications

This systolic array contains 128 processors shared into 32 full custom VLSI chips. One chip houses 4 processors, and one processor performs 10 millions matrix cells per second.

systolic architecture

Documents