tensaurus: a versatile accelerator for ... - nitish srivastavatensaurus: a versatile accelerator for...

Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense

Tensor Computations

Nitish Srivastava, Hanchen Jin, Shaden Smith2, Hongbo Rong3,

David Albonesi, and Zhiru Zhang

Cornell University2Microsoft AI & Research

3Intel Parallel Computing Lab

1

• Tensors are generalization of matrices to n

dimensions

• Scalar is tensor with 0 dimensions

• Vector is tensor with 1 dimension

• Matrix is tensor with 2 dimensions, and so on

What is a Tensor?

1 0 -5 3 2

1D Tensor/ vector

2 0 0

1 8 -4

-5 0 -1

2D Tensor/ Matrix

2 0 0

1 8 -4

-5 0 -1

2 0 0

1 8 -4

-5 0 -1

2 0 0

1 8 -4

-5 0 -1

3D Tensor/ Cube

4D Tensor

9

0D Tensor/ Scalar

Users

Products

Words

Product Reviews

• High-dimensional data

• Density: 10-7 %

• Requires low-dimensional

representation for ease of

analysis

2 0 0

1 8 -4

-5 0 -1

2 0 0

1 8 -4

-5 0 -1

2 0 0

1 8 -4

-5 0 -1

2 0 0

1 8 -4

-5 0 -1

2 0 0

Slice

(2D array in a tensor)

Fiber

(1D array

in a tensor)

2

Tensor Decompositions for Low-Dimensional Representation

𝒀 𝒊, 𝒇 =

𝑗=0

𝐽

𝑩 𝒋, 𝒇

𝑘=0

𝐾

𝑨 𝒊, 𝒋, 𝒌 𝑪(𝒌, 𝒇)

MTTKRP

𝒀 𝒊, 𝒇, 𝒈 =

𝑗=0

𝐽

𝑩 𝒋, 𝒇

𝑘=0

𝐾

𝑨 𝒊, 𝒋, 𝒌 𝑪(𝒌, 𝒈)

TTMc

I

J

K

B

C

≈ + + …

I

J

K B

C

CP Decomposition

Tucker Decomposition

⋅⋅

⋅⋅≈

Sparse data

Dense data

3

Kernel-Sparsity Spectrum of Tensor Applications

110-410-610-8

Density (log scale)

MT

TK

RP

MM

MV

Recommendation System CNN

Compression

MRI

Sparse CNN

Sparse RNN

Dense CNN

Dense RNN

Graph Neural Networks

PageRank

10-2

Ten

so

r K

ern

els

Graph Processing

ApplicationsMM = Matrix-Matrix multiply

MV = Matrix-Vector multiply

4


110-410-610-8

Density (log scale)

MT

TK

RP

MM

MV


Compression

MRI

Sparse CNN

Sparse RNN

Dense CNN

Dense RNN


PageRank

10-2

Ten

so

r K

ern

els

Applications

Graph Processing

Memory

BoundCompute

Bound

Compute &

Memory

Bound

MM = Matrix-Matrix multiply


5


110-410-610-8

Density (log scale)

MT

TK

RP

MM

MV


Compression

MRI

Sparse CNN

Sparse RNN

Dense CNN

Dense RNN


PageRank

10-2

Ten

so

r K

ern

els

Graph ProcessingGraph NN Xcels

Sparse

NN

Xcels

SpMV Xcels

Dense

NN

Xcels

Sparse-Sparse

MM Xcels



6


110-410-610-8

Density (log scale)

MT

TK

RP

MM

MV


Compression

MRI

Sparse CNN

Sparse RNN

Dense CNN

Dense RNN


PageRank

10-2

Ten

so

r K

ern

els

Graph ProcessingSparse-Sparse

MM

Sparse-Dense and Dense-

Dense Tensor Kernels



7


110-410-610-8

Density (log scale)

MT

TK

RP

MM

MV


Compression

MRI

Sparse CNN

Sparse RNN

Dense CNN

Dense RNN


PageRank

10-2

Ten

so

r K

ern

els

Graph Processing

Tensaurus

Sparse-Sparse

MM Xcels



8

▸ Dense Tensor Acceleration

– Systolic arrays provide high utilization of both

memory and compute

Challenges with Sparse-Dense Tensor Acceleration

PE00

PEr0

PE10…

PE01

PEr1

PE11…PE0c

PErc

PE1c……

…

…

Highly efficient

9

▸ Dense Tensor Acceleration

– Systolic arrays provide high utilization of both memory and compute

▸ Mixed Sparse-Dense Tensor Acceleration

– Memory bound

– Hard to achieve high compute and bandwidth utilization

▸ Goal: Leverage a dense accelerator to efficiently perform sparse-dense compute

Challenges with Sparse-Dense Tensor Acceleration

PE00

PEr0

PE10…

PE01

PEr1

PE11…PE0c

PErc

PE1c……

…

…

?▸ Key approach: Co-design of accelerator architecture

and sparse format

– Low overhead of supporting sparse compute

– High compute and bandwidth utilization

How to push sparse data in

dense systolic array?

10

Importance of Accelerator Friendly Formats

a00 a02

a11 a13

a21

a33

0

1

2

3

a00 a02 a11 a13 a21 a33

0 2 1 3 1 3

Compressed Sparse Row Format

(CSR)

b0

b1

b2

b3

Col. Id

Values

Row Ptr. . . . .

0 1 2 3

SpMV

(Sparse Matrix dense Vector multiply)

11


Col. Id

a00 a02 a11 a13 a21 a33

0 2 1 3 1 3

Values

Row Ptr. . . . .

PE0 PE1

a00 a02

a11 a13

a21

a33

0

1

2

3

b0

b1

b2

b3

PE0

PE1

PE0

PE1


(CSR)

SpMV


11


Col. Id

a00 a02 a11 a13 a21 a33

0 2 1 3 1 3

Values

Row Ptr.

a00 a11

. . . .

PE0 PE1

a00 * b0 a11 * b1

a00 a02

a11 a13

a21

a33

0

1

2

3

Cycle 0b0

b1

b2

b3

b0

b1

a00

a11

PE0

PE1

PE0

PE1


(CSR)

SpMV


11


Col. Id

a00 a02 a11 a13 a21 a33

0 2 1 3 1 3

Values

Row Ptr.

a00 a11a02

. . . .

PE0 PE1

a02 * b2 a13 * b3

a00 a02

a11 a13

a21

a33

0

1

2

3

Cycle 0

Cycle 1

b0

b1

b2

b3

b2

b3

a00

a11

a02PE0

PE1

PE0

PE1

a13

a13


(CSR)

SpMV


11


Col. Id

a00 a02 a11 a13 a21 a33

0 2 1 3 1 3

Values

Row Ptr.

a00 a11a02

. . . .

PE0 PE1

a21 * b1

a00 a02

a11 a13

a21

a33

0

1

2

3

Cycle 0

Cycle 1

Cycle 2

b0

b1

b2

b3

a00

a11

a02

a21

PE0

PE1

PE0

PE1 a33

a33

a13

a13 a21

a33 * b3


(CSR)

SpMV


11


Col. Id

a00 a02 a11 a13 a21 a33

0 2 1 3 1 3

Values

Row Ptr.

a00 a11a02

. . . .

PE0 PE1

a21 * b1

a00 a02

a11 a13

a21

a33

0

1

2

3

Cycle 0

Cycle 1

Cycle 2

b0

b1

b2

b3

a00

a11

a02

a21

PE0

PE1

PE0

PE1 a33

a33

Problems with CSR

• Non-streaming & non-vectorized

accesses

• Indirect memory access

a13

a13 a21

a33 * b3


(CSR)

SpMV


16


a00 a02

a11 a13

a21

a33

0

1

2

3

0 1 0 1 0 1

a00 a11

Col. Id

PE

Values

Compressed Interleaved Sparse Row

(CISR) [1]

0 1

Row Len.

PE0 PE1

b0

b1

b2

b3

[1] Fowers, et al. A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication, Int’l Symp. On Field-Programmable Custom Computing Machines (FCCM), 2014

2 2

a00 a11 a02 a13

0 1 2 1

a00 a11 a02 a13 a21 a33

0 1 2 3 1 3

2 2 1 1

0 1 2 3

SpMV


PE0

PE1

PE0

PE1

17


a00 a02

a11 a13

a21

a33

0 1 2 3

0

1

2

3

0 1 0 1 0 1

a00 a11 a02 a13 a21 a33

Col. Id

PE

Values


(CISR) [1]

0 1 2 3 1 3

2 2 1 1Row Len.

PE0 PE1

b0

b1

b2

b3


SpMV


PE0

PE1

PE0

PE1

17


a00 a02

a11 a13

a21

a33

0 1 2 3

0

1

2

3

Cycle 0

0 1 0 1 0 1

a00 a11 a02 a13 a21 a33

Col. Id

PE

Values


(CISR) [1]

0 1 2 3 1 3

2 2 1 1Row Len.

a00 a11

PE0 PE1

a00 * b0 a11 * b1

b0

b1

b2

b3

b0

b1

a00

a11


SpMV


PE0

PE1

PE0

PE1

17


a00 a02

a11 a13

a21

a33

0 1 2 3

0

1

2

3

Cycle 0

Cycle 1

0 1 0 1 0 1

a00 a11 a02 a13 a21 a33

Col. Id

PE

Values


(CISR) [1]

0 1 2 3 1 3

2 2 1 1Row Len.

a02 a13a00 a11

PE0 PE1

a02 * b2 a13 * b3

b0

b1

b2

b3

b2

b3

a00

a11

a02


a13

SpMV


PE0

PE1

PE0

PE1

17


a00 a02

a11 a13

a21

a33

0 1 2 3

0

1

2

3

Cycle 0

Cycle 1

Cycle 2

0 1 0 1 0 1

a00 a11 a02 a13 a21 a33

Col. Id

PE

Values


(CISR) [1]

0 1 2 3 1 3

2 2 1 1Row Len.

a21 a33a02 a13a00 a11

PE0 PE1

a21 * b1 a33 * b3

b0

b1

b2

b3b3

a00

a11

a02

a21

a33


a13 b1

SpMV


PE0

PE1

PE0

PE1

17


a00 a02

a11 a13

a21

a33

0 1 2 3

0

1

2

3

Cycle 0

Cycle 1

Cycle 2

0 1 0 1 0 1

a00 a11 a02 a13 a21 a33

Col. Id

PE

Values


(CISR) [1]

0 1 2 3 1 3

2 2 1 1Row Len.

Utilized Bandwidth

vs. # PEs

Max: 16GB/s (DDR3)

CSR

1.6GB/s : 2 PEs

1.8GB/s : 8 PEs

a21 a33a02 a13a00 a11

PE0 PE1

a21 * b1 a33 * b3

b0

b1

b2

b3b3

a00

a11

a02

a21

a33


a13 b1

SpMV


PE0

PE1

PE0

PE1

17


a00 a02

a11 a13

a21

a33

0 1 2 3

0

1

2

3

Cycle 0

Cycle 1

Cycle 2

0 1 0 1 0 1

a00 a11 a02 a13 a21 a33

Col. Id

PE

Values


(CISR) [1]

0 1 2 3 1 3

2 2 1 1Row Len.

Utilized Bandwidth

vs. # PEs

Max: 16GB/s (DDR3)

CISR:

6.1GB/s : 4 PEs

CISR:

11.2GB/s : 8 PEs

CSR

1.6GB/s : 2 PEs6x

hig

her

ban

dw

idth

uti

lizati

on

1.8GB/s : 8 PEs

a21 a33a02 a13a00 a11

PE0 PE1

a21 * b1 a33 * b3

b0

b1

b2

b3b3

a00

a11

a02

a21

a33


a13 b1

SpMV


PE0

PE1

PE0

PE1

17


a00 a02

a11 a13

a21

a33

0 1 2 3

0

1

2

3

Cycle 0

Cycle 1

Cycle 2

0 1 0 1 0 1

a00 a11 a02 a13 a21 a33

Col. Id

PE

Values


(CISR) [1]

0 1 2 3 1 3

2 2 1 1Row Len.

Utilized Bandwidth

vs. # PEs

Max: 16GB/s (DDR3)

CISR:

6.1GB/s : 4 PEs

CISR:

11.2GB/s : 8 PEs

CSR

1.6GB/s : 2 PEs6x

hig

her

ban

dw

idth

uti

lizati

on

1.8GB/s : 8 PEs

a21 a33a02 a13a00 a11

PE0 PE1

a21 * b1 a33 * b3

b0

b1

b2

b3b3

a00

a11

a02

a21

a33


Benefits of CISR

• Streaming & vectorized

accesses

a13 b1

SpMV


PE0

PE1

PE0

PE1

24

Extending CISR to CISS for Tensors

0 1 0 1 0 1 0 1 0 1

0 0 a00 a11 a02 a13 0 0 a21 a33

Row Id/

Col. Id

PE

Values

0 1 0 1 2 3 2 3 1 3

a301

a200 a201

𝒌𝒋

𝒊

a111a000

a011

0 1 0 1 0 1 0 1 0 1

0 0 a000 a111 a011 0 0 a200 a301 a201

𝒊/𝒋

PE

Values

0 1 0 1 1 2 3 0 0 0

𝒌 x x 0 1 1 x x 0 1 1

Compressed Interleaved Sparse Slice

(CISS)

a00 a02

a11 a13

a21

a33

0 1 2 3

0

1

2

3

CISR+

Extending CISR+

to Tensors

slice: two-dimensional

arrays in a tensor

CISR+ avoids row-decoding

Adds extra zeros to make design more

scalable ⇨ HW/SW co-design

PE0

PE1

PE1

PE0

25

Computation Pattern for Tensor Kernels

∀𝑖 𝑌 𝑖, : =

𝑗

𝐵 𝑗, :

𝑘

𝐴 𝑖, 𝑗, 𝑘 𝐶(𝑘, : )

𝑘

◦ ⋅

⋅

◦

𝑗

𝒔𝒄𝒂𝒍𝒂𝒓scalar fiber

𝐷0

(𝒔𝒄𝒂𝒍𝒂𝒓 𝒇𝒊𝒃𝒆𝒓𝟎)𝒇𝒊𝒃𝒆𝒓𝒔𝒐𝒖𝒕 =

𝐷1

⋅fiberfiber

0, 𝐽or

𝑗 ∃𝑘 𝑠𝑡. 𝐴 𝑖, 𝑗, 𝑘 ≠ 0}

0, 𝐾or

𝑘 𝐴 𝑖, 𝑗, 𝑘 ≠ 0}

∀𝑖 𝒀(𝒊, : ) =

ϕ

𝒏𝒖𝒍𝒍 𝑜𝑝

𝑗

(𝑨(𝒊, 𝒋) 𝑩(𝒋, : ))

∀𝑖 𝒀(𝒊, : ) =

ϕ

𝒏𝒖𝒍𝒍 𝑜𝑝

𝑗0

(𝑨 𝒊, 𝒋 𝒃(𝒋, : ))

∀𝑖 𝒀(𝒊, : ) =

𝑗

𝑩(𝒋, : )

𝑘

(𝑨(𝒊, 𝒋, 𝒌) 𝑪(𝒌, : ))⊗

MV

TTMc

MM

Sim

ilarl

y

⋅

⋅

⋅

Scalar-Fiber product followed

by Fiber-Fiber product (SF3)

PatternMTTKRP

SF3 compute pattern can express all the

common dense and mixed sparse-dense tensor kernels

For MTTKRP

𝒇𝒊𝒃𝒆𝒓𝟏 𝑜𝑝op

dense

sparse

26

PE

Different output

fibers can be

computed in parallel

<<

VMUL

VA

DD

REP

EAT

ControlUnit

𝒇𝒊𝒃𝒆𝒓𝟎 𝒐𝒓 𝒇𝒊𝒃𝒆𝒓𝟏

𝒔𝒄𝒂𝒍𝒂𝒓

(𝒔𝒄𝒂𝒍𝒂𝒓 𝒇𝒊𝒃𝒆𝒓𝟎)

𝒇𝒊𝒃𝒆𝒓𝟏 𝒐𝒑 …

𝐷0

(… )

𝐷1

(… )

or

or

0

0

𝑫𝟎

(… )

PE for SF3 Compute Pattern

⋅

𝒇𝒊𝒃𝒆𝒓𝒔𝒐𝒖𝒕 =

𝐷1

𝒇𝒊𝒃𝒆𝒓𝟏 𝑜𝑝

𝐷0

(𝒔𝒄𝒂𝒍𝒂𝒓 𝒇𝒊𝒃𝒆𝒓𝟎)⋅

27

PE0

PEr-1

PE1…

HB

M PE0 :

PE1 :

PEr-1 :

Vertical Scaling Using Coarse-Grained Parallelism

𝒇𝒊𝒃𝒆𝒓𝒐𝒖𝒕[𝟎] =

𝐷′1


𝐷′0


𝒇𝒊𝒃𝒆𝒓𝒐𝒖𝒕[𝟏] =

𝐷′′1


𝐷′′0


𝒇𝒊𝒃𝒆𝒓𝒐𝒖𝒕[𝒓 − 𝟏] =

𝐷′′′1


𝐷′′′0


Long vector SIMD compute

28

PE00

PEr0

PE10…

HB

M PE01

PEr1

PE11…

PE0c

PErc

PE1c…

…

…

…

Systolic Network

Horizontal Scaling Using SIMD-Vector Parallelism

29

Tensaurus Architecture

Double Buffer

Crossbar

MLU: Matrix Load UnitTLU: Tensor Load UnitMSU: Matrix Store Unit

TLU

PE00

PEr0

PE10…

MLU

PE01

PEr1

PE11…

PE0c

PErc

PE1c…

…

…

…

…

MSU

HB

M

SPM0 SPM1 SPM7

… … …

30

Tensaurus Architecture

TLU

PE00

PEr0

PE10…

MLU

PE01

PEr1

PE11…

PE0c

PErc

PE1c……

…

…

…

MSU

HB

M

SPM0 SPM1 SPM7

… … …

Accelerator for both dense and sparse-dense!!

31

▸ Cycle-level simulation in gem5– 8 x 8 PE array, VLEN = 8

– 8 16KB RAMs per SPM

– HBM: 8 128-bit physical channels (128 GB/s peak bandwidth)

▸ RTL Modeling of a PE using PyMTL– 28 nm (Synopsys & Cadence Tools)

▸ Baselines– CPU: Intel(R) Xeon(R) CPU E7-8867

• SparseBLAS and SPLATT

– GPU: Titan XP• CuSparse, PaRTI

– Sparse NN Accelerator:• Cambricon-X [1]

▸ Datasets– FROSTT Tensors, Florida Sparse Matrices, AlexNet, VGG-16

Evaluation Methodology

Area and Power Breakdown

[1] Zhang, Shijin, et al. "Cambricon-x: An accelerator for sparse neural networks.“, Int’l Symp. on Microarchitecture (MICRO), 2016.

32

▸ Cambricon-X uses a CSR-variant

– Pads empty entries with padding (x)

– Uses vector bit-masks to indicate

non-zero positions

– Specialized for CNNs with low sparsity

▸ CSR results in load-imbalanced schedule

– Cambricon-X has synchronization boundaries across rows

Cambricon-X Baseline

a00 a02

a11

a23

a31 a33

a00 a02

a11 x

a23 x

a31 a33

1 0 1 0

0 1 0 0

0 0 0 1

0 1 0 1

&

Vector BitmasksValues with

Paddings

a00 a02

a11

a23

a31 a33

0

1

2

3

PE0

PE1

PE0

PE1

PE0 PE1

a00

a02

a23

a11

a31

a33

time

Load imbalance

due to

synchronization at

row boundaries

33

Results on Sparse Neural Nets

AlexNet VGG-16 Overall

AlexNet VGG-16 Overall

0.1

1

10

100

1000

Sp

ee

du

p o

ve

r C

PU

GPU-CuSparse Cambricon-X Tensaurus

0.1

1

10

100

1000

10000

En

erg

y e

ffic

ien

cy

rela

tive

to

CP

U

GPU-CuSparse Cambricon-X Tensaurus

CPU baseline

CPU baseline

Overall Tensaurus is 1.9x faster and 1.7x more energy-efficient than

Cambricon-X even for Sparse Neural Nets

34

Results on Sparse Tensor Decomposition

10.37.2

10.3 12.2 17.3 15.1

3.6 2.8 2.6

44.124.6 30.6

17.329.2 22.4 20.0 15.3 15.1

1

10

100

Speedup for MTTKRP

GPU-ParTI Tensaurus

1.1 0.7 1.1 1.2 1.8 1.50.4 0.3 0.3

365 237 308128 215 163 278 214 209

0.1

1

10

100

1000

Performance/Watt for MTTKRP

GPU-ParTI Tensaurus

CPU baselineCPU

baseline

34

Tensaurus is 22.9x & 3.1x faster, and 220x & 290x more energy-efficient than CPU & GPU for MTTKRP

35

▸ Tensaurus: A versatile accelerator for sparse-dense tensor acceleration

– First accelerator for sparse tensor decompositions (MTTKRP, TTMc)

– Versatile: NOT limited to tensor decompositions. Also efficient for sparse-dense matrix

computations

– Adaptable:

• Also accelerates dense kernels

• Easily adapts to different levels of sparsity found in various domains

▸ Key Approach: Co-design sparse format and architecture

▸ Key Results:

– High bandwidth utilization (> 70% of peak bandwidth)

– High speedup and energy efficiency compared to CPU, GPU and Cambricon-X

Concluding Remarks

36

Nitish Srivastava, Hanchen Jin, Shaden Smith2, Hongbo Rong3,

David Albonesi, and Zhiru Zhang

Cornell University2Microsoft AI & Research

3Intel Parallel Computing Lab

Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense

Tensor Computations

tensaurus: a versatile accelerator for ... - nitish srivastavatensaurus: a versatile accelerator for...

Documents