tensaurus: a versatile accelerator for ... - nitish srivastavatensaurus: a versatile accelerator for...
TRANSCRIPT
Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense
Tensor Computations
Nitish Srivastava, Hanchen Jin, Shaden Smith2, Hongbo Rong3,
David Albonesi, and Zhiru Zhang
Cornell University2Microsoft AI & Research
3Intel Parallel Computing Lab
1
• Tensors are generalization of matrices to n
dimensions
• Scalar is tensor with 0 dimensions
• Vector is tensor with 1 dimension
• Matrix is tensor with 2 dimensions, and so on
What is a Tensor?
1 0 -5 3 2
1D Tensor/ vector
2 0 0
1 8 -4
-5 0 -1
2D Tensor/ Matrix
2 0 0
1 8 -4
-5 0 -1
2 0 0
1 8 -4
-5 0 -1
2 0 0
1 8 -4
-5 0 -1
3D Tensor/ Cube
4D Tensor
9
0D Tensor/ Scalar
Users
Products
Words
Product Reviews
• High-dimensional data
• Density: 10-7 %
• Requires low-dimensional
representation for ease of
analysis
2 0 0
1 8 -4
-5 0 -1
2 0 0
1 8 -4
-5 0 -1
2 0 0
1 8 -4
-5 0 -1
2 0 0
1 8 -4
-5 0 -1
2 0 0
Slice
(2D array in a tensor)
Fiber
(1D array
in a tensor)
2
Tensor Decompositions for Low-Dimensional Representation
𝒀 𝒊, 𝒇 =
𝑗=0
𝐽
𝑩 𝒋, 𝒇
𝑘=0
𝐾
𝑨 𝒊, 𝒋, 𝒌 𝑪(𝒌, 𝒇)
MTTKRP
𝒀 𝒊, 𝒇, 𝒈 =
𝑗=0
𝐽
𝑩 𝒋, 𝒇
𝑘=0
𝐾
𝑨 𝒊, 𝒋, 𝒌 𝑪(𝒌, 𝒈)
TTMc
I
J
K
B
C
≈ + + …
I
J
K B
C
CP Decomposition
Tucker Decomposition
⋅⋅
⋅⋅≈
Sparse data
Dense data
3
Kernel-Sparsity Spectrum of Tensor Applications
110-410-610-8
Density (log scale)
MT
TK
RP
MM
MV
Recommendation System CNN
Compression
MRI
Sparse CNN
Sparse RNN
Dense CNN
Dense RNN
Graph Neural Networks
PageRank
10-2
Ten
so
r K
ern
els
Graph Processing
ApplicationsMM = Matrix-Matrix multiply
MV = Matrix-Vector multiply
4
Kernel-Sparsity Spectrum of Tensor Applications
110-410-610-8
Density (log scale)
MT
TK
RP
MM
MV
Recommendation System CNN
Compression
MRI
Sparse CNN
Sparse RNN
Dense CNN
Dense RNN
Graph Neural Networks
PageRank
10-2
Ten
so
r K
ern
els
Applications
Graph Processing
Memory
BoundCompute
Bound
Compute &
Memory
Bound
MM = Matrix-Matrix multiply
MV = Matrix-Vector multiply
5
Kernel-Sparsity Spectrum of Tensor Applications
110-410-610-8
Density (log scale)
MT
TK
RP
MM
MV
Recommendation System CNN
Compression
MRI
Sparse CNN
Sparse RNN
Dense CNN
Dense RNN
Graph Neural Networks
PageRank
10-2
Ten
so
r K
ern
els
Graph ProcessingGraph NN Xcels
Sparse
NN
Xcels
SpMV Xcels
Dense
NN
Xcels
Sparse-Sparse
MM Xcels
ApplicationsMM = Matrix-Matrix multiply
MV = Matrix-Vector multiply
6
Kernel-Sparsity Spectrum of Tensor Applications
110-410-610-8
Density (log scale)
MT
TK
RP
MM
MV
Recommendation System CNN
Compression
MRI
Sparse CNN
Sparse RNN
Dense CNN
Dense RNN
Graph Neural Networks
PageRank
10-2
Ten
so
r K
ern
els
Graph ProcessingSparse-Sparse
MM
Sparse-Dense and Dense-
Dense Tensor Kernels
ApplicationsMM = Matrix-Matrix multiply
MV = Matrix-Vector multiply
7
Kernel-Sparsity Spectrum of Tensor Applications
110-410-610-8
Density (log scale)
MT
TK
RP
MM
MV
Recommendation System CNN
Compression
MRI
Sparse CNN
Sparse RNN
Dense CNN
Dense RNN
Graph Neural Networks
PageRank
10-2
Ten
so
r K
ern
els
Graph Processing
Tensaurus
Sparse-Sparse
MM Xcels
ApplicationsMM = Matrix-Matrix multiply
MV = Matrix-Vector multiply
8
▸ Dense Tensor Acceleration
– Systolic arrays provide high utilization of both
memory and compute
Challenges with Sparse-Dense Tensor Acceleration
PE00
PEr0
PE10…
PE01
PEr1
PE11…PE0c
PErc
PE1c……
…
…
Highly efficient
9
▸ Dense Tensor Acceleration
– Systolic arrays provide high utilization of both memory and compute
▸ Mixed Sparse-Dense Tensor Acceleration
– Memory bound
– Hard to achieve high compute and bandwidth utilization
▸ Goal: Leverage a dense accelerator to efficiently perform sparse-dense compute
Challenges with Sparse-Dense Tensor Acceleration
PE00
PEr0
PE10…
PE01
PEr1
PE11…PE0c
PErc
PE1c……
…
…
?▸ Key approach: Co-design of accelerator architecture
and sparse format
– Low overhead of supporting sparse compute
– High compute and bandwidth utilization
How to push sparse data in
dense systolic array?
10
Importance of Accelerator Friendly Formats
a00 a02
a11 a13
a21
a33
0
1
2
3
a00 a02 a11 a13 a21 a33
0 2 1 3 1 3
Compressed Sparse Row Format
(CSR)
b0
b1
b2
b3
Col. Id
Values
Row Ptr. . . . .
0 1 2 3
SpMV
(Sparse Matrix dense Vector multiply)
11
Importance of Accelerator Friendly Formats
Col. Id
a00 a02 a11 a13 a21 a33
0 2 1 3 1 3
Values
Row Ptr. . . . .
PE0 PE1
a00 a02
a11 a13
a21
a33
0
1
2
3
b0
b1
b2
b3
PE0
PE1
PE0
PE1
Compressed Sparse Row Format
(CSR)
SpMV
(Sparse Matrix dense Vector multiply)
11
Importance of Accelerator Friendly Formats
Col. Id
a00 a02 a11 a13 a21 a33
0 2 1 3 1 3
Values
Row Ptr.
a00 a11
. . . .
PE0 PE1
a00 * b0 a11 * b1
a00 a02
a11 a13
a21
a33
0
1
2
3
Cycle 0b0
b1
b2
b3
b0
b1
a00
a11
PE0
PE1
PE0
PE1
Compressed Sparse Row Format
(CSR)
SpMV
(Sparse Matrix dense Vector multiply)
11
Importance of Accelerator Friendly Formats
Col. Id
a00 a02 a11 a13 a21 a33
0 2 1 3 1 3
Values
Row Ptr.
a00 a11a02
. . . .
PE0 PE1
a02 * b2 a13 * b3
a00 a02
a11 a13
a21
a33
0
1
2
3
Cycle 0
Cycle 1
b0
b1
b2
b3
b2
b3
a00
a11
a02PE0
PE1
PE0
PE1
a13
a13
Compressed Sparse Row Format
(CSR)
SpMV
(Sparse Matrix dense Vector multiply)
11
Importance of Accelerator Friendly Formats
Col. Id
a00 a02 a11 a13 a21 a33
0 2 1 3 1 3
Values
Row Ptr.
a00 a11a02
. . . .
PE0 PE1
a21 * b1
a00 a02
a11 a13
a21
a33
0
1
2
3
Cycle 0
Cycle 1
Cycle 2
b0
b1
b2
b3
a00
a11
a02
a21
PE0
PE1
PE0
PE1 a33
a33
a13
a13 a21
a33 * b3
Compressed Sparse Row Format
(CSR)
SpMV
(Sparse Matrix dense Vector multiply)
11
Importance of Accelerator Friendly Formats
Col. Id
a00 a02 a11 a13 a21 a33
0 2 1 3 1 3
Values
Row Ptr.
a00 a11a02
. . . .
PE0 PE1
a21 * b1
a00 a02
a11 a13
a21
a33
0
1
2
3
Cycle 0
Cycle 1
Cycle 2
b0
b1
b2
b3
a00
a11
a02
a21
PE0
PE1
PE0
PE1 a33
a33
Problems with CSR
• Non-streaming & non-vectorized
accesses
• Indirect memory access
a13
a13 a21
a33 * b3
Compressed Sparse Row Format
(CSR)
SpMV
(Sparse Matrix dense Vector multiply)
16
Importance of Accelerator Friendly Formats
a00 a02
a11 a13
a21
a33
0
1
2
3
0 1 0 1 0 1
a00 a11
Col. Id
PE
Values
Compressed Interleaved Sparse Row
(CISR) [1]
0 1
Row Len.
PE0 PE1
b0
b1
b2
b3
[1] Fowers, et al. A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication, Int’l Symp. On Field-Programmable Custom Computing Machines (FCCM), 2014
2 2
a00 a11 a02 a13
0 1 2 1
a00 a11 a02 a13 a21 a33
0 1 2 3 1 3
2 2 1 1
0 1 2 3
SpMV
(Sparse Matrix dense Vector multiply)
PE0
PE1
PE0
PE1
17
Importance of Accelerator Friendly Formats
a00 a02
a11 a13
a21
a33
0 1 2 3
0
1
2
3
0 1 0 1 0 1
a00 a11 a02 a13 a21 a33
Col. Id
PE
Values
Compressed Interleaved Sparse Row
(CISR) [1]
0 1 2 3 1 3
2 2 1 1Row Len.
PE0 PE1
b0
b1
b2
b3
[1] Fowers, et al. A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication, Int’l Symp. On Field-Programmable Custom Computing Machines (FCCM), 2014
SpMV
(Sparse Matrix dense Vector multiply)
PE0
PE1
PE0
PE1
17
Importance of Accelerator Friendly Formats
a00 a02
a11 a13
a21
a33
0 1 2 3
0
1
2
3
Cycle 0
0 1 0 1 0 1
a00 a11 a02 a13 a21 a33
Col. Id
PE
Values
Compressed Interleaved Sparse Row
(CISR) [1]
0 1 2 3 1 3
2 2 1 1Row Len.
a00 a11
PE0 PE1
a00 * b0 a11 * b1
b0
b1
b2
b3
b0
b1
a00
a11
[1] Fowers, et al. A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication, Int’l Symp. On Field-Programmable Custom Computing Machines (FCCM), 2014
SpMV
(Sparse Matrix dense Vector multiply)
PE0
PE1
PE0
PE1
17
Importance of Accelerator Friendly Formats
a00 a02
a11 a13
a21
a33
0 1 2 3
0
1
2
3
Cycle 0
Cycle 1
0 1 0 1 0 1
a00 a11 a02 a13 a21 a33
Col. Id
PE
Values
Compressed Interleaved Sparse Row
(CISR) [1]
0 1 2 3 1 3
2 2 1 1Row Len.
a02 a13a00 a11
PE0 PE1
a02 * b2 a13 * b3
b0
b1
b2
b3
b2
b3
a00
a11
a02
[1] Fowers, et al. A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication, Int’l Symp. On Field-Programmable Custom Computing Machines (FCCM), 2014
a13
SpMV
(Sparse Matrix dense Vector multiply)
PE0
PE1
PE0
PE1
17
Importance of Accelerator Friendly Formats
a00 a02
a11 a13
a21
a33
0 1 2 3
0
1
2
3
Cycle 0
Cycle 1
Cycle 2
0 1 0 1 0 1
a00 a11 a02 a13 a21 a33
Col. Id
PE
Values
Compressed Interleaved Sparse Row
(CISR) [1]
0 1 2 3 1 3
2 2 1 1Row Len.
a21 a33a02 a13a00 a11
PE0 PE1
a21 * b1 a33 * b3
b0
b1
b2
b3b3
a00
a11
a02
a21
a33
[1] Fowers, et al. A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication, Int’l Symp. On Field-Programmable Custom Computing Machines (FCCM), 2014
a13 b1
SpMV
(Sparse Matrix dense Vector multiply)
PE0
PE1
PE0
PE1
17
Importance of Accelerator Friendly Formats
a00 a02
a11 a13
a21
a33
0 1 2 3
0
1
2
3
Cycle 0
Cycle 1
Cycle 2
0 1 0 1 0 1
a00 a11 a02 a13 a21 a33
Col. Id
PE
Values
Compressed Interleaved Sparse Row
(CISR) [1]
0 1 2 3 1 3
2 2 1 1Row Len.
Utilized Bandwidth
vs. # PEs
Max: 16GB/s (DDR3)
CSR
1.6GB/s : 2 PEs
1.8GB/s : 8 PEs
a21 a33a02 a13a00 a11
PE0 PE1
a21 * b1 a33 * b3
b0
b1
b2
b3b3
a00
a11
a02
a21
a33
[1] Fowers, et al. A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication, Int’l Symp. On Field-Programmable Custom Computing Machines (FCCM), 2014
a13 b1
SpMV
(Sparse Matrix dense Vector multiply)
PE0
PE1
PE0
PE1
17
Importance of Accelerator Friendly Formats
a00 a02
a11 a13
a21
a33
0 1 2 3
0
1
2
3
Cycle 0
Cycle 1
Cycle 2
0 1 0 1 0 1
a00 a11 a02 a13 a21 a33
Col. Id
PE
Values
Compressed Interleaved Sparse Row
(CISR) [1]
0 1 2 3 1 3
2 2 1 1Row Len.
Utilized Bandwidth
vs. # PEs
Max: 16GB/s (DDR3)
CISR:
6.1GB/s : 4 PEs
CISR:
11.2GB/s : 8 PEs
CSR
1.6GB/s : 2 PEs6x
hig
her
ban
dw
idth
uti
lizati
on
1.8GB/s : 8 PEs
a21 a33a02 a13a00 a11
PE0 PE1
a21 * b1 a33 * b3
b0
b1
b2
b3b3
a00
a11
a02
a21
a33
[1] Fowers, et al. A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication, Int’l Symp. On Field-Programmable Custom Computing Machines (FCCM), 2014
a13 b1
SpMV
(Sparse Matrix dense Vector multiply)
PE0
PE1
PE0
PE1
17
Importance of Accelerator Friendly Formats
a00 a02
a11 a13
a21
a33
0 1 2 3
0
1
2
3
Cycle 0
Cycle 1
Cycle 2
0 1 0 1 0 1
a00 a11 a02 a13 a21 a33
Col. Id
PE
Values
Compressed Interleaved Sparse Row
(CISR) [1]
0 1 2 3 1 3
2 2 1 1Row Len.
Utilized Bandwidth
vs. # PEs
Max: 16GB/s (DDR3)
CISR:
6.1GB/s : 4 PEs
CISR:
11.2GB/s : 8 PEs
CSR
1.6GB/s : 2 PEs6x
hig
her
ban
dw
idth
uti
lizati
on
1.8GB/s : 8 PEs
a21 a33a02 a13a00 a11
PE0 PE1
a21 * b1 a33 * b3
b0
b1
b2
b3b3
a00
a11
a02
a21
a33
[1] Fowers, et al. A high memory bandwidth FPGA accelerator for sparse matrix-vector multiplication, Int’l Symp. On Field-Programmable Custom Computing Machines (FCCM), 2014
Benefits of CISR
• Streaming & vectorized
accesses
a13 b1
SpMV
(Sparse Matrix dense Vector multiply)
PE0
PE1
PE0
PE1
24
Extending CISR to CISS for Tensors
0 1 0 1 0 1 0 1 0 1
0 0 a00 a11 a02 a13 0 0 a21 a33
Row Id/
Col. Id
PE
Values
0 1 0 1 2 3 2 3 1 3
a301
a200 a201
𝒌𝒋
𝒊
a111a000
a011
0 1 0 1 0 1 0 1 0 1
0 0 a000 a111 a011 0 0 a200 a301 a201
𝒊/𝒋
PE
Values
0 1 0 1 1 2 3 0 0 0
𝒌 x x 0 1 1 x x 0 1 1
Compressed Interleaved Sparse Slice
(CISS)
a00 a02
a11 a13
a21
a33
0 1 2 3
0
1
2
3
CISR+
Extending CISR+
to Tensors
slice: two-dimensional
arrays in a tensor
CISR+ avoids row-decoding
Adds extra zeros to make design more
scalable ⇨ HW/SW co-design
PE0
PE1
PE1
PE0
25
Computation Pattern for Tensor Kernels
∀𝑖 𝑌 𝑖, : =
𝑗
𝐵 𝑗, :
𝑘
𝐴 𝑖, 𝑗, 𝑘 𝐶(𝑘, : )
𝑘
◦ ⋅
⋅
◦
𝑗
𝒔𝒄𝒂𝒍𝒂𝒓scalar fiber
𝐷0
(𝒔𝒄𝒂𝒍𝒂𝒓 𝒇𝒊𝒃𝒆𝒓𝟎)𝒇𝒊𝒃𝒆𝒓𝒔𝒐𝒖𝒕 =
𝐷1
⋅fiberfiber
0, 𝐽or
𝑗 ∃𝑘 𝑠𝑡. 𝐴 𝑖, 𝑗, 𝑘 ≠ 0}
0, 𝐾or
𝑘 𝐴 𝑖, 𝑗, 𝑘 ≠ 0}
∀𝑖 𝒀(𝒊, : ) =
ϕ
𝒏𝒖𝒍𝒍 𝑜𝑝
𝑗
(𝑨(𝒊, 𝒋) 𝑩(𝒋, : ))
∀𝑖 𝒀(𝒊, : ) =
ϕ
𝒏𝒖𝒍𝒍 𝑜𝑝
𝑗0
(𝑨 𝒊, 𝒋 𝒃(𝒋, : ))
∀𝑖 𝒀(𝒊, : ) =
𝑗
𝑩(𝒋, : )
𝑘
(𝑨(𝒊, 𝒋, 𝒌) 𝑪(𝒌, : ))⊗
MV
TTMc
MM
Sim
ilarl
y
⋅
⋅
⋅
Scalar-Fiber product followed
by Fiber-Fiber product (SF3)
PatternMTTKRP
SF3 compute pattern can express all the
common dense and mixed sparse-dense tensor kernels
For MTTKRP
𝒇𝒊𝒃𝒆𝒓𝟏 𝑜𝑝op
dense
sparse
26
PE
Different output
fibers can be
computed in parallel
<<
VMUL
VA
DD
REP
EAT
ControlUnit
𝒇𝒊𝒃𝒆𝒓𝟎 𝒐𝒓 𝒇𝒊𝒃𝒆𝒓𝟏
𝒔𝒄𝒂𝒍𝒂𝒓
(𝒔𝒄𝒂𝒍𝒂𝒓 𝒇𝒊𝒃𝒆𝒓𝟎)
𝒇𝒊𝒃𝒆𝒓𝟏 𝒐𝒑 …
𝐷0
(… )
𝐷1
(… )
or
or
0
0
𝑫𝟎
(… )
PE for SF3 Compute Pattern
⋅
𝒇𝒊𝒃𝒆𝒓𝒔𝒐𝒖𝒕 =
𝐷1
𝒇𝒊𝒃𝒆𝒓𝟏 𝑜𝑝
𝐷0
(𝒔𝒄𝒂𝒍𝒂𝒓 𝒇𝒊𝒃𝒆𝒓𝟎)⋅
27
PE0
PEr-1
PE1…
HB
M PE0 :
PE1 :
PEr-1 :
Vertical Scaling Using Coarse-Grained Parallelism
𝒇𝒊𝒃𝒆𝒓𝒐𝒖𝒕[𝟎] =
𝐷′1
𝒇𝒊𝒃𝒆𝒓𝟏 𝑜𝑝
𝐷′0
(𝒔𝒄𝒂𝒍𝒂𝒓 𝒇𝒊𝒃𝒆𝒓𝟎)⋅
𝒇𝒊𝒃𝒆𝒓𝒐𝒖𝒕[𝟏] =
𝐷′′1
𝒇𝒊𝒃𝒆𝒓𝟏 𝑜𝑝
𝐷′′0
(𝒔𝒄𝒂𝒍𝒂𝒓 𝒇𝒊𝒃𝒆𝒓𝟎)⋅
𝒇𝒊𝒃𝒆𝒓𝒐𝒖𝒕[𝒓 − 𝟏] =
𝐷′′′1
𝒇𝒊𝒃𝒆𝒓𝟏 𝑜𝑝
𝐷′′′0
(𝒔𝒄𝒂𝒍𝒂𝒓 𝒇𝒊𝒃𝒆𝒓𝟎)⋅
Long vector SIMD compute
28
PE00
PEr0
PE10…
HB
M PE01
PEr1
PE11…
PE0c
PErc
PE1c…
…
…
…
Systolic Network
Horizontal Scaling Using SIMD-Vector Parallelism
29
Tensaurus Architecture
Double Buffer
Crossbar
MLU: Matrix Load UnitTLU: Tensor Load UnitMSU: Matrix Store Unit
TLU
PE00
PEr0
PE10…
MLU
PE01
PEr1
PE11…
PE0c
PErc
PE1c…
…
…
…
…
MSU
HB
M
SPM0 SPM1 SPM7
… … …
30
Tensaurus Architecture
TLU
PE00
PEr0
PE10…
MLU
PE01
PEr1
PE11…
PE0c
PErc
PE1c……
…
…
…
MSU
HB
M
SPM0 SPM1 SPM7
… … …
Accelerator for both dense and sparse-dense!!
31
▸ Cycle-level simulation in gem5– 8 x 8 PE array, VLEN = 8
– 8 16KB RAMs per SPM
– HBM: 8 128-bit physical channels (128 GB/s peak bandwidth)
▸ RTL Modeling of a PE using PyMTL– 28 nm (Synopsys & Cadence Tools)
▸ Baselines– CPU: Intel(R) Xeon(R) CPU E7-8867
• SparseBLAS and SPLATT
– GPU: Titan XP• CuSparse, PaRTI
– Sparse NN Accelerator:• Cambricon-X [1]
▸ Datasets– FROSTT Tensors, Florida Sparse Matrices, AlexNet, VGG-16
Evaluation Methodology
Area and Power Breakdown
[1] Zhang, Shijin, et al. "Cambricon-x: An accelerator for sparse neural networks.“, Int’l Symp. on Microarchitecture (MICRO), 2016.
32
▸ Cambricon-X uses a CSR-variant
– Pads empty entries with padding (x)
– Uses vector bit-masks to indicate
non-zero positions
– Specialized for CNNs with low sparsity
▸ CSR results in load-imbalanced schedule
– Cambricon-X has synchronization boundaries across rows
Cambricon-X Baseline
a00 a02
a11
a23
a31 a33
a00 a02
a11 x
a23 x
a31 a33
1 0 1 0
0 1 0 0
0 0 0 1
0 1 0 1
&
Vector BitmasksValues with
Paddings
a00 a02
a11
a23
a31 a33
0
1
2
3
PE0
PE1
PE0
PE1
PE0 PE1
a00
a02
a23
a11
a31
a33
time
Load imbalance
due to
synchronization at
row boundaries
33
Results on Sparse Neural Nets
AlexNet VGG-16 Overall
AlexNet VGG-16 Overall
0.1
1
10
100
1000
Sp
ee
du
p o
ve
r C
PU
GPU-CuSparse Cambricon-X Tensaurus
0.1
1
10
100
1000
10000
En
erg
y e
ffic
ien
cy
rela
tive
to
CP
U
GPU-CuSparse Cambricon-X Tensaurus
CPU baseline
CPU baseline
Overall Tensaurus is 1.9x faster and 1.7x more energy-efficient than
Cambricon-X even for Sparse Neural Nets
34
Results on Sparse Tensor Decomposition
10.37.2
10.3 12.2 17.3 15.1
3.6 2.8 2.6
44.124.6 30.6
17.329.2 22.4 20.0 15.3 15.1
1
10
100
Speedup for MTTKRP
GPU-ParTI Tensaurus
1.1 0.7 1.1 1.2 1.8 1.50.4 0.3 0.3
365 237 308128 215 163 278 214 209
0.1
1
10
100
1000
Performance/Watt for MTTKRP
GPU-ParTI Tensaurus
CPU baselineCPU
baseline
34
Tensaurus is 22.9x & 3.1x faster, and 220x & 290x more energy-efficient than CPU & GPU for MTTKRP
35
▸ Tensaurus: A versatile accelerator for sparse-dense tensor acceleration
– First accelerator for sparse tensor decompositions (MTTKRP, TTMc)
– Versatile: NOT limited to tensor decompositions. Also efficient for sparse-dense matrix
computations
– Adaptable:
• Also accelerates dense kernels
• Easily adapts to different levels of sparsity found in various domains
▸ Key Approach: Co-design sparse format and architecture
▸ Key Results:
– High bandwidth utilization (> 70% of peak bandwidth)
– High speedup and energy efficiency compared to CPU, GPU and Cambricon-X
Concluding Remarks
36
Nitish Srivastava, Hanchen Jin, Shaden Smith2, Hongbo Rong3,
David Albonesi, and Zhiru Zhang
Cornell University2Microsoft AI & Research
3Intel Parallel Computing Lab
Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense
Tensor Computations