case study: sparse matrix-vector multiplication
TRANSCRIPT
![Page 1: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/1.jpg)
Case study:Sparse Matrix-Vector Multiplication
![Page 2: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/2.jpg)
SpMVM: The Basics
![Page 3: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/3.jpg)
3
Sparse Matrix Vector Multiplication (SpMV)
Key ingredient in some matrix diagonalization algorithms Lanczos, Davidson, Jacobi-Davidson
Store only Nnz nonzero elements of matrix and RHS, LHS vectors with Nr (number of matrix rows) entries
βSparseβ: Nnz ~ Nr
Average number of nonzeros per row: Nnzr = Nnz/Nr
= + β’ Nr
General case: some indirectaddressingrequired!
Sparse Matrix-Vector Multiplication
![Page 4: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/4.jpg)
Sparse Matrix-Vector Multiplication 4
SpMVM characteristics
For large problems, SpMV is inevitably memory-bound Intra-socket saturation effect on modern multicores
SpMV is easily parallelizable in shared and distributed memory Load balancing Communication overhead
Data storage format is crucial for performance propertiesMost useful general format on CPUs:
Compressed Row Storage (CRS) Depending on compute architecture
![Page 5: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/5.jpg)
Sparse Matrix-Vector Multiplication 5
CRS matrix storage scheme
β¦
column index
row
inde
x
1 2 3 4 β¦1234β¦
val[]
1 5 3 72 1 46323 4 21 5 815 β¦ col_idx[]
1 5 15 198 12 β¦ row_ptr[]
val[] stores all the nonzeros (length Nnz)
col_idx[] stores the column index of each nonzero (length Nnz)
row_ptr[] stores the starting index of each new row in val[] (length: Nr)
![Page 6: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/6.jpg)
Sparse Matrix-Vector Multiplication 6
Case study: Sparse matrix-vector multiply
Strongly memory-bound for large data sets Streaming, with partially indirect access:
Usually many spMVMs required to solve a problem
Now letβs look at some performance measurementsβ¦
do i = 1,Nrdo j = row_ptr(i), row_ptr(i+1) - 1c(i) = c(i) + val(j) * b(col_idx(j)) enddoenddo
!$OMP parallel do schedule(???)
!$OMP end parallel do
![Page 7: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/7.jpg)
SpMVM: Performance Analysis
![Page 8: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/8.jpg)
Sparse Matrix-Vector Multiplication 8
Performance characteristics
Strongly memory-bound for large data sets saturating performance across cores on the chip
Performance seems to depend on the matrix
Can we explainthis?
Is there aβlight speedβfor SpMV?
Optimization?
???
???
10-core Ivy Bridge, static scheduling
![Page 9: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/9.jpg)
Sparse Matrix-Vector Multiplication 9
SpMV node performance model
Sparse MVM indouble precision w/ CRS data storage:
π΅π΅πππ·π·π·π·,πΆπΆπΆπΆπΆπΆ =
8+4+8πΌπΌ+20/ππππππππ2
BF
Absolute minimum code balance: π΅π΅min = 6 BF
πΌπΌmax = 16
FB
= 6+4Ξ±+10ππππππππ
BF
Hard upper limit forin-memory performance: πππΆπΆ/π΅π΅min
![Page 10: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/10.jpg)
Sparse Matrix-Vector Multiplication 10
The βπΆπΆ effectβ
DP CRS code balance Ξ± quantifies the traffic
for loading the RHS πΌπΌ = 0 RHS is in cache πΌπΌ = 1/Nnzr RHS loaded once πΌπΌ = 1 no cache πΌπΌ > 1 Houston, we have a problem!
βTargetβ performance = πππΆπΆ/π΅π΅ππ Caveat: Maximum memory BW may not be achieved with spMVM (see later)
Can we predict πΌπΌ? Not in general Simple cases (banded, block-structured): Similar to layer condition analysis
Determine πΌπΌ by measuring the actual memory traffic
π΅π΅πππ·π·π·π·,πΆπΆπΆπΆπΆπΆ(πΌπΌ) =
8+4+8πΌπΌ+20/ππππππππ2
BF
= 6+4Ξ±+ 10ππππππππ
BF
![Page 11: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/11.jpg)
Sparse Matrix-Vector Multiplication 11
Determine πΆπΆ (RHS traffic quantification)
ππππππππππ is the measured overall memory data traffic (using, e.g., likwid-perfctr)
Solve for πΌπΌ:
Example: kkt_power matrix from the UoF collectionon one Intel SNB socket
ππππππ = 14.6 οΏ½ 106, ππππππππ = 7.1 ππππππππππ β 258 MB πΌπΌ = 0.36, πΌπΌππππππππ = 2.5 RHS is loaded 2.5 times from memory and:
π΅π΅πππ·π·π·π·,πΆπΆπΆπΆπΆπΆ = 6+4Ξ±+
10ππππππππ
BF
=ππππππππππ
ππππππ οΏ½ 2 F
πΌπΌ =14
ππππππππππππππππ οΏ½ 2 bytes
β 6 β10ππππππππ
π΅π΅πππ·π·π·π·,πΆπΆπΆπΆπΆπΆ(πΌπΌ)
π΅π΅πππ·π·π·π·,πΆπΆπΆπΆπΆπΆ(1/ππππππππ)
= 1.11 11% extra trafficoptimization potential!
![Page 12: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/12.jpg)
Sparse Matrix-Vector Multiplication 12
Three different sparse matrices
Matrix ππ ππππππππ π΅π΅ππππππππ [B/F] ππππππππ [GF/s]
DLR1 278,502 143 6.1 7.64scai1 3,405,035 7.0 8.0 5.83kkt_power 2,063,494 7.08 8.0 5.83
DLR1 scai1 kkt_power
Benchmark system: Intel Xeon Ivy Bridge E5-2660v2, 2.2 GHz, πππΆπΆ = 46.6 βGB s
![Page 13: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/13.jpg)
Sparse Matrix-Vector Multiplication 13
Now back to the startβ¦
πππΆπΆ = 46.6 βGB s , π΅π΅ππππππππ = 6 βB F Maximum spMVM performance:
ππππππππ = 7.8 βGF s DLR1 causes minimum CRS code
balance (as expected)
scai1 measured balance:
π΅π΅ππππππππππ β 8.5 B/F > π΅π΅ππππππππ
good BW utilization, slightly non-optimal πΌπΌ
kkt_power measured balance:
π΅π΅ππππππππππ β 8.8 B/F > π΅π΅ππππππππ
performance degraded by loadimbalance, fix by block-cyclicschedule
scai1, kkt_power upper limit
![Page 14: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/14.jpg)
Sparse Matrix-Vector Multiplication 14
Investigating the load imbalance with kkt_power
static,2048
static
Fewer overall instructions, (almost) BW saturation, 50% better performandce with load balancing
CPI value unchanged!
Measurements with likwid-perfctr(MEM_DP group)
![Page 15: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/15.jpg)
Cascade-Lake AP: πππΆπΆ = 500 Gbyte/s NVIDIA V100: πππΆπΆ = 840 Gbyte/s Fujitsu A64FX in Fugaku: πππΆπΆ = 859 Gbyte/s
Absolute upper limits (matrix independent)given by πππΆπΆ/(6 B/F)
15Sparse Matrix-Vector Multiplication
CPU performance comparison
C. L. Alappat et al., DOI: 10.1002/cpe.6512
![Page 16: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/16.jpg)
SpMVM with multiple RHS & LHS Vectors
![Page 17: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/17.jpg)
Sparse Matrix-Vector Multiplication 17
Multiple RHS vectors (SpMMV)
Unchanged matrix applied to multiple RHS vectors to yield multiple LHS vectorsdo s = 1,rdo i = 1, Nrdo j = row_ptr(i),row_ptr(i+1)-1C(i,s) = C(i,s) + val(j) *
B(col_idx(j),s)enddo
enddoenddo
π΅π΅ππ unchanged, no reuse of matrix data
do i = 1, Nrdo j = row_ptr(i),row_ptr(i+1)-1do s = 1,rC(i,s) = C(i,s) + val(j) *
B(col_idx(j),s)enddo
enddoenddo
Lower π΅π΅ππ due to maxreuse of matrix data
do i = 1, Nrdo j = row_ptr(i),row_ptr(i+1)-1do s = 1,rC(s,i) = C(s,i) + val(j) *
B(s,col_idx(j))enddo
enddoenddo
CL-friendly data structure (row major)
![Page 18: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/18.jpg)
Sparse Matrix-Vector Multiplication 18
SpMMV code balanceOne complete inner (s) loop traversal: 2ππ flops 12 bytes from matrix data
(value + index)
16ππππππππππ
bytes from the ππ LHS updates
4
ππππππππbytes from the row pointer
8πππΌπΌ ππ bytes from the ππ RHS reads
do i = 1, Nrdo j = row_ptr(i),row_ptr(i+1)-1do s = 1,rC(s,i) = C(s,i) + val(j) *
B(s,col_idx(j))enddo
enddoenddo
π΅π΅ππ ππ =12ππ
12 + 8πππΌπΌ ππ +16ππ + 4ππππππππ
BF
=6ππ
+ 4πΌπΌ ππ +8 + 2/ππππππππππ
BF OK so what now???
![Page 19: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/19.jpg)
Sparse Matrix-Vector Multiplication 19
SpMMV code balance
Letβs check some limits to see if this makes sense!
π΅π΅ππ ππ =6ππ
+ 4πΌπΌ ππ +8 + 2/ππππππππππ
BF
ππ = 16+4Ξ±+
10ππππππππ
BF
4πΌπΌ ππ +8
ππππππππBF
reassuring
Can become very small for large ππππππππ decoupling from memory bandwidth is possible!
M. Kreutzer et al.: Performance Engineering of theKernel Polynomial Method on Large-Scale CPU-GPU Systems. Proc. IPDPS15, DOI: 10.1109/IPDPS.2015.76
6ππ
BF
![Page 20: Case study: Sparse Matrix-Vector Multiplication](https://reader033.vdocuments.mx/reader033/viewer/2022041516/62527df61762706bb77ee0da/html5/thumbnails/20.jpg)
Sparse Matrix-Vector Multiplication 20
Roofline analysis for spMVM
Conclusion from the Roofline analysis The roofline model does not βworkβ for spMVM due to the RHS
traffic uncertaintiesWe have βturned the model aroundβ and measured the actual
memory traffic to determine the RHS overhead Result indicates:
1. how much actual traffic the RHS generates2. how efficient the RHS access is (compare BW with max. BW)3. how much optimization potential we have with matrix reordering
Do not forget about load balancing! Sparse matrix times multiple vectors bears the potential of huge
savings in data volume
Consequence: Modeling is not always 100% predictive. Itβs all about learning more about performance properties!