gpu accelerated arnoldi solver for small batched matrix › content › gtc-kr ›...
TRANSCRIPT
GPU accelerated Arnoldi solver for small batched matrix
15. 09. 22
Hyung-Jin Kim
Samsung Advanced Institute of Technology
Contents
- Eigen value problems
- Solution
- Arnoldi Algorithm
- Target
- CUDA optimization
Eigen Value Problem
𝐴𝑥 = 𝜆𝑥
𝐴 ≡
𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮
𝑎𝑛0 ⋯ 𝑎𝑛𝑛 : 𝑥 ≡
𝑎0⋮𝑎𝑛
: n x n, complex
valued matrix ,
vector(𝑥) which satisfies
above linear equation
: scalar value(λ) which satisfies above linear equation λ
Solution of Schrödinger Equation
𝐻ψ𝐸 = 𝐸ψ𝐸
Vibration analysis
m𝑥 + 𝑘𝑥 = 0 Principal Component Analysis
Eigen faces
…
How to solve eigenvalue problems?
𝑑𝑒𝑡 (𝐴 − 𝜆𝐼) = 0
- Find a solution of following equation,
- For 2x2 case,
𝑑𝑒𝑡𝑎 − 𝜆 𝑏𝑐 𝑑 − 𝜆
= 𝑎 − 𝜆 𝑑 − 𝜆 − 𝑏𝑐 = 0
𝐴 ≡𝑎 𝑏𝑐 𝑑
, 𝐼 ≡1 00 1
𝜆 = 𝑎 + 𝑑 ± (𝑎 − 𝑑)2+4𝑏𝑐
2
- No exact solution for finding n roots of n-th order polynomial equation
- Iterative method : Power method, QR, Arnoldi, Lanczos, …
Arnoldi Algorithm
v1 = v/|v| for k = 1 to m -1 do vk+1 = Avk for j = 1 to k do hjk = vj
†*vk+1 vk+1 = vk+1-vjhjk end for /*Gram-Schmidt orthogonalization*/ hk+1,k = |vk+1| if hk+1,k = 0 then return {v1,…,vk} /*is invariant under A*/ end if vk+1 = vk+1/hk+1,k end for
- Arnoldi algorithm just gives k-orthonormal basis and Hessenberg matrix H
- If matrix A is hermitian,
iteration sequence becomes simpler form : Lanczos algorithm
𝐻 ≡
𝑎00 𝑎01 𝑎02 𝑎03𝑎10 𝑎11 𝑎12 𝑎130 𝑎21 𝑎22 𝑎230 0 𝑎32 𝑎33
Further works…(1)
- QR factorization : finds upper trianglar matrix T, unitary matrix U s.t A = UTU*
Set A0 = A and U0 = I. for k = 1, 2,..., do QkRk = Ak−1; /* QR factorization */ Ak = RkQk; Uk = Uk-1Qk; /* Update transformation matrix */ end for Set T = A∞ and U = U∞.
- In general, QR factorization is O(3) algorithm → computationally expensive!
- QR rotation : define a rotation transform G(i,j,θ), 𝐺 𝑖, 𝑗, 𝜃 =
1 ⋯ 0 ⋯ 0 ⋯ 0⋮ ⋯ ⋮ ⋯ ⋮ ⋯ ⋮0 … 𝑐 … 𝑠 ⋯ 0⋮ ⋯ ⋮ ⋯ ⋮ ⋯ ⋮0 ⋯ −𝑠 ⋯ 𝑐 ⋯ 0⋮ ⋯ ⋮ ⋯ ⋮ ⋯ ⋮0 … 0 ⋯ 0 ⋯ 1
i
j
j i
𝑐 = 𝑥𝑖
𝑥𝑖 2 + 𝑥𝑗2 𝑠 =
−𝑥𝑗
𝑥𝑖 2 + 𝑥𝑗2
𝐻 ≡
× × × ×× × × ×0 × × ×0 0 × ×
𝐺 0,1, 𝜃0
× × × ×0 × × ×0 × × ×0 0 × ×
𝐺 1,2, 𝜃1
× × × ×0 × × ×0 0 × ×0 0 × ×
𝐺 … ,… , 𝜃… × × × ×0 × × ×0 0 × ×0 0 0 ×
= 𝑅
: O(2) algorithm!
→ Computationally cheap!
Further works…(2)
- Diagonal elements of T are eigenvalues of A : Done!
(Eigenvalues are preserved under similarity transform)
- From 𝐻 − 𝜆𝑖𝐼 𝑠𝑖 = 0, we can calculate n-unknown elements of i-th eigenvector
of Hessenberg matrix
- Eigenvector of A can be derived from 𝑥𝑖 = 𝑉𝑠𝑖
- Overally, most of the computations are O(2) algorithms except 𝑥𝑖 computation
, 𝑉 ≡ 𝑣1 𝑣2 ⋯ 𝑣𝑛
Implementation strategy
- Matrix size is less than 1000x1000
- Complex, non-symmetric matrix
- parallel process for more than ~100 of matrices → simultaneous kernel run
- Each matrices are aligned sequentially in memory → batched data
- CUBLAS library(ver >7.0) is good enough in most of BLAS computations
- Arnoldi, QR algorithm structural sub-routine calls are executed from CPU side
- “cudaStream” is good solution for parallel(or simultaneous) kernel launching
𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮
𝑎𝑛0 ⋯ 𝑎𝑛𝑛 𝑘
𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−1
GPU
kernel 𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−2
Input data
𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮
𝑎𝑛0 ⋯ 𝑎𝑛𝑛 𝑘+1
output data
Current Solver
𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮
𝑎𝑛0 ⋯ 𝑎𝑛𝑛 𝑘
𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−1
Kernel 0
𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−2
Input data
𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮
𝑎𝑛0 ⋯ 𝑎𝑛𝑛 𝑘+1
output data
cudaStream threaded Solver
𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮
𝑎𝑛0 ⋯ 𝑎𝑛𝑛 𝑘
𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−1
Kernel 1
𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−2
𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮
𝑎𝑛0 ⋯ 𝑎𝑛𝑛 𝑘+1
𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮
𝑎𝑛0 ⋯ 𝑎𝑛𝑛 𝑘
𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−1
Kernel 2
𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−2
𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮
𝑎𝑛0 ⋯ 𝑎𝑛𝑛 𝑘+1
𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮
𝑎𝑛0 ⋯ 𝑎𝑛𝑛 𝑘
𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−1
Kernel 3
𝑑0 ⋯ 0⋮ ⋱ ⋮0 ⋯ 𝑑𝑛 𝑘−2
𝑎00 ⋯ 𝑎0𝑛⋮ ⋱ ⋮
𝑎𝑛0 ⋯ 𝑎𝑛𝑛 𝑘+1
…
Implementation Detail
cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); kernel1<<<grid, block, 0, stream1>>>(data_1); kernel2<<<grid, block, 0, stream2>>>(data_2);
※ Nsight Profiler view
8 concurrent
kernel launch
- CUBLAS also supports “cudaStream” with other name,
cudaStreamCreate() and cublasSetStream()
- For <t>gemm(), <t>trsm() operations, batch mode is natively supported
Performance(1)
- Dgemm operation
※Tested on Xeon E5-2680v2, K80 GPUs
MKL(GF) Single
kernel(GF) 10 stream kernels(GF)
10/Single ratio
128x128 65 42 192 4.5
256x256 115 262 532 2.0
- Dgemv operation
MKL(GF) Single
kernel(GF) 10 stream kernels(GF)
10/Single ratio
128x128 1.10 1.14 6.35 5.5
256x256 2.58 4.00 11.8 3.0
- Dgemv performance on different kernel size
Single kernel (GF)
10 kernels (GF)
100 kernels (GF)
1000 kernels (GF)
128x128 1.14 6.35 26.8 33.8
256x256 4.00 11.8 32.5 34.9
Performance(2)
- Full eigenvalue evaluation sequences are still under developing
- Tested Arnoldi iterations only, 10 matrix solve (preliminary!)
- Intel MKL, MAGMA library data are given
※Tested on Xeon E5-2680v2, K80 GPUs
MKL(sec) Magma(sec) Optimized solver(sec)
256x256 1.0 1.1 0.37
512x512 5.3 5.1 2.1
Potential problems?
- Maximum number of cudaStream is unknown(Help!)
- Fermi GPU : 16, Kepler GPU : 32 concurrent kernel run
- Maxwell GPU : Unknown(Help!)
→ If matrix size is too small, GPU utilization could be still low
⇒ “per-thread” option would be useful?
- Shaded 4 elements are not continually aligned on GPU memory
- For QR rotation, we don’t need to fully access on Hessenberg matrix
- Custom data structure for Hessenberg matrix might be useful!
× × × ×0 × × ×0 × × ×0 0 × ×
- Eigenvectors are calculated from 𝐻 − 𝜆𝑖𝐼 𝑠𝑖 = 0
- Eigenvector elements are calculated in sequential way
- concurrent eigenvector computation is also needed
× × × ×× × × ×0 × × ×0 0 × ×
𝑠𝑖0𝑠𝑖1𝑠𝑖2𝑠𝑖3
= 0
→ ×43 𝑠𝑖2 = - ×44 𝑠𝑖3 , …
Backward computation
Thank you! Ref) Arnoldi Algorithm : http://people.inf.ethz.ch/arbenz/ewp/Lnotes/lsevp2010.pdf
Kernel Streaming : CUDA_C_Programming_Guide, CUDA SDK(concurrentKernels)
CUBLAS : CUBLAS_Library, CUDA SDK(batchCUBLAS)