extreme‐scale parallel symmetric eigensolver for very small‐size matrices using a...
TRANSCRIPT
Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size
Matrices Using A Communication‐Avoiding for Pivot Vectors
Takahiro Katagiri (Information Technology Center, The University of Tokyo)
Jun'ichi Iwata and Kazuyuki Uchida (Department of Applied Physics School of Engineering,
The University of Tokyo)Thursday, February 20, Room: Salon A, 10:35‐10:55 MS34 Auto‐tuning Technologies for Extreme‐Scale Solvers ‐ Part I of IIISIAM PP14, Feb.18‐21, 2014, Marriott Portland Downtown Waterfront, Portland, OR., USA
Outline• Target Application: RSDFT• Parallel Algorithm of Symmetric Eigensolver for Small Matrices
• Performance Evaluation with 76,800 cores of the Fujitsu FX10
• Conclusion
Outline• Target Application: RSDFT• Parallel Algorithm of Symmetric Eigensolver for Small Matrices
• Performance Evaluation with 76,800 cores of the Fujitsu FX10
• Conclusion
RSDFT (Real Space Density Functional Theory)RSDFT (Real Space Density Functional Theory)
)()()(
][)(21 2 rr
rrrrr jjj
XCion
Edv
Kohn-Sham equation is solved as afinite-difference equation
J.-I. Iwata et al., J. Comp. Phys. 229, 2339 (2010).
10648-atom cell of Si crystal and its electron density Volume of Si crystal vs. Total Energy
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
18 18.5 19 19.5 20 20.5 21
Energy/atom (eV)
Volume/atom
10648 atoms21952 atoms
Volume / atom
Ene
rgy
/ ato
m (e
V)
10,648 atoms 21,952 atoms
Structural properties of Si crystal
Requirements of Mathematical Software from RSDFT
• An FFT‐free algorithm.• All eigenvalues and eigenvectors computation fora dense real symmetric matrix.– Standard Eigenproblem.
– O(100) times are executed for SCF (Self Consistent Field) process.
• Re‐orthogonalization for eigenvectors.• Due to computational complexity, the parts of eigensolver and orthogonalization become a bottleneck.– Since these parts require O(N3) computations, while others require O(N2)
computations.
• Matrix and eigenvalues are distributed to obtain parallelism for the other parts to eigensolver.– It is difficult to obtain while data even if it is small.
Requirements of Mathematical Software from RSDFT (Cont’d)
• Other parts of the eigensolver in application are also time‐consuming.
Source: Y. Hasegawa et.al.: First‐principles calculations of electron states of a silicon nanowire with 100,000 atoms on the K computer, SC11, (2011)
Processes Execution Costs to whole time [%] Order
SCF 99.6% O(N3)
SD 47.2% O(N3)
Subspace Diag. 44.2% O(N3)
MatE 10.0% O(N3) DGEMM
Eigensolve 19.6% O(N3)
Rot V 14.6% O(N3)
CG (Conjugate Gradient) 26.0% O(N2)
GS (Gramm‐Schmidt Ort.) 25.8% O(N3) DGEMM
Others 0.6% ‐
RSDFT Processes Breakdown
Eigensolve and GS Parts will be bottleneck in large‐scale computation,
but other processes is needed to be considered.
• Required memory space is also needed to be considered.– Due to API of numerical library, such as re‐distribution of data, actual problem
size is limited as small sizes with respect to remainder memory space.
Our Assumption• Target : The eigensolver part in RSDFT• Exa‐scale computing: Total number of nodes is on the order of 1,000,000 (a million).
• Since the matrix is two‐dimensional (2D), the size of the matrix required in exa‐scale computers reaches the order of:10,000 * sqrt (1,000,000) = 10,000,000 (ten millions), if each node has matrix of N=10,000 .
• Since most dense solvers require O(N3) for computational complexity, the execution time with a matrix of N=10,000,000 (ten millions) is unrealistic in actual applications (in production‐run phase).
Our Assumption (Cont’d)• We presume that N=1,000 per node is the maximum size. The size in exa‐scale is on the order of N=1,000,000 (a million).
• The used memory size of a matrix per node is only on the order of 8 MB. – ! This is eigensolver part only.
• This is just the cache size for current CPUs.– Next generation CPUs may be having order of 100MB cache!
• Such as the IBM Power8 with e‐DRAM (3D Stacked Memory) for L4 cache.
Originalities of Our Eigensolver1. Non‐blocking Computation Algorithm Since data in cache in our assumption in exa‐scale
computing. 2. Communication reducing and
communication avoiding algorithm Tridiagonalization and Householder inverse
transformation of symmetric eigensolvers. By duplicating Householder vectors.
3. Hybrid MPI‐OpenMP execution With a full system of a peta‐scale supercomputer
(The Fujitsu FX10) consisting of 4800 nodes (76,800 cores).
Outline• Target Application: RSDFT• Parallel Algorithm of Symmetric Eigensolver for Small Matrices
• Performance Evaluation with 76,800 cores of the Fujitsu FX10
• Conclusion
A Classical Householder Algorithm (Standard Eigenproblem )xAx
Symmetric Dense Matrix
A
1. Householder Transformation
QAQ=TTri-diagonalization
16
)( 3nO
T
Tridiagonalmatrix
4. Householder Inverse TransformationA: Dense matrix All eigenvectors: X = QY
)( 3nO
Q=H1 H2 … Hn-2
2. BisectionT: Tridiagonal matrix All eigenvalues :Λ
3. Inverse IterationT : Tridiagonal matrixAll eigenvectors: Y
)(~)( 32 nOnO)( 2nOMRRR:
Whole Parallel Processes on the Eigensolver
A
Tridiagonalization
T
Gather All Elements T T
T T
UpperLower
Compute Upper and Lower limitsFor eigenvalues
1,2,3,4… (Rising Order)Λ
1,2,3,4… (Corresponding toRising Order for the eigenvalues
Compute Eigenvectors Householder Inverse Transformation
YGather All Eigenvalues
Λ17
2DCyclic‐Cyclic Distribution
Data Duplication in Tridiagonalization
19
Matrix A
:Vectorsuk , xk
uk
ukDuplication of
p Processes
q Processes
uk: Householder Vector
:Vectorsyk,
ykykDuplication of
Transposed yk in Tridiagonalization (The case of p < q)
20
yk
Multi‐casting MPI_ALLREDUCE
p Processes
q Processes
p=2q=4
:RootProcesses
: With Rectangle Processor Grid [Katagiri and Itoh, 2010]
ykDuplication ofCommunicationAvoidingBy Using the Duplications
<1> do k=n-2, 1, -1
<2> Gather the vector and scalar by using multiple MPI_BCASTs.
<3> do i=nstart, nend<4> <5> <6> enddo<7> enddo
Parallel Householder Inverse Transformation
ku
ikiinkk
inkk uAA ,:
)(,:
)(
k
21
inkkT
kki Au ,:)(
①Multi‐casting MPI_BCAST
Gathering vector uk for Inverse Transformation :Non-packing messages for gathering uk
22
uk ukDuplication of
p Processes
q Processes
p = 2q = 4
②Multi‐casting MPI_BCAST
CommunicationAvoidingby using the duplications
Gathering vector uk for Inverse Transformation :Packing messages for gathering uk
23
uk
ukDuplication of
p Processes
q Processes
p = 2q = 4
①Multi‐casting MPI_BCAST
②Multi‐casting MPI_BCAST
CommunicationAvoiding &Reducing by using packingof messages uk : Send the two vectors
by one communication→Communication Blocking
Communication Blocking Length = 2
uk+1
Outline• Target Application: RSDFT• Parallel Algorithm of Symmetric Eigensolver for Small Matrices
• Performance Evaluation with 76,800 cores of the Fujitsu FX10
• Conclusion
Oakleaf‐FX (ITC, U.Tokyo), The Fujitsu PRIMEHPC FX10Contents Specifications
WholeSystem
Total Performance 1.135 PFLOPS
Total Memory Amounts 150 TB
Total #nodes 4,800
Inter ConnectionThe TOFU(6 Dimension Mesh / Torus)
Local File System Amounts 1.1 PB
Shared File System Amounts 2.1 PB
Contents Specifications
Node
Theoretical Peak Performance 236.5 GFlops
#Processors (#Cores) 16
Main Memory Amounts 32 GB
Processor
Processor Name SPARC64 IX‐fx
Frequency 1.848 GHz
Theoretical Peak Performance (Core) 14.78 GFLOPS
4800 Nodes (76,800 Cores)
COMMUNICATION AVOIDING EFFECT
Householder Inverse Transformation(4096 Nodes (65,536 Cores), 64x64), N=38,400, Hybrid
0
10
20
30
40
50
60
70
80
90
MPI_BCAST Binary Tree MPI_Isend Block MPI_BCAST
Time in Secon
d
Communication Implementations
Other HIT Ker Send Piv
The BestParameter
#Processes =4096#Threads=16/node
Comm. Block =12
Non‐packing Sending Packing Sending
1.57x
Non‐blocking MPI
HYBRID MPI‐OPENMPEFFECT
Pure MPI vs. Hybrid MPI‐OpenMPI(64 Nodes (1024 Cores)), N=4800, Total Time
0
0.5
1
1.5
2
2.5
3
3.5
16x64 (Pure MPI) 8x8 (Hybrid MPI)
Time in Secon
d
Process Organization
Householder Inv
Calculating Eigenvectors
Re‐distribution
Tridiagonalization
1.61x64 MPI Processes,16 OMP Threads/MPI Process
Pure MPI vs. Hybrid MPI‐OpenMPI(64 Nodes (1024 Cores)), N=4800, Tridiagonalization
0
0.5
1
1.5
2
2.5
16x64 (Pure MPI) 8x8 (Hybrid MPI)
Time in Secon
d
Process Organization
Other UpdateMatVec MatVec ReduceSend xt Send ytSend Piv
Communication
Computation
27.9%
46.1%72.1%
53.9%18.2 Points Reduction
Pure MPI vs. Hybrid MPI‐OpenMPI(64 Nodes (1024 Cores)), N=4800,
Householder Inverse Transformation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
16x64 (Pure MPI) 8x8 (Hybrid MPI)
Time in Secon
d
Process Organization
OtherHIT KerSend Piv
Communication
Computation
15.6%
44.6%84.4%
55.4%29 Points Reduction
FX10 76800 CORES (4800 NODES)RESULTS
Hybrid MPI‐OpenMP Executionin 4800 nodes (76,800 Cores) (40x120)
31.8 83.3
429.9
34.3 180.1
904.0
0
200
400
600
800
1000
1200
1400
1600
N=41568 N=83138 N=166276
Time in Secon
d
Process Organization
Householder InvCalculating EigenvecRe‐distTridiag
HIT comm. block=6
HIT comm. block=4
HIT comm. block=2
2.61x
5.24x5.16x
5.01x
3.97x
5.05xInner L1 Cache Size
Only 4x increase with 2x problem sizein O(N3) algorithm
Execution Time in Pure MPIbetween ScaLAPACK PDSYEVD and Ours
ScaLAPACK (version 1.8) on the Fujitsu FX10. Fujitsu Optimized BLAS is used.The best block size is specified for each ScaLAPACK execution in range between 1, 8, 16, 32, 64, 128, and 256.
4.26
10.96
25.76
1.794.61
15.52
0
5
10
15
20
25
30
N=4800 (8x8) 64cores
N=9600 (16x16) 256cores
N=19200 (32x32)1024 cores
ScaLAPACKOurs
[Time in Seconds]
Better
Conclusion• Our eigensolver is effective for very small matrices to utilize communication reducing and avoiding techniques.– By halving duplicate Householder vectors in Tridiagonalization and Householder Inverse Transformation phases.
– By using reduced communications for multiple sending with 2D splitting for process grid.
– By using packing messages for Householder Inverse Transformation part.
• Selection of implementations in communication processes is the target of AT.– The best implementation depends on process grids, the number of processors, and block size for data packing.
Conclusion (Cont’d)• One of drawbacks is increase of memory space.
– , where process grid is p * q.– Since memory space for matrix is in cache size, the increase of memory space can be ignored.
• Comparison with new blocking algorithms is future work.– 2‐step method with block Householder tridiagonalization.• Eigen‐K (Riken)• ELPA (Technische Universität München)• A new implementation of PLASMA and MAGMA
)/( 2 pNO
Acknowledgements • Computational resource of Fujitsu FX10 was awarded by “Large‐scale HPC Challenge” Project, Information Technology Center, The University of Tokyo.
This topic was submitted to Parallel Computing.(As of December 2013.)