extreme‐scale parallel symmetric eigensolver for very small‐size matrices using a...

Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size

Matrices Using A Communication‐Avoiding for Pivot Vectors

Takahiro Katagiri (Information Technology Center, The University of Tokyo)

Jun'ichi Iwata and Kazuyuki Uchida (Department of Applied Physics School of Engineering,

The University of Tokyo)Thursday, February 20, Room: Salon A, 10:35‐10:55 MS34 Auto‐tuning Technologies for Extreme‐Scale Solvers ‐ Part I of IIISIAM PP14, Feb.18‐21, 2014, Marriott Portland Downtown Waterfront, Portland, OR., USA

Outline• Target Application: RSDFT• Parallel Algorithm of Symmetric Eigensolver for Small Matrices

• Performance Evaluation with 76,800 cores of the Fujitsu FX10

• Conclusion

RSDFT (Real Space Density Functional Theory)RSDFT (Real Space Density Functional Theory)

)()()(

][)(21 2 rr

rrrrr jjj

XCion

Edv

Kohn-Sham equation is solved as afinite-difference equation

J.-I. Iwata et al., J. Comp. Phys. 229, 2339 (2010).

10648-atom cell of Si crystal and its electron density Volume of Si crystal vs. Total Energy

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

18 18.5 19 19.5 20 20.5 21

Energy/atom (eV)

Volume/atom

10648 atoms21952 atoms

Volume / atom

Ene

rgy

/ ato

m (e

V)

10,648 atoms 21,952 atoms

Structural properties of Si crystal

Requirements of Mathematical Software from RSDFT

• An FFT‐free algorithm.• All eigenvalues and eigenvectors computation fora dense real symmetric matrix.– Standard Eigenproblem.

– O(100) times are executed for SCF (Self Consistent Field) process.

• Re‐orthogonalization for eigenvectors.• Due to computational complexity, the parts of eigensolver and orthogonalization become a bottleneck.– Since these parts require O(N3) computations, while others require O(N2)

computations.

• Matrix and eigenvalues are distributed to obtain parallelism for the other parts to eigensolver.– It is difficult to obtain while data even if it is small.

Requirements of Mathematical Software from RSDFT (Cont’d)

• Other parts of the eigensolver in application are also time‐consuming.

Source: Y. Hasegawa et.al.: First‐principles calculations of electron states of a silicon nanowire with 100,000 atoms on the K computer, SC11, (2011)

Processes Execution Costs to whole time [%] Order

SCF 99.6% O(N3)

SD 47.2% O(N3)

Subspace Diag. 44.2% O(N3)

MatE 10.0% O(N3) DGEMM

Eigensolve 19.6% O(N3)

Rot V 14.6% O(N3)

CG (Conjugate Gradient) 26.0% O(N2)

GS (Gramm‐Schmidt Ort.) 25.8% O(N3) DGEMM

Others 0.6% ‐

RSDFT Processes Breakdown

Eigensolve and GS Parts will be bottleneck in large‐scale computation,

but other processes is needed to be considered.

• Required memory space is also needed to be considered.– Due to API of numerical library, such as re‐distribution of data, actual problem

size is limited as small sizes with respect to remainder memory space.

Our Assumption• Target : The eigensolver part in RSDFT• Exa‐scale computing: Total number of nodes is on the order of 1,000,000 (a million).

• Since the matrix is two‐dimensional (2D), the size of the matrix required in exa‐scale computers reaches the order of:10,000 * sqrt (1,000,000) = 10,000,000 (ten millions), if each node has matrix of N=10,000 .

• Since most dense solvers require O(N3) for computational complexity, the execution time with a matrix of N=10,000,000 (ten millions) is unrealistic in actual applications (in production‐run phase).

Our Assumption (Cont’d)• We presume that N=1,000 per node is the maximum size. The size in exa‐scale is on the order of N=1,000,000 (a million).

• The used memory size of a matrix per node is only on the order of 8 MB. – ! This is eigensolver part only.

• This is just the cache size for current CPUs.– Next generation CPUs may be having order of 100MB cache!

• Such as the IBM Power8 with e‐DRAM (3D Stacked Memory) for L4 cache.

Originalities of Our Eigensolver1. Non‐blocking Computation Algorithm Since data in cache in our assumption in exa‐scale

computing. 2. Communication reducing and

communication avoiding algorithm Tridiagonalization and Householder inverse

transformation of symmetric eigensolvers. By duplicating Householder vectors.

3. Hybrid MPI‐OpenMP execution With a full system of a peta‐scale supercomputer

(The Fujitsu FX10) consisting of 4800 nodes (76,800 cores).



• Conclusion

A Classical Householder Algorithm (Standard Eigenproblem )xAx

Symmetric Dense Matrix

A

1. Householder Transformation

ＱAＱ=TTri-diagonalization

16

)( 3nO

T

Tridiagonalmatrix

4. Householder Inverse TransformationA: Dense matrix All eigenvectors： X = ＱY

)( 3nO

Ｑ=H1 H2 … Hn-2

2. BisectionT: Tridiagonal matrix All eigenvalues :Λ

3. Inverse IterationT : Tridiagonal matrixAll eigenvectors: Y

)(~)( 32 nOnO)( 2nOMRRR:

Whole Parallel Processes on the Eigensolver

A

Tridiagonalization

T

Gather All Elements T T

T T

UpperLower

Compute Upper and Lower limitsFor eigenvalues

1，2，3，4… (Rising Order)Λ

1，2，3，4… （Corresponding toRising Order for the eigenvalues

Compute Eigenvectors Householder Inverse Transformation

YGather All Eigenvalues

Λ17

2DCyclic‐Cyclic Distribution

Data Duplication in Tridiagonalization

19

Matrix A

:Vectorsuk , xk

uk

ukDuplication of

ｐ Processes

ｑ Processes

uk: Householder Vector

:Vectorsyk,

ykykDuplication of

Transposed yk in Tridiagonalization (The case of p < q)

20

yk

Multi‐casting MPI_ALLREDUCE

ｐ Processes

ｑ Processes

ｐ＝２ｑ＝４

：RootProcesses

: With Rectangle Processor Grid [Katagiri and Itoh, 2010]

ykDuplication ofCommunicationAvoidingBy Using the Duplications

<1> do k=n－2, １, －１

<2> Gather the vector and scalar by using multiple MPI_BCASTs.

<3> do i=nstart, nend<4> <5> <6> enddo<7> enddo

Parallel Householder Inverse Transformation

ku

ikiinkk

inkk uAA ,:

)(,:

)(

k

21

inkkT

kki Au ,:)(　

①Multi‐casting MPI_BCAST

Gathering vector uk for Inverse Transformation :Non-packing messages for gathering uk

22

uk ukDuplication of

p Processes

q Processes

p = 2q = 4

②Multi‐casting MPI_BCAST

CommunicationAvoidingby using the duplications

Gathering vector uk for Inverse Transformation :Packing messages for gathering uk

23

uk

ukDuplication of

p Processes

q Processes

p = 2q = 4

①Multi‐casting MPI_BCAST

②Multi‐casting MPI_BCAST

CommunicationAvoiding &Reducing by using packingof messages uk : Send the two vectors

by one communication→Communication Blocking

Communication Blocking Length = 2

uk+1



• Conclusion

Oakleaf‐FX (ITC, U.Tokyo), The Fujitsu PRIMEHPC FX10Contents Specifications

WholeSystem

Total Performance 1.135 PFLOPS

Total Memory Amounts 150 TB

Total #nodes 4,800

Inter ConnectionThe TOFU(6 Dimension Mesh / Torus)

Local File System Amounts 1.1 PB

Shared File System Amounts 2.1 PB

Contents Specifications

Node

Theoretical Peak Performance 236.5 GFlops

#Processors (#Cores) 16

Main Memory Amounts 32 GB

Processor

Processor Name SPARC64 IX‐fx

Frequency 1.848 GHz

Theoretical Peak Performance (Core) 14.78 GFLOPS

4800 Nodes (76,800 Cores)

COMMUNICATION AVOIDING EFFECT

Householder Inverse Transformation(4096 Nodes (65,536 Cores), 64x64), N=38,400, Hybrid

0

10

20

30

40

50

60

70

80

90

MPI_BCAST Binary Tree MPI_Isend Block MPI_BCAST

Time in Secon

d

Communication Implementations

Other HIT Ker Send Piv

The BestParameter

#Processes =4096#Threads=16/node

Comm. Block =12

Non‐packing Sending Packing Sending

1.57x

Non‐blocking MPI

HYBRID MPI‐OPENMPEFFECT

Pure MPI vs. Hybrid MPI‐OpenMPI(64 Nodes (1024 Cores)), N=4800, Total Time

0

0.5

1

1.5

2

2.5

3

3.5

16x64 (Pure MPI) 8x8 (Hybrid MPI)

Time in Secon

d

Process Organization

Householder Inv

Calculating Eigenvectors

Re‐distribution

Tridiagonalization

1.61x64 MPI Processes,16 OMP Threads/MPI Process

Pure MPI vs. Hybrid MPI‐OpenMPI(64 Nodes (1024 Cores)), N=4800, Tridiagonalization

0

0.5

1

1.5

2

2.5


Time in Secon

d


Other UpdateMatVec MatVec ReduceSend xt Send ytSend Piv

Communication

Computation

27.9%

46.1%72.1%

53.9%18.2 Points Reduction

Pure MPI vs. Hybrid MPI‐OpenMPI(64 Nodes (1024 Cores)), N=4800,

Householder Inverse Transformation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Time in Secon

d


OtherHIT KerSend Piv

Communication

Computation

15.6%

44.6%84.4%

55.4%29 Points Reduction

FX10 76800 CORES (4800 NODES)RESULTS

Hybrid MPI‐OpenMP Executionin 4800 nodes (76,800 Cores) (40x120)

31.8 83.3

429.9

34.3 180.1

904.0

0

200

400

600

800

1000

1200

1400

1600

N=41568 N=83138 N=166276

Time in Secon

d


Householder InvCalculating EigenvecRe‐distTridiag

HIT comm. block=6

HIT comm. block=4

HIT comm. block=2

2.61x

5.24x5.16x

5.01x

3.97x

5.05xInner L1 Cache Size

Only 4x increase with 2x problem sizein O(N3) algorithm

Execution Time in Pure MPIbetween ScaLAPACK PDSYEVD and Ours

ScaLAPACK (version 1.8) on the Fujitsu FX10. Fujitsu Optimized BLAS is used.The best block size is specified for each ScaLAPACK execution in range between 1, 8, 16, 32, 64, 128, and 256.

4.26

10.96

25.76

1.794.61

15.52

0

5

10

15

20

25

30

N=4800 (8x8) 64cores

N=9600 (16x16) 256cores

N=19200 (32x32)1024 cores

ScaLAPACKOurs

[Time in Seconds]

Better

Conclusion• Our eigensolver is effective for very small matrices to utilize communication reducing and avoiding techniques.– By halving duplicate Householder vectors in Tridiagonalization and Householder Inverse Transformation phases.

– By using reduced communications for multiple sending with 2D splitting for process grid.

– By using packing messages for Householder Inverse Transformation part.

• Selection of implementations in communication processes is the target of AT.– The best implementation depends on process grids, the number of processors, and block size for data packing.

Conclusion (Cont’d)• One of drawbacks is increase of memory space.

– , where process grid is p * q.– Since memory space for matrix is in cache size, the increase of memory space can be ignored.

• Comparison with new blocking algorithms is future work.– 2‐step method with block Householder tridiagonalization.• Eigen‐K (Riken)• ELPA (Technische Universität München)• A new implementation of PLASMA and MAGMA

)/( 2 pNO

Acknowledgements • Computational resource of Fujitsu FX10 was awarded by “Large‐scale HPC Challenge” Project, Information Technology Center, The University of Tokyo.

This topic was submitted to Parallel Computing.(As of December 2013.)

extreme‐scale parallel symmetric eigensolver for very small‐size matrices using a...

Presentations & Public Speaking

atom cell of si crystal

theexe wepresumethatn

actualprob target

total energy

afinitedifference equationj

on3cgconjugate gradient

thesizeinexascaleiso