extreme‐scale parallel symmetric eigensolver for very small‐size matrices using a...

31
ExtremeScale Parallel Symmetric Eigensolver for Very SmallSize Matrices Using A CommunicationAvoiding for Pivot Vectors Takahiro Katagiri (Information Technology Center, The University of Tokyo) Jun'ichi Iwata and Kazuyuki Uchida (Department of Applied Physics School of Engineering, The University of Tokyo) Thursday, February 20, Room: Salon A, 10:3510:55 MS34 Autotuning Technologies for ExtremeScale Solvers Part I of III SIAM PP14, Feb.1821, 2014, Marriott Portland Downtown Waterfront, Portland, OR., USA

Upload: takahiro-katagiri

Post on 22-Jul-2015

320 views

Category:

Presentations & Public Speaking


0 download

TRANSCRIPT

Page 1: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size 

Matrices Using A Communication‐Avoiding for Pivot Vectors 

Takahiro Katagiri (Information Technology Center, The University of Tokyo)

Jun'ichi Iwata and Kazuyuki Uchida (Department of Applied Physics  School of Engineering, 

The University of Tokyo)Thursday, February 20, Room: Salon A, 10:35‐10:55 MS34 Auto‐tuning Technologies for Extreme‐Scale Solvers ‐ Part I of IIISIAM PP14, Feb.18‐21, 2014, Marriott Portland Downtown Waterfront, Portland, OR., USA   

Page 2: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Outline• Target Application: RSDFT• Parallel Algorithm of Symmetric Eigensolver for Small Matrices 

• Performance Evaluation with 76,800 cores of the Fujitsu FX10

• Conclusion

Page 3: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Outline• Target Application: RSDFT• Parallel Algorithm of Symmetric Eigensolver for Small Matrices 

• Performance Evaluation with 76,800 cores of the Fujitsu FX10

• Conclusion

Page 4: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

RSDFT (Real Space Density Functional Theory)RSDFT (Real Space Density Functional Theory)

)()()(

][)(21 2 rr

rrrrr jjj

XCion

Edv

Kohn-Sham equation is solved as afinite-difference equation

J.-I. Iwata et al., J. Comp. Phys. 229, 2339 (2010).

10648-atom cell of Si crystal and its electron density Volume of Si crystal vs. Total Energy

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

18 18.5 19 19.5 20 20.5 21

Energy/atom (eV)

Volume/atom

10648 atoms21952 atoms

Volume / atom

Ene

rgy

/ ato

m (e

V)

10,648 atoms 21,952 atoms

Structural properties of Si crystal

Page 5: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Requirements of Mathematical Software from RSDFT

• An FFT‐free algorithm.• All eigenvalues and eigenvectors computation fora dense real symmetric matrix.– Standard Eigenproblem.

– O(100) times are executed for SCF (Self Consistent Field) process.   

• Re‐orthogonalization for eigenvectors.• Due to computational complexity, the parts of eigensolver and orthogonalization become a bottleneck.– Since these parts require O(N3) computations, while others require O(N2) 

computations. 

• Matrix and eigenvalues are distributed to obtain parallelism for the other parts to eigensolver.– It is difficult to obtain while data even if it is small.

Page 6: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Requirements of Mathematical Software from RSDFT (Cont’d)

• Other parts of the eigensolver in application are also time‐consuming.

Source: Y. Hasegawa et.al.: First‐principles calculations of electron states of a silicon nanowire with 100,000 atoms on the K computer, SC11, (2011)

Processes Execution Costs  to whole time [%] Order

SCF 99.6% O(N3)

SD 47.2% O(N3)

Subspace Diag. 44.2% O(N3)

MatE 10.0% O(N3) DGEMM

Eigensolve 19.6% O(N3)

Rot V 14.6% O(N3)

CG (Conjugate Gradient) 26.0% O(N2)

GS (Gramm‐Schmidt Ort.) 25.8% O(N3) DGEMM

Others 0.6% ‐

RSDFT Processes Breakdown

Eigensolve and GS Parts will be bottleneck in large‐scale computation, 

but other processes is needed to be considered.

• Required memory space is also needed to be considered.– Due to API of numerical library, such as re‐distribution of data, actual problem 

size is limited as small sizes with respect to remainder memory space.

Page 7: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Our Assumption• Target : The eigensolver part in RSDFT• Exa‐scale computing: Total number of nodes is on the order of 1,000,000 (a million).

• Since the matrix is two‐dimensional (2D), the size of the matrix required in exa‐scale computers reaches the order of:10,000 * sqrt (1,000,000) = 10,000,000 (ten millions), if each node has matrix of N=10,000 .

• Since most dense solvers require O(N3) for computational complexity, the execution time with a matrix of N=10,000,000 (ten millions) is unrealistic in actual applications (in production‐run phase). 

Page 8: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Our Assumption (Cont’d)• We presume that N=1,000 per node is the maximum size. The size in exa‐scale is on the order of N=1,000,000 (a million).

• The used memory size of a matrix per node is only on the order of 8 MB. – ! This is eigensolver part only.

• This is just the cache size for current CPUs.– Next generation CPUs may be having order of 100MB cache!

• Such as the IBM Power8 with e‐DRAM (3D Stacked Memory) for L4 cache. 

Page 9: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Originalities of Our Eigensolver1. Non‐blocking Computation Algorithm Since data in cache in our assumption in exa‐scale 

computing.  2. Communication reducing and 

communication avoiding algorithm Tridiagonalization and Householder inverse 

transformation of symmetric eigensolvers. By duplicating Householder vectors. 

3. Hybrid MPI‐OpenMP execution  With a full system of a peta‐scale supercomputer 

(The Fujitsu FX10) consisting of 4800 nodes (76,800 cores). 

Page 10: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Outline• Target Application: RSDFT• Parallel Algorithm of Symmetric Eigensolver for Small Matrices 

• Performance Evaluation with 76,800 cores of the Fujitsu FX10

• Conclusion

Page 11: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

A Classical Householder Algorithm (Standard Eigenproblem )xAx

Symmetric Dense Matrix

A

1. Householder Transformation

QAQ=TTri-diagonalization

16

)( 3nO

T

Tridiagonalmatrix

4. Householder Inverse TransformationA: Dense matrix All eigenvectors: X = QY

)( 3nO

Q=H1 H2 … Hn-2

2. BisectionT: Tridiagonal matrix All eigenvalues :Λ

3. Inverse IterationT : Tridiagonal matrixAll eigenvectors: Y

)(~)( 32 nOnO)( 2nOMRRR:

Page 12: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Whole Parallel Processes on the Eigensolver

A

Tridiagonalization

T

Gather All Elements T T

T T

UpperLower

Compute Upper and Lower limitsFor eigenvalues

1,2,3,4… (Rising Order)Λ

1,2,3,4… (Corresponding toRising Order for the eigenvalues

Compute Eigenvectors Householder Inverse Transformation

YGather All Eigenvalues

Λ17

2DCyclic‐Cyclic Distribution

Page 13: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Data Duplication in Tridiagonalization

19

Matrix A

:Vectorsuk  , xk

uk

ukDuplication of 

p Processes

q Processes

uk: Householder Vector

:Vectorsyk, 

ykykDuplication of 

Page 14: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Transposed yk in Tridiagonalization (The case of p < q)

20

yk

Multi‐casting MPI_ALLREDUCE

p Processes

q Processes

p=2q=4

:RootProcesses

: With Rectangle Processor Grid  [Katagiri and Itoh, 2010]

ykDuplication ofCommunicationAvoidingBy Using the Duplications

Page 15: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

<1> do k=n-2, 1, -1

<2>   Gather the vector      and  scalar    by using multiple MPI_BCASTs.

<3>  do i=nstart, nend<4>      <5>    <6>   enddo<7> enddo      

Parallel Householder Inverse Transformation

ku

ikiinkk

inkk uAA ,:

)(,:

)(

k

21

inkkT

kki Au ,:)( 

Page 16: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

①Multi‐casting MPI_BCAST

Gathering vector uk for Inverse Transformation :Non-packing messages for gathering uk

22

uk ukDuplication of 

p Processes

q Processes

p = 2q = 4

②Multi‐casting MPI_BCAST

CommunicationAvoidingby using the duplications

Page 17: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Gathering vector uk for Inverse Transformation :Packing messages for gathering uk

23

uk

ukDuplication of 

p Processes

q Processes

p = 2q = 4

①Multi‐casting MPI_BCAST

②Multi‐casting MPI_BCAST

CommunicationAvoiding &Reducing by using packingof messages uk : Send the two vectors

by one communication→Communication Blocking 

Communication Blocking Length = 2

uk+1

Page 18: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Outline• Target Application: RSDFT• Parallel Algorithm of Symmetric Eigensolver for Small Matrices 

• Performance Evaluation with 76,800 cores of the Fujitsu FX10

• Conclusion

Page 19: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Oakleaf‐FX (ITC, U.Tokyo), The Fujitsu PRIMEHPC FX10Contents Specifications

WholeSystem

Total Performance 1.135 PFLOPS

Total Memory Amounts 150 TB

Total #nodes 4,800

Inter ConnectionThe TOFU(6 Dimension Mesh / Torus)

Local File System Amounts 1.1 PB

Shared File System Amounts 2.1 PB

Contents Specifications

Node

Theoretical Peak Performance 236.5 GFlops

#Processors (#Cores) 16

Main Memory Amounts 32 GB

Processor

Processor Name SPARC64 IX‐fx

Frequency 1.848 GHz

Theoretical Peak Performance (Core) 14.78 GFLOPS

4800 Nodes (76,800 Cores)

Page 20: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

COMMUNICATION AVOIDING EFFECT

Page 21: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Householder Inverse Transformation(4096 Nodes (65,536 Cores), 64x64), N=38,400, Hybrid

0

10

20

30

40

50

60

70

80

90

MPI_BCAST Binary Tree MPI_Isend Block MPI_BCAST

Time in Secon

d

Communication Implementations

Other HIT Ker Send Piv

The BestParameter

#Processes =4096#Threads=16/node

Comm. Block =12

Non‐packing Sending Packing Sending 

1.57x

Non‐blocking MPI

Page 22: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

HYBRID MPI‐OPENMPEFFECT

Page 23: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Pure MPI vs. Hybrid MPI‐OpenMPI(64 Nodes (1024 Cores)), N=4800, Total Time

0

0.5

1

1.5

2

2.5

3

3.5

16x64 (Pure MPI) 8x8 (Hybrid MPI)

Time in Secon

d

Process Organization

Householder Inv

Calculating Eigenvectors

Re‐distribution

Tridiagonalization

1.61x64 MPI Processes,16 OMP Threads/MPI Process

Page 24: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Pure MPI vs. Hybrid MPI‐OpenMPI(64 Nodes (1024 Cores)), N=4800, Tridiagonalization

0

0.5

1

1.5

2

2.5

16x64 (Pure MPI) 8x8 (Hybrid MPI)

Time in Secon

d

Process Organization

Other UpdateMatVec MatVec ReduceSend xt Send ytSend Piv

Communication

Computation

27.9%

46.1%72.1%

53.9%18.2 Points Reduction

Page 25: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Pure MPI vs. Hybrid MPI‐OpenMPI(64 Nodes (1024 Cores)), N=4800, 

Householder Inverse Transformation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

16x64 (Pure MPI) 8x8 (Hybrid MPI)

Time in Secon

d

Process Organization

OtherHIT KerSend Piv

Communication

Computation

15.6%

44.6%84.4%

55.4%29 Points Reduction

Page 26: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

FX10 76800 CORES (4800 NODES)RESULTS 

Page 27: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Hybrid MPI‐OpenMP Executionin 4800 nodes (76,800 Cores) (40x120)

31.8  83.3 

429.9 

34.3 180.1 

904.0 

0

200

400

600

800

1000

1200

1400

1600

N=41568 N=83138 N=166276

Time in Secon

d

Process Organization

Householder InvCalculating EigenvecRe‐distTridiag

HIT comm. block=6

HIT comm. block=4

HIT comm. block=2

2.61x

5.24x5.16x

5.01x

3.97x

5.05xInner L1 Cache Size

Only 4x increase with 2x problem sizein O(N3) algorithm

Page 28: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Execution Time in Pure MPIbetween ScaLAPACK PDSYEVD and Ours

ScaLAPACK (version 1.8) on the Fujitsu FX10. Fujitsu Optimized BLAS is used.The best block size is specified for each ScaLAPACK execution in range between 1,  8, 16, 32, 64, 128, and 256.

4.26

10.96

25.76

1.794.61

15.52

0

5

10

15

20

25

30

N=4800 (8x8) 64cores

N=9600 (16x16) 256cores

N=19200 (32x32)1024 cores

ScaLAPACKOurs

[Time in Seconds]

Better

Page 29: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Conclusion• Our eigensolver is effective for very small matrices to utilize communication reducing and avoiding techniques.– By halving duplicate Householder vectors in Tridiagonalization and Householder Inverse Transformation phases.

– By using reduced communications for multiple sending with 2D splitting for process grid.

– By using packing messages for Householder Inverse Transformation part.

• Selection of implementations in communication processes is the target of AT.– The best implementation depends on process grids, the number of processors, and block size for data packing.    

Page 30: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Conclusion (Cont’d)• One of drawbacks is increase of memory space.

– , where process grid is p * q.– Since memory space for matrix is in cache size, the increase of memory space can be ignored.  

• Comparison with new blocking algorithms is future work.– 2‐step method with block Householder tridiagonalization.• Eigen‐K (Riken)• ELPA (Technische Universität München)• A new implementation of PLASMA and MAGMA 

)/( 2 pNO

Page 31: Extreme‐Scale Parallel Symmetric Eigensolver for Very Small‐Size Matrices Using A Communication-Avoiding for Pivot Vectors

Acknowledgements • Computational resource of Fujitsu FX10 was awarded by “Large‐scale HPC Challenge” Project, Information Technology Center, The University of Tokyo.

This topic was submitted to Parallel Computing.(As of December 2013.)