performance of density functional theory codes on … of density functional theory codes on cray xe6...

25
Performance of Density Functional Theory codes on Cray XE6 Zhengji Zhao, and Nicholas Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Upload: phungdiep

Post on 03-May-2018

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

Performance of Density Functional Theory codes on Cray XE6

Zhengji Zhao, and Nicholas Wright National Energy Research Scientific

Computing Center Lawrence Berkeley National

Laboratory

Page 2: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

•  Motivation •  Introduction to DFT codes •  Threads and performance of VASP •  OpenMP threads and performance of Qauntum Espresso •  Conclusion

Outline

Page 3: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

•  Challenges from the multi-core trend –  Address reduced per core memory, –  Make use of faster intra node memory access

•  Recommended path forward is to use threads/OpenMP •  Majority of the NERSC application codes are still in flat MPI •  Exam the performance implications from the use of threads in real user applications

Motivation

Page 4: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

•  Materials and Chemistry applications account for 1/3 of NERSC workflow. •  75% of them run various DFT codes. •  Among 500 application code instances at NERSC, VASP consumes the most computing cycles (~8%). •  VASP is in pure MPI, current status of majority user codes •  Quantum Espresso, an OpenMP/MPI hybrid codes, top #8 code at NERSC.

Why DFT codes

Page 5: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

•  What it solves –  Kohn-sham equation

Density Functional Theory

!

{" 12#2 +V (r)[$]}% i(r) = Ei% i(r)

!

" i(r) = Ci,Ge[i(k+G ).r]

G#!

" # i(r)# j (r)dr = $ ij,{# i}i=1,..,N

!

V (r)["] = Z R|r#R | +

"(r' )|r#r'|$

R% d3r'+µ("(r))

Local Density Approximation:

Page 6: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

N electrons N wave functions

H!, 2 FFTs (CG, RMM-DIIS,Davison)

Orthogonalization

Subspace diagonalization

Potential generation solve poission equation using density functional formula

Flow chart of DFT codes

!

{" 12#2 +Vin (r)}$ i(r) = Ei$ i(r)

!

Vout (r)

!

"(r) = fn |# ii$ (r) |2

Pot

entia

l mix

ing

- Vin, V

out !

new

Vin

!

" i |H |" j

!

{" i}i=1,..,Ntrial charge density trial wavefunctions

!

"(r) and

!

{" i}i=1,..,N

!

" i(r)" j (r)dr# = $ ij ,

!

"E < # breakno yes

!

{" i}i=1,..,N

Page 7: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

Parallelization in DFT codes Level 1: Parallel over k-points

!

{" i,k},i =1,...,N;k =1,nktot

•  The number of processors, Ntot, is divided into nkg group, each group has Nk number of processors (Ntot=nkg*Nk)

•  Each group of processors deal with nktot/nkg number of k points

Page 8: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

!

{" i,k}i=1,..,N

!

{" i,k}i=1,m

!

{" i,k}i=m+1,2m

!

{" i,k}i=m*(Ng#1)+1,N; ; …… ;

Processors 1 - Np

Processors Np+1 - 2Np

Processors N - Np+1 - N

Group 1 Group 2 Group Ng

•  The number of processors, Nk, is divided into Ng group, each group has Np number of processors (Ntot=Ng*Np) •  N wavefunctions are also divided into Ng groups, each with m wavefunctions •  One group of processors deal with one group of wavefunctions

Parallelization in DFT codes Level 2: Parallel over bands

Page 9: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

Within each group of processors, the planewave basis is divided among the Np number of processors:

FFT

Divide the G-space into columns, and distribute them to the Np processors

Real space

Figures from http://hpcrd.lbl.gov/~linwang/PEtot/PEtot_parallel.html

Parallelization in DFT codes Level 3: Parallel over planewave basis set

!

" i,k (r) = Ci,Ge[i(k+G ).r]

G#

Page 10: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

•  A planewave pseudopotential code –  A commercial code from Univ. of Vienna

•  Libraries used –  BLAS, fft

•  Parallel implementations –  Over planewave basis set and bands –  >1proc/atom scale –  Flops 20-50% of peak (in real calculations)

•  VASP use at NERSC –  Used by 83 projects, 200 active users

VASP

http://cmp.univie.ac.at/vasp

Page 11: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

11

VASP: Performance vs threads

•  When the number of threads increases, a little or no performance gain. Code runs slower.

•  But in comparison to the flat MPI, at threads=3, VASP runs faster than the flat MPI on unpacked nodes by 20-25%

!"#!!"$!!!"$#!!"%!!!"%#!!"&!!!"&#!!"'!!!"'#!!"

$" %" &" (" $%" %'"

$''" )%" '*" %'" $%" ("

!"#$%&'(%

)*#+$,%-.%/0,$12'3456%/1'7'%

+$#',-./01234"

+$#',566,-778"

9:.;"6<7,566,-778,"=4>.?@A1"431A2"

9:.;"6<7,-./01234,=4>.?@A1"431A2"

Test case A154:

154 atoms 998 electrons Zn48O48C22S2H34 80x70x140 real-space

grids; 160x140x280 FFT

grids 4 kpoints

Page 12: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

12

VASP: Memory usage vs threads

•  Memory usage is reduced when the number of threads increases

•  At threads=3, the memory usage is reduced by 10% compared to that of threads=2

Test case A154:

!"

!#!$"

!#%"

!#%$"

!#&"

!#&$"

!#'"

%" &" '" (" %&" &)"

%))" *&" )+" &)" %&" ("

!"#

$%&'("

%')$%"'*+

,-'

./#0"%'$1'23%"4567!89'246:6'

,%$)-.//-0112"

,%$)-34563789"

Page 13: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

13

VASP: VASP runs slower when the number of threads increases

Threaded VASP at best (threads=2) is slightly slower (~12%) than the flat MPI

!"

#!!"

$!!!"

$#!!"

%!!!"

%#!!"

&!!!"

&#!!"

'!!!"

$" %" &" (" $%" %'"

)(*" &*'" %#(" $%*" ('" &%"

!"#$%&'(%

)*#+$,%-.%/0,$12'3456%/1'7'%

+((!,-./01234"

+((!,566,-778"

Test case A660:

660 atoms 2220 electrons C200H230N70Na20O120P20 240x240x486 real-

space grids; 480x380x972 FFT grids 1 kpoint (Gamma point) Gamma kpoint only

VASP

Page 14: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

14

VASP: Memory usage vs threads

Compare the memory usage for threads=2 and the flat MPI: For RMM-DIIS: there is a slight memory saving For Davidson: no memory saving at threads=2, slightly more

use of memory (<3%)

!"

!#!$"

!#%"

!#%$"

!#&"

!#&$"

!#'"

%" &" '" (" %&" &)"

*(+" '+)" &$(" %&+" ()" '&"

!"#

$%&'("

%')$%"'*+

,-'

./#0"%'$1'23%"4567!89'246:6'

,((!-./012345"

,((!-677-.889"

Test case A660:

Page 15: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

•  A planewave pseudopotential code –  An open software DEMOCRITOS National Simulation Center and SISSA with collaboration with many other institutes

•  Libraries used –  BLAS, fft

•  Parallel implementations –  Over k-points, planewave basis and bands –  >1proc/atom scale

•  QE use at NERSC –  Used by 21 projects

Quantum Espresso

http://www.quantum-espresso.org

Page 16: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

16

QE: The Hybrid OpenMP+MPI code runs faster than the flat MPI

At threads=2, QE runs faster than the flat MPI on half-packed nodes by 38%

!"

#!!"

$!!!"

$#!!"

%!!!"

%#!!"

&!!!"

$" %" &" '" $%" %("

$((!" )%!" (*!" %(!" $%!" '!"

!"#$%&'(%

)*#+$,%-.%/0,$12'3456%/1'7'%

+,-,'*'"

./01"23-"45"60/7890:;<="54=<>"

Test case GRIR686:

686 atoms 5174 electrons C200Ir486 180x180x216 FFT

grids 2 kpoints

Page 17: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

17

At threads=2, the memory usage is reduced by 64% when compared to the flat MPI

!"!#$"!#%"!#&"!#'"("

(#$"(#%"(#&"(#'"$"

(" $" )" &" ($" $%"

(%%!" *$!" %'!" $%!" ($!" &!"

!"#

$%&'("

%')$%"'*+

,-'

./#0"%'$1'23%"4567!89'246:"6'

+,-,&'&"

./01"23-"45"60/7890:;<="54=<>"

Test case GRIR686:

686 atoms 5174 electrons C200Ir486 180x180x216 FFT

grids 2 kpoints

QE: The OpenMP+MPI code uses less memory than the flat MPI

Page 18: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

18

At threads=2, QE runs faster than the flat MPI on half-packed nodes by 28%

!"

#!!"

$!!!"

$#!!"

%!!!"

%#!!"

&!!!"

$" %" &" '" $%" %("

$'&%" )$'" #((" %*%" $&'" ')"

!"#$%&'(%

)*#+$,%-.%/0,$12'3456%/1'7'%

+,-$!./0)"

1234"5.6"78"932:;<3=>?@"87@?A"

Test case CNT10POR8:

1532 atoms 5232 electrons C200Ir486 540x540x540 FFT

grids 1 kpoint (Gamma point)

QE: The Hybrid OpenMP+MPI code runs faster than the flat MPI

Page 19: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

19

At threads=2, the memory usage is reduced by 30%

!"!!!#

!"$!!#

!"%!!#

!"&!!#

!"'!!#

("!!!#

("$!!#

("%!!#

(# $# )# &# ($# $%#

(&)$# '(&# *%%# $+$# ()&# &'#

!"#

$%&'("

%')$%"'*+

,-'

./#0"%'$1'23%"4567!89'246:6'

,-.(!/01'#

Test case CNT10POR8:

QE: The OpenMP+MPI code uses less memory than the flat MPI

Page 20: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

20

At threads=2, QE runs faster than the flat MPI on half-packed nodes by 22%

!"#!"

$!!"$#!"%!!"%#!"&!!"&#!"'!!"'#!"#!!"

$" %" &" ("

%))" $''" *(" ')"

!"#$%&'(%

)*#+$,%-.%/0,$12'3456%/1'7'%

+,-,./$$%"

/012"345"67"8109:;1<=>?"76?>@"

Test case AUSURF112:

112 atoms 5232 electrons C200Ir486 125x64x200 FFT grids 80x90x288 smooth

grids 2 k-points

QE: The Hybrid OpenMP+MPI code runs faster than the flat MPI

Page 21: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

21

At threads=2, QE runs faster than the flat MPI on half-packed nodes by 38%

!"

!#!$"

!#%"

!#%$"

!#&"

!#&$"

%" &" '" ("

&))" %**" +(" *)"

!"#

$%&'("

%')$%"'*+

,-'

./#0"%'$1'23%"4567!89'246:6'

,-.-/0%%&"

Test case AUSURF112:

QE: The OpenMP+MPI code uses less memory than the flat MPI

Page 22: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

•  Performance of VASP from using MPI+OpenMP programming model

–  A low-effort thread implementation - linked with the multi-threaded BLAS libraries –  Slight performance gains in the order of 20-25% –  Addition of OpenMP directives in the source code should help this situation. –  Slight memory savings –  Many optional parameters that affect the performance of VASP, our results are not all.

Conclusions

Page 23: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

•  Performance of QE from using MPI+OpenMP programming model

–  OpenMP directives in the source code + linking to the multi-threaded libraries –  Performance gains in the order of 40% in comparison to flat MPI, best performance achieved at threads=2 –  Significant memory savings, 20-40% per core when compared to the flat MPI.

Conclusions --continued

Page 24: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

•  OpenMP+MPI is a promising programming model on Hopper

–  Other DFT and other MPI codes which can make use of multi-threaded BLAS routines.

Conclusions --continued

Page 25: Performance of Density Functional Theory codes on … of Density Functional Theory codes on Cray XE6 ... • Exam the performance implications ... account for 1/3 of NERSC workflow

•  NERSC user Wai-Yim Ching and Sefa Dag for providing VASP test cases •  NERSC resources

Acknowledgement