vasp: running on hpc resourcesto the presently considered subspace: • rayleigh-ritz optimization...
TRANSCRIPT
UniversityofVienna,FacultyofPhysicsandCenterforComputationalMaterialsScience,
Vienna,Austria
VASP:runningonHPCresources
TheMany-BodySchrödingerequation
H (r1, ..., rN ) = E (r1, ..., rN )0
@�1
2
X
i
�i +X
i
V (ri) +X
i 6=j
1
|ri � rj |
1
A (r1, ..., rN ) = E (r1, ..., rN )
(r1, ..., rN ) ! { 1(r), 2(r), ..., N (r)}
Forinstance,many-bodyWFstoragedemandsareprohibitive:
Asolution:maponto“one-electron”theory:
5electronsona10×10×10grid~10PetaBytes !
(r1, ..., rN ) (#grid points)N
Hohenberg-Kohn-ShamDFT
E[⇢] = Ts[{ i[⇢]}] + EH [⇢] + Exc[⇢] + EZ [⇢] + U [Z]
(r1, ..., rN ) =NY
i
i(ri)
⇢(r) =NX
i
| i(r)|2
⇣�1
2�+ VZ(r) + VH [⇢](r) + Vxc[⇢](r)
⌘ i(r) = ✏i i(r)
(r1, ..., rN ) ! { 1(r), 2(r), ..., N (r)}
Maponto“one-electron”theory:
Totalenergyisafunctionalofthedensity:
Thedensityiscomputedusingtheone-electronorbitals:
Theone-electronorbitalsarethesolutionsoftheKohn-Shamequation:
BUT:Exc[⇢] =??? Vxc[⇢](r) =???
Exchange-Correlation
Exc[⇢] =??? Vxc[⇢](r) =???
• Exchange-Correlationfunctionals aremodeledontheuniform-electron-gas(UEG):Thecorrelationenergy(andpotential)hasbeencalculatedbymeansofMonte-Carlomethodsforawiderangeofdensities,andhasbeenparametrized toyieldadensityfunctional.
• LDA:wesimplypretendthataninhomogeneouselectronicdensitylocallybehaveslikeahomogeneouselectrongas.
• Many,many,manydifferentfunctionals available:LDA,GGA,meta-GGA,van-der-Waalsfunctionals,etc etc
AnN-electronsystem:N=O(1023)
N ⇥ (#grid points)(#grid points)N
(r1, ..., rN ) ! { 1(r), 2(r), ..., N (r)}
Hohenberg-Kohn-ShamDFTtakesusalongway:
Niceforatomsandmolecules,butinarealisticpieceofsolidstatematerialN=O(1023)!
Translationalinvariance:PeriodicBoundaryConditions
nk(r+R) = nk(r)eikR
nk(r) = unk(r)eikr
unk(r+R) = unk(r)
Translationalinvarianceimplies:
and
AllstatesarelabeledbyBlochvector k andthebandindex n:
• TheBlochvectork isusuallyconstrainedtoliewithinthefirstBrillouin zoneofthereciprocalspacelattice.
• Thebandindexnisoftheorderifthenumberofelectronsperunitcell.
Reciprocalspace&thefirstBrillouin zone
b2b1
b3
z
yx
C
b2 b1
b3
z
yx
B
4pi/L
a2a1
a3 y
Lz
x
A
b1 =2⇡
⌦a2 ⇥ a3 b2 =
2⇡
⌦a3 ⇥ a1 b3 =
2⇡
⌦a1 ⇥ a2
⌦ = a1 · a2 ⇥ a3 ai · bj = 2⇡�ij
Samplingthe1st BZ
⇢(r) =1
⌦BZ
X
n
Z
BZfnk| nk(r)|2dk
⇢(r) =X
nk
wkfnk| nk(r)|2dk,
Theevaluationofmanykeyquantitiesinvolvesanintegraloverthe1st BZ.Forinstancethechargedensity:
WeexploitthefactthattheorbitalsatBlochvectorsk thatareclosetogetherarealmostidenticalandapproximatetheintegraloverthe1st BZbyaweightedsumoveradiscretesetofk-points:
Fazit:theintractabletaskofdeterminingwithN=1023,hasbeenreducedtocalculatingatadiscretesetofk-pointsinthe1st BZ,foranumberofbandsthatisoftheorderifthenumberofelectronsintheunitcell.
(r1, ..., rN ) nk(r)
E[⇢, {R, Z}] = Ts[{ nk[⇢]}] + EH [⇢, {R, Z}] + Exc[⇢] + U({R, Z})
Ts[{ nk[⇢]}] =X
nk
wkfnkh nk|�1
2�| nki
EH[⇢, {R, Z}] = 1
2
ZZ⇢eZ(r)⇢eZ(r0)
|r� r0| dr0dr
⇢eZ(r) = ⇢(r) +X
i
Zi�(r�Ri) ⇢(r) =X
nk
wkfnk| nk(r)|2dk
⇣�1
2�+ VH [⇢eZ ](r) + Vxc[⇢](r)
⌘ nk(r) = ✏nk nk(r)
VH [⇢eZ ](r) =
Z⇢eZ(r0)
|r� r0|dr0
Thetotalenergy
Thekineticenergy
TheHartree energy
where
TheKohn-Shamequations
TheHartree potential
Aplanewavebasisset
unk(r+R) = unk(r) nk(r) = unk(r)eikr
unk(r) =1
⌦1/2
X
G
CGnkeiGr nk(r) =
1
⌦1/2
X
G
CGnkei(G+k)r
⇢(r) =X
G
⇢GeiGr
1
2|G+ k|2 < Ecuto↵
V (r) =X
G
VGeiGr
Allcell-periodicfunctionsareexpandedinplanewaves(Fourieranalysis):
Thebasissetincludesallplanewavesforwhich
Crnk =X
G
CGnkeiGr FFT �! CGnk =
1
NFFT
X
r
Crnke�iGr
TransformationbymeansofFFTbetween“real”spaceand“reciprocal”space:
τ1
τ1 π / τ 1
τ2
b1
b2
real space reciprocal spaceFFT
0 1 2 3 0
1 1 x = n / N 1 1 g = n 2
N−1 −1−2−3−4 0 1 2 3 4 55
N/2−N/2+1
cutG
Crnk =X
G
CGnkeiGr FFT �! CGnk =
1
NFFT
X
r
Crnke�iGr
Thechargedensity
ψrψG ψr
ρ rρG
2 Gcut
Gcut
FFT
FFT
TheSelf-Consistency-Cycle(cont.)
Twosub-problems:
• OptimizationofIterativeDiagonalizatione.g. RMM-DIISorBlockedDavidson
• ConstructionofDensityMixinge.g. Broyden mixer
{ n}
⇢in
Theself-Consistency-Cycle
H = hG|H[⇢]|G0i ! diagonalize H ! { i, ✏i} i = 1, .., NFFT
⇢0 ! H0 ! ⇢0 ! ⇢1 = f(⇢0, ⇢0) ! H1 ! ...
⇢ = ⇢0
Anaïvealgorithm:expresstheHamiltonmatrixinaplanewavebasisanddiagonalize it:
Self-consistency-cycle:
Iterateuntil:
BUT: wedonotneedNFFT eigenvectorsoftheHamiltonian(atacostofO(NFFT3)).
ActuallyweonlytheNb lowesteigenstates ofH,whereNb isoftheorderofthenumberofelectronsperunitcell(Nb <<NFFT).
Solution:useiterativematrixdiagonalization techniquestofindtheNb lowestEigenvectoroftheHamiltonian:RMM-DIIS,blocked-Davidson,etc.
Keyingredients:SubspacediagonalizationandtheResidual
X
m
HnmBmk =X
m
✏appk SnmBmk
Hnm = h n|H| mi Snm = h n|S| mi
| ki =X
m
Bmk| mi
|R( n)i = (H � ✏appS)| ni
• Rayleigh-Ritz:diagonalization oftheNb xNb subspace
with
yieldsNb eigenvectorswitheigenvaluesεapp.
Theseeigenstates arethebestapproximationtotheexactNb lowesteigenstates ofH withinthesubspacespannedbythecurrentorbitals.
• TheResidual:
✏app =h n|H| nih n|S| ni
(itsnormismeasurefortheerrorintheeigenvector)
Blocked-Davidson• Takeasubsetofallbands: { n|n = 1, .., N} ) { 1
k|k = 1, .., n1}
{ 1k/g
1k = K(H� ✏appS)
1k|k = 1, .., n1}
{ 2k|k = 1, .., n1}
• Extendthissubsetbyaddingthe(preconditioned)residualvectorstothepresentlyconsideredsubspace:
• Rayleigh-Ritzoptimization(“sub-spacerotation”)inthe2n1 dimensionalsubspacetodeterminethen1 lowesteigenvectors:
diag{ 1k/g
1k}
• Extendsubspacewiththeresidualsof { 2k}
{ 1k/g
1k/g
2k = K(H� ✏appS)
2k|k = 1, .., n1}
• Rayleigh-Ritzoptimization ) { 3k|k = 1, .., n1}
• Etc …
{ mk |k = 1, .., n1} { n|n = 1, .., n1}
• Theoptimizedsetreplacestheoriginalsubset:
• Continuewithnextsubset:,etc,…{ 1k|k = n1 + 1, .., n2}
Aftertreatingallbands:Rayleigh-Ritzoptimizationof { n|n = 1, .., N}
TheactionoftheHamiltonianH| nki ✓
�1
2�+ V (r)
◆ nk(r)
hr|G+ ki = 1
⌦1/2ei(G+k)r �! hG+ k| nki = CGnk
NNPLWhG+ k|� 1
2�| nki =
1
2|G+ k|2CGnk
hG+ k|V | nki =1
NFFT
X
r
VrCrnke�iGr NFFT logNFFT
Theaction
Usingtheconvention
• Kineticenergy:
• Localpotential:• Exchange-correlation:easilyobtainedinrealspace• FFTtoreciprocalspace• Hartree potential:solvePoissoneq.inreciprocalspace• Addallcontributions• FFTtorealspaceTheaction
V = VH[⇢] + Vxc[⇢] + Vext
Vxc,r = Vxc[⇢r]
VH,G =4⇡
|G|2 ⇢GVG = VH,G + Vxc,G + Vext,G
{VG} �! {Vr}
{Vxc,r} �! {Vxc,G}
SolvingtheKSequations
• FFTsextensivelyusedtoevaluate
• Weactuallyuseamixedbasisset(Projector-Augmented-Waves):
whichinvolvesprojectionofthepseudo-wavefunctionsonlocalprojectionoperators(DGEMM).
• Oneneedstokeepthesolutions(bands)ateachk-pointorthonormal:essentiallydonebyaCholeski decomposition(LU)oftheoverlapmatrix,followedbyaninversionofU(LAPACK/scaLAPACK)andatransformationbetweenthewavefunction(ZGEMM).
• DiagonalizationoftheHamiltonianinthesubspaceofthecurrentwavefunctions(LAPACK/scaLAPACK orELPA),followedbyaunitarytransformationbetweenthewavefunction(ZGEMM).
⇣�1
2�+ Vext(r) + VH(r) + Vxc(r)
⌘ nk(r) = ✏nk nk(r)
H| ni
| ni = | e ni+X
i
(|�ii � |e�ii)hepi| e ni
AtypicalworkloadH| ni• Action:
FFT( n)
V (r) n(r)
N lnN
N
8 n
8 n N2
N2 lnN fftw
hpi| ni N 8 i, nN3
const. N2 (realsp.) BLAS3(DGEMM)
• Subspacerotation:
Hnm = h n|H| mi 8 n,m N3 BLAS3(ZGEMM)
diag(H) N3 (sca)LAPACK
| 0
ni =X
m
Unm| mi 8 n N3 BLAS3(ZGEMM)
LREAL=A
Scalingwithsystemsize(N)
Self-ConsistencyCycle(SCC):RMM-DIISrunningon8IntelX5550quadcore procs.(total:32x2.66GHzcores)
Distributionofworkanddata
Distributeworkanddata“over-orbitals”• Default• NCORE=1
(orequivalently:NPAR=#-of-MPI-ranks)• KPAR=1
⇣�1
2�+ Vext(r) + VH(r) + Vxc(r)
⌘ nk(r) = ✏nk nk(r)
TheKohn-Shamequation:
• Orbitalindexn
Distributionofworkanddata
Distributeworkanddata“over-orbitals”• Default• NCORE=1
(orequivalently:NPAR=#-of-MPI-ranks)• KPAR=1
Distributionworkanddata“over-plane-waves”• NCORE=#-of-MPI-ranks
(orequivalently:NPAR=1)• KPAR=1
Distributionofworkanddata
Combinationsof”over-orbitals”and”over-plane-wave”distributionsareallowedaswell
DistributionofworkanddataAdditionallyworkmaybedistributed”over-k-points”• KPAR=n (n>1)• m =(#-of-MPI-ranks/n)mustbeaninteger• Workisdistributedinaround-robinfashionovergroupsofmMPI-ranks• Dataisduplicated!
⇣�1
2�+ Vext(r) + VH(r) + Vxc(r)
⌘ nk(r) = ✏nk nk(r)
• Orbitalindexn,k-pointindexk
All-2-Allcommunication
ParallelFFT:Ball-2-Cube
(A) (B) (C) (D)
G2G
4G
4G
4G
G
4G
FFT
FFT
FFT
z
x
y
In-houseparallelBall-2-CubeFFT:• Less1D-FFTs(reduction:≈ 1.76 ×)• BUT:communicationfrom(B)→(C)and(C)→(D)• ForsmalltomediumsizedFFTgridsahighlyoptimized3D-FFT
(e.g.fromIntel’smkl)isequallyfast
HardwareconsiderationsTypicalconfiguration:
• N interconnectednodes• 2packages/node• M cores/package
Distribution“over-plane-waves”:MPI-ranksthatshareanorbitalshouldresideonthesamenode(betterevenonthesamepackage).
• NCORE=n ≤ 2M• (2M/ n)shouldbeaninteger• Typically:n =M orn =M/2
Distribution“over-k-points”:• KPAR=n ≤ #-of-k-points(NKPTS)• (#-of-MPI-ranks / n)shouldbeaninteger• Ifmemoryallows:KPAR=NKPTS
DefaultplacementofMPI-ranksonthenodes/packages/coresdependsontheparticularsoftheMPIimplementationanditsconfiguration!
Forinstance:• 2 interconnectednodes• 2packages/node• 4cores/package
DefaultplacementofMPI-ranksonthenodes/packages/coresdependsontheparticularsoftheMPIimplementationanditsconfiguration!
Forinstance:• 2 interconnectednodes• 2packages/node• 4cores/package
Good:PlacesubsequentMPI-ranksclosetogether,i.e.,firstonsubsequentcoresofthesamepackage,thenmovingonthesecondpackageofthesamenode,beforestartingtofillthenextnode.
Forinstance:• 2 interconnectednodes• 2packages/node• 4cores/package
Bad:DistributesubsequentMPI-ranksinaround-robinfashionoverthepackages.
Forinstance:• 2 interconnectednodes• 2packages/node• 4cores/package
Worse:DistributesubsequentMPI-ranksinaround-robinfashionoverthenodes.
• N interconnectednodes• 2packages/node• M cores/package• 2NMMPI-ranks
Distribution“over-plane-waves”:MPI-ranksthatshareanorbitalshouldresideonthesamenode(betterevenonthesamepackage).
• NCORE=n ≤ 2M• (2M/ n)shouldbeaninteger• Typically:n =M orn =M/2
Distribution“over-k-points”:• KPAR=n ≤ #-of-k-points(NKPTS)• (#-of-MPI-ranks / n)shouldbeaninteger• Ifmemoryallows:KPAR=NKPTS
Distribution“over-orbitals”:• Thenumberoforbitals(NBANDS)issuch
thatNBANDS/(2NM /NCORE/KPAR)isanintegernumber
⇒ Increasing#-of-MPI-ranksmayleadtoanunnecessarilylargeNBANDS(i.e.,adding“empty”orbitals)
• SomealgorithmsconvergefasterwheneachMPIrankowns(partof)afeworbitals(e.g. blocked-Davidson)
• Generallyspeaking:havinglotsofMPIranksandverylittlework/dataperrankisneveragoodideasince(all-2-all)communicationbecomesunreasonablyexpensive.
NCORE,KPAR,andthe#-of-MPI-ranks
Strong/Weakscaling(SiN)
ScalingunderMPI(onaCrayXC-40)
CourtesyofP.Saxe,MaterialsDesignInc.(andCray).
• 5713e-• 2k-points• RMM-DIIS• PBE• KPAR=2
ScalingunderMPI(onaCrayXC-40)
• 864e-• 8k-points• HSE06• Davidson• KPAR=8
CourtesyofP.Saxe,MaterialsDesignInc.(andCray).
Largesystems
hpi| ni N 8 i, nN3
const. N2 (realsp.) BLAS3(DGEMM)LREAL=A
• Usereal-spacePAWprojectors(insteadofreciprocalspaceprojectors)
• Ifyoucanlimitk-pointsamplingtotheGamma(𝚪 ≡ 𝐤 = 𝟎)point:usethe“gamma-only”versionofVASP
Inthe”gamma-only”versionofVASPtheorbitalsarestoredasrealquantitiesinreal-space:• real-2-complexFFTs• DGEMMsinsteadofZGEMMs
Compilation
• FastFFTs:theFFTsfromIntel’smkl-libraryseemtobeunbeatablyfast…• scaLAPACK forlargesystems• AcompilerthateffectivelygeneratesAVX2instructionsandlibraries(e.g. BLAS)
thatareoptimizedforAVX2(upto20%performancegain)
TheEnd
Thankyou!