performance optimization of west and qbox on intel knights ... · performance optimization of west...

17
Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1 , Christopher Knight 1 , Giulia Galli 1,2 , Marco Govoni 1,2 , and Francois Gygi 3 1 Argonne National Laboratory 2 University of Chicago 3 University of California, Davis September 27 th , 2017 IXPUG Annual Fall Conference 2017 1

Upload: others

Post on 20-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

PerformanceoptimizationofWESTandQboxonIntelKnightsLanding

HuihuoZheng1,ChristopherKnight1,GiuliaGalli1,2,MarcoGovoni1,2,andFrancoisGygi3

1ArgonneNationalLaboratory2UniversityofChicago

3UniversityofCalifornia,Davis

September27th,2017

IXPUGAnnualFallConference2017 1

Page 2: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

ALCFThetaEarlyScienceProject:First-PrinciplesSimulationsofFunctionalMaterialsforEnergyConversion

http://qboxcode.org/; http://west-code.org; http://www.quantum-espresso.org/M. Govoni, G. Galli, J. Chem. Theory Comput. 2015, 11, 2680−2696P. Giannozzi, et al J.Phys.:Condens.Matter, 21, 395502 (2009)

Embedded nanocrystalT. Li, Phys. Rev. Lett. 107, 206805 (2011)

Matters at extreme conditionsD. Pan et al Proc. Nat'l Acd. Sci. 110, 6250 (2013)

Aqueous solutionA.Gaiduk et al., J. Am. Chem. Soc.

Comm. (2016)

Organic photovoltaicsM. Goldey Phys. Chem. Chem. Phys., Advance Article (2016)

Quantum informationH Seo, Sci Rep. 2016; 6: 20803.

2

Page 3: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

Optimizationfocus

Number of processors

Tim

e-to

-Sol

utio

nStrongscalinglimit

• Utilizingtunedmathlibraries(FFTW,MKL,ELPA,…)

• Vectorization:AVX512• HighBandwidthMemory

• Addingextralayersofparallelization->increaseintrinsicscalinglimit

• Reducingcommunicationoverheadtoreachtheintrinsiclimit

3,624 KNL nodes, 9.65petaFlOPS

3

Page 4: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

Outline

• WEST– additionallayersofparallelization• BandparallelizationofSternheimer equation• Taskgroupparallelizationtofit3DFFTswithinsingleKNLnodetoreducecommunicationoverheadsandtakeadvantageofHBM

• Qbox– reducecommunicationoverheadsofdenselinearalgebrawithon-the-flydataredistribution

• Gather&scatterremap• Transposeremap

• Conclusionsandinsights

4

Page 5: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

Optoelectroniccalculationsusingmany-bodyperturbationtheory(GW)

𝚫𝝆 = 𝝌𝚫𝑽𝒑𝒆𝒓𝒕Electronic density

Perturbation potential

Response function

…3D FFTs

Parallelizationscheme(imagegroups &planewave)

Intrinsicstrongscalinglimit𝑛𝑝𝑟𝑜𝑐~𝑁2345×𝑁7

𝚫𝑽𝟏𝚫𝑽𝟐𝚫𝑽𝟑𝚫𝑽𝟒 … 𝚫𝐕𝐍?𝟏𝚫𝐕𝐍

𝚫𝝆𝟏 𝚫𝝆𝟐 … 𝚫𝝆𝑵?𝟏 𝚫𝝆𝑵𝚫𝝆𝟑 𝚫𝝆𝟒

...

Massivelyparallelbydistributingperturbations

Linearresponsetheory

Efficientstrongscalinguptotheintrinsiclimitthenstopscaling

Si35H36,176 electrons256 perturbations

5

Page 6: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

Singleperturbationruntime (4BG/Qvs1KNL)

6

• 80%ofruntimeisspentinexternallibraries

• 3.7xspeedupfromBG/Q(ESSL)toKNL(MKL)

• High-bandwidthmemoryonThetacriticalforperformance(e.g.3DFFTs):3.1xspeedup

Cache mode

CdSe, 884 electrons

BG/QESSL

Page 7: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

ImprovementofstrongscalingbybandparallelizationofSternheimer equation

Si35H36, 176 electrons256 perturbations

7

Image para.Band para.

Sternheimer equation

IncreasedparallelismbyarrangingtheMPIranksina3Dgrid(perturbations&bands&FFT)

New intrinsic strong scaling limit: 𝑛𝑝𝑟𝑜𝑐 = 𝑁2345×𝑵𝒃𝒂𝒏𝒅×𝑁7

Cost: AllReduceonce across band groups (relatively cheap)

Call hpsi

Page 8: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

Improvingstrongscalingof3DFFTsusingtaskgroupparallelizationwhenabandgroupspreadsacrossmultiplenodes

8

Cetus (BG/Q)

Small3DFFTsdonotscalewellacrossmultipleKNLnodesbecauseofinternodecommunicationoverheadsrelativetoshared-memoryMPI.Taskgroups(tg)redistributecompletewavefunctionstoseparatenodestosimultaneouslycomputemultiple3DFFTs.

Theta (KNL)

Cost for data redistribution

MPI_Alltoallwithin single KNL node

Strong scaling of 3D FFT (plane and pencil decomposition) on Cetusand Theta using 256x256x256 FFT grid and FFTXlib in QE

Page 9: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

Outline

• WEST– additionallayersofparallelization• BandparallelizationofSternheimer equation• Taskgroupparallelizationtofit3DFFTswithinsingleKNLnodetoreducecommunicationoverheadsandtakeadvantageofHBM

• Qbox– reducecommunicationoverheadsofdenselinearalgebrawithon-the-flydataredistribution

• Gather&scatterremap• Transposeremap

• Conclusionsandinsights

9

Page 10: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

SiC512512atoms,2048electrons,PBE0

StrongscalingofQboxforhybrid-DFTcalculations

F. Gygi and I. Duchemin J. Chem. Theory Comput., 2013, 9 (1), pp 582–587

dgemm

dgemm,Gram-Schmidt(syrk,potrf,trsm)

Exactexchange3DFFTs

10

Page 11: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

Datalayout:blockdistributionofwavefunctionsto2Dprocessgrid 𝑛𝑝𝑐𝑜𝑙

𝑘

𝑛𝑝𝑟𝑜𝑤

𝑖

FFT FFT FFT FFT

1,024

140,288

𝜓P 𝑘 , 𝑖 = 1, 2, …𝑁QRST,𝑘 = 0, 1, … , 𝑛2U − 1

SiC512:140,288×1,02411

MPI_Alltoallv

MPI_Allreduce

Good for 3D FFTs

Intrinsic limit: 𝒏𝒑𝒓𝒐𝒄 = 𝑵𝒃𝒂𝒏𝒅𝑵𝒛

Page 12: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

PoorscalabilityofScaLAPACK fortall-skinnymatricesandsmallsquarematricesduetocommunicationoverheads

=

Wave function matrix

Overlap matrix

Tall-skinny matrices Small square matrix

12

Gram-Schmidt

d(z)gemm

Page 13: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

13

ReducingcommunicationoverheadsfromScaLAPACK with“gather&scatter”remap

Solution:creatingacontextwithfewercolumnsandon-the-flydataredistribution• Compute3DFFTsonoriginalgrid• Gatherdatatosmallergrid• RunScaLAPACK onsmallergrid• ScatterdatabacktooriginalgridTheremapcommunicationpatternonlyinvolvesprocs withinsameroworcolumn.

Key:remapcommunicationtimeneedstobesmall. Schematicofgather&scatterremap,gray

processesareidleduringScaLAPACK computation

Gather & scatter communication pattern

Page 14: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

Improvementofstrongscalingusing“gather&scatter”remap

14

hpsi +wf_update timeremainsminimalrelativelyflatwithremap,andtheremaptime(custom) istwoordersofmagnitudesmallerthanhpsi +wf_update time.

Customremapfunctionis1000xfasterthanScaLAPACK’s pdgemr2d.

ImprovementofQbox’s strongscalingafteroptimizations;runtimeofimprovesfrom~400to~30sperSCFiteration(13xspeedup)on131,072ranksfor2048electrons.

400.8s

67.5s

30.5s

Page 15: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

ReducingcommunicationoverheadsfromScaLAPACK by“transpose”remap

15

Transpose communication pattern

Processrearrangementanddatamovementoftransposeremap

Improvementofruntimebyremapmethods(1) 𝑛𝑝𝑐𝑜𝑙’ = S2]^_

`, 𝑛𝑝𝑟𝑜𝑤a = 𝑛𝑝𝑟𝑜𝑤

(2) 𝑛𝑝𝑐𝑜𝑙’ = S2]^_`

, 𝑛𝑝𝑟𝑜𝑤a = 8×𝑛𝑝𝑟𝑜𝑤

Problemof“gather&scatter”:Idleprocesses.Howtoutilizethem?Assignidleprocessestoactivecolumns.Transpose remap: • Perform 3D FFTs in the

original context. • Transfer data through a series

of local regional transposes• Run ScaLAPACK in the new

context

Keyconceptforremap:creatingdifferentcontextsthatareoptimalfordifferentkernelsredistributingthedataon-the-fly

1.0

2.5x

Page 16: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

ConclusionandInsights

16

• BandparallelizationreducestheinternodecommunicationoverheadandimprovesstrongscalingofWESTupto𝑁bbc𝑁2345𝑁QRST cores.

• OptimalremappingofdataformatrixoperationsinQboxreducesScaLAPACKcommunicationoverheadatlargescale,andmakeshybrid- DFTcalculationscaleto𝑁bbc𝑁QRST cores.

• Giventheincreasedcomputationalperformancerelativetonetworkbandwidths,itiscrucialtoreduceand/orhideinter-nodecommunicationcosts.

• Guidingprinciplesfordevelopingcodesinmany-corearchitecture:1) Parallelizingindependent,fine-grainunitsofwork,reducinginter-nodecommunication,and

maximizingutilizationofon-noderesources.2) Optimizingcommunicationpatternsforperformancecriticalkernelswithon-the-flydata

redistributionandprocessreconfiguration.

Page 17: Performance optimization of WEST and Qbox on Intel Knights ... · Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng1, Christopher Knight1, Giulia Galli1,2,

Acknowledgement

17

• ThisresearchispartofThetaEarlyScienceProjectatArgonneLeadershipComputingFacility,whichisaDOEOfficeofScienceUserFacilityunderContractDE-AC02-06CH11357.

• ThisworkwassupportedbyMICCoM,aspartofComp.Mats.Sci.ProgramfundedbytheU.S.DOE,OfficeofScience,BES,MSEDivision.