parallel transport time dependent density functional...

29
Parallel Transport Time Dependent Density Functional Theory Calculations with Hybrid Functional on Summit Weile Jia 1 , Lin-Wang Wang 2 , Lin Lin 1,2 1. University of California, Berkeley 2. Lawrence Berkeley National Lab SC'19 Denver, Nov, 21

Upload: others

Post on 20-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Parallel Transport Time Dependent Density Functional Theory Calculations with Hybrid

Functional on Summit

Weile Jia1, Lin-Wang Wang2, Lin Lin1,2

1. University of California, Berkeley

2. Lawrence Berkeley National Lab

SC'19 Denver, Nov, 21

Page 2: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Outline

• Motivation

• rt-TDDFT algorithm and parallelization

• Implementation details and results

• What is next?

Page 3: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Motivation• electronic ultrafast phenomena

• ion collision

• light absorption spectrum

• laser-induced demagnetization and

phase change

• charge transfer

• excited carrier dynamics

• chemical reactions

• …..

Page 4: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Real-time TDDFT• time dependent many electron system, starting from an initial state Ψ(0), can be

determined by one body time dependent density alone (Runge and Gross, 1984).

i𝜕𝜕t𝜓) 𝑡 = H(𝑃(𝑡), 𝑡)𝜓)(𝑡)

𝑃 𝑡 = Ψ t Ψ∗(𝑡)

Explicit RK-4 method for rt-TDDFT

Page 5: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

State of art

• SALMON: https://salmon-tddft.jp/

• OCTOPUS: https://gitlab.com/octopus-code/octopus

Pros: complexity - O(N2)

easy to paralelize

Mainly used explicit time integrator

Page 6: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Problem No. I

• Go ́mez Pueyo, Adria ́n, Miguel AL Marques, Angel Rubio, and Alberto Castro. "Propagators for the Time-Dependent Kohn–Sham Equations: Multistep, Runge–Kutta, Exponential Runge–Kutta, and Commutator Free Magnus Methods." Journal of chemical theory and computation 14, no. 6 (2018): 3040-3052.

• Rehn, D. A., Shen, Y., Buchholz, M. E., Dubey, M., Namburu, R., & Reed, E. J. (2019). ODE integration schemes for plane-wave real-time time-dependent density functional theory. The Journal of chemical physics, 150(1), 014101.

Time step too small!

Δ𝑡 < 𝐻 34 ~ 1 attosecond

Total time: 10-100 fs

Number of steps: ~ 10,000

Page 7: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Problem No. II

• Accuracy

• PBE/LDA relatively cheap, but not accurate enough

• Hybrid functional: accurate, but too expensive

Literature: Hybrid functional rt-TDDFT: 8 atom system

Computational complexity: (𝑁?@ 𝑁A𝑙𝑜𝑔𝑁A)

Ng ~ 105

Ne ~ Natom100 atom requires 20,000 FFTs 1000 atom requires 2,000,000 FFTs

Page 8: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Parallel transport gauge formation

• 𝑃 𝑡 = 𝛹 𝑡 𝛹∗(𝑡) oscillate much slower

• von Neumann equation:

Black line: Oscillation of real part of wavefunction 𝜓 𝑡, 𝑟0

Green line: optimal gauge ϕ(𝑡, 𝑟H)

• 𝜓(𝑡) oscillate fast

• Φ(𝑡) = Ψ(𝑡)𝑈(𝑡). U(t) is a unitary matrix

• Parallel transport governing equation:

Page 9: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Crank–Nicolson implicit time integrator(PWDFT)

• Preconditioned Anderson mixing method

• Time step: 10-50 attosecond

• ~ 5-20x speedup for the hybrid functional

Problem I

Jia, W., An, D., Wang, L. W., & Lin, L. (2018). Fast real-time time-dependent density functional theory calculations with the parallel transport gauge. Journal of Chemical Theory and Computation, 14(11), 5645-5652.

Page 10: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Strong scaling is essential

• Total time: 30 fs

• Each time step: 50as

• Total steps: 6000

• Each step: 262 seconds

• Total cost: 18.25 days

1024 atom silicon

2048 CPU cores

Ecut: 10 Hartree

FFT grid: 723

#FFTs: 184,000,000 each TDDFT step

380 nm laser

Page 11: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Summit Supercomputer

2 IBM POWER 9 sockets

6 NVIDIA V100 GPUs

512 GB main memory

96GB GPU memory

NVLink – 50GB/s

NIC connected to both sockets

V100: 7.6Tflops bandwidth: 900GB/s

1 GPU per MPI in our code

One of 4600 nodes of Summit

Page 12: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Data distribution (PWDFT)

• Band-index parallelization:

• good for FFT calculation.

• G-parallelization:

• good for GEMM calculation.

• K-parallelization:

• Not mentioned here.Data distribution of the wave function

For1000atomsystem:𝑁?: ~ 1000𝑁Q:~ 10R

Band-index Parallel G-parallel

Page 13: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

rt-TDDFT PT-CN Algorithm

• Challenges for rt-TDDFT code

• Computation:

Fock exchange operator takes 92% of total time

nonlinear part Ψ Ψ 𝐻 Ψ calculation

occupation of Ψ

Anderson mixing

…..

• Storage:

20 copies of the wavefunction

Page 14: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Fock exchange operator calculation

• Band-index parallel

• Two parts:

• MPI_Bcast

• Calculation

Page 15: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Fock exchange operator on GPU – I

• Step 1. band-by-band

FFTW => CUFFT

CUDA custom kernels

Page 16: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Fock exchange operator on GPU – II

• Step 1. band-by-band

• Step 2. batched implementation

Further utilize GPU bandwidth

Page 17: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Fock exchange operator on GPU – III

• Step 1. band-by-band

• Step 2. batched implementation

• Step 3. CUDA-aware MPI

Page 18: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Fock exchange operator on GPU – IV

• Step 1. band-by-band

• Step 2. batched implementation

• Step 3. CUDA-aware MPI

• Step 4. single precision MPI

Implicit barrier during MPI_Bcast

Page 19: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Fock exchange operator on GPU – V

• Step 1. band-by-band

• Step 2. batched implementation

• Step 3. CUDA-aware MPI

• Step 4. single precision MPI

• Step 5. overlap MPI/GPU

Page 20: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Fock exchange operator speedup

0

50

100

150

200

250

300

350

Summit3072

CPUcore

Cori3072

CPUcore

Band-by-band

Batched GPUDirect

SinglePrecision

MPI

MPI/CompOverlap

Tim

e(s)

MPI timeComputation time1536 atoms

ONCV pseudopotentialEcut = 10 Hartree3072 bandsG-grid: 60*90*120Density grid: 120*180*2403072 CPU cores ~ 74 nodes72 GPUs = 12 nodes

Fock exchange part Yme (3072 CPU cores v.s 72 GPUs)

7x speedup under same power consumpYon comparison380nm laser

30 fs simulation22 SCF per TDDFT step(on average)50 as per step

GPU bandwidth utilization: 90%

5.5% Peak FLOPS

Page 21: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

PT-CN algorithm onto GPU

Port entire PT-CN onto GPU

• On GPU:

• occupation of Ψ

• 𝐻Ψ

• Residual 𝑅T

• Orthogonalization

• CPU

• ρ => 𝑉

Page 22: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Residual calculation => GPU

• GEMM on GPU

• MPI_Alltoall with CUDA-aware MPI

Residual calculation on GPU

Page 23: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Strong scaling

Strong scaling of the silicon 1536 atom systemSpeedup baseline: 36 GPU time

Page 24: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Wall clock time: PWDFT

0

20

40

60

80

100

36 72 144 288 768

Tim

e(s)

Number of GPUs

H^Residual related

Density evaluationAnderson mixing

Others

Single SCF time – Strong scaling

0

300

600

900

1200

1500

1800

2100

2400

36 72 144 288 384 768 1536

Tim

e(s)

Number of GPUs

MPI BcastMemory copy operation

MPI Alltoallv MPI Allreduce

Computational time

Wall clock Yme of one rt-TDDFT step(22 SCFs)

5 mins per step

Page 25: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Weak scaling and comparison with RK-4

0.1

1

10

100

48 96 192 384 768 1536

Tim

e(s)

Number of atoms

Total timeIdeal scaling

0

8000

16000

24000

32000

40000

36 72 144 288 384 768

Tim

e(s)

Number of GPUs

RK-4PT-CN

Smaller è better

Page 26: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Time-to-solution (PWDFT)

1536 atomsEcut = 10 Hartree3072 bandsG-grid: 60*90*120Density grid: 120*180*240

380nm laser

30 fs simulation22 SCF per TDDFT step50 as per step

RK4, CPU: ~ 3.4 YearsRK4, GPU: ~37 DaysPT-CN, CPU: ~63 DaysPT-CN, GPU: ~45 Hour

3072 CPU cores ~ 74 nodes786 GPUs = 131 nodes

Page 27: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Conclusion and future work

• New algorithm(PT-CN, 20x) + new machine(Summit, 34x) leads to ~680x speedup for

1536 atom silicon rt-TDDFT hybrid functional calculation.

• GPU is 7x more power efficient compared to the CPU code.

• Data movement is the key in the GPU implementation.

• Future work:

• Metal systems

• Better preconditioner for the rt-TDDFT

Page 28: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Some thoughts• Data movement is important, try reduce it

• NVLink• CUDA-aware MPI

• Watch out for the unexpected behavior

• Algorithm entirely on GPU• Batch computation• Reduce data copy

• Try the new libraries. cuSolver instead of MAGMA• Try mixed precision – both calculation and communication

• Try different resource setup.• https://jsrunvisualizer.olcf.ornl.gov/?s1f1o01n1c42g6r16d1b27l0=

• Summit tutorial• https://www.olcf.ornl.gov/for-users/system-user-guides/summit/

Page 29: Parallel Transport Time Dependent Density Functional ...sc19.supercomputing.org/proceedings/tech_paper/... · CUDA-aware MPI •Step 4. single precision MPI •Step 5. overlap MPI/GPU

Thank you for your attention!