parallel transport time dependent density functional...

Parallel Transport Time Dependent Density Functional Theory Calculations with Hybrid

Functional on Summit

Weile Jia1, Lin-Wang Wang2, Lin Lin1,2

1. University of California, Berkeley

2. Lawrence Berkeley National Lab

SC'19 Denver, Nov, 21

Outline

• Motivation

• rt-TDDFT algorithm and parallelization

• Implementation details and results

• What is next?

Motivation• electronic ultrafast phenomena

• ion collision

• light absorption spectrum

• laser-induced demagnetization and

phase change

• charge transfer

• excited carrier dynamics

• chemical reactions

• …..

Real-time TDDFT• time dependent many electron system, starting from an initial state Ψ(0), can be

determined by one body time dependent density alone (Runge and Gross, 1984).

i𝜕𝜕t𝜓) 𝑡 = H(𝑃(𝑡), 𝑡)𝜓)(𝑡)

𝑃 𝑡 = Ψ t Ψ∗(𝑡)

Explicit RK-4 method for rt-TDDFT

State of art

• SALMON: https://salmon-tddft.jp/

• OCTOPUS: https://gitlab.com/octopus-code/octopus

Pros: complexity - O(N2)

easy to paralelize

Mainly used explicit time integrator

https://salmon-tddft.jp/

https://gitlab.com/octopus-code/octopus

Problem No. I

• Go ́mez Pueyo, Adria ́n, Miguel AL Marques, Angel Rubio, and Alberto Castro. "Propagators for the Time-Dependent Kohn–Sham Equations: Multistep, Runge–Kutta, Exponential Runge–Kutta, and Commutator Free Magnus Methods." Journal of chemical theory and computation 14, no. 6 (2018): 3040-3052.

• Rehn, D. A., Shen, Y., Buchholz, M. E., Dubey, M., Namburu, R., & Reed, E. J. (2019). ODE integration schemes for plane-wave real-time time-dependent density functional theory. The Journal of chemical physics, 150(1), 014101.

Time step too small!

Δ𝑡 < 𝐻 34 ~ 1 attosecond

Total time: 10-100 fs

Number of steps: ~ 10,000

Problem No. II

• Accuracy

• PBE/LDA relatively cheap, but not accurate enough

• Hybrid functional: accurate, but too expensive

Literature: Hybrid functional rt-TDDFT: 8 atom system

Computational complexity: (𝑁?@ 𝑁A𝑙𝑜𝑔𝑁A)

Ng ~ 105

Ne ~ Natom100 atom requires 20,000 FFTs 1000 atom requires 2,000,000 FFTs

Parallel transport gauge formation

• 𝑃 𝑡 = 𝛹 𝑡 𝛹∗(𝑡) oscillate much slower

• von Neumann equation:

Black line: Oscillation of real part of wavefunction 𝜓 𝑡, 𝑟0

Green line: optimal gauge ϕ(𝑡, 𝑟H)

• 𝜓(𝑡) oscillate fast

• Φ(𝑡) = Ψ(𝑡)𝑈(𝑡). U(t) is a unitary matrix

• Parallel transport governing equation:

Crank–Nicolson implicit time integrator(PWDFT)

• Preconditioned Anderson mixing method

• Time step: 10-50 attosecond

• ~ 5-20x speedup for the hybrid functional

Problem I

Jia, W., An, D., Wang, L. W., & Lin, L. (2018). Fast real-time time-dependent density functional theory calculations with the parallel transport gauge. Journal of Chemical Theory and Computation, 14(11), 5645-5652.

Strong scaling is essential

• Total time: 30 fs

• Each time step: 50as

• Total steps: 6000

• Each step: 262 seconds

• Total cost: 18.25 days

1024 atom silicon

2048 CPU cores

Ecut: 10 Hartree

FFT grid: 723

#FFTs: 184,000,000 each TDDFT step

380 nm laser

Summit Supercomputer

2 IBM POWER 9 sockets

6 NVIDIA V100 GPUs

512 GB main memory

96GB GPU memory

NVLink – 50GB/s

NIC connected to both sockets

V100: 7.6Tflops bandwidth: 900GB/s

1 GPU per MPI in our code

One of 4600 nodes of Summit

Data distribution (PWDFT)

• Band-index parallelization:

• good for FFT calculation.

• G-parallelization:

• good for GEMM calculation.

• K-parallelization:

• Not mentioned here.Data distribution of the wave function

For1000atomsystem:𝑁?: ~ 1000𝑁Q:~ 10R

Band-index Parallel G-parallel

rt-TDDFT PT-CN Algorithm

• Challenges for rt-TDDFT code

• Computation:

Fock exchange operator takes 92% of total time

nonlinear part Ψ Ψ 𝐻 Ψ calculation

occupation of Ψ

Anderson mixing

…..

• Storage:

20 copies of the wavefunction

Fock exchange operator calculation

• Band-index parallel

• Two parts:

• MPI_Bcast

• Calculation

Fock exchange operator on GPU – I

• Step 1. band-by-band

FFTW => CUFFT

CUDA custom kernels

Fock exchange operator on GPU – II


• Step 2. batched implementation

Further utilize GPU bandwidth

Fock exchange operator on GPU – III



• Step 3. CUDA-aware MPI

Fock exchange operator on GPU – IV




• Step 4. single precision MPI

Implicit barrier during MPI_Bcast

Fock exchange operator on GPU – V




• Step 4. single precision MPI

• Step 5. overlap MPI/GPU

Fock exchange operator speedup

0

50

100

150

200

250

300

350

Summit3072

CPUcore

Cori3072

CPUcore

Band-by-band

Batched GPUDirect

SinglePrecision

MPI

MPI/CompOverlap

Tim

e(s)

MPI timeComputation time1536 atoms

ONCV pseudopotentialEcut = 10 Hartree3072 bandsG-grid: 60*90*120Density grid: 120*180*2403072 CPU cores ~ 74 nodes72 GPUs = 12 nodes

Fock exchange part Yme (3072 CPU cores v.s 72 GPUs)

7x speedup under same power consumpYon comparison380nm laser

30 fs simulation22 SCF per TDDFT step(on average)50 as per step

GPU bandwidth utilization: 90%

5.5% Peak FLOPS

PT-CN algorithm onto GPU

Port entire PT-CN onto GPU

• On GPU:

• occupation of Ψ

• 𝐻Ψ

• Residual 𝑅T

• Orthogonalization

• CPU

• ρ => 𝑉

Residual calculation => GPU

• GEMM on GPU

• MPI_Alltoall with CUDA-aware MPI

Residual calculation on GPU

Strong scaling

Strong scaling of the silicon 1536 atom systemSpeedup baseline: 36 GPU time

Wall clock time: PWDFT

0

20

40

60

80

100

36 72 144 288 768

Tim

e(s)

Number of GPUs

H^Residual related

Density evaluationAnderson mixing

Others

Single SCF time – Strong scaling

0

300

600

900

1200

1500

1800

2100

2400

36 72 144 288 384 768 1536

Tim

e(s)

Number of GPUs

MPI BcastMemory copy operation

MPI Alltoallv MPI Allreduce

Computational time

Wall clock Yme of one rt-TDDFT step(22 SCFs)

5 mins per step

Weak scaling and comparison with RK-4

0.1

1

10

100

48 96 192 384 768 1536

Tim

e(s)

Number of atoms

Total timeIdeal scaling

0

8000

16000

24000

32000

40000

36 72 144 288 384 768

Tim

e(s)

Number of GPUs

RK-4PT-CN

Smaller è better

Time-to-solution (PWDFT)

1536 atomsEcut = 10 Hartree3072 bandsG-grid: 60*90*120Density grid: 120*180*240

380nm laser

30 fs simulation22 SCF per TDDFT step50 as per step

RK4, CPU: ~ 3.4 YearsRK4, GPU: ~37 DaysPT-CN, CPU: ~63 DaysPT-CN, GPU: ~45 Hour

3072 CPU cores ~ 74 nodes786 GPUs = 131 nodes

Conclusion and future work

• New algorithm(PT-CN, 20x) + new machine(Summit, 34x) leads to ~680x speedup for

1536 atom silicon rt-TDDFT hybrid functional calculation.

• GPU is 7x more power efficient compared to the CPU code.

• Data movement is the key in the GPU implementation.

• Future work:

• Metal systems

• Better preconditioner for the rt-TDDFT

Some thoughts• Data movement is important, try reduce it

• NVLink• CUDA-aware MPI

• Watch out for the unexpected behavior

• Algorithm entirely on GPU• Batch computation• Reduce data copy

• Try the new libraries. cuSolver instead of MAGMA• Try mixed precision – both calculation and communication

• Try different resource setup.• https://jsrunvisualizer.olcf.ornl.gov/?s1f1o01n1c42g6r16d1b27l0=

• Summit tutorial• https://www.olcf.ornl.gov/for-users/system-user-guides/summit/

https://jsrunvisualizer.olcf.ornl.gov/?s1f1o01n1c42g6r16d1b27l0=

https://www.olcf.ornl.gov/for-users/system-user-guides/summit/

Thank you for your attention!

parallel transport time dependent density functional...

Documents