first principles modeling with octopus: massive ... · first principles modeling with octopus:...
TRANSCRIPT
First principles modeling with Octopus: massive parallelization towards petaflop computing and
more
A. Castro, J. Alberdi and A. Rubio
Theoretical Spectroscopy
Electronic excitations:~ Optical absorption~ Electron energy loss~ Inelastic X-ray scattering
~ Photoemission~ Inverse photoemission~ …
5
Theoretical Spectroscopy
Goal: First principles (from electronic structure) theoretical description of the various spectroscopies (“theoretical beamlines”):
6
Theoretical Spectroscopy
8
Role: interpretation of (complex) experimental findings
Theoretical atomistic structures, and corresponding TEM images.
Theoretical Spectroscopy
12
The European Theoretical Spectroscopy Facility (ETSF)
~ Networking~ Integration of tools (formalism, software)~ Maintenance of tools~ Support, service, formation
Theoretical Spectroscopy
The octopus code is a member of a family of free software codes developed, to a large extent, within the ETSF:~ abinit~ octopus~ dp
13
The octopus code
Targets:~ Optical absorption spectra of molecules, clusters, nanostructures, solids.~ Response to lasers (non-perturbative response to high-intensity fields)~ Dichroic spectra, and other mixed (electric-magnetic responses)~ Adiabatic and non-adiabatic Molecular Dynamics (for, e.g. infrared and vibrational spectra, or photochemical reactions).
~ Quantum Optimal Control Theory for molecular processes.
15
The octopus code
Physical approximations and techniques:~ Density-Functional Theory, Time-Dependent Density-Functional
Theory to describe the electron structure.• Comprehensive set of functionals through the libxc library.
~ Mixed quantum-classical systems.~ Both real-time and frequency domain response (“Casida” and
“Sternheimer” formulations).
16
The octopus code
17
Numerics:~ Basic representation: real space
grid.~ Usually regular and rectangular,
occasionally curvilinear.~ Plane waves for some procedures
(especially for periodic systems)~ Atomic orbitals for some
procedures
The octopus code
18
Derivative in a point: sum over neighbor points.Cij depend on the points used: the stencil.More points -> more precision.Semi-local operation.
The octopus code
The key equations~ Ground-state DFT: Kohn-Sham equations.
~ Time-dependent DFT: time-dependent KS eqs:
19
The octopus code
Key numerical operations:~ Linear systems with sparse matrices.~ Eigenvalue systems with sparse matrices.~ Non-linear eigenvalue systems.~ Propagation of “Schrödinger-like” equations.
~ The dimension can go up to 10 million points.~ The storage needs can go up to 10 Gb.
20
The octopus code
Use of libraries:~ BLAS, LAPACK~ GNU GSL mathematical library.~ FFTW~ NetCDF~ ETSF input/output library~ Libxc exchange and correlation library~ Other optional libraries.
21
Objective
Reach petaflops computing, with a scientific codeSimulate photosynthesis of the light in chlorophyll
24
The Octopus code
Software package for electron dynamicsDeveloped in the UPV/EHUGround state and excited states propertiesRealtime, Casida and Sternheimer TDDFTQuantum transport and optimal controlFree software: GPL license
26
http://www.tddft.org/programs/octopus/
Octopus simulation strategy
Pseudopotential approximationRealspace gridsMain operation: the finite difference Laplacian
27
Libraries
Intensive use of librariesGeneral libraries:
~ BLAS~ LAPACK~ FFT~ Zoltan/Metis~ ...
Specific libraries~ Libxc~ ETSF_IO
28
Multilevelparallelization
MPI KohnSham states
Realspace domains
InNo
de OpenMP threads OpenCLtasksVectorization
CPU GPU
29
Target systems:
Massive number of execution units
~ Multicore processors with vectorial FPUs
~ IBM Blue Gene architecture
~ Graphical processing units
30
Parallelization by states/orbitals
Assign each processor a group of statesTimepropagation is independent for each stateLittle communication requiredLimited by the number of states in the system
32
Domain parallelization
Assign each processor a set of grid pointsPartition libraries: Zoltan or Metis
33
Main operations in domain
parallelization
Laplacian: copy points in domain
boundaries
Overlap computation
and communication
34
Integration: global sums (reductions)
Group reduction operations
Two approaches
OpenMP
Thread programming based on compiler directivesInnode parallelizationLittle memory overhead compared to MPIScaling limited by memory bandwidthMultithreaded Blas and Lapack
OpenCL
Hundreds of execution unitsHigh memory bandwidth but with long latencyBehaves like a vector processor (length > 16)Separated memory: copy from/to main memory
36
Supercomputers
37
Corvo cluster~ X86_64VARGAS (in IDRIS)~ Power6~ 67 teraflopsMareNostrum~ PowerPC 970~ 94 teraflopsJugene (image)~ 1 petaflops
Laplacian operator
Comparison in performance of the finitedifference Laplacian operator
CPU uses 4 threadsGPU is 4 times fasterCache effects are visible
39
Timepropagation
Comparison in performance for a timepropagation
Fullerene moleculeThe GPU is 3 times fasterLimited by copying and nonGPU code
40
Multilevel parallelization
Clorophyll molecule: 650 atomsJugene Blue Gene/PSustained throughput: > 6.5 teraflopsPeak throughput: 55 teraflops
41
Target system
Jugene all nodes~ 294 912 processor cores = 73 728 nodes~ Maximum theoretical performance of 1002 MFlops
5879 atoms chlorophyll system~ Complete molecule of spinach
44
Tests systems
Smaller molecules~ 180 atoms~ 441 atoms~ 650 atoms~ 1365 atoms
Partition of machines~ Jugene and Corvo
45
Improvements
Memory improvements in GS~ Split the memory among the nodes~ Use of ScaLAPACK
Improvements in the Poisson solver for TD~ Pipeline execution
• Execute Poisson while continues with an approximation~ Use new algorithms like FFM~ Use of parallel FFTs
51