performance engineering for legacy codes on a cray xc40 ... · 64+ core (based on intel atom...
TRANSCRIPT
![Page 1: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/1.jpg)
Performance Engineering for Legacy Codeson a Cray XC40 with Intel Xeon Phi (KNL)
Matthias Noack ([email protected]), Florian Wende,Thomas Steinke, Alexander Reinefeld
Zuse Institute Berlin
1 / 442017-06-22, Performance Engineering for HPC: Implementation, Processes & Case Studies at ISC’17
![Page 2: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/2.jpg)
“Konrad” (Berlin, ZIB)- 1872 Xeon nodes- 44928 cores
“Gottfried” (Hanover, LUIS)- 1680 Xeon nodes- 40320 cores + 64 SMP servers, 256/512 GB
TDS (Berlin, ZIB)- 16 KNC nodes (until July 2016)- 80 KNL nodes (since July 2016)- data warp nodes
Applications on the HLRN-III- many pure MPI codes- some MPI+OpenMP- some vectorized
10 Gbps (243 km linear distance)
North-German Supercomputing Alliance (HLRN)
2 / 44
![Page 3: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/3.jpg)
1987Cray X-MP
471 MFlops
1994Cray T3D38 GFlops
1997Cray T3E
486 GFlops
2002 (HLRN-I)IBM p6902,5 TFlops
1984Cray 1M
160 MFlops
2008 (HLRN-II)SGI ICE, XE150 TFlops
2013 (HLRN-III)Cray XC30/XC40
1,3 PFlops
[peak performance]
2
192DEC Alpha
256DEC Alpha
384IBM Power4
10.240Intel Harpertown,
Nehalem
44.928Intel Ivy Bridge, Haswell
1
[cores]
Supercomputer at ZIB
3 / 44
![Page 4: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/4.jpg)
1987Cray X-MP
471 MFlops
1994Cray T3D38 GFlops
1997Cray T3E
486 GFlops
2002 (HLRN-I)IBM p6902,5 TFlops200 kWatt
10 M€1984Cray 1M
160 MFlops
2008 (HLRN-II)SGI ICE, XE150 TFlops
2013 (HLRN-III)Cray XC30/XC40
1,3 PFlops
[peak performance]
2
192DEC Alpha
256DEC Alpha
384IBM Power4
10.240Intel Harpertown,
Nehalem
44.928Intel Ivy Bridge, Haswell
1
[cores]
Xeon Phi KNL, 2016
Intel Xeon Phi 729072 cores (288 threads)3 TFLOPS245 Watt6662 € (6/2017)
Y
Supercomputer at ZIB
4 / 44
![Page 5: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/5.jpg)
Applications
• GLAT (atomistic thermodynamics)
• VASP (electronic structure)
• BQCD (high-energy physics)
• HEOM (photo-active processes)
• BOSS (time series analysis, phase 2)
• PALM (fluid dynamics, phase 2)
• PHOENIX3D (astrophysics, phase 2)
Challenges
• Adapting data structures for enabling SIMD
• Vectorising complex code structures
• Transition to hybrid MPI + OpenMP
• (Offload with Intel LEO vs. OpenMP 4.x)
OBJECTIVE
Many-Core High-Performance Computing
RESEARCH
Programming Models,
Runtime Libraries
APPLICATIONS
Modernization,
OpenMP/MPI,
Scalability
Intel Parallel Compute Center (IPCC)
Research Center for Many-Core HPC at ZIB
5 / 44
![Page 6: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/6.jpg)
Knights Landing (KNL): Intel’s 2. Many Integrated Core (MIC) Architecture1)
self-booting CPU, optionall with integrated OmniPath fabric 64+ core (based on Intel Atom Silvermont architecture, x86-64)
4-way hardware-threading 512-bit SIMD vector processing (AVX-512)
On-Chip Multi-Channel (MC) DRAM: 16 GiB DDR4 main memory: up to 384 GiB
1) A. Sodani et al., Knights Landing: Second-Generation Intel Xeon Phi Product, IEEE Micro vol. 6, April 2016
Intel Xeon Phi (Knights Landing) – Architecture
6 / 44
![Page 7: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/7.jpg)
KNL CPU
PCIe 3 DMI
Tile
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
Misc
up to 36 active tilesconnected via 2D-Mesh-
Interconnect
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
Intel Xeon Phi (Knights Landing) – Architecture
7 / 44
![Page 8: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/8.jpg)
KNL CPU Tile
PCIe 3 DMI
Tile
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
Misc
up to 36 active tilesconnected via 2D-Mesh-
Interconnect
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
Core Core
2 VPUs 2 VPUs
1 MiB L2 Cache
CHA
2 out-of-order cores 2 VPUs per core (AVX-512) 1 MiB shared L2-Cache Caching/Home-Agent (CHA)
interface to 2D-Mesh-Interconnect
distributed tag directory
(MESIF cache coherency protocol)
Intel Xeon Phi (Knights Landing) – Architecture
8 / 44
![Page 9: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/9.jpg)
2 memory controller 6 DDR4 Channels
KNL CPU DDR4 Memory (up to 384 GiB)
PCIe 3 DMI
Tile
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
Misc
up to 36 active tilesconnected via 2D-Mesh-
Interconnect
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
Intel Xeon Phi (Knights Landing) – Architecture
9 / 44
![Page 10: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/10.jpg)
2 memory controller 6 DDR4 Channels
MCDRAM (16 GiB) 8 on-chip units, each 2 GiB
KNL CPU DDR4 Memory (up to 384 GiB)
PCIe 3 DMI
Tile
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
Misc
up to 36 active tilesconnected via 2D-Mesh-
Interconnect
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
Intel Xeon Phi (Knights Landing) – Architecture
10 / 44
![Page 11: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/11.jpg)
Flat-ModeDDR4 and MCDRAMin same address space
Cache-ModeMCDRAM = direct-mapped Cachefor DDR4
Hybrid-Mode8 | 4 GiB MCDRAMin Cache-Mode,remainder in Flat-Mode
KNL CPU Memory Modes:
PCIe 3 DMI
Tile
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
DDR MC
EDC EDC
EDC EDC
3 D
DR
4 C
han
nel
s
Misc
up to 36 active tilesconnected via 2D-Mesh-
Interconnect
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
DDR4
16 GiBMCDRAM
16 GiBMCDRAM
DDR4
DDR4
8 | 12 GiBMCDRAM
8 | 4 GiBMCDRAM
phy
s. a
dd
ress
sp
ace
phy
s. a
dd
ress
sp
ace
Intel Xeon Phi (Knights Landing) – Architecture
11 / 44
![Page 12: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/12.jpg)
KNL in the Roofline Model (Samuel Williams, 2008)
12 / 44
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096
![Page 13: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/13.jpg)
KNL in the Roofline Model - adding peak FLOPS
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
![Page 14: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/14.jpg)
KNL in the Roofline Model - adding peak DRAM bandwidth
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
![Page 15: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/15.jpg)
KNL in the Roofline Model
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
![Page 16: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/16.jpg)
KNL in the Roofline Model
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
AttainableFLOPS
= min
PeakFLOPS
MemoryBandwidth
× ArithmeticIntensity
![Page 17: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/17.jpg)
KNL in the Roofline Model
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
AttainableFLOPS
= min
PeakFLOPS
MemoryBandwidth
× ArithmeticIntensity
⇐ memory bound compute bound ⇒
![Page 18: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/18.jpg)
KNL in the Roofline Model
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
⇐ memory bound compute bound ⇒
=
PeakFLOPS
MemoryBandwidth
![Page 19: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/19.jpg)
KNL in the Roofline Model
12 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/sNumbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s DDR, 490 GB/s MCDRAM; Xeon E5-2680v3: 480 GFLOPS, 68 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
=
PeakFLOPS
MemoryBandwidth
MCDRAM
BW
MinimalArithmetic Intensitynecessary forpeak FLOPS(HLRN nodes):KNL DRAM: 22.7KNL MCDRAM: 5.3Haswell: 7.1
![Page 20: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/20.jpg)
Refining the Ceilings - FLOPS
Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA
• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD
• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)
13 / 44
![Page 21: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/21.jpg)
Refining the Ceilings - FLOPS
Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS
• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA
• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD
• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)
13 / 44
![Page 22: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/22.jpg)
Refining the Ceilings - FLOPS
Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load
⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA
• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD
• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)
13 / 44
![Page 23: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/23.jpg)
Refining the Ceilings - FLOPS
Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA
• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD
• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)
13 / 44
![Page 24: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/24.jpg)
Refining the Ceilings - FLOPS
Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA
• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)
• without ILP, and without SIMD• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)
13 / 44
![Page 25: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/25.jpg)
Refining the Ceilings - FLOPS
Realistic peak FLOPS• Xeon Phi 7250 (HLRN Cray TDS), advertised with 3.05 TFLOPS peak DP
• 1.4 GHz × 68 core × 8 SIMD × 2 VPUs × 2 FMA = 3046.4 GFLOPS• AVX frequency is only 1.2 GHz and might throttle down under heavy load⇒ actual peak: 2611.2 GFLOPS
Add more FLOPS ceilings• without instruction level parallelism (ILP), i.e. dual VPUs and FMA
• 1.2 GHz × 68 core × 8 SIMD = 652.8 GFLOPS (25%)• without ILP, and without SIMD
• 1.2 GHz × 68 core = 84.6 GFLOPS (3.2%)
13 / 44
![Page 26: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/26.jpg)
KNL in the Roofline Model
14 / 44Numbers for Intel Xeon Phi 7250: 2611.2 GFLOPS, 115.2 GiB/s
Arithmetic Intensity
[FLOP/Byte]1 2 4 8 16 32 64 128 256
Attainable
FLOPS
[GFLOPS]
1
4
16
64
256
1024
4096 peak FLOPS
DRAM BW
MCDRAM
BW
w/o ILP
w/o ILP+SIMD
Upper computebound in GFLOPS:peak: 2611.2w/o ILP: 652.8w/o ILP+SIMD: 84.6
![Page 27: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/27.jpg)
Case study I: Atmospheric Research with PALM
DUSTDEVILS
WIND ENERGY
CLOUD PHYSIC
TURBULENCE EFFECTS ON AIRCRAFT
URBAN METEOROLOGY AND CITY PLANING
15 / 44https://palm.muk.uni-hannover.de
![Page 28: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/28.jpg)
The PALM Code
• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)
16 / 44https://palm.muk.uni-hannover.de
![Page 29: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/29.jpg)
The PALM Code
• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)
• Fortran 95/2003.• hybrid MPI + OpenMP code• 140 kLOC, 79 modules and 171 source files• highly scalable, tested for up to 43,200 cores
16 / 44https://palm.muk.uni-hannover.de
![Page 30: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/30.jpg)
The PALM Code
• continuously developed since 1997 by the PALM group (Siegfried Raasch et al.)
• Fortran 95/2003.• hybrid MPI + OpenMP code• 140 kLOC, 79 modules and 171 source files• highly scalable, tested for up to 43,200 cores
• runs on the HLRN supercomputing facilities at Berlin (ZIB) and Hannover (LUIS)• modernisation target within the Intel Parallel Computing Center at ZIB
16 / 44https://palm.muk.uni-hannover.de
![Page 31: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/31.jpg)
PALM Hackathon at ZIB
17 / 44left to right: Matthias Noack, Florian Wende, Helge Knoop, Matthias Sühring, Tobias Gronemeier; behind the camera: Thomas Steinke
![Page 33: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/33.jpg)
Parallelization Strategy
2D domain decomposition
21
0
54
3
87
6
x
y
18 / 44https://palm.muk.uni-hannover.de
![Page 34: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/34.jpg)
Parallelization Strategy
2D domain decomposition
21
0
54
3
87
6
x
y
outer of 3 nested loops threaded
n• different variants for different
targets• e.g. cache optimised with
decomposed inner loop
• no vectorisation⇒ use of floating point
exceptions preventsautomatic vectorisation
⇒ no explicit SIMD constructs
18 / 44https://palm.muk.uni-hannover.de
![Page 35: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/35.jpg)
KNL Optimisation Strategy
19 / 44
getoperationalon KNL
![Page 36: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/36.jpg)
KNL Optimisation Strategy
19 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
![Page 37: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/37.jpg)
KNL Optimisation Strategy
19 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
![Page 38: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/38.jpg)
KNL Optimisation Strategy
19 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
![Page 39: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/39.jpg)
KNL Optimisation Strategy
19 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
![Page 40: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/40.jpg)
Benchmark Systems and Haswell Baseline
HLRN-III prod. system, 1872 nodes (Konrad)• 2 × Intel Xeon E5-2680v3 (Haswell)
• 2 × 12 cores at 2.5 GHz⇒ 960 GFLOPS per node• 64 GiB DDR3, 136 GiB/s
20 / 44
![Page 41: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/41.jpg)
Benchmark Systems and Haswell Baseline
HLRN-III prod. system, 1872 nodes (Konrad)• 2 × Intel Xeon E5-2680v3 (Haswell)
• 2 × 12 cores at 2.5 GHz⇒ 960 GFLOPS per node• 64 GiB DDR3, 136 GiB/s
Haswell Resultsnodes runtime
small 4 131.3 smedium 8 147.6 slarge 16 164.6 s
20 / 44
![Page 42: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/42.jpg)
Benchmark Systems and Haswell Baseline
HLRN-III prod. system, 1872 nodes (Konrad)• 2 × Intel Xeon E5-2680v3 (Haswell)
• 2 × 12 cores at 2.5 GHz⇒ 960 GFLOPS per node• 64 GiB DDR3, 136 GiB/s
HLRN KNL TDS node, 80 nodes• 1 × Intel Xeon Phi 7250 (KNL)
• 68 cores at 1.4 GHz⇒ 2611.2 GFLOPS with AVX clock
"3.05 TFLOPS"• 96 GiB DDR4, 115.2 GiB/s• 16 GiB MCDRAM, 490 GiB/s
Haswell Resultsnodes runtime
small 4 131.3 smedium 8 147.6 slarge 16 164.6 s
20 / 44
![Page 43: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/43.jpg)
Benchmark Systems and Haswell Baseline
HLRN-III prod. system, 1872 nodes (Konrad)• 2 × Intel Xeon E5-2680v3 (Haswell)
• 2 × 12 cores at 2.5 GHz⇒ 960 GFLOPS per node• 64 GiB DDR3, 136 GiB/s
HLRN KNL TDS node, 80 nodes• 1 × Intel Xeon Phi 7250 (KNL)
• 68 cores at 1.4 GHz⇒ 2611.2 GFLOPS with AVX clock
"3.05 TFLOPS"• 96 GiB DDR4, 115.2 GiB/s• 16 GiB MCDRAM, 490 GiB/s
• Upper bounds for speed-up:⇒ compute bound: 2.7×⇒ memory bound: 3.6× (MCDRAM)
Haswell Resultsnodes runtime
small 4 131.3 smedium 8 147.6 slarge 16 164.6 s
20 / 44
![Page 44: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/44.jpg)
KNL Optimisation Strategy
21 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
![Page 45: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/45.jpg)
KNL Optimisation Strategy
21 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
![Page 46: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/46.jpg)
MCDRAM Usage and Boot Mode
boot flat mode
run in DDR
run in MCDRAM
boot cache mode
run again
- problem < 16 GiB- upper bound forgain from MCDRAM
- good enough?⇒ decide about
explicit placement
22 / 44
![Page 47: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/47.jpg)
MCDRAM Usage and Boot Mode
boot flat mode
run in DDR
run in MCDRAM
boot cache mode
run again
- problem < 16 GiB- upper bound forgain from MCDRAM
- good enough?⇒ decide about
explicit placement
• 25 - 41% gain from MCDRAM• ≤ 3% loss from Cache-Mode⇒ no need for explicit placement
• 2 MiB pages worked best
22 / 44
0
20
40
60
80
100
120
140
160
small medium large
runtime[s]
MCDRAM Usage Comparison
1.41
1.411.25
1.38
1.391.22
quad_�at, DDR4quad_�at, MCDRAMquad_cache, both
![Page 48: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/48.jpg)
KNL Optimisation Strategy
23 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
⇒ quad_cache
![Page 49: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/49.jpg)
KNL Optimisation Strategy
23 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
⇒ quad_cache
. . .
. . .MPI processes
vs.OpenMP threads
![Page 50: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/50.jpg)
MPI processes vs. OpenMP threads
Tuning run for optimal per-node config
ranks 64 32 16 8 4 2 1threads 1 2 4 8 16 32 64small X
medium Xlarge X
≤ 2% off from best≤ 15% off from best
worse
X fastest
24 / 44quad_cache mode
![Page 51: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/51.jpg)
MPI processes vs. OpenMP threads
Tuning run for optimal per-node config
ranks 64 32 16 8 4 2 1threads 1 2 4 8 16 32 64small X
medium Xlarge X
≤ 2% off from best≤ 15% off from best
worse
X fastest
Conclusion• fastest⇒ small: 16 ranks × 4 thread⇒ medium: 8 ranks × 8 threads⇒ large: 32 × 2 threads
• 16 × 4 performs for all setups• fastest vs. slowest config: 2.3×
24 / 44quad_cache mode
![Page 52: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/52.jpg)
MPI processes vs. OpenMP threads
Tuning run for optimal per-node config
ranks 64 32 16 8 4 2 1threads 1 2 4 8 16 32 64small X
medium Xlarge X
≤ 2% off from best≤ 15% off from best
worse
X fastest
Conclusion• fastest⇒ small: 16 ranks × 4 thread⇒ medium: 8 ranks × 8 threads⇒ large: 32 × 2 threads
• 16 × 4 performs for all setups• fastest vs. slowest config: 2.3×
• very low impact on speedupsbetween MCDRAM usage models
• absolute numbers vary largely⇒ do memory first
24 / 44quad_cache mode
![Page 53: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/53.jpg)
MPI processes vs. OpenMP threads
Tuning run for optimal per-node config
ranks 64 32 16 8 4 2 1threads 1 2 4 8 16 32 64small X
medium Xlarge X
≤ 2% off from best≤ 15% off from best
worse
X fastest
Conclusion• fastest⇒ small: 16 ranks × 4 thread⇒ medium: 8 ranks × 8 threads⇒ large: 32 × 2 threads
• 16 × 4 performs for all setups• fastest vs. slowest config: 2.3×
• very low impact on speedupsbetween MCDRAM usage models
• absolute numbers vary largely⇒ do memory first
• always check pinning/affinity
24 / 44quad_cache mode
![Page 54: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/54.jpg)
KNL Optimisation Strategy
25 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
⇒ quad_cache
. . .
. . .MPI processes
vs.OpenMP threads
⇒ 16 × 4
![Page 55: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/55.jpg)
KNL Optimisation Strategy
25 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
⇒ quad_cache
. . .
. . .MPI processes
vs.OpenMP threads
⇒ 16 × 4
startoptimisationworkflow
![Page 56: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/56.jpg)
KNL Optimisation Strategy
25 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
⇒ quad_cache
. . .
. . .MPI processes
vs.OpenMP threads
⇒ 16 × 4
startoptimisationworkflow
![Page 57: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/57.jpg)
KNL Optimisation Strategy
25 / 44
getoperationalon KNL
- build scripts- job scripts- bug fixing
definesome
benchmarks
- small- medium- large
measurebaselineon Xeon
X
MCDRAMand
boot mode
⇒ quad_cache
. . .
. . .MPI processes
vs.OpenMP threads
⇒ 16 × 4
startoptimisationworkflow
- compiler optimisation reports- Intel VTune Amplifier XE, Advisor XE
![Page 58: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/58.jpg)
First Code Changes
Floating point exception-handling
• prevents vectorisation• fp-model-strict → fp-model-source• remove exception handling• add NaN/Inf-tests⇒ when writing checkpoints
⇒ no significant auto vectorisation⇒ suboptimal memory layout
26 / 44
![Page 59: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/59.jpg)
First Code Changes
Floating point exception-handling
• prevents vectorisation• fp-model-strict → fp-model-source• remove exception handling• add NaN/Inf-tests⇒ when writing checkpoints
⇒ no significant auto vectorisation⇒ suboptimal memory layout
Using MKL FFT
• small gain for benchmarks⇒ might be significant for larger setups
26 / 44
![Page 60: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/60.jpg)
First Code Changes
Floating point exception-handling
• prevents vectorisation• fp-model-strict → fp-model-source• remove exception handling• add NaN/Inf-tests⇒ when writing checkpoints
⇒ no significant auto vectorisation⇒ suboptimal memory layout
Using MKL FFT
• small gain for benchmarks⇒ might be significant for larger setups
CONTIGUOUS keyword
• from Fortran 2008• tell the compiler about contiguously
allocated arrays
26 / 44
0
50
100
150
200
small medium large
runtime[s]
Current Results, Intel Compiler 17.0.0
1.001.00
1.00
1.091.13
1.01
1.24
1.07
0.96
HSW, BaselineHSW, CurrentKNL, Current
![Page 61: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/61.jpg)
Cray Specialities
Hardware/Software
• Cray Aries network• Cray MPI• Cray Performance Tools• Cray Compiler
27 / 44
![Page 62: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/62.jpg)
Cray Specialities
Hardware/Software
• Cray Aries network• Cray MPI• Cray Performance Tools• Cray Compiler
Rank Reordering
• instrumented application run• optimised mapping of MPI
ranks to cores and nodes⇒ no improvement on KNL
27 / 44
![Page 63: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/63.jpg)
Cray Specialities
Hardware/Software
• Cray Aries network• Cray MPI• Cray Performance Tools• Cray Compiler
Rank Reordering
• instrumented application run• optimised mapping of MPI
ranks to cores and nodes⇒ no improvement on KNL
Cray Compiler
• initial KNL support• crash with OpenMP⇒ 64 MPI ranks per KNL
27 / 44
0
50
100
150
200
small medium large
runtime[s]
Current Results, Intel 17.0.0 vs. Cray Compiler 8.5.3
1.001.00
1.00
1.091.13
1.01
1.24
1.07
0.96
1.29
1.17
1.12
HSW, Baseline, IntelHSW, Current, IntelKNL, Current, IntelKNL, Current, Cray
![Page 64: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/64.jpg)
Projected Production Run Performance• benchmark runs: ≈ 5 min, productions runs: ≈ 12 hours⇒ serial initialisation becomes negligible⇒ plot speedup based on ttotal − tinit
28 / 44
![Page 65: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/65.jpg)
Projected Production Run Performance• benchmark runs: ≈ 5 min, productions runs: ≈ 12 hours⇒ serial initialisation becomes negligible⇒ plot speedup based on ttotal − tinit
28 / 44
0
0.5
1
1.5
2
small medium large
speedup
Projected Speedups (without init)
1.00 1.00 1.001.10 1.14
1.00
1.251.11 1.06
1.45
1.25 1.23
HSW, BaselineHSW, Current, IntelKNL, Current, IntelKNL, Current, Cray
![Page 66: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/66.jpg)
PALM - Final RemarksBottom Line
• getting started on KNL was easy⇒ way easier than KNC and offloading
• good initial performance (cache-mode)• scalar parts hurt⇒ initialisation⇒ . . .
29 / 44https://palm.muk.uni-hannover.de
![Page 67: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/67.jpg)
PALM - Final RemarksBottom Line
• getting started on KNL was easy⇒ way easier than KNC and offloading
• good initial performance (cache-mode)• scalar parts hurt⇒ initialisation⇒ . . .
• Cray Fortran compiler up to 16.5%faster than Intel on KNL
29 / 44https://palm.muk.uni-hannover.de
![Page 68: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/68.jpg)
PALM - Final RemarksBottom Line
• getting started on KNL was easy⇒ way easier than KNC and offloading
• good initial performance (cache-mode)• scalar parts hurt⇒ initialisation⇒ . . .
• Cray Fortran compiler up to 16.5%faster than Intel on KNL
• speedup over dual HSW HLRN node:• benchmark: up to 1.29ו production (projected): up to 1.45×⇒ even without vectorisation
29 / 44https://palm.muk.uni-hannover.de
![Page 69: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/69.jpg)
PALM - Final RemarksBottom Line
• getting started on KNL was easy⇒ way easier than KNC and offloading
• good initial performance (cache-mode)• scalar parts hurt⇒ initialisation⇒ . . .
• Cray Fortran compiler up to 16.5%faster than Intel on KNL
• speedup over dual HSW HLRN node:• benchmark: up to 1.29ו production (projected): up to 1.45×⇒ even without vectorisation
What’s next• VTune Results:⇒ increase concurrency⇒ reduce L2 misses on KNL
• adapt data layout for SIMD⇒ twice the potential on KNL
29 / 44https://palm.muk.uni-hannover.de
![Page 70: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/70.jpg)
Case study II: Material Science with VASPVASP – Vienna Ab-Initio Simulation Package
• Plan wave electronic structure code to model atomic scale materials from firstprincipals: Hψ = Eψ
• widely used in material sciences• historically grown, large MPI-only Fortran code base
• different approximations (DFT, hybrid functionals, . . . ) to tackle the physics• library dependencies: FFTW, BLAS, scaLAPACK, ELPA
Collaboration of:
30 / 44ZIB contact: Florian Wende ([email protected])
![Page 71: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/71.jpg)
SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter SIMD registers become increasingly larger: currently 512 bit with AVX-512
Slightly increased logic on the chip, but heavily increased arithmetic throughput Xeon Phi KNL w/ and w/o SIMD: 3 TFLOPS vs. 0.37 TFLOPS
SIMD Introduction
31 / 44
![Page 72: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/72.jpg)
SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter
for (i = 0; i < N; ++i)y[i] = log(x[i]);
...
tim
eSIMD Introduction
32 / 44
![Page 73: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/73.jpg)
SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter
for (i = 0; i < N; ++i) for (i = 0; i < N; i += 8)y[i] = log(x[i]); y[i + 0] = log(x[i + 0]);
... y[i] = vlog(x[i])y[i + 7] = log(x[i + 7]);
...
...
tim
e
8 times faster execution with SIMD
2
SIMD Introduction
33 / 44
![Page 74: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/74.jpg)
4 times faster execution with SIMD
SIMD – Single Instruction Multiple Data Multiple words are processed at once sharing 1 program counter Control flow divergences can hurt SIMD performance significantly
for (i = 0; i < N; ++i) for (i = 0; i < N; i += 8)if (p[i]) y[i] = log(x[i]); if (m← p[i]) y[i] = vlog_mask(y[i], m, x[i]);else y[i] = exp(x[i]); else y[i] = vexp_mask(y[i], ~m, x[i]);
...
...
tim
eSIMD Introduction
34 / 44
![Page 75: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/75.jpg)
OpenMP 4.x compiler directives portability across compilers low code invasiveness no SIMD intrinsics for Fortran
combine OpenMP 4.x SIMD with “high-level vectors” (loop chunking) to increase flexibility and expressiveness
SIMD Vectorisation in VASP
35 / 44
![Page 76: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/76.jpg)
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;
d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);
}
C-version of the Codes(not optimized)
SIMD Vectorisation in VASP - Example
36 / 44
![Page 77: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/77.jpg)
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;
d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);
}
nj rather small: not a candidate for SIMD vectorization
SIMD Vectorisation in VASP - Example
37 / 44
![Page 78: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/78.jpg)
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;
d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);
}
loop iterations are not independent: idx
SIMD Vectorisation in VASP - Example
38 / 44
![Page 79: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/79.jpg)
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;
d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);
}
idx = 0;for (i = 0; i < ni; i += CHUNKSIZE) {ii_max = min(CHUNKSIZE, ni – i);for (ii = 0; ii < ii_max; ++ii) {while (“some condition”)
++idx;vidx[ii] = idx;
}...
}
Compute idx-values in advance to enable SIMD vectorization afterwards!
Loop-Chunking, e.g. CHUNKSIZE=32
SIMD Vectorisation in VASP - Example
39 / 44
![Page 80: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/80.jpg)
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;
d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);
}
idx = 0;for (i = 0; i < ni; i += CHUNKSIZE) {ii_max = min(CHUNKSIZE, ni – i);for (ii = 0; ii < ii_max; ++ii) {while (“some condition”)
++idx;vidx[ii] = idx;
}for (ii = 0; ii < ii_max; ++ii)vd[ii] = data[vidx[ii]];
...}
Load data in a separate loop: leave it to the compilerto vectorize or not
SIMD Vectorisation in VASP - Example
40 / 44
![Page 81: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/81.jpg)
Non-vectorizable loop split into parts to enable SIMD vectorization
idx = 0;for (i = 0; i < ni; ++i) {while (“some condition”)++idx;
d = data[idx];for (j = 0; j < nj; ++j)res[j] += d * (...);
}
idx = 0;for (i = 0; i < ni; i += CHUNKSIZE) {ii_max = min(CHUNKSIZE, ni – i);for (ii = 0; ii < ii_max; ++ii) {while (“some condition”)
++idx;vidx[ii] = idx;
}for (ii = 0; ii < ii_max; ++ii)vd[ii] = data[vidx[ii]];
for (j = 0; j < nj; ++j)#pragma omp simdfor (ii = 0; ii < ii_max; ++ii)
res[j] += vd[ii] * (...);}
SIMD Vectorisation in VASP - Example
41 / 44
![Page 82: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/82.jpg)
Non-vectorizable loop split into parts to enable SIMD vectorization
0
5
10
15
20
25
30
35
Tim
e [s
]
GW0 subroutine only
no-SIMD (KNL) SIMD (KNL)
no-SIMD (2x Haswell CPU) SIMD (2x Haswell CPU)
0
10
20
30
40
50
60
70
80
90
Whole program
further optimization needed!
Xeon Phi nodes: quadrant mode, all data in MCDRAM
SIMD Vectorisation in VASP - Results
42 / 44
![Page 83: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/83.jpg)
KNL SummaryXeon Phi (KNL) has a low entry barrier. . .
• no offloading• well-known CPU toolchains and workflows
. . . but getting performance is challenging
• ease of use can be misleading towards quick fixes• code needs to be re-thought and re-written for SIMD• effort pays off with significant speed-up for hot-spots⇒ Xeon benefits from Xeon Phi optimisations as well
• overall application performance suffers from low single-thread performance⇒ working on a few hotspots is not sufficient
With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.
43 / 44
![Page 84: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/84.jpg)
KNL SummaryXeon Phi (KNL) has a low entry barrier. . .
• no offloading• well-known CPU toolchains and workflows
. . . but getting performance is challenging
• ease of use can be misleading towards quick fixes• code needs to be re-thought and re-written for SIMD• effort pays off with significant speed-up for hot-spots⇒ Xeon benefits from Xeon Phi optimisations as well
• overall application performance suffers from low single-thread performance⇒ working on a few hotspots is not sufficient
With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.
43 / 44
![Page 85: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/85.jpg)
KNL SummaryXeon Phi (KNL) has a low entry barrier. . .
• no offloading• well-known CPU toolchains and workflows
. . . but getting performance is challenging
• ease of use can be misleading towards quick fixes• code needs to be re-thought and re-written for SIMD• effort pays off with significant speed-up for hot-spots⇒ Xeon benefits from Xeon Phi optimisations as well
• overall application performance suffers from low single-thread performance⇒ working on a few hotspots is not sufficient
With AVX-512 in Xeon and Xeon Phi, SIMD can no longer be neglected.
43 / 44
![Page 86: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/86.jpg)
Overall Conclusions
A deep knowledge of each hardware platform is necessary to fully exploit itscomputing power.
Code modernisation for KNL within IPCCs world-wide is just one example of theeffort it takes to keep the huge amount of legacy code in HPC usable.
Within the more and more diverse HPC hardware landscape, larger shares ofHPC centre’s budgets will have to be allocated for code modernisation work inorder to utilise future machines efficiently.
44 / 44
- EoP
![Page 87: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/87.jpg)
Overall Conclusions
A deep knowledge of each hardware platform is necessary to fully exploit itscomputing power.
Code modernisation for KNL within IPCCs world-wide is just one example of theeffort it takes to keep the huge amount of legacy code in HPC usable.
Within the more and more diverse HPC hardware landscape, larger shares ofHPC centre’s budgets will have to be allocated for code modernisation work inorder to utilise future machines efficiently.
44 / 44
- EoP
![Page 88: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/88.jpg)
Overall Conclusions
A deep knowledge of each hardware platform is necessary to fully exploit itscomputing power.
Code modernisation for KNL within IPCCs world-wide is just one example of theeffort it takes to keep the huge amount of legacy code in HPC usable.
Within the more and more diverse HPC hardware landscape, larger shares ofHPC centre’s budgets will have to be allocated for code modernisation work inorder to utilise future machines efficiently.
44 / 44
- EoP
![Page 89: Performance Engineering for Legacy Codes on a Cray XC40 ... · 64+ core (based on Intel Atom Silvermont architecture, x86 -64) 4-way hardware -threading 512 -bit SIMD vector processing](https://reader034.vdocuments.mx/reader034/viewer/2022042805/5f5cc90a4d3be64b514b828c/html5/thumbnails/89.jpg)
Overall Conclusions
A deep knowledge of each hardware platform is necessary to fully exploit itscomputing power.
Code modernisation for KNL within IPCCs world-wide is just one example of theeffort it takes to keep the huge amount of legacy code in HPC usable.
Within the more and more diverse HPC hardware landscape, larger shares ofHPC centre’s budgets will have to be allocated for code modernisation work inorder to utilise future machines efficiently.
44 / 44
- EoP