experiences programming and optimizing for knights landing multicore... · knights landing john...
TRANSCRIPT
Experiences Programming and Optimizing for Knights Landing
John Michalakes University Corporation for Atmospheric Research
Boulder, Colorado USA
2016 MultiCore 6 Workshop
September 13th 2016
(Post-workshop: slides 26 and 36 updated on 2016/09/22)
2
Outline
• Hardware and software– Applications: NEPTUNE, MPAS, WRF and kernels– Knights Landing overview
• Optimizing for KNL– Scaling– Performance– Roofline
• Tools• Summary
3
Models: NEPTUNE/NUMA
Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. “Towards Operational Weather Prediction at 3.0km Global Resolution With the Dynamical Core NUMA”. 96th Amer. Met. Society Annual Mtg. January, 2016. https://ams.confex.com/ams/96Annual/webprogram/Paper288883.html
4
Models: NEPTUNE/NUMA• Spectral element
– 4th, 5th and higher-order* continuous Galerkin (discontinuous planned)
– Cubed Sphere (also icosahedral)– Adaptive Mesh Refinement (AMR) in
development
• Computationally dense but highly scalable– Constant width-one halo communication – Good locality for next generation HPC– MPI-only but OpenMP threading in
development
NEPTUNE 72‐h forecast (5 km resolution) of accumulated precipitation for Hurr. Sandy
Example of Adaptive Grid tracking a severe event
courtesy: Frank Giraldo, NPS
*“This is not the same ‘order’ as is used to identify the leading term of the error in finite‐difference schemes, which in fact describes accuracy. Evaluation of Gaussian quadrature over N+1 LGL quadrature points will be exact to machine precision as longas the polynomial integrand is of the order 2x(N‐1) ‐3, or less.” Gabersek et al. MWR Apr 2012.DOI: 10.1175/MWR‐D‐11‐00144.1
5
• NWP is one of the first HPC applications and, with climate, has been one its key drivers
– Exponential growth in HPC capability has translated directly to better forecasts and steadily improving value to the public
• HPC growth continues toward Peta-/Exaflop, but only parallelism is increasing
– More floating point capability– Proportionately less cache, memory and I/O bandwidth NWP parallelism scales in one dimension less than complexity Ensembles scale computationally but move problem to I/O
• Can operational NWP stay on the bus?– More scalable formulations– Expose and exploit all available parallelism, especially fine-grain
Models: NEPTUNE/NUMA
Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. “Towards Operational Weather Prediction at 3.0kmGlobal Resolution With the Dynamical Core NUMA”. 96th Amer. Met. Society Annual Mtg. January, 2016. https://ams.confex.com/ams/96Annual/webprogram/Paper288883.html
3,000 threads(47 nodes)
3,000,000 threads(47,000 nodes)
Shows computational resources needed to maintain fixed solution time with increasing resolution
6
• NWP is one of the first HPC applications and, with climate, has been one its key drivers
– Exponential growth in HPC capability has translated directly to better forecasts and steadily improving value to the public
• HPC growth continues toward Peta-/Exaflop, but only parallelism is increasing
– More floating point capability– Proportionately less cache, memory and I/O bandwidth NWP parallelism scales in one dimension less than complexity Ensembles scale computationally but move problem to I/O
• Can operational NWP stay on the bus?– More scalable formulations– Expose and exploit all available parallelism, especially fine-grain
Models: NEPTUNE/NUMA
Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. “Towards Operational Weather Prediction at 3.0kmGlobal Resolution With the Dynamical Core NUMA”. 96th Amer. Met. Society Annual Mtg. January, 2016. https://ams.confex.com/ams/96Annual/webprogram/Paper288883.html
3,000 threads(47 nodes)
3,000,000 threads(47,000 nodes)
Shows computational resources needed to maintain fixed solution time with increasing resolution
See also: Andreas Müller, Michal A. Kopera, Simone Marras, Lucas C. Wilcox, Tobin Isaac, Francis X. Giraldo. Strong Scaling for Numerical Weather Prediction at Petascale with the Atmospheric Model NUMA (last revised 8 Sep 2016 (v3) http://arxiv.org/abs/1511.01561)
7
Models: MPAS
Primarily funding: National Science Foundation and the DOE Office of Science
Collaborative project between NCAR and LANL for developing atmosphere, ocean and other earth‐system simulation components for use in climate, regional climate and weather studies
• Applications include global NWP and global atmospheric chemistry, regional climate, tropical cyclones, convection‐permitting hazardous weather forecasting
• Finite Volume, C‐grid• Refinement capability
• Centroidal Voronoi‐tessellated unstructured mesh, allows arbitrary in‐place horizontal mesh refinement
• HPC Readiness• Current release (v4.0) supports parallelism via MPI
(horiz. domain decomp.)• Hybrid (MPI+OpenMP) parallelism implemented,
undergoing testing
8
Hardware
• Intel Xeon Phi 7250 (Knights Landing) announced at ISC’16 in June– 14 nanometer feature size, > 8 billion transistors– 68 cores, 1.4 GHz modified “Silvermont” with out-of-order instruction execution– Two 512-bit wide Vector Processing Units per core– Peak ~3 TF/s double precision, ~6 TF/s single precision– 16 GB MCDRAM (on-chip) memory, > 400 GB/s bandwidth– “Hostless” – no separate host processor and no “offload” programming– Binary compatible ISA (instruction set architecture)
9
Optimizing for Intel Xeon Phi
• Most work in MIC programming involves optimization to achieve best share of peak performance
– Parallelism: • KNL has up to 288 hardware threads (4 per core) and a total of more than 2000
floating point units on the chip• Exposing coarse, medium and especially fine-grain (vector) parallelism in
application to efficiently use Xeon Phi– Locality:
• Reorganizing and restructuring data structures and computation to improve data reuse in cache and reduce floating point units idle-time waiting for data: memory latency
• Kernels that do not have high data reuse or that do not fit in cache require high memory bandwidth
• The combination of these factors affecting performance on the Xeon Phi (or any contemporary processor) can be characterized in terms of computational intensity, and conceptualized using The Roofline Model
Optimizing for Intel Xeon Phi
• Roofline Model of Processor Performance– Bounds application performance as a function
of computational intensity – the number of floating point operations per byte moved between the processor and a level of the memory hierarchy.
Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008
KNL
computational intensity*
*Sometimes referred to as:Arithmetic intensity (registers→L1): largely algorithmicOperational intensity (LLC→DRAM): improvable
achievable perform
ance
slide 21
Empirical Roofline Toolkithttps://crd.lbl.gov/departments/computer‐
science/PAR/research/roofline/
Thanks: Doug Doerfler, LBNL
Optimizing for Intel Xeon Phi
• Roofline Model of Processor Performance– Bounds application performance as a function
of computational intensity– If intensity is high enough, application is
“compute bound” by floating point capability• 2333 GFLOP/s double precision vector & fma
Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008
KNL
(Vectorization means the processor can perform8 double precision operations at once as long asthe operations are independent)
slide 22
Optimizing for Intel Xeon Phi
• Roofline Model of Processor Performance– Bounds application performance as a function
of computational intensity– If intensity is high enough, application is
“compute bound” by floating point capability• 2333 GFLOP/s double precision vector & FMA• 1166 GFLOP/s double precision vector, no FMA
Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008
KNL
1166 GFLOP/sec (no FMA)
(FMA instructions perform a floating point multiplyand a floating point addition as the same operationwhen the compiler can detect such expressions ofthe form: result = A*B+C)
slide 23
Optimizing for Intel Xeon Phi
• Roofline Model of Processor Performance– Bounds application performance as a function
of computational intensity– If intensity is high enough, application is
“compute bound” by floating point capability• 2333 GFLOP/s double precision vector & FMA• 1166 GFLOP/s double precision vector, no FMA• 145 GFLOP/s double precision no vector nor FMA
Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008
KNL
1166 GFLOP/sec (no FMA)
145 GFLOP/sec (no vector)
slide 24
Optimizing for Intel Xeon Phi
• Roofline Model of Processor Performance– Bounds application performance as a function
of computational intensity– If intensity is high enough, application is
“compute bound” by floating point capability– If intensity is not enough to satisfy demand for
data by the processor’s floating point units, the application is“memory bound”
• 128 GB main memory (DRAM)
Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008
KNL
1166 GFLOP/sec (no FMA)
145 GFLOP/sec (no vector)
slide 25
Optimizing for Intel Xeon Phi
• Roofline Model of Processor Performance– Bounds application performance as a function
of computational intensity– If intensity is high enough, application is
“compute bound” by floating point capability– If intensity is not enough to satisfy demand for
data by the processor’s floating point units, the application is“memory bound”
• 128 GB main memory (DRAM)• 16 GB High Bandwidth memory (MCDRAM)
Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008
KNL
1166 GFLOP/sec (no FMA)
145 GFLOP/sec (no vector)
Among the ways to use MCDRAM:• Configure KNL to use MCDRAM as direct mapped Cache• Configure KNL to use MCDRAM as a NUMA region
numactl –membind=0 ./executable (main memory)numactl –membind=1 ./executable (MCDRAM)
• More info: http://colfaxresearch.com/knl-mcdram
Optimizing for Intel Xeon Phi
• Roofline Model of Processor Performance– Bounds application performance as a function
of computational intensity– If intensity is high enough, application is
“compute bound” by floating point capability– If intensity is not enough to satisfy demand for
data by the processor’s floating point units, the application is“memory bound”
• 128 GB main memory (DRAM)• 16 GB High Bandwidth memory (MCDRAM)
– KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need:
• 0.35 flops per byte from L1 cache• 1 flop per byte from L2 cache• 6 flops per byte from high bandwidth memory• 25 flops per byte from main memory
Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008
KNL
1166 GFLOP/sec (no FMA)
145 GFLOP/sec (no vector)
Optimizing for Intel Xeon Phi
• Roofline Model of Processor Performance– Bounds application performance as a function
of computational intensity– If intensity is high enough, application is
“compute bound” by floating point capability– If intensity is not enough to satisfy demand for
data by the processor’s floating point units, the application is“memory bound”
• 128 GB main memory (DRAM)• 16 GB High Bandwidth memory (MCDRAM)
– KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need:
• 0.35 flops per byte from L1 cache• 1 flop per byte from L2 cache• 6 flops per byte from high bandwidth memory• 25 flops per byte from main memory
– Hard to come by in real applications!• NEPTUNE benefits from MCDRAM (breaks
through the DRAM ceiling) but realizing only a fraction of the MCDRAM ceiling
KNL
1166 GFLOP/sec (no FMA)
145 GFLOP/sec (no vector)
0.312 FLOPS/byte
NEPTUNE E14P3L40(U.S. Navy Global Model Prototype)
36.5 GF/s (MCDRAM)20.3 GF/s (DRAM)
slide 28
Optimizing for Intel Xeon Phi
• Roofline Model of Processor Performance– Bounds application performance as a function
of computational intensity– If intensity is high enough, application is
“compute bound” by floating point capability– If intensity is not enough to satisfy demand for
data by the processor’s floating point units, the application is“memory bound”
• 128 GB main memory (DRAM)• 16 GB High Bandwidth memory (MCDRAM)
– KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need:
• 0.35 flops per byte from L1 cache• 1 flop per byte from L2 cache• 6 flops per byte from high bandwidth memory• 25 flops per byte from main memory
– Hard to come by in real applications!• Optimizing for KNL involves increasing
parallelism and locality– Increasing parallelism and parallel efficiency will raise
performance for a given computational intensity– Increasing locality will improve computational intensity
and move the bar to the right
KNL
1166 GFLOP/sec (no FMA)
145 GFLOP/sec (no vector)
improved locality
impr
oved
par
alle
ism
slide 29
19
NEPTUNE Performance on 1 Node Knights LandingNEPTUNE
GFLOP/second, based on count of 9.13 billion double precision floating point operations per time step
MPI-only
GF/sec
KNC→ KNL 4.2xDRAM → MCDRM 2.1xHSW → KNL 1.3xBDW → KNL .9x
high
er is
bet
ter
20
NEPTUNE Performance on 1 Node Knights LandingNEPTUNE
GFLOP/second, based on count of 9.13 billion double precision floating point operations per time step
MPI-only
GF/sec
KNC→ KNL 4.2xDRAM → MCDRM 2.1xHSW → KNL 1.3xBDW → KNL .9x
KNL peaks at 2‐threads per core• Thread competition for resources
(Q index is innermost in NEPTUNE)• Better ILP• Running out of work?
high
er is
bet
ter
21
MPASGFLOP/second, based on count of 3.7 billion single precisionfloating point operations per dynamics step
MPI and OpenMPbut best time shown here was single-threaded
KNL 7250 B0 68c (MCDRAM)
MPAS Performance on 1 Node Knights Landing
GF/sec
Number of threads
Caveat:Plot shows floating point rate, not forecast speed.
high
er is
bet
ter
22
MPAS Performance on 1 Node Knights Landing
GF/sec
Number of threads
MPASGFLOP/second, based on count of 3.7 billion single precisionfloating point operations per dynamics step
MPI and OpenMPbut best time shown here was single-threaded
KNL 7250 B0 68c (MCDRAM)
http://www.nersc.gov/users/application‐performance/measuring‐arithmetic‐intensity/
high
er is
bet
ter
23
MPAS and NEPTUNE Performance on Knights Landing
GF/sec
Number of threads
MPASGFLOP/second, based on count of 3.7 billion single precisionfloating point operations per dynamics step
Caveat:Plot shows floating point rate, not forecast speed.
NEPTUNEGFLOP/second, based on count of 9.3 billion double precisionfloating point operations per dynamics step
→
high
er is
bet
ter
Optimizing NEPTUNE
• NUMA results on Blue Gene/Q suggest opportunites fro improvement from vectorization and reduced parallel overheads
• Work in progress with NRL’s NEPTUNE model using the Supercell configuration and workload from NGGPS testing
– Analyzing and restructuring based on vector reports from compiler
– Adding multi-threading (OpenMP)
Andreas Müller, Michal A. Kopera, Simone Marras, Lucas C. Wilcox, Tobin Isaac, Francis X. Giraldo.“Strong Scaling for Numerical Weather Prediction at Petascale with the Atmospheric Model NUMA”(Submitted on 5 Nov 2015 (v1), last revised 8 Sep 2016 (v3))
http://arxiv.org/abs/1511.01561
0.7 FLOP/Byte
25
Optimizing NEPTUNEhi
gher
is b
ette
r
26
Optimizing NEPTUNEhi
gher
is b
ette
r
27
Optimizing NEPTUNE
• Optimizations– Compile time specification of important loop ranges
• Number of variables per grid point• Dimensions of an element• Flattening nested loops to collapse for vectorization
– Facilitating inlining• Subroutines called from element loops in CREATE_LAPLACIAN
and CREATE_RHS• Matrix multiplies (need to try MKL DGEMM)
– Splitting apart loops where part vectorizes and part doesn’t– Fixing unaligned data (apparently still a benefit on KNL)
• Drag forces– Using double precision, -fp-model precise– Need a larger workload
28
Optimizing NEPTUNEhi
gher
is b
ette
r
29
Optimizing NEPTUNE
• SPMD OpenMP **– Create threaded region very high in call tree
• Improves locality – more like MPI tasks• Avoid starting and stopping threaded regions• Avoid threading over different domain dimensions
– HOMME uses to block over elements, tracers and the vertical dimension
– HBM code modernization – insights from a Xeon Phi experiment, Jacob Weismann Poulsen, DMI, Denmark, Per Berg, DMI, Denmark, Karthik Raman, Intel, USA
• MPI3 shared memory windows †,‡,*
– Resyncing the memory model ate up any benefits– I-MPI hard to beat on Phi; using shmem-like transport already?** Jacob Weismann Poulsen (DMI) et al. “HBM code modernization – insights from a Xeon Phi experiment”, CUG 2016, London
† Torsten Hoefler, ETH Zurich (http://htor.inf.ethz.ch/)‡ Brinskiy and Lubin: http://goparallel.sourceforge.net/wp‐content/uploads/2015/06/PUM21‐2‐An_Introduction_to_MPI‐3.pdf* Jed Brown, “To Thread or Not To Thread”, NCAR Multi‐core Workshop, 2015. https://jedbrown.org/files/20150916‐Threads.pdf
30
Optimizing NEPTUNEhi
gher
is b
ette
r
NUMA effects on MCDRAM?
• Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers (EDCs)
• Can first touch by MPI task 0 during the application’s initialization phase cause uneven placement of addresses?
• Measure traffic as number of cache lines read/written using “uncore” event counters on each EDC
*Available from Intel and coming in RHEL/CENTOS
NUMA effects on MCDRAM?
• Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers (EDCs)
• Can first touch by MPI task 0 during the application’s initialization phase cause uneven placement of addresses?
• Measure traffic as number of cache lines read/written using “uncore” event counters on each EDC
– Linux kernel extensions to support counting KNL ucore in “Perf” utility*
– Functionality being added to PAPI– Application level library courtesy of
Sumedh Naik, Intel*Available from Intel and coming in RHEL/CENTOS
program NEPTUNEcall ucore_setup...
if (irank == irank0) thencall ucore_start_counting
endif!$omp parallel firstprivate(...
do 1 , ntimesteps
compute a time step
enddo!$omp end parallelif (irank == irank0) then
call ucore_stop_countingcall ucore_tellall
endif...
NUMA effects on MCDRAM?
• Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers (EDCs)
• Can first touch by MPI task 0 during the application’s initialization phase cause uneven placement of addresses?
• Measure traffic as number of cache lines read/written using “uncore” event counters on each EDC
– Linux kernel extensions to support counting KNL ucore in “Perf” utility*
– Functionality being added to PAPI– Application level ucore library courtesy
of Sumedh Naik, Intel*Available from Intel and coming in RHEL/CENTOS
program NEPTUNEcall ucore_setup...
if (irank == irank0) thencall ucore_start_counting
endif!$omp parallel firstprivate(...
do 1 , ntimesteps
compute a time step
enddo!$omp end parallelif (irank == irank0) then
call ucore_stop_countingcall ucore_tellall
endif...
OUTPUT FROM SUMEDH NAIK’S UCORE EVENT LIBRARY:
DDR Reads: 69106824 68856106 69666949 42679419 42543281 43278055DDR Reads Total: 336130634
DDR Writes: 9844349 9531724 9935443 10422308 10248463 10332718DDR Writes Total: 60315005
MCDRAM Reads: 3417431200 3433620908 3449823624 3466726377 2752299717 2768241026 3089391878 3105872573MCDRAM Reads Total: 25483407303MCDRAM Writes: 1278938059 1280452163 1301248872 1303018021 1041108410 1043107191 1172588335 1174497296MCDRAM Writes Total: 9594958347
NUMA effects on MCDRAM?
• Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers
• Can first touch by MPI task 0 during the application’s initialization phase cause uneven placement of addresses?LO
WER
is b
ette
r
NUMA effects on MCDRAM?
• Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers
• Can first touch by MPI task 0 during the application’s initialization phase cause uneven placement of addresses?LO
WER
is b
ette
r
36
WRF Kernels revisited
WSM5 Microphysics(single precision)
original optimized original→op m.
KNC 11 GF/s 154 GF/s 14x
KNL 68 GF/s 283 GF/s 4.2x
KNC→KNL 6.2x 1.8x
RRTMG Radiation(double precision)
original optimized original→op m.
KNC 10 GF/s 36 GF/s 3.6x
KNL 36 GF/s 78 GF/s 2.2x
KNC→KNL 3.6x 2.2x
Michalakes, Iacono and Jessup. Optimizing Weather Model Radiative Transfer Physics for Intel's Many Integrated Core (MIC) Architecture. Parallel Processing Letters. World Scientific. Vol. 26, No. 3 (Sept. 2016) Accepted; publication date TBD.
37
Tools overview
ProfilingOpcount and FP rate
Memory traffic and BW
Threadutilization
Vector utilization
Roofline analysis
Execution traces
Intel Vtune Amplifier By routine,
line, and instruction
Timeseries B/Woutput but and traffic counts
HistogramGeneral
exploration output
B/W, CPU time, spin‐time, MPI
comms, as fcn of time
Intel Software Development Emulator
Only way to count FPops on current Xeon
Count load/store between
registers and first level cache
Not tried
Intel AdvisorNot tried Not tried Not tried
Routine by routine and line by line
Routine by routine(In beta)
MPE/Jumpshot(ANL/LANS)
By routine,Instrument‐ed region
DetailedGantt plots
EmpiricalRoofline Toolkit(NERSC/LBL)
Max FP rate
Multiple levels of memory hierarchy
Yes
Linux PerfPAPI or other Possible Possible
Count load/storebetween
LLC and Mem.
Possible Possible
38
Summary
• New models such as MPAS and NUMA/NEPTUNE – Higher order, variable resolution– Demonstrated scalability– Focus is on computational efficiency to run faster
• Progress with accelerators– Early Intel Knights Landing results showing 1-2% of peak– Good news: with new MCDRAM, we aren’t memory bandwidth bound– Need to improve vector utilization, other bottlenecks to reach > 10% of peak– Naval Postgraduate School’s results speeding up NUMA show promise
• Ongoing efforts– Measurement, characterization and optimization of floating point and data
movement – Explore effect of different workloads, more aggressive compiler
optimizations, reduced (single) precision
39
Acknowledgements
• Intel: Mike Greenfield, Ruchira Sasanka, Indraneil Gokhale, Larry Meadows, Zakhar Matveev, Dmitry Petunin, Dmitry Dontsov, Naik Sumedh, Karthik Raman
• NRL and NPS: Alex Reinecke, Kevin Viner, Jim Doyle, Sasa Gabersek, Andreas Mueller (now at ECMWF), Frank Giraldo
• NCAR: Bill Skamarock, Michael Duda, John Dennis, Chris Kerr
• NOAA: Mark Govett, Jim Rosinski, Tom Henderson (now at Spire)
• Google search, which usually leads to something John McAlpin has already figured out...