experiences programming and optimizing for knights landing multicore... · knights landing john...

39
Experiences Programming and Optimizing for Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016 MultiCore 6 Workshop September 13 th 2016 (Post-workshop: slides 26 and 36 updated on 2016/09/22)

Upload: others

Post on 15-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

Experiences Programming and Optimizing for Knights Landing

John Michalakes University Corporation for Atmospheric Research

Boulder, Colorado USA

2016 MultiCore 6 Workshop

September 13th 2016

(Post-workshop: slides 26 and 36 updated on 2016/09/22)

Page 2: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

2

Outline

• Hardware and software– Applications: NEPTUNE, MPAS, WRF and kernels– Knights Landing overview

• Optimizing for KNL– Scaling– Performance– Roofline

• Tools• Summary

Page 3: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

3

Models: NEPTUNE/NUMA

Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. “Towards Operational Weather Prediction at 3.0km Global Resolution With the Dynamical Core NUMA”. 96th Amer. Met. Society Annual Mtg. January, 2016. https://ams.confex.com/ams/96Annual/webprogram/Paper288883.html

Page 4: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

4

Models: NEPTUNE/NUMA• Spectral element

– 4th, 5th and higher-order* continuous Galerkin (discontinuous planned)

– Cubed Sphere (also icosahedral)– Adaptive Mesh Refinement (AMR) in

development

• Computationally dense but highly scalable– Constant width-one halo communication – Good locality for next generation HPC– MPI-only but OpenMP threading in

development

NEPTUNE 72‐h forecast (5 km resolution) of accumulated precipitation for Hurr. Sandy

Example of Adaptive Grid tracking a severe event

courtesy: Frank Giraldo, NPS

*“This is not the same ‘order’ as is used to identify the leading term of the error in finite‐difference schemes, which in fact describes accuracy. Evaluation of Gaussian quadrature over N+1 LGL quadrature points will be exact to machine precision as longas the polynomial integrand is of the order 2x(N‐1) ‐3, or less.” Gabersek et al. MWR Apr 2012.DOI: 10.1175/MWR‐D‐11‐00144.1

Page 5: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

5

• NWP is one of the first HPC applications and, with climate, has been one its key drivers

– Exponential growth in HPC capability has translated directly to better forecasts and steadily improving value to the public

• HPC growth continues toward Peta-/Exaflop, but only parallelism is increasing

– More floating point capability– Proportionately less cache, memory and I/O bandwidth NWP parallelism scales in one dimension less than complexity Ensembles scale computationally but move problem to I/O

• Can operational NWP stay on the bus?– More scalable formulations– Expose and exploit all available parallelism, especially fine-grain

Models: NEPTUNE/NUMA

Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. “Towards Operational Weather Prediction at 3.0kmGlobal Resolution With the Dynamical Core NUMA”. 96th Amer. Met. Society Annual Mtg. January, 2016. https://ams.confex.com/ams/96Annual/webprogram/Paper288883.html

3,000 threads(47 nodes)

3,000,000 threads(47,000 nodes)

Shows computational resources needed to maintain fixed solution time with increasing resolution

Page 6: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

6

• NWP is one of the first HPC applications and, with climate, has been one its key drivers

– Exponential growth in HPC capability has translated directly to better forecasts and steadily improving value to the public

• HPC growth continues toward Peta-/Exaflop, but only parallelism is increasing

– More floating point capability– Proportionately less cache, memory and I/O bandwidth NWP parallelism scales in one dimension less than complexity Ensembles scale computationally but move problem to I/O

• Can operational NWP stay on the bus?– More scalable formulations– Expose and exploit all available parallelism, especially fine-grain

Models: NEPTUNE/NUMA

Andreas Mueller, NPS, Monterey, CA; and M. Kopera, S. Marras, and F. X. Giraldo. “Towards Operational Weather Prediction at 3.0kmGlobal Resolution With the Dynamical Core NUMA”. 96th Amer. Met. Society Annual Mtg. January, 2016. https://ams.confex.com/ams/96Annual/webprogram/Paper288883.html

3,000 threads(47 nodes)

3,000,000 threads(47,000 nodes)

Shows computational resources needed to maintain fixed solution time with increasing resolution

See also: Andreas Müller, Michal A. Kopera, Simone Marras, Lucas C. Wilcox, Tobin Isaac, Francis X. Giraldo. Strong Scaling for Numerical Weather Prediction at Petascale with the Atmospheric Model NUMA (last revised 8 Sep 2016 (v3) http://arxiv.org/abs/1511.01561)

Page 7: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

7

Models: MPAS

Primarily funding: National Science Foundation and the DOE Office of Science

Collaborative project between NCAR and LANL for developing atmosphere, ocean and other earth‐system simulation components for use in climate, regional climate and weather studies

• Applications include global NWP and global atmospheric chemistry, regional climate, tropical cyclones, convection‐permitting hazardous weather forecasting

• Finite Volume, C‐grid• Refinement capability

• Centroidal Voronoi‐tessellated unstructured mesh, allows arbitrary in‐place horizontal mesh refinement

• HPC Readiness• Current release (v4.0) supports parallelism via MPI

(horiz. domain decomp.)• Hybrid (MPI+OpenMP) parallelism implemented, 

undergoing testing

Page 8: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

8

Hardware

• Intel Xeon Phi 7250 (Knights Landing) announced at ISC’16 in June– 14 nanometer feature size, > 8 billion transistors– 68 cores, 1.4 GHz modified “Silvermont” with out-of-order instruction execution– Two 512-bit wide Vector Processing Units per core– Peak ~3 TF/s double precision, ~6 TF/s single precision– 16 GB MCDRAM (on-chip) memory, > 400 GB/s bandwidth– “Hostless” – no separate host processor and no “offload” programming– Binary compatible ISA (instruction set architecture)

Page 9: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

9

Optimizing for Intel Xeon Phi

• Most work in MIC programming involves optimization to achieve best share of peak performance

– Parallelism: • KNL has up to 288 hardware threads (4 per core) and a total of more than 2000

floating point units on the chip• Exposing coarse, medium and especially fine-grain (vector) parallelism in

application to efficiently use Xeon Phi– Locality:

• Reorganizing and restructuring data structures and computation to improve data reuse in cache and reduce floating point units idle-time waiting for data: memory latency

• Kernels that do not have high data reuse or that do not fit in cache require high memory bandwidth

• The combination of these factors affecting performance on the Xeon Phi (or any contemporary processor) can be characterized in terms of computational intensity, and conceptualized using The Roofline Model

Page 10: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

Optimizing for Intel Xeon Phi

• Roofline Model of Processor Performance– Bounds application performance as a function

of computational intensity – the number of floating point operations per byte moved between the processor and a level of the memory hierarchy.

Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008

KNL

computational intensity*

*Sometimes referred to as:Arithmetic intensity (registers→L1):  largely algorithmicOperational intensity (LLC→DRAM):  improvable

achievable perform

ance

slide 21

Empirical Roofline Toolkithttps://crd.lbl.gov/departments/computer‐

science/PAR/research/roofline/

Thanks: Doug Doerfler, LBNL

Page 11: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

Optimizing for Intel Xeon Phi

• Roofline Model of Processor Performance– Bounds application performance as a function

of computational intensity– If intensity is high enough, application is

“compute bound” by floating point capability• 2333 GFLOP/s double precision vector & fma

Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008

KNL

(Vectorization means the processor can perform8 double precision operations at once as long asthe operations are independent)

slide 22

Page 12: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

Optimizing for Intel Xeon Phi

• Roofline Model of Processor Performance– Bounds application performance as a function

of computational intensity– If intensity is high enough, application is

“compute bound” by floating point capability• 2333 GFLOP/s double precision vector & FMA• 1166 GFLOP/s double precision vector, no FMA

Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008

KNL

1166 GFLOP/sec (no FMA)

(FMA instructions perform a floating point multiplyand a floating point addition as the same operationwhen the compiler can detect such expressions ofthe form: result = A*B+C)

slide 23

Page 13: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

Optimizing for Intel Xeon Phi

• Roofline Model of Processor Performance– Bounds application performance as a function

of computational intensity– If intensity is high enough, application is

“compute bound” by floating point capability• 2333 GFLOP/s double precision vector & FMA• 1166 GFLOP/s double precision vector, no FMA• 145 GFLOP/s double precision no vector nor FMA

Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008

KNL

1166 GFLOP/sec (no FMA)

145 GFLOP/sec (no vector)

slide 24

Page 14: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

Optimizing for Intel Xeon Phi

• Roofline Model of Processor Performance– Bounds application performance as a function

of computational intensity– If intensity is high enough, application is

“compute bound” by floating point capability– If intensity is not enough to satisfy demand for

data by the processor’s floating point units, the application is“memory bound”

• 128 GB main memory (DRAM)

Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008

KNL

1166 GFLOP/sec (no FMA)

145 GFLOP/sec (no vector)

slide 25

Page 15: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

Optimizing for Intel Xeon Phi

• Roofline Model of Processor Performance– Bounds application performance as a function

of computational intensity– If intensity is high enough, application is

“compute bound” by floating point capability– If intensity is not enough to satisfy demand for

data by the processor’s floating point units, the application is“memory bound”

• 128 GB main memory (DRAM)• 16 GB High Bandwidth memory (MCDRAM)

Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008

KNL

1166 GFLOP/sec (no FMA)

145 GFLOP/sec (no vector)

Among the ways to use MCDRAM:• Configure KNL to use MCDRAM as direct mapped Cache• Configure KNL to use MCDRAM as a NUMA region

numactl –membind=0 ./executable (main memory)numactl –membind=1 ./executable (MCDRAM)

• More info: http://colfaxresearch.com/knl-mcdram

Page 16: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

Optimizing for Intel Xeon Phi

• Roofline Model of Processor Performance– Bounds application performance as a function

of computational intensity– If intensity is high enough, application is

“compute bound” by floating point capability– If intensity is not enough to satisfy demand for

data by the processor’s floating point units, the application is“memory bound”

• 128 GB main memory (DRAM)• 16 GB High Bandwidth memory (MCDRAM)

– KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need:

• 0.35 flops per byte from L1 cache• 1 flop per byte from L2 cache• 6 flops per byte from high bandwidth memory• 25 flops per byte from main memory

Williams, S. W., A. Waterman, D. Patterson.Electrical Engineering and Computer Sciences, University of California at BerkeleyTechnical Report No. UCB/EECS-2008-134. October 17, 2008

KNL

1166 GFLOP/sec (no FMA)

145 GFLOP/sec (no vector)

Page 17: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

Optimizing for Intel Xeon Phi

• Roofline Model of Processor Performance– Bounds application performance as a function

of computational intensity– If intensity is high enough, application is

“compute bound” by floating point capability– If intensity is not enough to satisfy demand for

data by the processor’s floating point units, the application is“memory bound”

• 128 GB main memory (DRAM)• 16 GB High Bandwidth memory (MCDRAM)

– KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need:

• 0.35 flops per byte from L1 cache• 1 flop per byte from L2 cache• 6 flops per byte from high bandwidth memory• 25 flops per byte from main memory

– Hard to come by in real applications!• NEPTUNE benefits from MCDRAM (breaks

through the DRAM ceiling) but realizing only a fraction of the MCDRAM ceiling

KNL

1166 GFLOP/sec (no FMA)

145 GFLOP/sec (no vector)

0.312 FLOPS/byte

NEPTUNE E14P3L40(U.S. Navy Global Model Prototype)

36.5 GF/s (MCDRAM)20.3 GF/s (DRAM)

slide 28

Page 18: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

Optimizing for Intel Xeon Phi

• Roofline Model of Processor Performance– Bounds application performance as a function

of computational intensity– If intensity is high enough, application is

“compute bound” by floating point capability– If intensity is not enough to satisfy demand for

data by the processor’s floating point units, the application is“memory bound”

• 128 GB main memory (DRAM)• 16 GB High Bandwidth memory (MCDRAM)

– KNL is nominally 3 TFLOP/sec but to saturate full floating point capability, need:

• 0.35 flops per byte from L1 cache• 1 flop per byte from L2 cache• 6 flops per byte from high bandwidth memory• 25 flops per byte from main memory

– Hard to come by in real applications!• Optimizing for KNL involves increasing

parallelism and locality– Increasing parallelism and parallel efficiency will raise

performance for a given computational intensity– Increasing locality will improve computational intensity

and move the bar to the right

KNL

1166 GFLOP/sec (no FMA)

145 GFLOP/sec (no vector)

improved locality

impr

oved

par

alle

ism

slide 29

Page 19: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

19

NEPTUNE Performance on 1 Node Knights LandingNEPTUNE

GFLOP/second, based on count of 9.13 billion double precision floating point operations per time step

MPI-only

GF/sec

KNC→ KNL 4.2xDRAM → MCDRM 2.1xHSW → KNL 1.3xBDW → KNL .9x

high

er is

bet

ter

Page 20: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

20

NEPTUNE Performance on 1 Node Knights LandingNEPTUNE

GFLOP/second, based on count of 9.13 billion double precision floating point operations per time step

MPI-only

GF/sec

KNC→ KNL 4.2xDRAM → MCDRM 2.1xHSW → KNL 1.3xBDW → KNL .9x

KNL peaks at 2‐threads per core• Thread competition for resources

(Q index is innermost in NEPTUNE)• Better ILP• Running out of work?

high

er is

bet

ter

Page 21: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

21

MPASGFLOP/second, based on count of 3.7 billion single precisionfloating point operations per dynamics step

MPI and OpenMPbut best time shown here was single-threaded

KNL 7250 B0 68c (MCDRAM)

MPAS Performance on 1 Node Knights Landing

GF/sec

Number of threads

Caveat:Plot shows floating point rate, not forecast speed.

high

er is

bet

ter

Page 22: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

22

MPAS Performance on 1 Node Knights Landing

GF/sec

Number of threads

MPASGFLOP/second, based on count of 3.7 billion single precisionfloating point operations per dynamics step

MPI and OpenMPbut best time shown here was single-threaded

KNL 7250 B0 68c (MCDRAM)

http://www.nersc.gov/users/application‐performance/measuring‐arithmetic‐intensity/ 

high

er is

bet

ter

Page 23: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

23

MPAS and NEPTUNE Performance on Knights Landing

GF/sec

Number of threads

MPASGFLOP/second, based on count of 3.7 billion single precisionfloating point operations per dynamics step

Caveat:Plot shows floating point rate, not forecast speed.

NEPTUNEGFLOP/second, based on count of 9.3 billion double precisionfloating point operations per dynamics step

high

er is

bet

ter

Page 24: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

Optimizing NEPTUNE

• NUMA results on Blue Gene/Q suggest opportunites fro improvement from vectorization and reduced parallel overheads

• Work in progress with NRL’s NEPTUNE model using the Supercell configuration and workload from NGGPS testing

– Analyzing and restructuring based on vector reports from compiler

– Adding multi-threading (OpenMP)

Andreas Müller, Michal A. Kopera, Simone Marras, Lucas C. Wilcox, Tobin Isaac, Francis X. Giraldo.“Strong Scaling for Numerical Weather Prediction at Petascale with the Atmospheric Model NUMA”(Submitted on 5 Nov 2015 (v1), last revised 8 Sep 2016 (v3))

http://arxiv.org/abs/1511.01561

0.7 FLOP/Byte

Page 25: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

25

Optimizing NEPTUNEhi

gher

is b

ette

r

Page 26: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

26

Optimizing NEPTUNEhi

gher

is b

ette

r

Page 27: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

27

Optimizing NEPTUNE

• Optimizations– Compile time specification of important loop ranges

• Number of variables per grid point• Dimensions of an element• Flattening nested loops to collapse for vectorization

– Facilitating inlining• Subroutines called from element loops in CREATE_LAPLACIAN

and CREATE_RHS• Matrix multiplies (need to try MKL DGEMM)

– Splitting apart loops where part vectorizes and part doesn’t– Fixing unaligned data (apparently still a benefit on KNL)

• Drag forces– Using double precision, -fp-model precise– Need a larger workload

Page 28: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

28

Optimizing NEPTUNEhi

gher

is b

ette

r

Page 29: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

29

Optimizing NEPTUNE

• SPMD OpenMP **– Create threaded region very high in call tree

• Improves locality – more like MPI tasks• Avoid starting and stopping threaded regions• Avoid threading over different domain dimensions

– HOMME uses to block over elements, tracers and the vertical dimension

– HBM code modernization – insights from a Xeon Phi experiment, Jacob Weismann Poulsen, DMI, Denmark, Per Berg, DMI, Denmark, Karthik Raman, Intel, USA

• MPI3 shared memory windows †,‡,*

– Resyncing the memory model ate up any benefits– I-MPI hard to beat on Phi; using shmem-like transport already?** Jacob Weismann Poulsen (DMI) et al. “HBM code modernization – insights from a Xeon Phi experiment”, CUG 2016, London

† Torsten Hoefler, ETH Zurich (http://htor.inf.ethz.ch/)‡ Brinskiy and Lubin: http://goparallel.sourceforge.net/wp‐content/uploads/2015/06/PUM21‐2‐An_Introduction_to_MPI‐3.pdf* Jed Brown, “To Thread or Not To Thread”, NCAR Multi‐core Workshop, 2015.  https://jedbrown.org/files/20150916‐Threads.pdf

Page 30: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

30

Optimizing NEPTUNEhi

gher

is b

ette

r

Page 31: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

NUMA effects on MCDRAM?

• Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers (EDCs)

• Can first touch by MPI task 0 during the application’s initialization phase cause uneven placement of addresses?

• Measure traffic as number of cache lines read/written using “uncore” event counters on each EDC

*Available from Intel and coming in RHEL/CENTOS

Page 32: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

NUMA effects on MCDRAM?

• Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers (EDCs)

• Can first touch by MPI task 0 during the application’s initialization phase cause uneven placement of addresses?

• Measure traffic as number of cache lines read/written using “uncore” event counters on each EDC

– Linux kernel extensions to support counting KNL ucore in “Perf” utility*

– Functionality being added to PAPI– Application level library courtesy of

Sumedh Naik, Intel*Available from Intel and coming in RHEL/CENTOS

program NEPTUNEcall ucore_setup...

if (irank == irank0) thencall ucore_start_counting

endif!$omp parallel firstprivate(...

do 1 , ntimesteps

compute a time step

enddo!$omp end parallelif (irank == irank0) then

call ucore_stop_countingcall ucore_tellall

endif...

Page 33: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

NUMA effects on MCDRAM?

• Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers (EDCs)

• Can first touch by MPI task 0 during the application’s initialization phase cause uneven placement of addresses?

• Measure traffic as number of cache lines read/written using “uncore” event counters on each EDC

– Linux kernel extensions to support counting KNL ucore in “Perf” utility*

– Functionality being added to PAPI– Application level ucore library courtesy

of Sumedh Naik, Intel*Available from Intel and coming in RHEL/CENTOS

program NEPTUNEcall ucore_setup...

if (irank == irank0) thencall ucore_start_counting

endif!$omp parallel firstprivate(...

do 1 , ntimesteps

compute a time step

enddo!$omp end parallelif (irank == irank0) then

call ucore_stop_countingcall ucore_tellall

endif...

OUTPUT FROM SUMEDH NAIK’S UCORE EVENT LIBRARY:

DDR Reads: 69106824 68856106 69666949 42679419 42543281 43278055DDR Reads Total: 336130634

DDR Writes: 9844349 9531724 9935443 10422308 10248463 10332718DDR Writes Total: 60315005

MCDRAM Reads: 3417431200 3433620908 3449823624 3466726377 2752299717 2768241026 3089391878 3105872573MCDRAM Reads Total: 25483407303MCDRAM Writes: 1278938059 1280452163 1301248872 1303018021 1041108410 1043107191 1172588335 1174497296MCDRAM Writes Total: 9594958347

Page 34: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

NUMA effects on MCDRAM?

• Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers

• Can first touch by MPI task 0 during the application’s initialization phase cause uneven placement of addresses?LO

WER

is b

ette

r

Page 35: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

NUMA effects on MCDRAM?

• Sub-NUMA Clustering (SNC-4) divides nodes into four NUMA regions, each assigned two MCDRAM memory controllers

• Can first touch by MPI task 0 during the application’s initialization phase cause uneven placement of addresses?LO

WER

is b

ette

r

Page 36: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

36

WRF Kernels revisited

WSM5 Microphysics(single precision)

original optimized original→op m.

KNC 11 GF/s 154 GF/s 14x

KNL 68 GF/s 283 GF/s 4.2x

KNC→KNL 6.2x 1.8x

RRTMG Radiation(double precision)

original optimized original→op m.

KNC 10 GF/s 36 GF/s 3.6x

KNL 36 GF/s 78 GF/s 2.2x

KNC→KNL 3.6x 2.2x

Michalakes, Iacono and Jessup. Optimizing Weather Model Radiative Transfer Physics for Intel's Many Integrated Core (MIC) Architecture. Parallel Processing Letters. World Scientific. Vol. 26, No. 3 (Sept. 2016) Accepted; publication date TBD.

Page 37: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

37

Tools overview

ProfilingOpcount and FP rate

Memory traffic and BW

Threadutilization

Vector utilization

Roofline analysis

Execution traces

Intel Vtune Amplifier By routine, 

line, and instruction

Timeseries B/Woutput but and traffic counts

HistogramGeneral

exploration   output

B/W, CPU time, spin‐time, MPI 

comms, as fcn of time

Intel Software Development Emulator

Only way to count FPops on current Xeon

Count load/store between 

registers and first level cache

Not tried

Intel AdvisorNot tried Not tried Not tried

Routine by routine and line by line

Routine by routine(In beta)

MPE/Jumpshot(ANL/LANS)

By routine,Instrument‐ed region

DetailedGantt plots

EmpiricalRoofline Toolkit(NERSC/LBL)

Max FP rate

Multiple levels of memory hierarchy

Yes

Linux PerfPAPI or other Possible Possible

Count load/storebetween 

LLC and Mem.

Possible Possible

Page 38: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

38

Summary

• New models such as MPAS and NUMA/NEPTUNE – Higher order, variable resolution– Demonstrated scalability– Focus is on computational efficiency to run faster

• Progress with accelerators– Early Intel Knights Landing results showing 1-2% of peak– Good news: with new MCDRAM, we aren’t memory bandwidth bound– Need to improve vector utilization, other bottlenecks to reach > 10% of peak– Naval Postgraduate School’s results speeding up NUMA show promise

• Ongoing efforts– Measurement, characterization and optimization of floating point and data

movement – Explore effect of different workloads, more aggressive compiler

optimizations, reduced (single) precision

Page 39: Experiences Programming and Optimizing for Knights Landing Multicore... · Knights Landing John Michalakes University Corporation for Atmospheric Research Boulder, Colorado USA 2016

39

Acknowledgements

• Intel: Mike Greenfield, Ruchira Sasanka, Indraneil Gokhale, Larry Meadows, Zakhar Matveev, Dmitry Petunin, Dmitry Dontsov, Naik Sumedh, Karthik Raman

• NRL and NPS: Alex Reinecke, Kevin Viner, Jim Doyle, Sasa Gabersek, Andreas Mueller (now at ECMWF), Frank Giraldo

• NCAR: Bill Skamarock, Michael Duda, John Dennis, Chris Kerr

• NOAA: Mark Govett, Jim Rosinski, Tom Henderson (now at Spire)

• Google search, which usually leads to something John McAlpin has already figured out...