intel xeon phi programming - johannes kepler university · multi-kepler gpu vs. multi-intel mic for...
TRANSCRIPT
Intel Xeon Phi Programming Dr. Volker Weinberg (LRZ) with material from Dr. M. Allalen (LRZ) & Dr. K. Fürlinger (LMU)
PRACE Autumn School 2016, September 27-30, 2016, Hagenberg
Agenda Intel Xeon Phi Programming
28.9.2016 Intel Xeon Phi Programming
● 10:15-11:45 Introduction: Intel Xeon Phi @ LRZ & EU,
Architecture, Programming Models, Native Mode
● 11:45-12:00 Coffee Break
● 12:00-13:00 Offload Mode I
● 13:00-14:15 Lunch Break
● 14:15-15:00 Offload Mode II
● 15:00-15:45 MPI
● 15:45-16:00 Coffee Break
● 16:00-17:00 Intel MKL Library
● 17:00-17:30 Optimisation and Vectorisation
Intel Xeon Phi and GPU Training @ LRZ
Intel Xeon Phi Programming
28.-30.4.2014 @ LRZ (PATC): KNC+GPU
27.-29.4.2015 @ LRZ (PATC): KNC+GPU
3.-4.2.2016 @ IT4Innovations: KNC
27.-29.6.2016 @ LRZ (PATC): KNC+KNL
Sept. 2016 @ PRACE Seasonal School,
Hagenberg: KNC
Feb. 2017 @ IT4Innovations (PATC): KNC
Jun. 2017 @ LRZ (PATC): KNL
http://inside.hlrs.de/
inSiDE, Vol. 12, No. 2, p. 102, 2014
inSiDE, Vol. 13, No. 2, p. 79, 2015
inSiDE, Vol. 14, No. 1, p. 76f, 2016
28.9.2016
Evaluating Accelerators at LRZ
Research at LRZ within PRACE & KONWIHR:
● CELL programming
2008-2009 Evaluation of CELL programming.
IBM announced to discontinue CELL in Nov. 2009.
● GPGPU programming
Regular GPGPU computing courses at LRZ since 2009.
Evaluation of GPGPU programming languages:
CAPS HMPP
PGI accelerator compiler
CUDA, cuBLAS, cuFFT
PyCUDA/R
● RapidMind → ArBB (Intel) → discontinued
● Larrabee (2009) → Knights Ferry (2010) → Knights Corner → Intel
Xeon Phi (2012) → KNL (2016)
28.9.2016 Intel Xeon Phi Programming
} → OpenACC
IPCC (Intel Parallel Computing Centre)
● New Intel Parallel Computing Centre (IPCC) since July 2014:
Extreme Scaling on MIC/x86
● Chair of Scientific Computing at the Department of Informatics in
the Technische Universität München (TUM) & LRZ
● https://software.intel.com/de-de/ipcc#centers
● https://software.intel.com/de-de/articles/intel-parallel-computing-center-at-
leibniz-supercomputing-centre-and-technische-universit-t
● Codes:
Simulation of Dynamic Ruptures and Seismic Motion in Complex
Domains: SeisSol
Numerical Simulation of Cosmological Structure Formation: GADGET
Molecular Dynamics Simulation for Chemical Engineering: ls1 mardyn
Data Mining in High Dimensional Domains Using Sparse Grids: SG++
28.9.2016 Intel Xeon Phi Programming
● Czech-Bavarian Competence Team for
Supercomputing Applications (CzeBaCCA)
● New BMBF funded project that started in Jan. 2016 to:
Foster Czech-German Collaboration in Simulation Supercomputing
series of workshops will initiate and deepen collaboration between Czech
and German computational scientists
Establish Well-Trained Supercomputing Communities
joint training program will extend and improve trainings on both sides
Improve Simulation Software
establish and disseminate role models and best practices of simulation
software in supercomputing
Intel Xeon Phi Programming
CzeBaCCA Project
28.9.2016
CzeBaCCA Trainings and Workshops
Intel Xeon Phi Programming
● https://www.lrz.de/forschung/projekte/forschung-hpc/CzeBaCCA/
Intel MIC Programming Workshop,
3 – 4 February 2016, Ostrava, Czech Republic
Scientific Workshop: SeisMIC - Seismic Simulation on Current and Future
Supercomputers,
5 February 2016, Ostrava, Czech Republic
Intel MIC Programming Workshop,
27 - 29 June 2016, Garching, Germany
Scientific Workshop: High Performance Computing for Water Related Hazards,
29 June - 1 July 2016, Garching, Germany
http://inside.hlrs.de/ inSiDE, Vol. 14, No. 1, p. 76f, 2016
http://www.gate-germany.de/fileadmin/dokumente/Laenderprofile/Laenderprofil_Tschechien.pdf, p.27
28.9.2016
PRACE: Best Practice Guides
● http://www.prace-ri.eu/best-practice-guides/● Best Practice Guide – Hydra, March 2013 PDF HTML
● Best Practice Guide – JUROPA, March 2013 PDF HTML
● Best Practice Guide – Anselm, June 2013 PDF HTML
● Best Practice Guide – Curie, November 2013 PDF HTML
● Best Practice Guide – Blue Gene/Q, January 2014 PDF HTML
● Best Practice Guide – Intel Xeon Phi, February 2014 PDF HTML
● Best Practice Guide - JUGENE, June 2012 PDF HTML
● Best Practice Guide - Cray XE-XC, December 2013 PDF HTML
● Best Practice Guide - IBM Power, June 2012 PDF HTML
● Best Practice Guide - IBM Power 775, November 2013 PDF HTML
● Best Practice Guide - Chimera, April 2013 PDF HTML
● Best Practice Guide - GPGPU, May 2013 PDF HTML
● Best Practice Guide - Jade, February 2013 PDF HTML
● Best Practice Guide - Stokes, February 2013 PDF HTML
● Best Practice Guide - SuperMUC, May 2013 PDF HTML
● Best Practice Guide - Generic x86, May 2013 PDF HTML
28.9.2016 Intel Xeon Phi Programming
Intel MIC within PRACE: Best Practice
Guide
● Best Practice Guide – Intel Xeon Phi
Created within PRACE-3IP.
Written in Docbook XML.
Michaela Barth (KTH Sweden),Mikko Byckling (CSC
Finland), Nevena Ilieva (NCSA Bulgaria), Sami
Saarinen (CSC Finland), Michael Schliephake KTH
Sweden), Volker Weinberg (LRZ, Editor).
http://www.prace-ri.eu/Best-Practice-Guide-Intel-Xeon-
Phi-HTML
http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-
Intel-Xeon-Phi.pdf
28.9.2016 Intel Xeon Phi Programming
Intel MIC within PRACE: Preparatory
Access
28.9.2016 Intel Xeon Phi Programming
● Applications Enabling for Capability Science
27 enabling projects from 17 PRACE partners from 14 countries
Jul-Dec 2013
Computations on Eurora (EURopean many integrated cORe
Architecture) Prototype at CINECA, Italy with 64 Xeon Phi
coprocessors and 64 NVIDIA GPUs
X. Guo, Report on Application Enabling for Capability Science in
the MIC Architecture, PRACE Deliverable D7.1.3,
http://www.prace-ri.eu/IMG/pdf/d7.1.3_1ip.pdf
16 Whitepapers available online:
http://www.prace-project.eu/Evaluation-Intel-MIC
Intel MIC within PRACE: Preparatory
Access
● Performance Analysis and Enabling of the RayBen Code for the Intel® MIC Architecture
● Enabling the UCD-SPH code on the Xeon Phi
● Xeon Phi Meets Astrophysical Fluid Dynamics
● Multi-Kepler GPU vs. Multi-Intel MIC for spin systems simulations
● Enabling Smeagol on Xeon Phi: Lessons Learned
● Code Optimization and Scaling of the Astrophysics Software Gadget on Intel Xeon Phi
● Code Optimization and Scalability Testing of an Artificial Bee Colony Based Software for
Massively Parallel Multiple Sequence Alignment on the Intel MIC Architecture
● Optimization and Scaling of Multiple Sequence Alignment Software ClustalW on Intel Xeon
Phi
● Porting FEASTFLOW to the Intel Xeon Phi: Lessons Learned
● Optimising CP2K for the Intel Xeon Phi
● Towards Porting a Real-World Seismological Application to the Intel MIC Architecture
● FMPS on MIC
● Massively parallel Poisson Equation Solver for hybrid Intel Xeon – Xeon Phi HPC Systems
● Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core
Architecture
● Porting and Verification of ExaFMM Library in MIC Architecture
● AGBNP2 Implicit Solvent Library for Intel® MIC Architecture
28.9.2016 Intel Xeon Phi Programming
PRACE Systems
● MARCONI @ CINECA
The second partition is based on the Lenovo Adam Pass
architecture and is equipped with the new Intel Knights Landing
BIN1 processors (KNL). It consists of 3600 nodes (1 KNL
processor at 1.4GHz and 96 GB of DDR4 ram per node). Each
KNL is equipped with 68 cores and 16 GB of MCD RAM.
● MareNostrum @ BSC
1 partition with 42 nodes, each with 2 Intel Xeon Phi 5110P
(60 cores / each with 4 hardware threads = 240 total threads, 8
GB of GDDR5 RAM, 1.053 GHz clock frequency)
● SuperMIC @ LRZ
1 partition with 32 nodes, each with 2 Intel Xeon Phi 5110P
28.9.2016 Intel Xeon Phi Programming
DEEP/ER Project: Towards Exascale
● Design of an architecture leading to Exascale.
● Development of hardware:
Implementation of a Booster based on MIC processors and EXTOLL
interconnect.
● Energy-aware integration of components:
Hot-water cooling.
● Cluster management system.
● Programming environment, programming models.
● Libraries and performance analysis tools.
● Porting applications.
28.9.2016 Intel Xeon Phi Programming
Green 500 List (Nov 2015)
28.9.2016 Intel Xeon Phi Programming
All systems in the top 10 are accelerator-based (mostly using GPUs)
Intel Xeon Phi @ top500 June 2016
● http://www.top500.org/list/2016/06/
● #2 National Super Computer Center in Guangzhou, China:
Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Cluster, Intel Xeon E5-
2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P
NUDT
● #12 Texas Advanced Computing Center/Univ. of Texas United
States Stampede - PowerEdge C8220, Xeon E5-2680 8C
2.700GHz, Infiniband FDR, Intel Xeon Phi SE10P, Dell
● #25 (US) / #34 (USA) / #42 (China)
● #55 IT4Innovations National Supercomputing Center, VSB-
Technical University of Ostrava Czech Republic Salomon - SGI
ICE X, Xeon E5-2680v3 12C 2.5GHz, Infiniband FDR, Intel, SGI
● #64 (USA) / #65 (USA) / #88 (Japan) / #100 (USA)
28.9.2016 Intel Xeon Phi Programming
Current Intel Xeon Phi Installations
● Tianhe-2 (China)
16000 nodes, each with 2 CPUs and 3 Intel Xeon Phis 31S1P
48000 Xeon Phi accelerators in total
3.1 Mio cores in total
33.8 PFlop/s Linpack, 17.8 MW
● Stampede (TACC, Texas)
6400 nodes, each with 1 Intel Xeon Phi SE10P
5.1 PFlop/s Linpack, 4.5 MW
● Salomon (IT4Innovations, Ostrava)
432 nodes, each with 2x Intel Xeon E5-2680v3 @ 2.5GHz and 2 x Intel
Xeon Phi 7120P with 61 cores @ 1.238 GHz, 16 GB RAM
● SuperMIC (LRZ, Munich)
32 nodes, each with 2 Intel Xeon E5-2650 @ 2.6 GHz and 2 x Intel
Xeon Phi 5110P with 60 cores @ 1.1 GHz, 8 GB RAM
28.9.2016 Intel Xeon Phi Programming
The Salomon System
28.9.2016 Intel Xeon Phi Programming
The Salomon cluster consists of 1008 compute nodes, totalling 24192 compute cores
with 129TB RAM and giving over 2 PFlop/s theoretical peak performance. Each node is
a powerful x86-64 computer, equipped with 24 cores, at least 128GB RAM. Nodes are
interconnected by 7D enhanced hypercube Infiniband network and equipped with Intel
Xeon E5-2680v3 processors. The Salomon cluster consists of 576 nodes without
accelerators and 432 nodes equipped with Intel Xeon Phi MIC accelerators.
Login: salomon.it4i.cz
Module System: module available
module load intel
Batch System: PBS Pro job workload manager
Documentation: https://docs.it4i.cz/salomon
The Salomon System
In general
Primary purpose High Performance Computing
Architecture of compute nodes x86-64
Operating system CentOS 6.7 Linux
Compute nodes
Totally 1008
Processor 2x Intel Xeon E5-2680v3, 2.5GHz, 12cores
RAM 128GB, 5.3GB per core, DDR4@2133 MHz
Local disk drive no
Compute network / Topology InfiniBand FDR56 / 7D Enhanced hypercube
w/o accelerator 576
MIC accelerated 432
In total
Total theoretical peak performance (Rpeak) 2011 Tflop/s
Total amount of RAM 129.024 TB
28.9.2016 Intel Xeon Phi Programming
The Salomon System: Compute Nodes
28.9.2016 Intel Xeon Phi Programming
Node Count Processor Cores Memory Accelerator
w/o
accelerator576
2x Intel Xeon E5-
2680v3, 2.5GHz24 128GB -
MIC
accelerated432
2x Intel Xeon E5-
2680v3, 2.5GHz24 128GB
2x Intel Xeon Phi
7120P, 61cores,
16GB RAM
Xeon Phi - History
● Intel decided to enter the GPU market in the mid 2000s
● GPUs need massive parallelism
GPU as a CPU with many x86 cores
Code-named Larrabee
Compared to established GPUs it was not competitive
● The project was discontinued in favor of a product for the HPC market
● MIC (many integrated cores) architecture
Knights Ferry: prototype card, not a commercial product
Knights Corner: first commercial product – system used during this school
Knights Landing: the next iteration of MIC
Knights Hill: (announced at SC14 for 2017/18)
28.9.2016 Intel Xeon Phi Programming
Knights Corner vs. Xeon Phi vs. MIC
● MIC is the code name for Intel’s range of manycore CPUs
Knights Corner is the code name for the product
Xeon Phi is the official marketing terminology
● KNC comes in…
6 different specifications
3 main lines
57 / 60 / 61 cores, clocked at 1.1 / 1.053 / 1.238 GHz
6 / 8 / 16 GB of main memory
Different TDPs and memory bandwidths
3 different form factors
Actively cooled
Passively cooled
Dense form factor
28.9.2016 Intel Xeon Phi Programming
Comparison CPU – MIC - GPU
28.9.2016 Intel Xeon Phi Programming
CPU MIC GPUGeneral purpose architecture power-efficient multiprocessor massively data-parallel
(low frequency, Pentium design)
Intel Knights Landing
● From ISC 2016 in Frankfurt, Germany, Intel Corp.
launched the second-generation Xeon Phi product
family, formerly code-named Knights Landing, aimed
at HPC and machine learning workloads.
● Will not be covered in this school!
28.9.2016 Intel Xeon Phi Programming
The Future of the MIC Architecture
● Knights Landing (KNL)
Next iteration of the MIC
architecture
14nm process
Based on Silvermont architecture
(Out-of-order Atom processors)
Major improvements and
upgrades over KNC
2D mesh interconnect instead of KNC ring
interconnect
Is available as a stand-alone CPU
Supports AVX-512 (Advanced Vector Extensions)
28.9.2016 Intel Xeon Phi Programming
The Xeon Phi (KNC) in use at LRZ & BSC
● This KNC is the 5110P model
Passively cooled, PCIe form factor
245 Watt Thermal Design Power (TDP)
60 cores / each with 4 hardware threads = 240 total threads
8 GB of GDDR5 RAM
1.053 GHz clock frequency
320 GB/sec peak memory bandwidth
http://ark.intel.com/de/products/71992/Intel-Xeon-Phi-
Coprocessor-5110P-8GB-1_053-GHz-60-core
28.9.2016 Intel Xeon Phi Programming
The Xeon Phi (KNC) in use at
IT4Innovations
● This KNC is the 7120P model
Passively cooled, PCIe form factor
300 Watt Thermal Design Power (TDP)
61 cores / each with 4 hardware threads = 244 total threads
16 GB of GDDR5 RAM
1.24 GHz clock frequency
352 GB/sec peak memory bandwidth
http://ark.intel.com/de/products/75799/Intel-Xeon-Phi-Coprocessor-
7120P-16GB-1_238-GHz-61-core
28.9.2016 Intel Xeon Phi Programming
The Intel MIC Architecture
● Up to 16 GB GDDR5 memory (350 GB/s).
● Coprocessor connected to the host by PCIe Gen2.
● Runs Linux OS (Linux Standard Base (LSB) core libraries &
Busybox minimal shell environment).
● Up to 61 cores @ 1 GHz interconnected by a ring interconnect
● Theoretical peak performance:1 TFlop/s (DP), 2 TFlop/s (SP).
● 64-bit execution.
● x86 architecture, but SSE/AVX not supported!
Different instruction set for SIMD:
Intel Initial Many Core Instructions (IMCI).
● Highly parallel and power-efficient design.
28.9.2016 Intel Xeon Phi Programming
The Intel MIC Architecture
28.9.2016 Intel Xeon Phi Programming
CRI: Core Ring Interface, bidirectional ring interconnect
which connects all the cores, L2 caches, PCIe client logic,
GDDR5 memory controllers etc.
The Intel MIC Architecture: HW Threads
● Derived from Pentium P54c design:
Intel gave RTL code to Pentagon to produce radiation
hardened version for the military
In-order architecture.
2 instructions per cycle: one on U-pipe, one on V-pipe.
At least 2 threads should be run per core.
● Xeon Phi supports 4 hardware threads
Intended to hide latencies.
Unlike hyperthreading, MIC HW threads cannot be switched
off.
Max. perf. may be reached before 4 threads / core.
28.9.2016 Intel Xeon Phi Programming
The Intel MIC Architecture: Caches
● Cache sizes:
32 kB of L1 instruction cache.
32 kB of L1 data cache.
512 kB of local L2 cache.
● Latency:
L1 cache: 1 cycle.
L2 cache: 15-30 cycles.
GDDR5 memory: 500-1000 cycles.
● HW Prefetcher: L2 cache only
● L2 size depends on how data/code is shared between the cores
If no data is shared between cores: L2 size is 30.5 MB (61 cores).
If evey core shares the same data: L2 size is 512 kB.
Cache coherency across the entire coprocessor.
Data remains consistent without software intervention.
28.9.2016 Intel Xeon Phi Programming
Network Access
● Network access possible using TCP/IP tools like ssh.
● NFS mounts on Xeon Phi supported.
● Proxy Console / File I/O.
28.9.2016 Intel Xeon Phi Programming
Advantages of the MIC Architecture
● Retains programmability and flexibility of standard x86
architecture.
● No need to learn a new complicated language like CUDA or
OpenCL.
● Offers possibilities we always missed on GPUs: Login onto the
system, watching and controlling processes via top, kill etc. like
on a Linux host .
● Allows many different parallel programming models like
OpenMP, MPI, Intel Cilk and Intel Threading Building Blocks.
● Offers standard math-libraries like Intel MKL.
● Supports whole Intel tool chain, e.g. Intel C/C++ and Fortran
Compiler, Debugger & Intel VTune Amplifier.
28.9.2016 Intel Xeon Phi Programming
Programming Modes
● Native Mode
Programs started on Xeon Phi.
Cross-compilation using –mmic.
User access to Xeon Phi necessary.
Necessary to support MPI ranks on Xeon Phi.
● Offload (Accelerator) Mode
Programs started on the host.
Intel Pragmas to offload code to Xeon Phi.
OpenMP possible, but no MPI ranks on Xeon Phi.
No user access to Xeon Phi needed.
No input data files on Xeon Phi possible.
28.9.2016 Intel Xeon Phi Programming
Offload Modes
● Host and MIC do not share physical or virtual memory in hardware.
● 2 Offload data transfer models are available:
1. Explicit copy: Language Extensions for Offload (LEO) / OpenMP 4
Syntax: pragma/directive based
offload directive specifies variables that need to be copied between host and
MIC
Example (LEO):
C: #pragma offload target(mic) in(data:length(size))
Fortran: !DIR$ offload target(mic) in(data:length(size))
2. Implicit Copy: MYO
Syntax: keyword extension based
shared variables need to be declared, same variables can be used on the
host and MIC, runtime automatically maintains coherence
Example:
C: _Cilk_shared double a; _Cilk_offload func(a);
Fortran: not supported
28.9.2016 Intel Xeon Phi Programming
Programming Languages / Libraries
● OpenMP
Native execution on MIC (cross-compilation with –mmic)
Execution on host, using offload pragmas / directives to offload code at
runtime
● MPI (and hybrid MPI & OpenMP)
Co-processor only MPI programming model: native execution on MIC
using mpiexec.hydra on MIC.
Symmetric MPI programming model: MPI ranks on MICs and host CPUs.
● MKL
Native execution on MIC (compilation with –mkl -mmic).
Compiler assisted offload.
Automatic Offload (AO): automatically uses both host and MIC,
transparent and automatic data transfer and execution management
(compilation with –mkl, mkl_mic_enable()/ MKL_MIC_ENABLE=1).
28.9.2016 Intel Xeon Phi Programming
Distributed vs. Shared Memory
28.9.2016 Intel Xeon Phi Programming
Distributed Memory
● Same program on each processor/machine (SPMD) or
Multiple programs with consistent communication structure (MPMD)
● Program written in a sequential language
all variables process-local
no implicit knowledge of data on other processors
● Data exchange between processes:
send/receive messages via appropriate library
most tedious, but also the most flexible way of parallelization
● Parallel library discussed here:
Message Passing Interface, MPI
Shared Memory
● Single Program on single machine UNIX Process splits off threads,
mapped to CPUs for work distribution
● Data may be process-global or thread-
local
exchange of data not needed, or via suitable synchronization mechanisms
● Programming models explicit threading (hard)
directive-based threading via OpenMP (easier)
automatic parallelization (very easy, but mostly not efficient)
MPI vs. OpenMP
28.9.2016 Intel Xeon Phi Programming
● MPI standard
MPI forum released version 2.2 in
September 2009
MPI version 3.1 in June 2015
unified document („MPI1+2“)
● Base languages
Fortran (77, 95)
C
C++ binding obsolescent
use C bindings
● Resources:
http://www.mpi-forum.org
● OpenMP standard
OpenMP 3.1 (July 2011) released by
architecture review board (ARB)
feature update (tasking etc.)
OpenMP 4.0 (July 2013)
SIMD, affinity policies, accelerator
support
OpenMP 4.5 (Nov 2015)
● Base languages
Fortran (77, 95)
C, C++
(Java is not a base language)
● Resources:
http://www.openmp.org
http://www.compunity.org
Simple OpenMP program
28.9.2016 Intel Xeon Phi Programming
#include <omp.h>
int main() {
int numth = 1;
#pragma omp parallel
{int myth = 0; /* private */
#pragma omp single
numth = omp_get_num_threads();
/* block above: one statement */
myth = omp_get_thread_num();
printf(“Hello from %i of %i\n”,\
myth,numth);
} /* end parallel */
}
icc –openmp helloopenmp.c
Simple OpenMP Program
lu65fok@login12:~/mickurs> export OMP_NUM_THREADS=10
lu65fok@login12:~/mickurs> ./helloopenmp
Hello from 5 of 10
Hello from 2 of 10
Hello from 6 of 10
Hello from 0 of 10
Hello from 8 of 10
Hello from 3 of 10
Hello from 4 of 10
Hello from 9 of 10
Hello from 7 of 10
Hello from 1 of 10
28.9.2016 Intel Xeon Phi Programming
Simplest MPI Program
/* C Example */
#include <stdio.h>
#include <mpi.h>
int main (int argc, char* argv[])
{
int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
printf("Hello from %i of %i\n“, rank, size);
MPI_Finalize();
return 0;
}
mpiicc hellompi.c
28.9.2016 Intel Xeon Phi Programming
Simplest MPI Program
lu65fok@login12:~/mickurs> mpiicc hellompi.c -o hellompi
lu65fok@login12:~/mickurs> mpirun -n 10 ./hellompi
Hello from 5 of 10
Hello from 6 of 10
Hello from 7 of 10
Hello from 8 of 10
Hello from 9 of 10
Hello from 0 of 10
Hello from 1 of 10
Hello from 2 of 10
Hello from 3 of 10
Hello from 4 of 10
28.9.2016 Intel Xeon Phi Programming
Useful Tools and Files on Coprocessor
● top - display Linux tasks
● ps - report a snapshot of the current processes.
● kill - send signals to processes, or list signals
● ifconfig - configure a network interface
● traceroute - print the route packets take to network host
● mpiexec.hydra – run Intel MPI natively
● /proc/cpuinfo
● /proc/meminfo
28.9.2016 Intel Xeon Phi Programming
/proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 11
model : 1
model name : 0b/01
stepping : 3
cpu MHz : 1052.630
cache size : 512 KB
physical id : 0
siblings : 240
core id : 59
cpu cores : 60
apicid : 236
initial apicid : 236
fpu : yes
fpu_exception : yes
cpuid level : 4
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr ht
syscall nx lm rep_good nopl lahf_lm
bogomips : 2094.86
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
28.9.2016 Intel Xeon Phi Programming
/proc/meminfo
[lu65fok@i01r13c01-mic0 proc]$ cat meminfo
MemTotal: 7882368 kB
MemFree: 7182704 kB
Buffers: 0 kB
Cached: 298824 kB
SwapCached: 0 kB
Active: 38660 kB
Inactive: 265544 kB
Active(anon): 38660 kB
Inactive(anon): 265544 kB
Active(file): 0 kB
Inactive(file): 0 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
…
28.9.2016 Intel Xeon Phi Programming
Native Mode
28.9.2016 Intel Xeon Phi Programming
● Compile on the Host (Login Node supermic):
lu65fok@login12:~/test> icpc -mmic hello.c -o hello
lu65fok@login12:~/tests> ifort -mmic hello.f90 -o hello
● Launch execution from the MIC:
lu65fok@login12:~/test> scp hello i01r13c01-mic0:
hello 100% 10KB 10.2KB/s 00:00
lu65fok@login12:~/test> ssh i01r13c01-mic0
[lu65fok@i01r13c01-mic0 ~]$ ./hello
hello, world
lu65fok@i01r13c01-mic0 ~]$Home-Directories might also be mounted on the MICs like on Salomon and SuperMIC.
Native Mode: micnativeloadex
● Launch execution from the host:
blu65fok@login12:~/test> ./hello
-bash: ./hello: cannot execute Binary file
lu65fok@i01r13c01:~/test> micnativeloadex ./hello
hello, world
lu65fok@i01r13c01:~/test> micnativeloadex ./hello -v
hello, world
Remote process returned: 0
Exit reason: SHUTDOWN OK
28.9.2016 Intel Xeon Phi Programming
micinfo
lu65fok@i01r13c01:~> micinfo -listdevices
MicInfo Utility Log
Created Thu Apr 17 17:22:27 2014
List of Available Devices
deviceId | domain | bus# | pciDev# | hardwareId
---------|----------|------|---------|-----------
0 | 0 | 20 | 0 | 22508086
1 | 0 | 8b | 0 | 22508086
-------------------------------------------------
28.9.2016 Intel Xeon Phi Programming
Micinfo Output
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.1.2
Device Serial Number : ADKC33400625
Cores
Total No of Active Cores : 60
Voltage : 1027000 uV
Frequency : 1052631 kHz
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
28.9.2016 Intel Xeon Phi Programming
_SC_NPROCESSORS_ONLN
lu65fok@login12:~/tests> cat hello.c
#include <unistd.h>
int main(){
printf("Hello world! I have %ld logical cores.\n",
sysconf(_SC_NPROCESSORS_ONLN));
}
lu65fok@i01r13c01:~/tests> ./hello-host
Hello world! I have 32 logical cores.
[lu65fok@i01r13c01-mic0 ~]$ ./hello-mic
Hello world! I have 240 logical cores.
lu65fok@i01r13c01:~/tests> micnativeloadex ./hello-mic
Hello world! I have 240 logical cores.
28.9.2016 Intel Xeon Phi Programming
Native Mode: micnativeloadex -l
lu65fok@i01r13c01:~/test> micnativeloadex hello -l
Dependency information for hello
Full path was resolved as
/home/hpc/pr28fa/lu65fok/test/hello
Binary was built for Intel(R) Xeon Phi(TM) Coprocessor
(codename: Knights Corner) architecture
SINK_LD_LIBRARY_PATH =
Dependencies Found:
(none found)
Dependencies Not Found Locally (but may exist already on the coprocessor):
libm.so.6
libstdc++.so.6
libgcc_s.so.1
libc.so.6
libdl.so.2
28.9.2016 Intel Xeon Phi Programming
For the Labs:
Salomon System Initialisation
● Load the Intel environment on the host via:
module load intel
● Submit an interactive job viaqsub -I -A DD-16-44 -q R??????
-l select=1:ncpus=24:accelerator=True:naccelerators=2
-l walltime=12:00:00
● No module system on the MIC, manual initialisation
needed, i.e.:
export PATH=$PATH:/apps/all/impi/5.1.2.150-iccifort-2016.1.150-GCC-
4.9.3-2.25/mic/bin/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/apps/all/imkl/11.3.1.150-
iimpi-8.1.5-GCC-4.9.3-2.25/mkl/lib/mic/:/apps/all/icc/2016.1.150-GCC-
4.9.3-2.25/compilers_and_libraries_2016.1.150/linux/compiler/lib/mic/
28.9.2016 Intel Xeon Phi Programming
Intel Offload Directives
● Syntax:
C:
#pragma offload target(mic) <clauses>
<statement block>
Fortran:
!DIR$ offload target(mic) <clauses>
<statement>
!DIR$ omp offload target(mic) <clauses>
<OpenMP construct>
28.9.2016 Intel Xeon Phi Programming
Intel Offload Directive
● C:
Pragma can be before any statement, including a
compound statement or an OpenMP parallel
pragma
● Fortran: If OMP is specified: the next line, other than a
comment, must be an OpenMP PARALLEL,
PARALLEL SECTIONS, or PARALLEL DO directive.
If OMP is not specified, next line must :
An OpenMP* PARALLEL, PARALLEL SECTIONS, or
PARALLEL DO directive
A CALL statement
An assignment statement where the right side only calls a
function
28.9.2016 Intel Xeon Phi Programming
Intel Offload Directive
● Offloading a code block in Fortran:
!DIR$ offload begin target(MIC)
…
!DIR$ end offload
Code block can include any number of Fortran
statements, including DO, CALL and any assignments,
but not OpenMP directives.
28.9.2016 Intel Xeon Phi Programming
Intel Offload
● Implements the following steps:
1. Memory allocation on the MIC
2. Data transfer from the host to the MIC
3. Execution on the MIC
4. Data transfer from the MIC to the host
5. Memory deallocation on MIC
28.9.2016 Intel Xeon Phi Programming
Intel Offload: Hello World in C
#include <stdio.h>
int main (int argc, char* argv[]) {
#pragma offload target(mic)
{
printf("MIC: Hello world from MIC.\n");
}
printf( "Host: Hello world from host.\n");
}
28.9.2016 Intel Xeon Phi Programming
Must be in a new
line!
Intel Offload: Hello World in Fortran
PROGRAM HelloWorld
!DIR$ offload begin target(MIC)
PRINT *,'MIC: Hello world from MIC'
!DIR$ end offload
PRINT *,'Host: Hello world from host'
END
28.9.2016 Intel Xeon Phi Programming
Intel Offload: Hello World in C
lu65fok@login12:~/tests> icpc offload1.c -o offload1
lu65fok@login12:~/tests> ./offload1
offload error: cannot offload to MIC - device is not available
lu65fok@i01r13c01:~/tests> ./offload1
Host: Hello world from host.
MIC: Hello world from MIC.
28.9.2016 Intel Xeon Phi Programming
Intel Offload: Hello World in Fortran
lu65fok@login12:~/tests> ifort offload1.f90 -o offload1
lu65fok@login12:~/tests> ./offload1
offload error: cannot offload to MIC - device is not available
lu65fok@i01r13c01:~/tests> ./offload1
Host: Hello world from host.
MIC: Hello world from MIC.
28.9.2016 Intel Xeon Phi Programming
Intel Offload: Hello World with Hostnames
#include <stdio.h>
#include <unistd.h>
int main (int argc, char* argv[]) {
char hostname[100];
gethostname(hostname,sizeof(hostname));
#pragma offload target(mic)
{
char michostname[100];
gethostname(michostname, sizeof(michostname));
printf("MIC: Hello world from MIC. I am %s and I have %ld logical cores. I was
called from host: %s \n", michostname, sysconf(_SC_NPROCESSORS_ONLN),
hostname);
}
28.9.2016 Intel Xeon Phi Programming
Intel Offload: Hello World with Hostnames
lu65fok@login12:~/tests> icpc offload.c -o offload
lu65fok@i01r13c01:~/tests> ./offload
Host: Hello world from host. I am i01r13c01 and I have 32 logical
cores.
MIC: Hello world from MIC. I am i01r13c01-mic0 and I have 240
logical cores. I was called from host: i01r13c01
28.9.2016 Intel Xeon Phi Programming
Intel Offload: -offload=optional / mandatory
lu65fok@login12:~/tests> icpc -offload=optional offload.c -o offload
lu65fok@login12:~/tests> ./offload
MIC: Hello world from MIC. I am login12 and I have 16 logical cores. I
was called from host: login12
Host: Hello world from host. I am login12 and I have 16 logical cores.
lu65fok@login12:~/tests> icpc -offload=mandatory offload.c -o offload
lu65fok@login12:~/tests> ./offload
offload error: cannot offload to MIC - device is not available
28.9.2016 Intel Xeon Phi Programming
Intel Offload: -none
lu65fok@login12:~/tests> icpc -offload=none offload.c -o offload
offload.c(13): warning #161: unrecognized #pragma
#pragma offload target(mic)
^
lu65fok@login12:~/tests>
lu65fok@i01r13c01:~/tests> ./offload
MIC: Hello world from MIC. I am i01r13c01 and I have 32 logical cores.
I was called from host: i01r13c01
Host: Hello world from host. I am i01r13c01 and I have 32 logical
cores.
28.9.2016 Intel Xeon Phi Programming
Intel Offload
#include <stdio.h>
#include <stdlib.h>
int main(){
#pragma offload target (mic)
{
system(“command");
}
}
28.9.2016 Intel Xeon Phi Programming
Intel Offload: system(“set”)
lu65fok@i01r13c01:~/tests> ./system
BASH=/bin/sh
BASHOPTS=cmdhist:extquote:force_fignore:hostcomplete:interactive_comments:progcomp:prom
ptvars:sourcepath
BASH_ALIASES=()
BASH_ARGC=()
BASH_ARGV=()
BASH_CMDS=()
BASH_EXECUTION_STRING=set
BASH_LINENO=()
BASH_SOURCE=()
BASH_VERSINFO=([0]="4" [1]="2" [2]="10" [3]="1" [4]="release" [5]="k1om-mpss-linux-gnu")
BASH_VERSION='4.2.10(1)-release‘
COI_LOG_PORT=65535
COI_SCIF_SOURCE_NODE=0
DIRSTACK=()
ENV_PREFIX=MIC
EUID=400
GROUPS=()
28.9.2016 Intel Xeon Phi Programming
Intel Offload: system(“set”)
HOSTNAME=i01r13c01-mic0
HOSTTYPE=k1om
IFS='
'
LIBRARY_PATH=/lrz/sys/intel/compiler140_144/composer_xe_2013_sp1.2.144/tbb/lib/mic:/lrz/sys/intel/compiler1
40_144/composer_xe_2013_sp1.2.144/tbb/lib/mic:/lrz/sys/intel/compiler140_144/composer_xe_2013_sp1.2.144/t
bb/lib/mic:/lrz/sys/intel/compiler140_144/composer_xe_2013_sp1.2.144/tbb/lib/mic
MACHTYPE=k1om-mpss-linux-gnu
OPTERR=1
OPTIND=1
OSTYPE=linux-gnu
PATH=/usr/bin:/bin
POSIXLY_CORRECT=y
PPID=37141
PS4='+ '
PWD=/var/volatile/tmp/coi_procs/1/37141
SHELL=/bin/false
SHELLOPTS=braceexpand:hashall:interactive-comments:posix
SHLVL=1
TERM=dumb
UID=400
_=sh
28.9.2016 Intel Xeon Phi Programming
Intel Offload: system(command)
#pragma offload target (mic)
{
system("hostname");
system("uname -a");
system("whoami");
system(“id”);
}
lu65fok@i01r13c01:~/tests> ./system
i01r13c01-mic0
Linux i01r13c01-mic0 2.6.38.8+mpss3.1.2 #1 SMP Wed Dec 18
19:09:36 PST 2013 k1om GNU/Linux
micuser
uid=400(micuser) gid=400(micuser)
28.9.2016 Intel Xeon Phi Programming
Offload: Using several MIC Coprocessors
● To query the number of coprocessors:
int nmics = __Offload_number_of_devices()
● To specify which coprocessor n< nmics should do the
computation:
#pragma offload target(mic:n)
● If (n > nmics) then coprocessor (n % nmics) is used
● Important for:
Asynchronous offloads
Coprocessor-Persistent data
28.9.2016 Intel Xeon Phi Programming
Offloading OpenMP Computations
● C/C++ & OpenMP:
#pragma offload target(mic)
#pragma omp parallel for
for (int i=0;i<n;i++) {
a[i]=c*b[i]+d;
}
● Fortran & OpenMP
!DIR$ offload target(mic)
!$OMP PARALLEL DO
do i = 1, n
a(i) = c*b(i) + d
end do
!$omp END PARALLEL DO
28.9.2016 Intel Xeon Phi Programming
Functions and Variables on the MIC
● C:
__attribute__((target(mic))) variables / function
__declspec (target(mic)) variables / function
#pragma offload_attribute(push, target(mic))
… multiple lines with variables / functions
#pragma offload_attribute(pop)
● Fortran:
!DIR$ attributes offload:mic:: variables / function
28.9.2016 Intel Xeon Phi Programming
Functions and Variables on the MIC
#pragma offload_attribute(push,target(mic))
const int n=100;
int a[n], b[n],c,d;
void myfunction(int* a, int*b, int c, int d){
for (int i=0;i<n;i++) {
a[i]=c*b[i]+d;
}
}
#pragma offload_attribute(pop)
int main (int argc, char* argv[]) {
#pragma offload target(mic)
{
myfunction(a,b,c,d);
}
28.9.2016 Intel Xeon Phi Programming
Intel Offload Clauses
Clauses Syntax SemanticsMultiple coprocessors target(mic[:unit] ) Select specific coprocessors
Conditional offload if (condition) / manadatory Select coprocessor or host compute
Inputs in(var-list modifiersopt) Copy from host to coprocessor
Outputs out(var-list modifiersopt) Copy from coprocessor to host
Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back
when offload completes
Non-copied data nocopy(var-list
modifiersopt)
Data is local to target
Async. Offload signal(signal-slot) Trigger asynchronous Offload
Async. Offload wait(signal-slot) Wait for completion
28.9.2016 Intel Xeon Phi Programming
Intel Offload Modifier Options
Modifiers Syntax Semantics
Specify copy length length(N) Copy N elements of pointer’s type
Coprocessor memory
allocation
alloc_if (bool) Allocate coprocessor space on this
offload (default: TRUE)
Coprocessor memory
release
free_if (bool) Free coprocessor space at the end of
this offload (default: TRUE)
Array partial
allocation & variable
relocation
alloc (array-slice)
in (var-expr )
Enables partial array allocation and
data copy into other vars & ranges
28.9.2016 Intel Xeon Phi Programming
Intel Offload: Data Movement
● #pragma offload target(mic) in(in1,in2,…)
out(out1,out2,…) inout(inout1,inout2,…)
● At Offload start:
Allocate Memory Space on MIC for all variables
Transfer in/inout variables from Host to MIC
● At Offload end:
Transfer out/inout variables from MIC to Host
Deallocate Memory Space on MIC for all variables
28.9.2016 Intel Xeon Phi Programming
Intel Offload: Data Movement
● data = (double*)malloc(n*sizeof(double));
● #pragma offload target(mic) in(data:length(n))
● Copies n doubles to the coprocessor,
not n * sizeof(double) Bytes
● ditto for out() and inout()
28.9.2016 Intel Xeon Phi Programming
Allocation of Partial Arrays in C
● int n=1000
● data = (double*)malloc(n*sizeof(double));
● #pragma offload target(mic) in(data[100:200] :
alloc(data[300:400])
● Host:
1000 doubles allocated
First element has index 0
Last element has index 999
● MIC:
400 doubles are allocated
First element has index 300
Last element has index 699
200 elements in the range data[100], …, data[299] are copied to
the MIC
28.9.2016 Intel Xeon Phi Programming
Allocation of Partial Arrays in Fortran
● integer :: n=1000
● double precision, allocatable :: data(:)
● allocate(data(n) )
● !C: #pragma offload target(mic) in(data[100:200] : alloc(data[300:400])
● !DIR$ offload target(mic) in(data(100:299) : alloc(data(300:699))
● Host:
1000 doubles allocated
First element has index 0
Last element has index 999
● MIC:
400 doubles are allocated
First element has index 300
Last element has index 699
200 elements in the range data[100], …, data[299] are copied to the MIC
28.9.2016 Intel Xeon Phi Programming
An example for Offloading: Offloading
Code
28.9.2016 Intel Xeon Phi Programming
#pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n))
{
#pragma omp parallel for
for( i = 0; i < n; i++ ) {
for( k = 0; k < n; k++ ) {
#pragma vector aligned
#pragma ivdep
for( j = 0; j < n; j++ ) {
//c[i][j] = c[i][j] + a[i][k]*b[k][j];
c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];
}
}
}
}
Vectorisation Diagnostics
28.9.2016 Intel Xeon Phi Programming
lu65fok@login12:~/tests> icc -vec-report2 -openmp offloadmul.c -ooffloadmul
offloadmul.c(35): (col. 5) remark: LOOP WAS VECTORIZED
offloadmul.c(32): (col. 3) remark: loop was not vectorized: not inner loop
offloadmul.c(57): (col. 2) remark: LOOP WAS VECTORIZED
offloadmul.c(54): (col. 7) remark: loop was not vectorized: not inner loop
offloadmul.c(53): (col. 5) remark: loop was not vectorized: not inner loop
offloadmul.c(8): (col. 9) remark: loop was not vectorized: existence of
vector dependence
offloadmul.c(7): (col. 5) remark: loop was not vectorized: not inner loop
offloadmul.c(57): (col. 2) remark: *MIC* LOOP WAS VECTORIZED
offloadmul.c(54): (col. 7) remark: *MIC* loop was not vectorized: not inner
loop
offloadmul.c(53): (col. 5) remark: *MIC* loop was not vectorized: not inner
loop
-vec-report2 deprecated in icc 15.0. Use -qopt-report=n -qopt-
report-phase=vec and view *.optrpt files.
Intel Offload: Example
28.9.2016 Intel Xeon Phi Programming
__attribute__((target(mic))) void mxm( int n, double * restrict a, double * restrict b,
double *restrict c ){
int i,j,k;
for( i = 0; i < n; i++ ) {
...}
}
main(){
...
#pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n))
{
mxm(n,a,b,c);
}
}
Offload Diagnostics
u65fok@i01r13c06:~/tests> export OFFLOAD_REPORT=2
lu65fok@i01r13c06:~/tests> ./offloadmul
[Offload] [MIC 0] [File] offloadmul.c
[Offload] [MIC 0] [Line] 50
[Offload] [MIC 0] [Tag] Tag 0
[Offload] [HOST] [Tag 0] [CPU Time] 51.927456(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data] 24000016 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time] 50.835065(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data] 8000016 (bytes)
28.9.2016 Intel Xeon Phi Programming
Offload Diagnostics
lu65fok@i01r13c06:~/tests> export H_TRACE=1
lu65fok@i01r13c06:~/tests> ./offloadmul
HOST: Offload function
__offload_entry_offloadmul_c_50mainicc638762473Jnx4JU,
is_empty=0, #varDescs=7, #waits=0, signal=none
HOST: Total pointer data sent to target: [24000000] bytes
HOST: Total copyin data sent to target: [16] bytes
HOST: Total pointer data received from target: [8000000] bytes
MIC0: Total copyin data received from host: [16] bytes
MIC0: Total copyout data sent to host: [16] bytes
HOST: Total copyout data received from target: [16] bytes
lu65fok@i01r13c06:~/tests>
28.9.2016 Intel Xeon Phi Programming
Offload Diagnostics
lu65fok@i01r13c06:~/tests> export H_TIME=1
lu65fok@i01r13c06:~/tests> ./offloadmul
[Offload] [MIC 0] [File] offloadmul.c
[Offload] [MIC 0] [Line] 50
[Offload] [MIC 0] [Tag] Tag 0
[Offload] [HOST] [Tag 0] [CPU Time] 51.920016(seconds)
[Offload] [MIC 0] [Tag 0] [MIC Time] 50.831497(seconds)
**************************************************************
timer data (sec)
**************************************************************
lu65fok@i01r13c06:~/tests>
28.9.2016 Intel Xeon Phi Programming
Environment Variables
● Host environment variables are automatically
forwarded to the coprocessor when offload mode is
used.
● To avoid names collisions:
Set MIC_ENVIRONMENT_PREFIX=MIC on the host
Then only names with prefix MIC_ are forwarded to
the coprocessor with prefix stripped
Exception: MIC_LD_LIBRARY_PATH is never passed
to the coprocessor.
Value of LD_LIBRARY_PATH cannot be changed via
forwarding of environment variables.
28.9.2016 Intel Xeon Phi Programming
Environment Variables on the MIC
#include <stdio.h>
#include <stdlib.h>
int main(){
#pragma offload target (mic)
{
char* varmic = getenv("VAR");
if (varmic) {
printf("VAR=%s on MIC.\n", varmic);
} else {
printf("VAR is not defined on MIC.\n");
}
}
char* varhost = getenv("VAR");
if (varhost) {
printf("VAR=%s on host.\n", varhost);
} else {
printf("VAR is not defined on host.\n");
}
}28.9.2016 Intel Xeon Phi Programming
Environment Variables on the MIC
lu65fok@i01r13c01:~/tests> ./env
VAR is not defined on host.
VAR is not defined on MIC.
lu65fok@i01r13c01:~/tests> export VAR=299792458
lu65fok@i01r13c01:~/tests> ./env
VAR=299792458 on host.
VAR=299792458 on MIC.
lu65fok@i01r13c01:~/tests> export MIC_ENV_PREFIX=MIC
lu65fok@i01r13c01:~/tests> ./env
VAR=299792458 on host.
VAR is not defined on MIC.
lu65fok@i01r13c01:~/tests> export MIC_VAR=3.141592653
lu65fok@i01r13c01:~/tests> ./env
VAR=299792458 on host.
VAR=3.141592653 on MIC.
28.9.2016 Intel Xeon Phi Programming
The Preprocessor Macro __MIC__
● The macro __MIC__ is only defined in code version
for MIC, not in the fallback version for the host
● Allows to check where the code is running.
● Allows to write multiversioned code.
● __MIC__ also defined in native mode.
28.9.2016 Intel Xeon Phi Programming
The Preprocessor Macro __MIC__
#pragma offload target(mic)
{
#ifdef __MIC__
printf(“Hello from MIC (offload succeeded).\n");
#else
printf(“Hello from host (offload to MIC failed!).\n");
#endif
}
lu65fok@login12:~/tests> icpc -offload=optional offload-mic.c
lu65fok@login12:~/tests> ./a.out
Hello from host (offload to MIC failed!).
lu65fok@i01r13c06:~/tests> ./a.out
Hello from MIC (offload succeeded).
28.9.2016 Intel Xeon Phi Programming
Proxy Console I/O
● stderr and stdout on MIC are buffered and forwarded
(proxied) to the host console.
● Forwarding is done by the coi_daemon running on
the MIC.
● Output buffer should be flushed with fflush(0) of the
stdio-Library.
● Proxy console input not supported.
● Proxy I/O is enabled by default.
● Can be switched off using MIC_PROXY_IO=0.
28.9.2016 Intel Xeon Phi Programming
Proxy Console I/O
#include <stdio.h>
#include <unistd.h>
__attribute__((target(mic))) extern struct _IO_FILE *stderr;
int main (int argc, char* argv[]){
char hostname[100]; gethostname(hostname,sizeof(hostname));
#pragma offload target(mic) {
char michostname[100]; gethostname(michostname, sizeof(michostname));
printf("MIC stdout: Hello world from MIC. I am %s and I have %ld logical cores. I was called
from host: %s \n", michostname, sysconf(_SC_NPROCESSORS_ONLN), hostname);
fprintf(stderr,"MIC stderr: Hello world from MIC. I am %s and I have %ld logical cores. I was
called from host: %s \n", michostname, sysconf(_SC_NPROCESSORS_ONLN), hostname);
fflush(0);
}
printf( "Host stdout: Hello world from host. I am %s and I have %ld logical cores.\n", hostname,
sysconf(_SC_NPROCESSORS_ONLN));
fprintf(stderr, "Host stderr: Hello world from host. I am %s and I have %ld logical cores.\n",
hostname, sysconf(_SC_NPROCESSORS_ONLN));
}28.9.2016 Intel Xeon Phi Programming
Proxy Console I/O
lu65fok@i01r13c01:~/tests> ./proxyio 1>proxyio.out 2>proxyio.err
lu65fok@i01r13c01:~/tests> cat proxyio.out
MIC stdout: Hello world from MIC. I am i01r13c01-mic0 and I have 240 logical cores.
I was called from host: i01r13c01
Host stdout: Hello world from host. I am i01r13c01 and I have 32 logical cores.
lu65fok@i01r13c01:~/tests> cat proxyio.err
MIC stderr: Hello world from MIC. I am i01r13c01-mic0 and I have 240 logical cores. I
was called from host: i01r13c01
Host stderr: Hello world from host. I am i01r13c01 and I have 32 logical cores.
lu65fok@i01r13c01:~/tests>
28.9.2016 Intel Xeon Phi Programming
Proxy Console I/O
lu65fok@i01r13c01:~/tests> export MIC_PROXY_IO=0
lu65fok@i01r13c01:~/tests> ./proxyio 1>proxyio.out 2>proxyio.err
lu65fok@i01r13c01:~/tests> cat proxyio.out
Host stdout: Hello world from host. I am i01r13c01 and I have 32
logical cores.
lu65fok@i01r13c01:~/tests> cat proxyio.err
Host stderr: Hello world from host. I am i01r13c01 and I have 32
logical cores.
lu65fok@i01r13c01:~/tests>
28.9.2016 Intel Xeon Phi Programming
Data Traffic without Computation
● 2 possibilities:
Blank body of #pragma offload, i.e.
#pragma offload target(mic) in (data: length(n))
{}
Use a special pragma offload_transfer, i.e.
#pragma offload_transfer target(mic) in(data:
length(n))
28.9.2016 Intel Xeon Phi Programming
Asynchronous Offload
● Asynchronous Data Transfer helps to:
Overlap computations on host and MIC(s).
Work can be distributed to multiple coprocessors.
Data transfer time can be masked.
28.9.2016 Intel Xeon Phi Programming
Asynchronous Offload
● To allow asynchronous data transfer, the specifiers
signal() and wait() can be used, i.e.
#pragma offload_transfer target(mic:0) in(data : length(n))
signal(data)
// work on other data concurrent to data transfer …
#pragma offload target(mic:0) wait(data) \
nocopy(data : length(N)) out(result : length(N))
{
….
result[i]=data[i] + …;
}
28.9.2016 Intel Xeon Phi Programming
Device number
must be
specified!
Any pointer type
variable can
serve as a signal!
Asynchronous Offload
● Alternative to the wait() clause, a new pragma can be
used:
#pragma offload_wait target(mic:0) wait(data)
● Useful if no other offload or data transfer is necessary
at the synchronisation point.
28.9.2016 Intel Xeon Phi Programming
Asynchronous Offload to Multiple
Coprocessors
char* offload0;
char* offload1;
#pragma offload target(mic:0) signal(offload0) \
in(data0 : length(N)) out(result0 : length(N))
{
Calculate(data0, result0);
}
#pragma offload target(mic:1) signal(offload1) \
in(data1 : length(N)) out(result1 : length(N))
{
Calculate(data1, result1);
}
#pragma offload_wait target(mic:0) wait(offload0)
#pragma offload_wait target(mic:1) wait(offload1)
28.9.2016 Intel Xeon Phi Programming
Explicit Worksharing
28.9.2016 Intel Xeon Phi Programming
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
//section running on the coprocessor
#pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n))
{
mxm(n,a,b,c);
}
}
#pragma omp section
{
//section running on the host
mxm(n,d,e,f);
}
}
}
Persistent Data
● #define ALLOC alloc_if(1)
#define FREE free_if(1)
#define RETAIN free_if(0)
#define REUSE alloc_if(0)
● To allocate data and keep it for the next offload:
#pragma offload target(mic) in (p:length(l) ALLOC RETAIN)
● To reuse the data and still keep it on the coprocessor:
#pragma offload target(mic) in (p:length(l) REUSE RETAIN)
● To reuse the data again and free the memory. (FREE is the
default, and does not need to be explicitly specified):
#pragma offload target(mic) in (p:length(l) REUSE FREE)
28.9.2016 Intel Xeon Phi Programming
Virtual Shared Classes
● Offload Model only allows offloading of bitwise-
copyable data.
● Sharing complicated structures with pointers or C++
classes is only possible via MYO
28.9.2016 Intel Xeon Phi Programming
MYO
● “Mine Yours Ours” virtual shared memory model.
● Alternative to Offload approach.
● Only available in C++.
● Allows to share not bit-wise compatible complex data
(like structures with pointer elements, C++ classes)
without data marshalling.
● Allocation of data at the same virtual addresses on
the host and the coprocessor.
● Runtime automatically maintains coherence.
● Syntax based on the keywords __Cilk_shared and
__Cilk_offload.
28.9.2016 Intel Xeon Phi Programming
MYO: Example
#define N 10000
_Cilk_shared int a[N], b[N], c[N];
_Cilk_shared void add() {
for (int i = 0; i < N; i++)
c[i] = a[i] + b[i];
}
int main(int argc, char *argv[]) {
…
_Cilk_offload add(); // Function call on coprocessor:
…
}
28.9.2016 Intel Xeon Phi Programming
MYO Language Extensions
Entity Syntax Semantics
Function int _Cilk_shared f(int x){…} Executable code for both host and MIC; may
be called from either side
Global variable _Cilk_shared int x = 0 Visible on both sides
File/Function static static _Cilk_shared int x Visible on both sides, only to code within the
file/function
Class class _Cilk_shared x {...} Class methods, members, and operators are
available on both sides
Pointer to shared data int _Cilk_shared *p p is local (not shared), can point to shared
data
A shared pointer int *_Cilk_shared p p is shared, should only point at shared data
Offloading a function call x = _Cilk_offload func(y) func executes on MIC if possible
x = _Cilk_offload_to(n) func func must be executed on specified (n-th)
MIC
Offloading asynchronously _Cilk_spawn _Cilk_offload func(y) Non-blocking offload
Offload a parallel for-loop _Cilk_offload _Cilk_for(i=0; i<N; i++) {…} Loop executes in parallel on MIC
28.9.2016 Intel Xeon Phi Programming
Important MPI environment variables
● Important Paths are already set by intel module,
otherwise use:
. $ICC_BASE/bin/compilervars.sh intel64
. $MPI_BASE/bin64/mpivars.sh
● Recommended environment on Salomon:
module load intel
export I_MPI_HYDRA_BOOTSTRAP=ssh
export I_MPI_MIC=enable
export I_MPI_FABRICS=shm:dapl
export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-
v2-scif0,ofa-v2-mcm-1
export MIC_LD_LIBRARY_PATH =
$MIC_LD_LIBRARY_PATH:/apps/all/impi/5.1.2.150-iccifort-
2016.1.150-GCC-4.9.3-2.25/mic/lib/ depending on version
28.9.2016 Intel Xeon Phi Programming
Invocation of the Intel MPI compiler
28.9.2016 Intel Xeon Phi Programming
Language MPI Compiler Compiler
C mpiicc icc
C++ mpiicpc icpc
Fortran mpiifort ifort
I_MPI_FABRICS
● The following network fabrics are available for the
Intel Xeon Phi coprocessor:
28.9.2016 Intel Xeon Phi Programming
shm Shared-memory
tcp TCP/IP-capable network
fabrics, such as Ethernet and
InfiniBand (through IPoIB)
ofa OFA-capable network fabric
including InfiniBand
(through OFED verbs)
dapl DAPL–capable network
fabrics, such as InfiniBand,
iWarp, Dolphin, and XPMEM
(through DAPL)
I_MPI_FABRICS
● The default can be changed by setting the
I_MPI_FABRICS environment variable to
I_MPI_FABRICS=<fabric> or I_MPI_FABRICS=
<intra-node fabric>:<inter-nodes fabric>
● Intranode: Shared Memory, Internode: DAPL
(Default on SuperMIC/MUC)
export I_MPI_FABRICS=shm:dapl
● Intranode: Shared Memory, Internode: TCP
(Can be used in case of Infiniband problems)
export I_MPI_FABRICS=shm:tcp
28.9.2016 Intel Xeon Phi Programming
Sample MPI Program
lu65fok@login12:~/tests> cat testmpi.c
#include <stdio.h>
#include <mpi.h>
int main (int argc, char* argv[]) {
char hostname[100];
int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
gethostname(hostname,100);
printf( "Hello world from process %d of %d: host: %s\n", rank, size, hostname);
MPI_Finalize();
return 0;
}28.9.2016 Intel Xeon Phi Programming
MPI on hosts
● Compile for host using mpiicc / mpiifort:
lu65fok@login12:~/tests> mpiicc testmpi.c -o testmpi-
host
● Run 2 MPI tasks on host node i01r13a01
lu65fok@login12:~/tests> mpiexec -n 2 -host i01r13a01
./testmpi-host
Hello world from process 0 of 2: host: i01r13a01
Hello world from process 1 of 2: host: i01r13a01
28.9.2016 Intel Xeon Phi Programming
MPI in native mode on 1 MIC
● Compile for MIC using mpiicc / mpiifort -mmic:
lu65fok@login12:~/tests> mpiicc -mmic testmpi.c -o testmpi-mic
● Copy binary to MIC:lu65fok@login12:~/tests> scp testmpi-mic i01r13a01-mic0:
● Launch 2 MPI tasks from MIC node i01r13a01-mic0lu65fok@i01r13a04:~/tests> ssh i01r13a01-mic0
[lu65fok@i01r13a01-mic0 ~]$ mpiexec -n 2 ./testmpi-mic
Hello world from process 1 of 2: host: i01r13a01-mic0
Hello world from process 0 of 2: host: i01r13a01-mic0
28.9.2016 Intel Xeon Phi Programming
Do not mix up with mpicc and mpifort!!
lu65fok@login12:~/tests> mpicc -mmic testmpi.c -o testmpi-mic
/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: skipping incompatible
/lrz/sys/intel/mpi_41_3_048/mic/lib/libmpigf.so when searching for -lmpigf
/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: skipping incompatible
/lrz/sys/intel/mpi_41_3_048/mic/lib/libmpigf.a when searching for -lmpigf
/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: cannot find -lmpigf
/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: skipping incompatible
/lrz/sys/intel/mpi_41_3_048/mic/lib/libmpi.so when searching for -lmpi
/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: skipping incompatible
/lrz/sys/intel/mpi_41_3_048/mic/lib/libmpi.a when searching for -lmpi
/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: cannot find -lmpi
/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: skipping incompatible
/lrz/sys/intel/mpi_41_3_048/mic/lib/libmpigi.a when searching for -lmpigi
/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: cannot find -lmpigi
collect2: ld returned 1 exit status
28.9.2016 Intel Xeon Phi Programming
MPI on 1 MIC
● Compile for MIC using mpiicc / mpiifort -mmic:
lu65fok@login12:~/tests> mpiicc -mmic testmpi.c -o testmpi-mic
● Copy binary to MIC(not necessary if home is mounted on MICs)
lu65fok@login12:~/tests> scp testmpi-mic i01r13a01-mic0:
● Run 2 MPI tasks on MIC node i01r13a01-mic0lu65fok@i01r13a04:~/tests> mpiexec -n 2 -host i01r13a01-mic0
./home/lu65fok/testmpi-mic
Hello world from process 1 of 2: host: i01r13a01-mic0
Hello world from process 0 of 2: host: i01r13a01-mic0
28.9.2016 Intel Xeon Phi Programming
Full path
needed!
MPI on 2 MICs
● Compile for MIC using mpiicc / mpiifort -mmic:
lu65fok@login12:~/tests> mpiicc -mmic testmpi.c -o testmpi-mic
● Copy binary to MICs:(not necessary if home is mounted on MICs)
lu65fok@login12:~/tests> scp testmpi-mic i01r13a01-mic0:
lu65fok@login12:~/tests> scp testmpi-mic i01r13a01-mic1:
● Run 2 MPI tasks on MIC node i01r13a01-mic0
lu65fok@login12:~/tests> mpiexec -n 2 -perhost 1 -host
i01r13a01-mic0,i01r13a01-mic1 ./home/lu65fok/testmpi-mic
Hello world from process 1 of 2: host: i01r13a01-mic1
Hello world from process 0 of 2: host: i01r13a01-mic0
28.9.2016 Intel Xeon Phi Programming
MPI on Host and 2 MICs attached to the
host
lu65fok@login12:~/tests> mpirun -n 1 -host i01r13a01 ./testmpi-host : -n 1 -
host i01r13a01-mic0 /home/lu65fok/testmpi-mic : -n 1 -host i01r13a01-mic1
/home/lu65fok/testmpi-mic
Hello world from process 0 of 3: host: i01r13a01
Hello world from process 2 of 3: host: i01r13a01-mic1
Hello world from process 1 of 3: host: i01r13a01-mic0
28.9.2016 Intel Xeon Phi Programming
MPI on multiple Hosts & MICs
28.9.2016 Intel Xeon Phi Programming
lu65fok@i01r13a01:~/tests> mpirun -n 1 -host i01r13a01 ./testmpi-host : -n 1 -host
i01r13a01-mic0 /home/lu65fok/testmpi-mic : -n 1 -host i01r13a01-mic1
/home/lu65fok/testmpi-mic : -n 1 -host i01r13a02 ./testmpi-host : -n 1 -host
i01r13a02-mic0 /home/lu65fok/testmpi-mic : -n 1 -host i01r13a02-mic1
/home/lu65fok/testmpi-mic
Hello world from process 3 of 6: host: i01r13a02
Hello world from process 0 of 6: host: i01r13a01
Hello world from process 2 of 6: host: i01r13a01-mic1
Hello world from process 5 of 6: host: i01r13a02-mic1
Hello world from process 1 of 6: host: i01r13a01-mic0
Hello world from process 4 of 6: host: i01r13a02-mic0
MPI Machine File
lu65fok@login12:~/tests> cat machinefile.txt
i01r13a01-mic0
i01r13a01-mic1
i01r13a02-mic0
i01r13a02-mic1
lu65fok@login12:~/tests> mpirun -n 4 -machinefile machinefile.txt
/home/lu65fok/testmpi-mic
Hello world from process 3 of 4: host: i01r13a02-mic1
Hello world from process 2 of 4: host: i01r13a02-mic0
Hello world from process 1 of 4: host: i01r13a01-mic1
Hello world from process 0 of 4: host: i01r13a01-mic0
28.9.2016 Intel Xeon Phi Programming
MPI Machine File
lu65fok@login12:~/tests> cat machinefile.txt
i01r13a01-mic0:2
i01r13a01-mic1
i01r13a02-mic0
i01r13a02-mic1
lu65fok@login12:~/tests> mpirun -n 4 -machinefile machinefile.txt
/home/lu65fok/testmpi-mic
Hello world from process 3 of 4: host: i01r13a02-mic0
Hello world from process 0 of 4: host: i01r13a01-mic0
Hello world from process 2 of 4: host: i01r13a01-mic1
Hello world from process 1 of 4: host: i01r13a01-mic0
28.9.2016 Intel Xeon Phi Programming
Offload from MPI Tasks
#include <unistd.h>
#include <stdio.h>
#include <mpi.h>
int main (int argc, char* argv[]) {
char hostname[100];
int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
gethostname(hostname,100);
#pragma offload target(mic)
{
char michostname[50];
gethostname(michostname, 50);
printf("MIC: I am %s and I have %ld logical cores. I was called by process %d of %d: host: %s \n", michostname,
sysconf(_SC_NPROCESSORS_ONLN), rank, size, hostname);
}
printf( "Hello world from process %d of %d: host: %s\n", rank, size, hostname);
MPI_Finalize();
return 0;
}
28.9.2016 Intel Xeon Phi Programming
Offload from MPI Tasks using 1 host
lu65fok@login12:~/tests> mpiicc testmpioffload.c -o testmpioffload
lu65fok@login12:~/tests> mpirun -n 4 -host i01r13a01 ./testmpioffload
Hello world from process 3 of 4: host: i01r13a01
Hello world from process 1 of 4: host: i01r13a01
Hello world from process 0 of 4: host: i01r13a01
Hello world from process 2 of 4: host: i01r13a01
MIC: I am i01r13a01-mic0 and I have 240 logical cores. I was called by
process 3 of 4: host: i01r13a01
MIC: I am i01r13a01-mic0 and I have 240 logical cores. I was called by
process 0 of 4: host: i01r13a01
MIC: I am i01r13a01-mic0 and I have 240 logical cores. I was called by
process 1 of 4: host: i01r13a01
MIC: I am i01r13a01-mic0 and I have 240 logical cores. I was called by
process 2 of 4: host: i01r13a01
28.9.2016 Intel Xeon Phi Programming
Offload from MPI Tasks using multiple
hosts
lu65fok@login12:~/tests> mpirun -n 4 -perhost 2 -host
i01r13a01,i01r13a02 ./testmpioffload
Hello world from process 2 of 4: host: i01r13a02
Hello world from process 0 of 4: host: i01r13a01
Hello world from process 3 of 4: host: i01r13a02
Hello world from process 1 of 4: host: i01r13a01
MIC: I am i01r13a02-mic0 and I have 240 logical cores. I was called by
process 2 of 4: host: i01r13a02
MIC: I am i01r13a01-mic0 and I have 240 logical cores. I was called by
process 1 of 4: host: i01r13a01
MIC: I am i01r13a01-mic0 and I have 240 logical cores. I was called by
process 0 of 4: host: i01r13a01
MIC: I am i01r13a02-mic0 and I have 240 logical cores. I was called by
process 3 of 4: host: i01r13a02
28.9.2016 Intel Xeon Phi Programming
Offload from MPI Tasks: Using both MICs
#pragma offload target(mic:rank%2)
{
char michostname[50];
gethostname(michostname, sizeof(michostname));
printf("MIC: I am %s and I have %ld logical cores. I was called by
process %d of %d: host: %s \n", michostname,
sysconf(_SC_NPROCESSORS_ONLN), rank, size,
hostname);
}
28.9.2016 Intel Xeon Phi Programming
Offload from MPI Tasks: Using both MICs
lu65fok@login12:~/tests> mpirun -n 4 -perhost 2 -host
i01r13a01,i01r13a02 ./testmpioffload
Hello world from process 0 of 4: host: i01r13a01
Hello world from process 2 of 4: host: i01r13a02
Hello world from process 3 of 4: host: i01r13a02
Hello world from process 1 of 4: host: i01r13a01
MIC: I am i01r13a02-mic1 and I have 240 logical cores. I was called
by process 3 of 4: host: i01r13a02
MIC: I am i01r13a01-mic1 and I have 240 logical cores. I was called
by process 1 of 4: host: i01r13a01
MIC: I am i01r13a01-mic0 and I have 240 logical cores. I was called
by process 0 of 4: host: i01r13a01
MIC: I am i01r13a02-mic0 and I have 240 logical cores. I was called
by process 2 of 4: host: i01r13a02
28.9.2016 Intel Xeon Phi Programming
Intel MKL
● Math library for C and Fortran
● Includes
BLAS
LAPACK
ScaLAPACK
FFTW
…
● Contains optimised routines
For Intel CPUs and MIC architecture
● All MKL functions are supported on Xeon Phi
But optimised at different levels
28.9.2016 Intel Xeon Phi Programming
MKL Usage In Accelerator Mode
● Compiler Assisted Offload
Offloading is explicitly controlled by compiler pragmas or
directives.
All MKL functions can be inserted inside offload region to run on
the Xeon Phi (In contrast, only a subset of MKL is subject to AO).
More flexibility in data transfer and remote execution
management.
● Automatic Offload Mode
MKL functions are automatically offloaded to the accelerator.
MKL decides:
when to offload
work division between host and targets
Data is managed automatically
● Native Execution
MKL functions are executed natively on the accelerator.
28.9.2016 Intel Xeon Phi Programming
Compiler Assisted Offload
MKL functions are offloaded in the same way as any other
offloaded function.
An example in C:
#pragma offload target(mic) \
in(transa, transb, N, alpha, beta) \
in(A:length(matrix_elements)) \
in(B:length(matrix_elements)) \
in(C:length(matrix_elements)) \
out(C:length(matrix_elements) alloc_if(0))
{
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,&beta, C, &N);
}
28.9.2016 Intel Xeon Phi Programming
How to use CAO
An example in Fortran:
!DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM
!DEC$ OMP OFFLOAD TARGET( MIC ) &
!DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), &
!DEC$ IN( A: LENGTH( NCOLA * LDA )), &
!DEC$ IN( B: LENGTH( NCOLB * LDB )), &
!DEC$ INOUT( C: LENGTH( N * LDC ))
!$OMP PARALLEL SECTIONS
!$OMP SECTION
CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, &A, LDA, B, LDB BETA, C, LDC )
!$OMP END PARALLEL SECTIONS
28.9.2016 Intel Xeon Phi Programming
Automatic Offload
- With automatic offload the user does not have to change the code at all:MKL_MIC_ENABLE=1
- The runtime may automatically download data to the Xeon Phi coprocessor
and execute (all or part of) the computations there, transparent for the
user
In Intel MKL 11.0.2 the following functions are enabled for automatic offload:
Level-3 BLAS functions
*GEMM (for m,n > 2048, k > 256)
*TRSM (for M,N > 3072)
*TRMM (for M,N > 3072)
*SYMM (for M,N > 2048)
LAPACK functions
LU (M,N > 8192)
QR
Cholesky
28.9.2016 Intel Xeon Phi Programming
Automatic Offload
BLAS only: Work can be divided between host and device using
mkl_mic_set_workdivision(TARGET_TYPE, TARGET_NUMBER, WORK_RATIO)
Users can use AO for some MKL calls and use CAO for others in
the same program
- Only supported by Intel compilers
- Work division must be set explicitly for AO, otherwise, all MKL
AO calls are executed on the host
28.9.2016 Intel Xeon Phi Programming
Automatic Offload Mode Example
#include “mkl.h”
err = mkl_mic_enable();
//Offload all work on the Xeon Phi
err = mkl_mic_set_workdivsion (MKL_TARGET_HOST, MIC_HOST_DEVICE, 0, 0);
//Let MKL decide of the amount of work to offload on coprocessor 0
err = mkl_mic_set_workdivision(MKL_TARGET_MIC, 0, MIC_AUTO_WORKDIVISION);
//Offload 50% of work on coprocessor 0
err = mkl_mic_set_workdivision(MKL_TARGET_MIC, 0, 0.5);
//Get amount of work on coprocessor 0
err = mkl_mic_get_workdivision(MKL_TARGET_MIC, 0, &wd);
28.9.2016 Intel Xeon Phi Programming
Tips for Using Automatic Offload
● AO works only when matrix sizes are right
● SGEMM: Offloading only when M, N > 2048
● Square matrices give much better performance
● These settings may produce better results for SGEMM
calculation for 60-core coprocessor: export MIC_USE_2MB_BUFFERS=16K
export MIC_OMP_NUM_THREADS=240
export MIC_ENV_PREFIX=MIC
export MIC_KMP_AFFINITY=compact,granularity=fine
export MIC_PLACE_THREADS=60C,4t
● Work division settings are just hints to MKL runtimeThreading control tips:
Prevent thread migration on the host using:
export KMP_AFFINITY=granularity=fine, compact, 1,0
28.9.2016 Intel Xeon Phi Programming
Native Execution
● In order to use Intel MKL in a native application, an
additional argument -mkl is required with the compiler
option -mmic.
● Native applications with Intel MKL functions operate
just like native applications with user-defined
functions.
● $ icc –O3 –mmic -mkl sgemm.c –o sgemm.exe
28.9.2016 Intel Xeon Phi Programming
Compile to use Intel MKL
● Compile using –mkl flag
-mkl=parallel (default) for parallel execution
-mkl=sequential for sequential execution
● AO: The same way of building code on Xeon:
user@host $ icc -O3 -mkl sgemm.c -o sgemm.exe
● Native using -mmic
user@host $ ifort –mmic –mkl myProgram.c –o
myExec.mic
● MKL can also be used in native mode if compiled with -mmic
28.9.2016 Intel Xeon Phi Programming
More Code Examples
● $MKLROOT/examples/examples_mic.tgz
sgemm SGEMM example
sgemm_f SGEMM example (Fortran 90)
fft complex-to-complex 1D FFT
solverc Pardiso examples
sgaussian single precision Gaussian RNG
dgaussian double precision Gaussian RNG
. ..
28.9.2016 Intel Xeon Phi Programming
Which Model to Choose
● Native execution for
highly parallel code.
using coprocessors as independent compute nodes
● AO if
Sufficient Byte/FLOP ratio makes offload beneficial.
Using Level-3 BLAS functions: GEMM, TRMM, TRSM
● CAO if
There is enough computations to offset data transfer
overhead
Transferred data can be reused by multiple operations
https://software.intel.com/en-us/articles/recommendations-to-
choose-the-right-mkl-usage-model-for-xeon-phi
28.9.2016 Intel Xeon Phi Programming
Memory Allocation: Data Alignment
Compiler-assisted offload
- Memory alignment is inherited from host!
General memory alignment (SIMD vectorisation)
- Align buffers (leading dimension) to a multiple of
vector width (64 Byte)
- mkl_malloc, _mm_malloc (_aligned_malloc),
tbb::scalable_aligned_malloc, …
28.9.2016 Intel Xeon Phi Programming
Memory Allocation: Data Alignment
void * darray;int workspace;int alignment = 64;
...darray = mkl_malloc(sizeof(double) * workspace,
alignment);
...
mkl_free(darray);
28.9.2016 Intel Xeon Phi Programming
Memory Allocation: Page Size
Performance of many Intel MKL routines improves when input and
output data reside in memory allocated with 2 MB pages
Address more memory with less pages,
reduce overhead of translating between host- and MIC address
spaces
# Allocate all pointer-based variables with run-time# length > 64 KB in 2 MB pages:
$ export MIC_USE_2MB_BUFFERS=64K
28.9.2016 Intel Xeon Phi Programming
Environment Settings
Native:KMP_AFFINITY=balancedOMP_NUM_THREADS=244
Compiler-Assisted Offload:MIC_ENV_PREFIX=MICMIC_KMP_AFFINITY=balancedMIC_OMP_NUM_THREADS=240MIC_USE_2MB_BUFFERS=64K
https://software.intel.com/en-us/articles/performance-tips-of-using-intel-mkl-on-intel-xeon-phi-
coprocessor
Automatic Offload:MKL_MIC_ENABLE=1OFFLOAD_DEVICES=<list>MKL_MIC_MAX_MEMORY=2GBMIC_ENV_PREFIX=MICMIC_OMP_NUM_THREADS=240MIC_KMP_AFFINITY=balanced
+ Compiler-Assisted Offload:OFFLOAD_ENABLE_ORSL=1
28.9.2016 Intel Xeon Phi Programming
Environment Settings: Affinity
KMP_AFFINITY=
- Host: e.g., compact,1
- Coprocessor: balanced
MIC_ENV_PREFIX=MIC; MIC_KMP_AFFINITY=
- Coprocessor (CAO): balanced
KMP_PLACE_THREADS
- Note: does not replace KMP_AFFINITY
- Helps to set/achieve pinning on e.g., 60 cores with 3 threads
each
kmp_* (or mkl_*) functions take precedence over corresponding env.
variables
28.9.2016 Intel Xeon Phi Programming
More MKL Documentation
● https://software.intel.com/en-us/node/528430
● https://www.nersc.gov/assets/MKL_for_MIC.pdf
● https://software.intel.com/en-us/articles/intel-mkl-on-
the-intel-xeon-phi-coprocessors
● Intel Many Integrated Core Community website:
https://software.intel.com/en-us/mic-developer
● Intel MKL forum
https://software.intel.com/en-us/forums/intel-math-
kernel-library
28.9.2016 Intel Xeon Phi Programming
Xeon Phi Hardware (Recap.)
60 in-order cores, ring interconnect Scalar unit based on Intel Pentium P54C
– 64-bit addressing mode Vector unit added
– 4 hardware threads per core– Each thread issues instructions in turn– Round-robin execution hides latency– 512 bit (64 byte) vector instructions (IMCI)
Conclusion: need to fully utilise the vector units to achieve performance close to peak
28.9.2016 Intel Xeon Phi Programming
Performance
28.9.2016 Intel Xeon Phi Programming
● Sandy-Bridge-EP: 2 sockets ×8 cores @ 2.7 GHz.
● Xeon Phi: 60 cores @1.0 GHz.
● # cycles/s:
SandyBridge: 4.3E10 cycles/s.
Xeon Phi: 6.0E10 cycles /s.
● DP FLOP/s:
SandyBridge: 2 sockets × 8 cores × 2.7 GHz × 4
(SIMD) × 2 (ALUs) = 345.6 GFLOP/s
Xeon Phi: 60 cores × 1 GHz × 8 (SIMD) × 2 (FMA)=
960 GFLOP/s Factor 2.7
The Intel MIC Architecture: VPU
Vector Processing Unit (VPU)
● The VPU includes the EMU (Extended Math Unit)
and executes 16 single-precision floating point, 16
32-bit integer operations or 8 double-precision
floating point operations per cycle. Each operation
can be a FMA, giving 32 single-precision or 16
double-precision floating-point operations per cycle.
● Contains the vector register file: 32 512-bit wide
registers per thread context, each register can hold
16 singles or 8 doubles.
● Most vector instructions have a 4-clock latency with a
1 clock throughput.
28.9.2016 Intel Xeon Phi Programming
SIMD Fused Multiply Add (FMA)
28.9.2016 Intel Xeon Phi Programming
vfmadd213ps source1, source2, cource3
Vectorisation
Vectorisation: Most important to get performance on Xeon Phi
● Use Intel options –vec-report, -vec-report2 , -vec-report3 to
show information about vectorisation
● Use Intel option -guide-vec to get tips on improvement.
● Prefer SoA over AoS (AoS good for encapsulation but bad
for vector processing)
● Help the compiler with Intel Pragmas
28.9.2016 Intel Xeon Phi Programming
Vectorisation
● Good news:
Compiler can vectorise code for you in many cases
● Bad news:
That sometimes doesn’t work perfectly and the
compiler may need your assistance
Other option: explicit vector programming28.9.2016 Intel Xeon Phi Programming
Vectorisation: Approaches
● Let the compiler do the job: auto-vectorisation
Compiler might need your help: annotations/pragmas
● Explicit vector programming:
Use vector classes or array notation
Not fully general: restricted to C++ and Cilk plus
● Intrinstics / assembly
Full control but low level
● Also important for successful vectorisation
Data alignment
Prefetching
28.9.2016 Intel Xeon Phi Programming
Auto-vectorisation (Intel compiler)
● The vectoriser for MIC works just like for the host
Enabled by default at optimisation level –O2 and above
Data alignment should be 64 bytes instead of 16
More loops can be vectorised, because of masked vector
instructions, gather/scatter and fused multiply-add (FMA)
Try to avoid 64 bit integers (except as addresses)
● Vectorised loops may be recognised by
Vectorisation and optimisation reports (recommended)
–qopt-report=2 –qopt-report-phase=vec
Unmasked vector instructions
Gather and scatter instructions
Math library calls to libsvml
28.9.2016 Intel Xeon Phi Programming
Vectorisation Report
By default, both host and target compilations may generate
messages for the same loop, e.g.
To get a vec. report for the offload target compilation, but not for
the host compilation:
host:~/> icc -qopt-report=2 –qopt-report-phase=vec test_vec.ctest_vec.c(10): (col. 1) remark: LOOP WAS VECTORIZED.test_vec.c(10): (col. 1) remark: *MIC* LOOP WAS VECTORIZED.
host:~/> icc –qopt-report=2 –qopt-report-phase=vec –qoffload-option,mic,compiler,”-qopt-report=2” test_vec.ctest_vec.c(10): (col. 1) remark: *MIC* LOOP WAS VECTORIZED.test_vec.c(20): (col. 1) remark: *MIC* loop was not vectorized: existence of vector dependence.test_vec.c(20): (col. 1) remark: *MIC* PARTIAL LOOP WAS VECTORIZED.
28.9.2016 Intel Xeon Phi Programming
Common Compiler Messages
● “Loop was not vectorized” because
“Low trip count”
“Existence of vector dependence”
Possible dependence of one loop iteration on another,
e.g.
“vectorization possible but seems inefficient”
“Not inner loop”
● It may be possible to overcome these using switches, pragmas,
source code changes
for (j=n; j<MAX; j++) {a[j] = a[j] + c * a[j-n];
}
28.9.2016 Intel Xeon Phi Programming
Intel-Specific Vectorisation Pragmas
● #pragma ivdep: Instructs the compiler to ignore
assumed vector dependencies.
● #pragma loop_count: Specifies the iterations for the
for loop.
● #pragma novector: Specifies that the loop should
never be vectorized.
● #pragma omp simd: Transforms the loop into a loop
that will be executed concurrently using Single
Instruction Multiple Data (SIMD) instructions.
(OpenMP 4.0)
28.9.2016 Intel Xeon Phi Programming
#pragma vector
28.9.2016 Intel Xeon Phi Programming
always instructs the compiler to override any efficiency heuristic during the decision to vectorize or
not, and vectorize non-unit strides or very unaligned memory accesses; controls the
vectorization of the subsequent loop in the program; optionally takes the keyword assert
aligned instructs the compiler to use aligned data movement instructions for all array references
when vectorizing
unaligned instructs the compiler to use unaligned data movement instructions for all array references
when vectorizing
nontemporal directs the compiler to use non-temporal (that is, streaming) stores on systems based on
all supported architectures, unless otherwise specified; optionally takes a comma
separated list of variables.
On systems based on Intel® MIC Architecture, directs the compiler to generate clevict
(cache-line-evict) instructions after the stores based on the non-temporal pragma when the
compiler knows that the store addresses are aligned; optionally takes a comma separated
list of variables
temporal directs the compiler to use temporal (that is, non-streaming) stores on systems based on
all supported architectures, unless otherwise specified
vecremainder instructs the compiler to vectorize the remainder loop when the original loop is vectorized
novecremainder instructs the compiler not to vectorize the remainder loop when the original loop is
vectorized
Example for Vectorisation Pragmas
pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n)) {
#pragma omp parallel for
for( i = 0; i < n; i++ ) {
for( k = 0; k < n; k++ ) {
#pragma vector aligned
#pragma ivdep
for( j = 0; j < n; j++ ) {
//c[i][j] = c[i][j] + a[i][k]*b[k][j];
c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];
}}}}
28.9.2016 Intel Xeon Phi Programming
#pragma simd
● The simd pragma is used to guide the compiler to
vectorize more loops. Vectorization using the simd
pragma complements (but does not replace) the fully
automatic approach.
● Without explicit vectorlength() and vectorlengthfor()
clauses, compiler will choose a vectorlength using its
own cost model. Misclassification of variables into
private, firstprivate, lastprivate, linear, and reduction,
or lack of appropriate classification of variables may
lead to unintended consequences such as runtime
failures and/or incorrect result.
28.9.2016 Intel Xeon Phi Programming
#pragma simd
void add_floats(float *a, float *b, float *c, float *d, float *e, int n)
{
int i;
#pragma simd
for (i=0; i<n; i++){
a[i] = a[i] + b[i] + c[i] + d[i] + e[i];
}
}
28.9.2016 Intel Xeon Phi Programming
Function uses too
many unknown
pointers for the
compiler's
automatic runtime
independence
check
optimization to
kick-in
IMCI Instruction Set
● IMCI: Initial Many-Core instruction set
IMCI is not SSE/AVX!
28.9.2016 Intel Xeon Phi Programming
for (int i=0; i<n; i+=4) {
__m128 Avec=_mm_load_ps(A+i);
__m128 Bvec=_mm_load_ps(B+i);
Avec=_mm_add_ps(Avec, Bvec);
_mm_store_ps(A+i, Avec);
}
for (int i=0; i<n; i+=16) {
__m512 Avec=_mm512_load_ps(A+i);
__m512 Bvec=_mm512_load_ps(B+i);
Avec=_mm512_add_ps(Avec, Bvec);
_mm512_store_ps(A+i, Avec);
}
SSE2 Intrinsics IMCI Intrinsics
Features of the IMCI Instruction Set
● Fused Multiply-Add (FMA) instruction support.
● Gather and Scatter instructions:
copy non-contiguous data from MEM to SIMD registers
(gather) or from SIMD registers to MEM (scatter).
● Swizzle and Permute instructions:
swizzle: rearranges elements within each 128-bit block
permute: rearranges the 128-bit blocks according to
patterns specified by the user.
● Bitmasked operations: Control which of the elements
in the resulting vector are modified / preserved.
● Reduction (sum, product), min/max operations.
28.9.2016 Intel Xeon Phi Programming
Thread Affinity
● Pinning Threads is important!
● export KMP_AFFINITY=“granularity=thread,x”
x=compact, scatter, balanced
● See Intel compiler Documentation.
28.9.2016 Intel Xeon Phi Programming
KMP_AFFINITY=granularity=thread,scatter.KMP_AFFINITY=granularity=thread,compact.
Data Alignment
● Prerequisite for successful use of the SIMD units.
● A pointer p is said to address a memory location
aligned on an n-byte boundary if ((size_t)p%n==0).
● Memory address should a multiple of the vector
register width in bytes, i.e.
SSE2: 16-Byte alignment
AVX: 32-Byte alignment
MIC: 64-Byte alignment
28.9.2016 Intel Xeon Phi Programming
Data Alignment
● Data alignment on the stack:
__declspec(align(64)) double data[N]; (ICC)
double data[N] __attribute__((aligned(64))); (ICC, GCC)
● Data alignment on the heap (ICC)
_mm_malloc/free functions (ICC)
#include <malloc.h>
double*A = (double*)_mm_malloc(N*sizeof(double), 64);
_mm_free(A);
posix_memalign (ICC, GCC)
ok=posix_memalign((void**)&a, 64, n*n*sizeof(double));
28.9.2016 Intel Xeon Phi Programming
Memory Bandwidth
● Sandy-Bridge:
2 sockets× 4 memory channels × 6.4 GT/s × 2
bytes per channel = 102.4 GB/s
● Xeon Phi:
8 memory controllers × 2 channels/controller × 6.0
GT/s ×4 bytes per channel = 384 GB/s.
● For complicated memory access patterns: memory
latency / cache performance is important. Xeon Phi
caches less powerful then Xeon caches (e.g. no L1
prefetcher etc.) .
28.9.2016 Intel Xeon Phi Programming
Factor 3.8
Memory / Host Access
● When is Xeon Phi expected to deliver better
performance than the host:
1. Bandwidth-bound code: If memory access patterns
are streamlined so that application is limited by
memory bandwidth and not memory-latency bound.
2. Compute-bound code: high arithmetic intensity (#
operations / memory transfer).
3. Code should not be dominated by Host <-> MIC
communication limited by slow PCIe v2 bandwidth of
6 GB/s .
28.9.2016 Intel Xeon Phi Programming
Prefetching
● Hardware Prefetching:
Intel Xeon processors: L1 and L2 hardware
prefetchers
Intel Xeon Phi: only L2 hardware prefetcher
● Software Prefetching:
Instructions that request that a cache line is fetched
from memory in cache.
Does not stall execution.
Prefetch distance= time between the prefetch
instruction and the instruction using the data
28.9.2016 Intel Xeon Phi Programming
Prefetching
● If software prefetches are doing a good job, then
hardware prefetching does not kick in.
● In several workloads (such as stream), maximal
software prefetching gives the best performance.
● Any references not prefetched by compiler may get
prefetched by hardware.
● Details: Rakesh Krishnayer
http://software.intel.com/sites/default/files/article/326703/5.3-
prefetching-on-mic-update.pdf.
28.9.2016 Intel Xeon Phi Programming
Summary
● Concerning the ease of use and the programmability Intel Xeon Phi is a
promising hardware architecture compared to other accelerators like
GPGPUs, FPGAs or former CELL processors or ClearSpeed cards.
● Codes using MPI, OpenMP or MKL etc. can be quickly ported. Some
MKL routines have been highly optimised for the MIC.
● Due to the large SIMD width of 64 Bytes vectorisation is even more
important for the MIC architecture than for Intel Xeon based systems.
● It is extremely simple to get a code running on Intel Xeon Phi, but
getting performance out of the chip in most cases needs manual tuning
of the code due to failing auto-vectorisation.
● MIC programming enforces programmer to think about SIMD
vectorisation
→ Performance on current /future Xeon based systems also much
better with MIC-optimised code.
28.9.2016 Intel Xeon Phi Programming
Xeon Phi References
● Books:
James Reinders, James Jeffers, Intel Xeon Phi Coprocessor High
Performance Programming, Morgan Kaufman Publ. Inc., 2013
http://lotsofcores.com ; new KNL edition in July 2016
Rezaur Rahman: Intel Xeon Phi Coprocessor Architecture and
Tools: The Guide for Application Developers, Apress 2013 .
Parallel Programming and Optimization with Intel Xeon Phi
Coprocessors, Colfax 2013
http://www.colfaxintl.com/nd/xeonphi/book.aspx
● Intel Xeon Phi Programming, Training material, CAPS
● Intel Training Material and Webinars
● V. Weinberg (Editor) et al., Best Practice Guide - Intel Xeon Phi,
http://www.prace-project.eu/Best-Practice-Guide-Intel-Xeon-Phi-
HTML and references therein
28.9.2016 Intel Xeon Phi Programming
Acknowledgements
● IT4Innovation, Ostrava.
● Partnership for Advanced Computing in Europe (PRACE)
● Intel
● BMBF (Federal Ministry of Education and Research)
● Dr. Karl Fürlinger (LMU)
Intel Xeon Phi Programming28.9.2016