intel xeon phi programming - johannes kepler university · multi-kepler gpu vs. multi-intel mic for...

Intel Xeon Phi Programming Dr. Volker Weinberg (LRZ) with material from Dr. M. Allalen (LRZ) & Dr. K. Fürlinger (LMU)

PRACE Autumn School 2016, September 27-30, 2016, Hagenberg

Agenda Intel Xeon Phi Programming

28.9.2016 Intel Xeon Phi Programming

● 10:15-11:45 Introduction: Intel Xeon Phi @ LRZ & EU,

Architecture, Programming Models, Native Mode

● 11:45-12:00 Coffee Break

● 12:00-13:00 Offload Mode I

● 13:00-14:15 Lunch Break

● 14:15-15:00 Offload Mode II

● 15:00-15:45 MPI

● 15:45-16:00 Coffee Break

● 16:00-17:00 Intel MKL Library

● 17:00-17:30 Optimisation and Vectorisation

Intel Xeon Phi @ LRZ and EU


Intel Xeon Phi and GPU Training @ LRZ

Intel Xeon Phi Programming

28.-30.4.2014 @ LRZ (PATC): KNC+GPU

27.-29.4.2015 @ LRZ (PATC): KNC+GPU

3.-4.2.2016 @ IT4Innovations: KNC

27.-29.6.2016 @ LRZ (PATC): KNC+KNL

Sept. 2016 @ PRACE Seasonal School,

Hagenberg: KNC

Feb. 2017 @ IT4Innovations (PATC): KNC

Jun. 2017 @ LRZ (PATC): KNL

http://inside.hlrs.de/

inSiDE, Vol. 12, No. 2, p. 102, 2014

inSiDE, Vol. 13, No. 2, p. 79, 2015

inSiDE, Vol. 14, No. 1, p. 76f, 2016

28.9.2016

Evaluating Accelerators at LRZ

Research at LRZ within PRACE & KONWIHR:

● CELL programming

2008-2009 Evaluation of CELL programming.

IBM announced to discontinue CELL in Nov. 2009.

● GPGPU programming

Regular GPGPU computing courses at LRZ since 2009.

Evaluation of GPGPU programming languages:

CAPS HMPP

PGI accelerator compiler

CUDA, cuBLAS, cuFFT

PyCUDA/R

● RapidMind → ArBB (Intel) → discontinued

● Larrabee (2009) → Knights Ferry (2010) → Knights Corner → Intel

Xeon Phi (2012) → KNL (2016)


} → OpenACC

IPCC (Intel Parallel Computing Centre)

● New Intel Parallel Computing Centre (IPCC) since July 2014:

Extreme Scaling on MIC/x86

● Chair of Scientific Computing at the Department of Informatics in

the Technische Universität München (TUM) & LRZ

● https://software.intel.com/de-de/ipcc#centers

● https://software.intel.com/de-de/articles/intel-parallel-computing-center-at-

leibniz-supercomputing-centre-and-technische-universit-t

● Codes:

Simulation of Dynamic Ruptures and Seismic Motion in Complex

Domains: SeisSol

Numerical Simulation of Cosmological Structure Formation: GADGET

Molecular Dynamics Simulation for Chemical Engineering: ls1 mardyn

Data Mining in High Dimensional Domains Using Sparse Grids: SG++


● Czech-Bavarian Competence Team for

Supercomputing Applications (CzeBaCCA)

● New BMBF funded project that started in Jan. 2016 to:

Foster Czech-German Collaboration in Simulation Supercomputing

series of workshops will initiate and deepen collaboration between Czech

and German computational scientists

Establish Well-Trained Supercomputing Communities

joint training program will extend and improve trainings on both sides

Improve Simulation Software

establish and disseminate role models and best practices of simulation

software in supercomputing


CzeBaCCA Project

28.9.2016

CzeBaCCA Trainings and Workshops


● https://www.lrz.de/forschung/projekte/forschung-hpc/CzeBaCCA/

Intel MIC Programming Workshop,

3 – 4 February 2016, Ostrava, Czech Republic

Scientific Workshop: SeisMIC - Seismic Simulation on Current and Future

Supercomputers,

5 February 2016, Ostrava, Czech Republic

Intel MIC Programming Workshop,

27 - 29 June 2016, Garching, Germany

Scientific Workshop: High Performance Computing for Water Related Hazards,

29 June - 1 July 2016, Garching, Germany

http://inside.hlrs.de/ inSiDE, Vol. 14, No. 1, p. 76f, 2016

http://www.gate-germany.de/fileadmin/dokumente/Laenderprofile/Laenderprofil_Tschechien.pdf, p.27

28.9.2016

PRACE: Best Practice Guides

● http://www.prace-ri.eu/best-practice-guides/● Best Practice Guide – Hydra, March 2013 PDF HTML

● Best Practice Guide – JUROPA, March 2013 PDF HTML

● Best Practice Guide – Anselm, June 2013 PDF HTML

● Best Practice Guide – Curie, November 2013 PDF HTML

● Best Practice Guide – Blue Gene/Q, January 2014 PDF HTML

● Best Practice Guide – Intel Xeon Phi, February 2014 PDF HTML

● Best Practice Guide - JUGENE, June 2012 PDF HTML

● Best Practice Guide - Cray XE-XC, December 2013 PDF HTML

● Best Practice Guide - IBM Power, June 2012 PDF HTML

● Best Practice Guide - IBM Power 775, November 2013 PDF HTML

● Best Practice Guide - Chimera, April 2013 PDF HTML

● Best Practice Guide - GPGPU, May 2013 PDF HTML

● Best Practice Guide - Jade, February 2013 PDF HTML

● Best Practice Guide - Stokes, February 2013 PDF HTML

● Best Practice Guide - SuperMUC, May 2013 PDF HTML

● Best Practice Guide - Generic x86, May 2013 PDF HTML


http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-Hydra.pdf

http://www.prace-ri.eu/Best-Practice-Guide-Hydra-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-JUROPA.pdf

http://www.prace-ri.eu/Best-Practice-Guide-JUROPA-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-Anselm.pdf

http://www.prace-ri.eu/Best-Practice-Guide-Anselm-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-Curie.pdf

http://www.prace-ri.eu/Best-Practice-Guide-Curie-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-BlueGene-Q.pdf

http://www.prace-ri.eu/Best-Practice-Guide-Blue-Gene-Q-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-Intel-Xeon-Phi.pdf

http://www.prace-ri.eu/Best-Practice-Guide-Intel-Xeon-Phi-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-JUGENE.pdf

http://www.prace-ri.eu/Best-Practice-Guide-JUGENE-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-Cray-XE-XC.pdf

http://www.prace-ri.eu/Best-Practice-Guide-Cray-XE-XC-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-IBM-Power.pdf

http://www.prace-ri.eu/Best-Practice-Guide-IBM-Power-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-IBM-Power-775.pdf

http://www.prace-ri.eu/Best-Practice-Guide-IBM-Power-775-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-Chimera.pdf

http://www.prace-ri.eu/Best-Practice-Guide-Chimera-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-GPGPU.pdf

http://www.prace-ri.eu/Best-Practice-Guide-GPGPU-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-Jade.pdf

http://www.prace-ri.eu/Best-Practice-Guide-Jade-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-Stokes.pdf

http://www.prace-ri.eu/Best-Practice-Guide-Stokes-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-SuperMUC.pdf

http://www.prace-ri.eu/Best-Practice-Guide-SuperMUC-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-Generic-x86.pdf

http://www.prace-ri.eu/Best-Practice-Guide-Generic-x86-HTML

Intel MIC within PRACE: Best Practice

Guide

● Best Practice Guide – Intel Xeon Phi

Created within PRACE-3IP.

Written in Docbook XML.

Michaela Barth (KTH Sweden),Mikko Byckling (CSC

Finland), Nevena Ilieva (NCSA Bulgaria), Sami

Saarinen (CSC Finland), Michael Schliephake KTH

Sweden), Volker Weinberg (LRZ, Editor).

http://www.prace-ri.eu/Best-Practice-Guide-Intel-Xeon-

Phi-HTML

http://www.prace-ri.eu/IMG/pdf/Best-Practice-Guide-

Intel-Xeon-Phi.pdf


Intel MIC within PRACE: Preparatory

Access


● Applications Enabling for Capability Science

27 enabling projects from 17 PRACE partners from 14 countries

Jul-Dec 2013

Computations on Eurora (EURopean many integrated cORe

Architecture) Prototype at CINECA, Italy with 64 Xeon Phi

coprocessors and 64 NVIDIA GPUs

X. Guo, Report on Application Enabling for Capability Science in

the MIC Architecture, PRACE Deliverable D7.1.3,

http://www.prace-ri.eu/IMG/pdf/d7.1.3_1ip.pdf

16 Whitepapers available online:

http://www.prace-project.eu/Evaluation-Intel-MIC

Intel MIC within PRACE: Preparatory

Access

● Performance Analysis and Enabling of the RayBen Code for the Intel® MIC Architecture

● Enabling the UCD-SPH code on the Xeon Phi

● Xeon Phi Meets Astrophysical Fluid Dynamics

● Multi-Kepler GPU vs. Multi-Intel MIC for spin systems simulations

● Enabling Smeagol on Xeon Phi: Lessons Learned

● Code Optimization and Scaling of the Astrophysics Software Gadget on Intel Xeon Phi

● Code Optimization and Scalability Testing of an Artificial Bee Colony Based Software for

Massively Parallel Multiple Sequence Alignment on the Intel MIC Architecture

● Optimization and Scaling of Multiple Sequence Alignment Software ClustalW on Intel Xeon

Phi

● Porting FEASTFLOW to the Intel Xeon Phi: Lessons Learned

● Optimising CP2K for the Intel Xeon Phi

● Towards Porting a Real-World Seismological Application to the Intel MIC Architecture

● FMPS on MIC

● Massively parallel Poisson Equation Solver for hybrid Intel Xeon – Xeon Phi HPC Systems

● Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core

Architecture

● Porting and Verification of ExaFMM Library in MIC Architecture

● AGBNP2 Implicit Solvent Library for Intel® MIC Architecture


PRACE Systems

● MARCONI @ CINECA

The second partition is based on the Lenovo Adam Pass

architecture and is equipped with the new Intel Knights Landing

BIN1 processors (KNL). It consists of 3600 nodes (1 KNL

processor at 1.4GHz and 96 GB of DDR4 ram per node). Each

KNL is equipped with 68 cores and 16 GB of MCD RAM.

● MareNostrum @ BSC

1 partition with 42 nodes, each with 2 Intel Xeon Phi 5110P

(60 cores / each with 4 hardware threads = 240 total threads, 8

GB of GDDR5 RAM, 1.053 GHz clock frequency)

● SuperMIC @ LRZ

1 partition with 32 nodes, each with 2 Intel Xeon Phi 5110P


DEEP/ER Project: Towards Exascale

● Design of an architecture leading to Exascale.

● Development of hardware:

Implementation of a Booster based on MIC processors and EXTOLL

interconnect.

● Energy-aware integration of components:

Hot-water cooling.

● Cluster management system.

● Programming environment, programming models.

● Libraries and performance analysis tools.

● Porting applications.


DEEP Cluster-Booster Architecture


Intel Xeon Phi: Architecture


Top 10 of the Top 500 List (June 2016)


Accelerators in the Top 500 List


Green 500 List (Nov 2015)


All systems in the top 10 are accelerator-based (mostly using GPUs)

Intel Xeon Phi @ top500 June 2016

● http://www.top500.org/list/2016/06/

● #2 National Super Computer Center in Guangzhou, China:

Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Cluster, Intel Xeon E5-

2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P

NUDT

● #12 Texas Advanced Computing Center/Univ. of Texas United

States Stampede - PowerEdge C8220, Xeon E5-2680 8C

2.700GHz, Infiniband FDR, Intel Xeon Phi SE10P, Dell

● #25 (US) / #34 (USA) / #42 (China)

● #55 IT4Innovations National Supercomputing Center, VSB-

Technical University of Ostrava Czech Republic Salomon - SGI

ICE X, Xeon E5-2680v3 12C 2.5GHz, Infiniband FDR, Intel, SGI

● #64 (USA) / #65 (USA) / #88 (Japan) / #100 (USA)


Current Intel Xeon Phi Installations

● Tianhe-2 (China)

16000 nodes, each with 2 CPUs and 3 Intel Xeon Phis 31S1P

48000 Xeon Phi accelerators in total

3.1 Mio cores in total

33.8 PFlop/s Linpack, 17.8 MW

● Stampede (TACC, Texas)

6400 nodes, each with 1 Intel Xeon Phi SE10P

5.1 PFlop/s Linpack, 4.5 MW

● Salomon (IT4Innovations, Ostrava)

432 nodes, each with 2x Intel Xeon E5-2680v3 @ 2.5GHz and 2 x Intel

Xeon Phi 7120P with 61 cores @ 1.238 GHz, 16 GB RAM

● SuperMIC (LRZ, Munich)

32 nodes, each with 2 Intel Xeon E5-2650 @ 2.6 GHz and 2 x Intel

Xeon Phi 5110P with 60 cores @ 1.1 GHz, 8 GB RAM


The Salomon System


The Salomon cluster consists of 1008 compute nodes, totalling 24192 compute cores

with 129TB RAM and giving over 2 PFlop/s theoretical peak performance. Each node is

a powerful x86-64 computer, equipped with 24 cores, at least 128GB RAM. Nodes are

interconnected by 7D enhanced hypercube Infiniband network and equipped with Intel

Xeon E5-2680v3 processors. The Salomon cluster consists of 576 nodes without

accelerators and 432 nodes equipped with Intel Xeon Phi MIC accelerators.

Login: salomon.it4i.cz

Module System: module available

module load intel

Batch System: PBS Pro job workload manager

Documentation: https://docs.it4i.cz/salomon

The Salomon System

In general

Primary purpose High Performance Computing

Architecture of compute nodes x86-64

Operating system CentOS 6.7 Linux

Compute nodes

Totally 1008

Processor 2x Intel Xeon E5-2680v3, 2.5GHz, 12cores

RAM 128GB, 5.3GB per core, DDR4@2133 MHz

Local disk drive no

Compute network / Topology InfiniBand FDR56 / 7D Enhanced hypercube

w/o accelerator 576

MIC accelerated 432

In total

Total theoretical peak performance (Rpeak) 2011 Tflop/s

Total amount of RAM 129.024 TB


The Salomon System: Compute Nodes


Node Count Processor Cores Memory Accelerator

w/o

accelerator576

2x Intel Xeon E5-

2680v3, 2.5GHz24 128GB -

MIC

accelerated432

2x Intel Xeon E5-

2680v3, 2.5GHz24 128GB

2x Intel Xeon Phi

7120P, 61cores,

16GB RAM

Xeon Phi - History

● Intel decided to enter the GPU market in the mid 2000s

● GPUs need massive parallelism

GPU as a CPU with many x86 cores

Code-named Larrabee

Compared to established GPUs it was not competitive

● The project was discontinued in favor of a product for the HPC market

● MIC (many integrated cores) architecture

Knights Ferry: prototype card, not a commercial product

Knights Corner: first commercial product – system used during this school

Knights Landing: the next iteration of MIC

Knights Hill: (announced at SC14 for 2017/18)


Knights Corner vs. Xeon Phi vs. MIC

● MIC is the code name for Intel’s range of manycore CPUs

Knights Corner is the code name for the product

Xeon Phi is the official marketing terminology

● KNC comes in…

6 different specifications

3 main lines

57 / 60 / 61 cores, clocked at 1.1 / 1.053 / 1.238 GHz

6 / 8 / 16 GB of main memory

Different TDPs and memory bandwidths

3 different form factors

Actively cooled

Passively cooled

Dense form factor


Comparison CPU – MIC - GPU


CPU MIC GPUGeneral purpose architecture power-efficient multiprocessor massively data-parallel

(low frequency, Pentium design)

Intel Knights Landing

● From ISC 2016 in Frankfurt, Germany, Intel Corp.

launched the second-generation Xeon Phi product

family, formerly code-named Knights Landing, aimed

at HPC and machine learning workloads.

● Will not be covered in this school!


The Future of the MIC Architecture

● Knights Landing (KNL)

Next iteration of the MIC

architecture

14nm process

Based on Silvermont architecture

(Out-of-order Atom processors)

Major improvements and

upgrades over KNC

2D mesh interconnect instead of KNC ring

interconnect

Is available as a stand-alone CPU

Supports AVX-512 (Advanced Vector Extensions)


The Xeon Phi (KNC) in use at LRZ & BSC

● This KNC is the 5110P model

Passively cooled, PCIe form factor

245 Watt Thermal Design Power (TDP)

60 cores / each with 4 hardware threads = 240 total threads

8 GB of GDDR5 RAM

1.053 GHz clock frequency

320 GB/sec peak memory bandwidth

http://ark.intel.com/de/products/71992/Intel-Xeon-Phi-

Coprocessor-5110P-8GB-1_053-GHz-60-core


The Xeon Phi (KNC) in use at

IT4Innovations

● This KNC is the 7120P model

Passively cooled, PCIe form factor

300 Watt Thermal Design Power (TDP)

61 cores / each with 4 hardware threads = 244 total threads

16 GB of GDDR5 RAM

1.24 GHz clock frequency

352 GB/sec peak memory bandwidth

http://ark.intel.com/de/products/75799/Intel-Xeon-Phi-Coprocessor-

7120P-16GB-1_238-GHz-61-core


Xeon Phi Naked and Exposed


The Intel MIC Architecture

● Up to 16 GB GDDR5 memory (350 GB/s).

● Coprocessor connected to the host by PCIe Gen2.

● Runs Linux OS (Linux Standard Base (LSB) core libraries &

Busybox minimal shell environment).

● Up to 61 cores @ 1 GHz interconnected by a ring interconnect

● Theoretical peak performance:1 TFlop/s (DP), 2 TFlop/s (SP).

● 64-bit execution.

● x86 architecture, but SSE/AVX not supported!

Different instruction set for SIMD:

Intel Initial Many Core Instructions (IMCI).

● Highly parallel and power-efficient design.




CRI: Core Ring Interface, bidirectional ring interconnect

which connects all the cores, L2 caches, PCIe client logic,

GDDR5 memory controllers etc.

The Intel MIC Architecture: HW Threads

● Derived from Pentium P54c design:

Intel gave RTL code to Pentagon to produce radiation

hardened version for the military

In-order architecture.

2 instructions per cycle: one on U-pipe, one on V-pipe.

At least 2 threads should be run per core.

● Xeon Phi supports 4 hardware threads

Intended to hide latencies.

Unlike hyperthreading, MIC HW threads cannot be switched

off.

Max. perf. may be reached before 4 threads / core.


The Intel MIC Architecture: Caches

● Cache sizes:

32 kB of L1 instruction cache.

32 kB of L1 data cache.

512 kB of local L2 cache.

● Latency:

L1 cache: 1 cycle.

L2 cache: 15-30 cycles.

GDDR5 memory: 500-1000 cycles.

● HW Prefetcher: L2 cache only

● L2 size depends on how data/code is shared between the cores

If no data is shared between cores: L2 size is 30.5 MB (61 cores).

If evey core shares the same data: L2 size is 512 kB.

Cache coherency across the entire coprocessor.

Data remains consistent without software intervention.


Network Access

● Network access possible using TCP/IP tools like ssh.

● NFS mounts on Xeon Phi supported.

● Proxy Console / File I/O.


Intel Xeon Phi: Programming Models


Advantages of the MIC Architecture

● Retains programmability and flexibility of standard x86

architecture.

● No need to learn a new complicated language like CUDA or

OpenCL.

● Offers possibilities we always missed on GPUs: Login onto the

system, watching and controlling processes via top, kill etc. like

on a Linux host .

● Allows many different parallel programming models like

OpenMP, MPI, Intel Cilk and Intel Threading Building Blocks.

● Offers standard math-libraries like Intel MKL.

● Supports whole Intel tool chain, e.g. Intel C/C++ and Fortran

Compiler, Debugger & Intel VTune Amplifier.


MIC Programming Models


Programming Modes

● Native Mode

Programs started on Xeon Phi.

Cross-compilation using –mmic.

User access to Xeon Phi necessary.

Necessary to support MPI ranks on Xeon Phi.

● Offload (Accelerator) Mode

Programs started on the host.

Intel Pragmas to offload code to Xeon Phi.

OpenMP possible, but no MPI ranks on Xeon Phi.

No user access to Xeon Phi needed.

No input data files on Xeon Phi possible.


MIC Programming Models


Offload Modes

● Host and MIC do not share physical or virtual memory in hardware.

● 2 Offload data transfer models are available:

1. Explicit copy: Language Extensions for Offload (LEO) / OpenMP 4

Syntax: pragma/directive based

offload directive specifies variables that need to be copied between host and

MIC

Example (LEO):

C: #pragma offload target(mic) in(data:length(size))

Fortran: !DIR$ offload target(mic) in(data:length(size))

2. Implicit Copy: MYO

Syntax: keyword extension based

shared variables need to be declared, same variables can be used on the

host and MIC, runtime automatically maintains coherence

Example:

C: _Cilk_shared double a; _Cilk_offload func(a);

Fortran: not supported


Programming Languages / Libraries

● OpenMP

Native execution on MIC (cross-compilation with –mmic)

Execution on host, using offload pragmas / directives to offload code at

runtime

● MPI (and hybrid MPI & OpenMP)

Co-processor only MPI programming model: native execution on MIC

using mpiexec.hydra on MIC.

Symmetric MPI programming model: MPI ranks on MICs and host CPUs.

● MKL

Native execution on MIC (compilation with –mkl -mmic).

Compiler assisted offload.

Automatic Offload (AO): automatically uses both host and MIC,

transparent and automatic data transfer and execution management

(compilation with –mkl, mkl_mic_enable()/ MKL_MIC_ENABLE=1).


Distributed vs. Shared Memory


Distributed Memory

● Same program on each processor/machine (SPMD) or

Multiple programs with consistent communication structure (MPMD)

● Program written in a sequential language

all variables process-local

no implicit knowledge of data on other processors

● Data exchange between processes:

send/receive messages via appropriate library

most tedious, but also the most flexible way of parallelization

● Parallel library discussed here:

Message Passing Interface, MPI

Shared Memory

● Single Program on single machine UNIX Process splits off threads,

mapped to CPUs for work distribution

● Data may be process-global or thread-

local

exchange of data not needed, or via suitable synchronization mechanisms

● Programming models explicit threading (hard)

directive-based threading via OpenMP (easier)

automatic parallelization (very easy, but mostly not efficient)

MPI vs. OpenMP


● MPI standard

MPI forum released version 2.2 in

September 2009

MPI version 3.1 in June 2015

unified document („MPI1+2“)

● Base languages

Fortran (77, 95)

C

C++ binding obsolescent

use C bindings

● Resources:

http://www.mpi-forum.org

● OpenMP standard

OpenMP 3.1 (July 2011) released by

architecture review board (ARB)

feature update (tasking etc.)

OpenMP 4.0 (July 2013)

SIMD, affinity policies, accelerator

support

OpenMP 4.5 (Nov 2015)

● Base languages

Fortran (77, 95)

C, C++

(Java is not a base language)

● Resources:

http://www.openmp.org

http://www.compunity.org

Simple OpenMP program


#include <omp.h>

int main() {

int numth = 1;

#pragma omp parallel

{int myth = 0; /* private */

#pragma omp single

numth = omp_get_num_threads();

/* block above: one statement */

myth = omp_get_thread_num();

printf(“Hello from %i of %i\n”,\

myth,numth);

} /* end parallel */

}

icc –openmp helloopenmp.c

Simple OpenMP Program

lu65fok@login12:~/mickurs> export OMP_NUM_THREADS=10

lu65fok@login12:~/mickurs> ./helloopenmp

Hello from 5 of 10

Hello from 2 of 10

Hello from 6 of 10

Hello from 0 of 10

Hello from 8 of 10

Hello from 3 of 10

Hello from 4 of 10

Hello from 9 of 10

Hello from 7 of 10

Hello from 1 of 10


Simplest MPI Program

/* C Example */

#include <stdio.h>

#include <mpi.h>

int main (int argc, char* argv[])

{

int rank, size;

MPI_Init (&argc, &argv); /* starts MPI */

MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */

MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */

printf("Hello from %i of %i\n“, rank, size);

MPI_Finalize();

return 0;

}

mpiicc hellompi.c


Simplest MPI Program

lu65fok@login12:~/mickurs> mpiicc hellompi.c -o hellompi

lu65fok@login12:~/mickurs> mpirun -n 10 ./hellompi

Hello from 5 of 10

Hello from 6 of 10

Hello from 7 of 10

Hello from 8 of 10

Hello from 9 of 10

Hello from 0 of 10

Hello from 1 of 10

Hello from 2 of 10

Hello from 3 of 10

Hello from 4 of 10


Intel Xeon Phi Programming Models:

Native Mode


Useful Tools and Files on Coprocessor

● top - display Linux tasks

● ps - report a snapshot of the current processes.

● kill - send signals to processes, or list signals

● ifconfig - configure a network interface

● traceroute - print the route packets take to network host

● mpiexec.hydra – run Intel MPI natively

● /proc/cpuinfo

● /proc/meminfo


/proc/cpuinfo

processor : 0

vendor_id : GenuineIntel

cpu family : 11

model : 1

model name : 0b/01

stepping : 3

cpu MHz : 1052.630

cache size : 512 KB

physical id : 0

siblings : 240

core id : 59

cpu cores : 60

apicid : 236

initial apicid : 236

fpu : yes

fpu_exception : yes

cpuid level : 4

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr ht

syscall nx lm rep_good nopl lahf_lm

bogomips : 2094.86

clflush size : 64

cache_alignment : 64

address sizes : 40 bits physical, 48 bits virtual

power management:


/proc/meminfo

[lu65fok@i01r13c01-mic0 proc]$ cat meminfo

MemTotal: 7882368 kB

MemFree: 7182704 kB

Buffers: 0 kB

Cached: 298824 kB

SwapCached: 0 kB

Active: 38660 kB

Inactive: 265544 kB

Active(anon): 38660 kB

Inactive(anon): 265544 kB

Active(file): 0 kB

Inactive(file): 0 kB

Unevictable: 0 kB

Mlocked: 0 kB

SwapTotal: 0 kB

SwapFree: 0 kB

Dirty: 0 kB

Writeback: 0 kB

…


Native Mode


● Compile on the Host (Login Node supermic):

lu65fok@login12:~/test> icpc -mmic hello.c -o hello

lu65fok@login12:~/tests> ifort -mmic hello.f90 -o hello

● Launch execution from the MIC:

lu65fok@login12:~/test> scp hello i01r13c01-mic0:

hello 100% 10KB 10.2KB/s 00:00

lu65fok@login12:~/test> ssh i01r13c01-mic0

[lu65fok@i01r13c01-mic0 ~]$ ./hello

hello, world

lu65fok@i01r13c01-mic0 ~]$Home-Directories might also be mounted on the MICs like on Salomon and SuperMIC.

Native Mode: micnativeloadex

● Launch execution from the host:

blu65fok@login12:~/test> ./hello

-bash: ./hello: cannot execute Binary file

lu65fok@i01r13c01:~/test> micnativeloadex ./hello

hello, world

lu65fok@i01r13c01:~/test> micnativeloadex ./hello -v

hello, world

Remote process returned: 0

Exit reason: SHUTDOWN OK


micinfo

lu65fok@i01r13c01:~> micinfo -listdevices

MicInfo Utility Log

Created Thu Apr 17 17:22:27 2014

List of Available Devices

deviceId | domain | bus# | pciDev# | hardwareId

---------|----------|------|---------|-----------

0 | 0 | 20 | 0 | 22508086

1 | 0 | 8b | 0 | 22508086

-------------------------------------------------


Micinfo Output

Version

Flash Version : 2.1.02.0390

SMC Firmware Version : 1.16.5078

SMC Boot Loader Version : 1.8.4326

uOS Version : 2.6.38.8+mpss3.1.2

Device Serial Number : ADKC33400625

Cores

Total No of Active Cores : 60

Voltage : 1027000 uV

Frequency : 1052631 kHz

GDDR

GDDR Vendor : Elpida

GDDR Version : 0x1

GDDR Density : 2048 Mb

GDDR Size : 7936 MB

GDDR Technology : GDDR5

GDDR Speed : 5.000000 GT/s

GDDR Frequency : 2500000 kHz

GDDR Voltage : 1501000 uV


_SC_NPROCESSORS_ONLN

lu65fok@login12:~/tests> cat hello.c

#include <unistd.h>

int main(){

printf("Hello world! I have %ld logical cores.\n",

sysconf(_SC_NPROCESSORS_ONLN));

}

lu65fok@i01r13c01:~/tests> ./hello-host

Hello world! I have 32 logical cores.

[lu65fok@i01r13c01-mic0 ~]$ ./hello-mic


lu65fok@i01r13c01:~/tests> micnativeloadex ./hello-mic



Native Mode: micnativeloadex -l

lu65fok@i01r13c01:~/test> micnativeloadex hello -l

Dependency information for hello

Full path was resolved as

/home/hpc/pr28fa/lu65fok/test/hello

Binary was built for Intel(R) Xeon Phi(TM) Coprocessor

(codename: Knights Corner) architecture

SINK_LD_LIBRARY_PATH =

Dependencies Found:

(none found)

Dependencies Not Found Locally (but may exist already on the coprocessor):

libm.so.6

libstdc++.so.6

libgcc_s.so.1

libc.so.6

libdl.so.2


For the Labs:

Salomon System Initialisation

● Load the Intel environment on the host via:

module load intel

● Submit an interactive job viaqsub -I -A DD-16-44 -q R??????

-l select=1:ncpus=24:accelerator=True:naccelerators=2

-l walltime=12:00:00

● No module system on the MIC, manual initialisation

needed, i.e.:

export PATH=$PATH:/apps/all/impi/5.1.2.150-iccifort-2016.1.150-GCC-

4.9.3-2.25/mic/bin/

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/apps/all/imkl/11.3.1.150-

iimpi-8.1.5-GCC-4.9.3-2.25/mkl/lib/mic/:/apps/all/icc/2016.1.150-GCC-

4.9.3-2.25/compilers_and_libraries_2016.1.150/linux/compiler/lib/mic/


Lab: Native Mode



Offload Mode I


Intel Offload Directives

● Syntax:

C:

#pragma offload target(mic) <clauses>

<statement block>

Fortran:

!DIR$ offload target(mic) <clauses>

<statement>

!DIR$ omp offload target(mic) <clauses>

<OpenMP construct>


Intel Offload Directive

● C:

Pragma can be before any statement, including a

compound statement or an OpenMP parallel

pragma

● Fortran: If OMP is specified: the next line, other than a

comment, must be an OpenMP PARALLEL,

PARALLEL SECTIONS, or PARALLEL DO directive.

If OMP is not specified, next line must :

An OpenMP* PARALLEL, PARALLEL SECTIONS, or

PARALLEL DO directive

A CALL statement

An assignment statement where the right side only calls a

function


Intel Offload Directive

● Offloading a code block in Fortran:

!DIR$ offload begin target(MIC)

…

!DIR$ end offload

Code block can include any number of Fortran

statements, including DO, CALL and any assignments,

but not OpenMP directives.


Intel Offload

● Implements the following steps:

1. Memory allocation on the MIC

2. Data transfer from the host to the MIC

3. Execution on the MIC

4. Data transfer from the MIC to the host

5. Memory deallocation on MIC


Intel Offload: Hello World in C

#include <stdio.h>

int main (int argc, char* argv[]) {

#pragma offload target(mic)

{

printf("MIC: Hello world from MIC.\n");

}

printf( "Host: Hello world from host.\n");

}


Must be in a new

line!

Intel Offload: Hello World in Fortran

PROGRAM HelloWorld

!DIR$ offload begin target(MIC)

PRINT *,'MIC: Hello world from MIC'

!DIR$ end offload

PRINT *,'Host: Hello world from host'

END


Intel Offload: Hello World in C

lu65fok@login12:~/tests> icpc offload1.c -o offload1

lu65fok@login12:~/tests> ./offload1

offload error: cannot offload to MIC - device is not available

lu65fok@i01r13c01:~/tests> ./offload1

Host: Hello world from host.

MIC: Hello world from MIC.


Intel Offload: Hello World in Fortran

lu65fok@login12:~/tests> ifort offload1.f90 -o offload1

lu65fok@login12:~/tests> ./offload1


lu65fok@i01r13c01:~/tests> ./offload1

Host: Hello world from host.

MIC: Hello world from MIC.


Intel Offload: Hello World with Hostnames

#include <stdio.h>

#include <unistd.h>


char hostname[100];

gethostname(hostname,sizeof(hostname));


{

char michostname[100];

gethostname(michostname, sizeof(michostname));

printf("MIC: Hello world from MIC. I am %s and I have %ld logical cores. I was

called from host: %s \n", michostname, sysconf(_SC_NPROCESSORS_ONLN),

hostname);

}


Intel Offload: Hello World with Hostnames

lu65fok@login12:~/tests> icpc offload.c -o offload

lu65fok@i01r13c01:~/tests> ./offload

Host: Hello world from host. I am i01r13c01 and I have 32 logical

cores.

MIC: Hello world from MIC. I am i01r13c01-mic0 and I have 240

logical cores. I was called from host: i01r13c01


Intel Offload: -offload=optional / mandatory

lu65fok@login12:~/tests> icpc -offload=optional offload.c -o offload

lu65fok@login12:~/tests> ./offload

MIC: Hello world from MIC. I am login12 and I have 16 logical cores. I

was called from host: login12

Host: Hello world from host. I am login12 and I have 16 logical cores.

lu65fok@login12:~/tests> icpc -offload=mandatory offload.c -o offload

lu65fok@login12:~/tests> ./offload



Intel Offload: -none

lu65fok@login12:~/tests> icpc -offload=none offload.c -o offload

offload.c(13): warning #161: unrecognized #pragma


^

lu65fok@login12:~/tests>

lu65fok@i01r13c01:~/tests> ./offload

MIC: Hello world from MIC. I am i01r13c01 and I have 32 logical cores.

I was called from host: i01r13c01

Host: Hello world from host. I am i01r13c01 and I have 32 logical

cores.


Intel Offload

#include <stdio.h>

#include <stdlib.h>

int main(){

#pragma offload target (mic)

{

system(“command");

}

}


Intel Offload: system(“set”)

lu65fok@i01r13c01:~/tests> ./system

BASH=/bin/sh

BASHOPTS=cmdhist:extquote:force_fignore:hostcomplete:interactive_comments:progcomp:prom

ptvars:sourcepath

BASH_ALIASES=()

BASH_ARGC=()

BASH_ARGV=()

BASH_CMDS=()

BASH_EXECUTION_STRING=set

BASH_LINENO=()

BASH_SOURCE=()

BASH_VERSINFO=([0]="4" [1]="2" [2]="10" [3]="1" [4]="release" [5]="k1om-mpss-linux-gnu")

BASH_VERSION='4.2.10(1)-release‘

COI_LOG_PORT=65535

COI_SCIF_SOURCE_NODE=0

DIRSTACK=()

ENV_PREFIX=MIC

EUID=400

GROUPS=()


Intel Offload: system(“set”)

HOSTNAME=i01r13c01-mic0

HOSTTYPE=k1om

IFS='

'

LIBRARY_PATH=/lrz/sys/intel/compiler140_144/composer_xe_2013_sp1.2.144/tbb/lib/mic:/lrz/sys/intel/compiler1

40_144/composer_xe_2013_sp1.2.144/tbb/lib/mic:/lrz/sys/intel/compiler140_144/composer_xe_2013_sp1.2.144/t

bb/lib/mic:/lrz/sys/intel/compiler140_144/composer_xe_2013_sp1.2.144/tbb/lib/mic

MACHTYPE=k1om-mpss-linux-gnu

OPTERR=1

OPTIND=1

OSTYPE=linux-gnu

PATH=/usr/bin:/bin

POSIXLY_CORRECT=y

PPID=37141

PS4='+ '

PWD=/var/volatile/tmp/coi_procs/1/37141

SHELL=/bin/false

SHELLOPTS=braceexpand:hashall:interactive-comments:posix

SHLVL=1

TERM=dumb

UID=400

_=sh


Intel Offload: system(command)


{

system("hostname");

system("uname -a");

system("whoami");

system(“id”);

}

lu65fok@i01r13c01:~/tests> ./system

i01r13c01-mic0

Linux i01r13c01-mic0 2.6.38.8+mpss3.1.2 #1 SMP Wed Dec 18

19:09:36 PST 2013 k1om GNU/Linux

micuser

uid=400(micuser) gid=400(micuser)


Offload: Using several MIC Coprocessors

● To query the number of coprocessors:

int nmics = __Offload_number_of_devices()

● To specify which coprocessor n< nmics should do the

computation:

#pragma offload target(mic:n)

● If (n > nmics) then coprocessor (n % nmics) is used

● Important for:

Asynchronous offloads

Coprocessor-Persistent data


Offloading OpenMP Computations

● C/C++ & OpenMP:


#pragma omp parallel for

for (int i=0;i<n;i++) {

a[i]=c*b[i]+d;

}

● Fortran & OpenMP

!DIR$ offload target(mic)

!$OMP PARALLEL DO

do i = 1, n

a(i) = c*b(i) + d

end do

!$omp END PARALLEL DO


Functions and Variables on the MIC

● C:

__attribute__((target(mic))) variables / function

__declspec (target(mic)) variables / function

#pragma offload_attribute(push, target(mic))

… multiple lines with variables / functions

#pragma offload_attribute(pop)

● Fortran:

!DIR$ attributes offload:mic:: variables / function


Functions and Variables on the MIC

#pragma offload_attribute(push,target(mic))

const int n=100;

int a[n], b[n],c,d;

void myfunction(int* a, int*b, int c, int d){

for (int i=0;i<n;i++) {

a[i]=c*b[i]+d;

}

}

#pragma offload_attribute(pop)



{

myfunction(a,b,c,d);

}


Intel Offload Clauses

Clauses Syntax SemanticsMultiple coprocessors target(mic[:unit] ) Select specific coprocessors

Conditional offload if (condition) / manadatory Select coprocessor or host compute

Inputs in(var-list modifiersopt) Copy from host to coprocessor

Outputs out(var-list modifiersopt) Copy from coprocessor to host

Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back

when offload completes

Non-copied data nocopy(var-list

modifiersopt)

Data is local to target

Async. Offload signal(signal-slot) Trigger asynchronous Offload

Async. Offload wait(signal-slot) Wait for completion


Intel Offload Modifier Options

Modifiers Syntax Semantics

Specify copy length length(N) Copy N elements of pointer’s type

Coprocessor memory

allocation

alloc_if (bool) Allocate coprocessor space on this

offload (default: TRUE)

Coprocessor memory

release

free_if (bool) Free coprocessor space at the end of

this offload (default: TRUE)

Array partial

allocation & variable

relocation

alloc (array-slice)

in (var-expr )

Enables partial array allocation and

data copy into other vars & ranges


Intel Offload: Data Movement

● #pragma offload target(mic) in(in1,in2,…)

out(out1,out2,…) inout(inout1,inout2,…)

● At Offload start:

Allocate Memory Space on MIC for all variables

Transfer in/inout variables from Host to MIC

● At Offload end:

Transfer out/inout variables from MIC to Host

Deallocate Memory Space on MIC for all variables


Intel Offload: Data Movement

● data = (double*)malloc(n*sizeof(double));

● #pragma offload target(mic) in(data:length(n))

● Copies n doubles to the coprocessor,

not n * sizeof(double) Bytes

● ditto for out() and inout()


Allocation of Partial Arrays in C

● int n=1000

● data = (double*)malloc(n*sizeof(double));

● #pragma offload target(mic) in(data[100:200] :

alloc(data[300:400])

● Host:

1000 doubles allocated

First element has index 0

Last element has index 999

● MIC:

400 doubles are allocated



200 elements in the range data[100], …, data[299] are copied to

the MIC


Allocation of Partial Arrays in Fortran

● integer :: n=1000

● double precision, allocatable :: data(:)

● allocate(data(n) )

● !C: #pragma offload target(mic) in(data[100:200] : alloc(data[300:400])

● !DIR$ offload target(mic) in(data(100:299) : alloc(data(300:699))

● Host:

1000 doubles allocated



● MIC:

400 doubles are allocated



200 elements in the range data[100], …, data[299] are copied to the MIC


An example for Offloading: Offloading

Code


#pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n))

{


for( i = 0; i < n; i++ ) {

for( k = 0; k < n; k++ ) {

#pragma vector aligned

#pragma ivdep

for( j = 0; j < n; j++ ) {

//c[i][j] = c[i][j] + a[i][k]*b[k][j];

c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];

}

}

}

}

Vectorisation Diagnostics


lu65fok@login12:~/tests> icc -vec-report2 -openmp offloadmul.c -ooffloadmul

offloadmul.c(35): (col. 5) remark: LOOP WAS VECTORIZED

offloadmul.c(32): (col. 3) remark: loop was not vectorized: not inner loop

offloadmul.c(57): (col. 2) remark: LOOP WAS VECTORIZED



offloadmul.c(8): (col. 9) remark: loop was not vectorized: existence of

vector dependence


offloadmul.c(57): (col. 2) remark: *MIC* LOOP WAS VECTORIZED

offloadmul.c(54): (col. 7) remark: *MIC* loop was not vectorized: not inner

loop

offloadmul.c(53): (col. 5) remark: *MIC* loop was not vectorized: not inner

loop

-vec-report2 deprecated in icc 15.0. Use -qopt-report=n -qopt-

report-phase=vec and view *.optrpt files.

Intel Offload: Example


__attribute__((target(mic))) void mxm( int n, double * restrict a, double * restrict b,

double *restrict c ){

int i,j,k;

for( i = 0; i < n; i++ ) {

...}

}

main(){

...


{

mxm(n,a,b,c);

}

}

Offload Diagnostics

u65fok@i01r13c06:~/tests> export OFFLOAD_REPORT=2

lu65fok@i01r13c06:~/tests> ./offloadmul

[Offload] [MIC 0] [File] offloadmul.c

[Offload] [MIC 0] [Line] 50

[Offload] [MIC 0] [Tag] Tag 0

[Offload] [HOST] [Tag 0] [CPU Time] 51.927456(seconds)

[Offload] [MIC 0] [Tag 0] [CPU->MIC Data] 24000016 (bytes)

[Offload] [MIC 0] [Tag 0] [MIC Time] 50.835065(seconds)

[Offload] [MIC 0] [Tag 0] [MIC->CPU Data] 8000016 (bytes)


Offload Diagnostics

lu65fok@i01r13c06:~/tests> export H_TRACE=1


HOST: Offload function

__offload_entry_offloadmul_c_50mainicc638762473Jnx4JU,

is_empty=0, #varDescs=7, #waits=0, signal=none

HOST: Total pointer data sent to target: [24000000] bytes

HOST: Total copyin data sent to target: [16] bytes

HOST: Total pointer data received from target: [8000000] bytes

MIC0: Total copyin data received from host: [16] bytes

MIC0: Total copyout data sent to host: [16] bytes

HOST: Total copyout data received from target: [16] bytes

lu65fok@i01r13c06:~/tests>


Offload Diagnostics

lu65fok@i01r13c06:~/tests> export H_TIME=1


[Offload] [MIC 0] [File] offloadmul.c

[Offload] [MIC 0] [Line] 50

[Offload] [MIC 0] [Tag] Tag 0

[Offload] [HOST] [Tag 0] [CPU Time] 51.920016(seconds)

[Offload] [MIC 0] [Tag 0] [MIC Time] 50.831497(seconds)

**************************************************************

timer data (sec)

**************************************************************



Environment Variables

● Host environment variables are automatically

forwarded to the coprocessor when offload mode is

used.

● To avoid names collisions:

Set MIC_ENVIRONMENT_PREFIX=MIC on the host

Then only names with prefix MIC_ are forwarded to

the coprocessor with prefix stripped

Exception: MIC_LD_LIBRARY_PATH is never passed

to the coprocessor.

Value of LD_LIBRARY_PATH cannot be changed via

forwarding of environment variables.


Environment Variables on the MIC

#include <stdio.h>

#include <stdlib.h>

int main(){


{

char* varmic = getenv("VAR");

if (varmic) {

printf("VAR=%s on MIC.\n", varmic);

} else {

printf("VAR is not defined on MIC.\n");

}

}

char* varhost = getenv("VAR");

if (varhost) {

printf("VAR=%s on host.\n", varhost);

} else {

printf("VAR is not defined on host.\n");

}

}28.9.2016 Intel Xeon Phi Programming

Environment Variables on the MIC

lu65fok@i01r13c01:~/tests> ./env

VAR is not defined on host.

VAR is not defined on MIC.

lu65fok@i01r13c01:~/tests> export VAR=299792458


VAR=299792458 on host.

VAR=299792458 on MIC.

lu65fok@i01r13c01:~/tests> export MIC_ENV_PREFIX=MIC



VAR is not defined on MIC.

lu65fok@i01r13c01:~/tests> export MIC_VAR=3.141592653



VAR=3.141592653 on MIC.


The Preprocessor Macro __MIC__

● The macro __MIC__ is only defined in code version

for MIC, not in the fallback version for the host

● Allows to check where the code is running.

● Allows to write multiversioned code.

● __MIC__ also defined in native mode.


The Preprocessor Macro __MIC__


{

#ifdef __MIC__

printf(“Hello from MIC (offload succeeded).\n");

#else

printf(“Hello from host (offload to MIC failed!).\n");

#endif

}

lu65fok@login12:~/tests> icpc -offload=optional offload-mic.c

lu65fok@login12:~/tests> ./a.out

Hello from host (offload to MIC failed!).

lu65fok@i01r13c06:~/tests> ./a.out

Hello from MIC (offload succeeded).


Lab: Offload Mode I



Offload Mode II


Proxy Console I/O

● stderr and stdout on MIC are buffered and forwarded

(proxied) to the host console.

● Forwarding is done by the coi_daemon running on

the MIC.

● Output buffer should be flushed with fflush(0) of the

stdio-Library.

● Proxy console input not supported.

● Proxy I/O is enabled by default.

● Can be switched off using MIC_PROXY_IO=0.


Proxy Console I/O

#include <stdio.h>

#include <unistd.h>

__attribute__((target(mic))) extern struct _IO_FILE *stderr;

int main (int argc, char* argv[]){

char hostname[100]; gethostname(hostname,sizeof(hostname));

#pragma offload target(mic) {

char michostname[100]; gethostname(michostname, sizeof(michostname));

printf("MIC stdout: Hello world from MIC. I am %s and I have %ld logical cores. I was called

from host: %s \n", michostname, sysconf(_SC_NPROCESSORS_ONLN), hostname);

fprintf(stderr,"MIC stderr: Hello world from MIC. I am %s and I have %ld logical cores. I was

called from host: %s \n", michostname, sysconf(_SC_NPROCESSORS_ONLN), hostname);

fflush(0);

}

printf( "Host stdout: Hello world from host. I am %s and I have %ld logical cores.\n", hostname,

sysconf(_SC_NPROCESSORS_ONLN));

fprintf(stderr, "Host stderr: Hello world from host. I am %s and I have %ld logical cores.\n",

hostname, sysconf(_SC_NPROCESSORS_ONLN));


Proxy Console I/O

lu65fok@i01r13c01:~/tests> ./proxyio 1>proxyio.out 2>proxyio.err

lu65fok@i01r13c01:~/tests> cat proxyio.out

MIC stdout: Hello world from MIC. I am i01r13c01-mic0 and I have 240 logical cores.

I was called from host: i01r13c01

Host stdout: Hello world from host. I am i01r13c01 and I have 32 logical cores.

lu65fok@i01r13c01:~/tests> cat proxyio.err

MIC stderr: Hello world from MIC. I am i01r13c01-mic0 and I have 240 logical cores. I

was called from host: i01r13c01

Host stderr: Hello world from host. I am i01r13c01 and I have 32 logical cores.



Proxy Console I/O

lu65fok@i01r13c01:~/tests> export MIC_PROXY_IO=0

lu65fok@i01r13c01:~/tests> ./proxyio 1>proxyio.out 2>proxyio.err

lu65fok@i01r13c01:~/tests> cat proxyio.out

Host stdout: Hello world from host. I am i01r13c01 and I have 32

logical cores.

lu65fok@i01r13c01:~/tests> cat proxyio.err

Host stderr: Hello world from host. I am i01r13c01 and I have 32

logical cores.



Data Traffic without Computation

● 2 possibilities:

Blank body of #pragma offload, i.e.

#pragma offload target(mic) in (data: length(n))

{}

Use a special pragma offload_transfer, i.e.

#pragma offload_transfer target(mic) in(data:

length(n))


Asynchronous Offload

● Asynchronous Data Transfer helps to:

Overlap computations on host and MIC(s).

Work can be distributed to multiple coprocessors.

Data transfer time can be masked.



● To allow asynchronous data transfer, the specifiers

signal() and wait() can be used, i.e.

#pragma offload_transfer target(mic:0) in(data : length(n))

signal(data)

// work on other data concurrent to data transfer …

#pragma offload target(mic:0) wait(data) \

nocopy(data : length(N)) out(result : length(N))

{

….

result[i]=data[i] + …;

}


Device number

must be

specified!

Any pointer type

variable can

serve as a signal!


● Alternative to the wait() clause, a new pragma can be

used:

#pragma offload_wait target(mic:0) wait(data)

● Useful if no other offload or data transfer is necessary

at the synchronisation point.


Asynchronous Offload to Multiple

Coprocessors

char* offload0;

char* offload1;

#pragma offload target(mic:0) signal(offload0) \

in(data0 : length(N)) out(result0 : length(N))

{

Calculate(data0, result0);

}

#pragma offload target(mic:1) signal(offload1) \

in(data1 : length(N)) out(result1 : length(N))

{

Calculate(data1, result1);

}

#pragma offload_wait target(mic:0) wait(offload0)

#pragma offload_wait target(mic:1) wait(offload1)


Explicit Worksharing


#pragma omp parallel

{

#pragma omp sections

{

#pragma omp section

{

//section running on the coprocessor


{

mxm(n,a,b,c);

}

}

#pragma omp section

{

//section running on the host

mxm(n,d,e,f);

}

}

}

Persistent Data

● #define ALLOC alloc_if(1)

#define FREE free_if(1)

#define RETAIN free_if(0)

#define REUSE alloc_if(0)

● To allocate data and keep it for the next offload:

#pragma offload target(mic) in (p:length(l) ALLOC RETAIN)

● To reuse the data and still keep it on the coprocessor:

#pragma offload target(mic) in (p:length(l) REUSE RETAIN)

● To reuse the data again and free the memory. (FREE is the

default, and does not need to be explicitly specified):

#pragma offload target(mic) in (p:length(l) REUSE FREE)


Virtual Shared Classes

● Offload Model only allows offloading of bitwise-

copyable data.

● Sharing complicated structures with pointers or C++

classes is only possible via MYO


MYO

● “Mine Yours Ours” virtual shared memory model.

● Alternative to Offload approach.

● Only available in C++.

● Allows to share not bit-wise compatible complex data

(like structures with pointer elements, C++ classes)

without data marshalling.

● Allocation of data at the same virtual addresses on

the host and the coprocessor.

● Runtime automatically maintains coherence.

● Syntax based on the keywords __Cilk_shared and

__Cilk_offload.


MYO: Example

#define N 10000

_Cilk_shared int a[N], b[N], c[N];

_Cilk_shared void add() {

for (int i = 0; i < N; i++)

c[i] = a[i] + b[i];

}

int main(int argc, char *argv[]) {

…

_Cilk_offload add(); // Function call on coprocessor:

…

}


MYO Language Extensions

Entity Syntax Semantics

Function int _Cilk_shared f(int x){…} Executable code for both host and MIC; may

be called from either side

Global variable _Cilk_shared int x = 0 Visible on both sides

File/Function static static _Cilk_shared int x Visible on both sides, only to code within the

file/function

Class class _Cilk_shared x {...} Class methods, members, and operators are

available on both sides

Pointer to shared data int _Cilk_shared *p p is local (not shared), can point to shared

data

A shared pointer int *_Cilk_shared p p is shared, should only point at shared data

Offloading a function call x = _Cilk_offload func(y) func executes on MIC if possible

x = _Cilk_offload_to(n) func func must be executed on specified (n-th)

MIC

Offloading asynchronously _Cilk_spawn _Cilk_offload func(y) Non-blocking offload

Offload a parallel for-loop _Cilk_offload _Cilk_for(i=0; i<N; i++) {…} Loop executes in parallel on MIC


Lab: Offload Mode II



MPI


MPI on Hosts & MICs


Important MPI environment variables

● Important Paths are already set by intel module,

otherwise use:

. $ICC_BASE/bin/compilervars.sh intel64

. $MPI_BASE/bin64/mpivars.sh

● Recommended environment on Salomon:

module load intel

export I_MPI_HYDRA_BOOTSTRAP=ssh

export I_MPI_MIC=enable

export I_MPI_FABRICS=shm:dapl

export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-

v2-scif0,ofa-v2-mcm-1

export MIC_LD_LIBRARY_PATH =

$MIC_LD_LIBRARY_PATH:/apps/all/impi/5.1.2.150-iccifort-

2016.1.150-GCC-4.9.3-2.25/mic/lib/ depending on version


Invocation of the Intel MPI compiler


Language MPI Compiler Compiler

C mpiicc icc

C++ mpiicpc icpc

Fortran mpiifort ifort

I_MPI_FABRICS

● The following network fabrics are available for the

Intel Xeon Phi coprocessor:


shm Shared-memory

tcp TCP/IP-capable network

fabrics, such as Ethernet and

InfiniBand (through IPoIB)

ofa OFA-capable network fabric

including InfiniBand

(through OFED verbs)

dapl DAPL–capable network

fabrics, such as InfiniBand,

iWarp, Dolphin, and XPMEM

(through DAPL)

I_MPI_FABRICS

● The default can be changed by setting the

I_MPI_FABRICS environment variable to

I_MPI_FABRICS=<fabric> or I_MPI_FABRICS=

<intra-node fabric>:<inter-nodes fabric>

● Intranode: Shared Memory, Internode: DAPL

(Default on SuperMIC/MUC)

export I_MPI_FABRICS=shm:dapl

● Intranode: Shared Memory, Internode: TCP

(Can be used in case of Infiniband problems)

export I_MPI_FABRICS=shm:tcp


Sample MPI Program

lu65fok@login12:~/tests> cat testmpi.c

#include <stdio.h>

#include <mpi.h>


char hostname[100];

int rank, size;




gethostname(hostname,100);

printf( "Hello world from process %d of %d: host: %s\n", rank, size, hostname);

MPI_Finalize();

return 0;


MPI on hosts

● Compile for host using mpiicc / mpiifort:

lu65fok@login12:~/tests> mpiicc testmpi.c -o testmpi-

host

● Run 2 MPI tasks on host node i01r13a01

lu65fok@login12:~/tests> mpiexec -n 2 -host i01r13a01

./testmpi-host

Hello world from process 0 of 2: host: i01r13a01



MPI in native mode on 1 MIC

● Compile for MIC using mpiicc / mpiifort -mmic:

lu65fok@login12:~/tests> mpiicc -mmic testmpi.c -o testmpi-mic

● Copy binary to MIC:lu65fok@login12:~/tests> scp testmpi-mic i01r13a01-mic0:

● Launch 2 MPI tasks from MIC node i01r13a01-mic0lu65fok@i01r13a04:~/tests> ssh i01r13a01-mic0

[lu65fok@i01r13a01-mic0 ~]$ mpiexec -n 2 ./testmpi-mic

Hello world from process 1 of 2: host: i01r13a01-mic0



Do not mix up with mpicc and mpifort!!

lu65fok@login12:~/tests> mpicc -mmic testmpi.c -o testmpi-mic

/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: skipping incompatible

/lrz/sys/intel/mpi_41_3_048/mic/lib/libmpigf.so when searching for -lmpigf


/lrz/sys/intel/mpi_41_3_048/mic/lib/libmpigf.a when searching for -lmpigf

/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: cannot find -lmpigf


/lrz/sys/intel/mpi_41_3_048/mic/lib/libmpi.so when searching for -lmpi


/lrz/sys/intel/mpi_41_3_048/mic/lib/libmpi.a when searching for -lmpi

/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: cannot find -lmpi


/lrz/sys/intel/mpi_41_3_048/mic/lib/libmpigi.a when searching for -lmpigi

/usr/lib64/gcc/x86_64-suse-linux/4.3/../../../../x86_64-suse-linux/bin/ld: cannot find -lmpigi

collect2: ld returned 1 exit status


MPI on 1 MIC



● Copy binary to MIC(not necessary if home is mounted on MICs)

lu65fok@login12:~/tests> scp testmpi-mic i01r13a01-mic0:

● Run 2 MPI tasks on MIC node i01r13a01-mic0lu65fok@i01r13a04:~/tests> mpiexec -n 2 -host i01r13a01-mic0

./home/lu65fok/testmpi-mic




Full path

needed!

MPI on 2 MICs



● Copy binary to MICs:(not necessary if home is mounted on MICs)



● Run 2 MPI tasks on MIC node i01r13a01-mic0

lu65fok@login12:~/tests> mpiexec -n 2 -perhost 1 -host

i01r13a01-mic0,i01r13a01-mic1 ./home/lu65fok/testmpi-mic




MPI on Host and 2 MICs attached to the

host

lu65fok@login12:~/tests> mpirun -n 1 -host i01r13a01 ./testmpi-host : -n 1 -

host i01r13a01-mic0 /home/lu65fok/testmpi-mic : -n 1 -host i01r13a01-mic1

/home/lu65fok/testmpi-mic





MPI on multiple Hosts & MICs


lu65fok@i01r13a01:~/tests> mpirun -n 1 -host i01r13a01 ./testmpi-host : -n 1 -host

i01r13a01-mic0 /home/lu65fok/testmpi-mic : -n 1 -host i01r13a01-mic1

/home/lu65fok/testmpi-mic : -n 1 -host i01r13a02 ./testmpi-host : -n 1 -host

i01r13a02-mic0 /home/lu65fok/testmpi-mic : -n 1 -host i01r13a02-mic1








MPI Machine File

lu65fok@login12:~/tests> cat machinefile.txt

i01r13a01-mic0

i01r13a01-mic1

i01r13a02-mic0

i01r13a02-mic1

lu65fok@login12:~/tests> mpirun -n 4 -machinefile machinefile.txt







MPI Machine File

lu65fok@login12:~/tests> cat machinefile.txt

i01r13a01-mic0:2

i01r13a01-mic1

i01r13a02-mic0

i01r13a02-mic1

lu65fok@login12:~/tests> mpirun -n 4 -machinefile machinefile.txt







Offload from MPI Tasks

#include <unistd.h>

#include <stdio.h>

#include <mpi.h>


char hostname[100];

int rank, size;




gethostname(hostname,100);


{


gethostname(michostname, 50);

printf("MIC: I am %s and I have %ld logical cores. I was called by process %d of %d: host: %s \n", michostname,

sysconf(_SC_NPROCESSORS_ONLN), rank, size, hostname);

}

printf( "Hello world from process %d of %d: host: %s\n", rank, size, hostname);

MPI_Finalize();

return 0;

}


Offload from MPI Tasks using 1 host

lu65fok@login12:~/tests> mpiicc testmpioffload.c -o testmpioffload

lu65fok@login12:~/tests> mpirun -n 4 -host i01r13a01 ./testmpioffload





MIC: I am i01r13a01-mic0 and I have 240 logical cores. I was called by

process 3 of 4: host: i01r13a01








Offload from MPI Tasks using multiple

hosts

lu65fok@login12:~/tests> mpirun -n 4 -perhost 2 -host

i01r13a01,i01r13a02 ./testmpioffload














Offload from MPI Tasks: Using both MICs

#pragma offload target(mic:rank%2)

{


gethostname(michostname, sizeof(michostname));

printf("MIC: I am %s and I have %ld logical cores. I was called by

process %d of %d: host: %s \n", michostname,

sysconf(_SC_NPROCESSORS_ONLN), rank, size,

hostname);

}


Offload from MPI Tasks: Using both MICs

lu65fok@login12:~/tests> mpirun -n 4 -perhost 2 -host

i01r13a01,i01r13a02 ./testmpioffload





MIC: I am i01r13a02-mic1 and I have 240 logical cores. I was called

by process 3 of 4: host: i01r13a02








Lab: MPI


Intel Xeon Phi: Intel MKL Library


Intel MKL

● Math library for C and Fortran

● Includes

BLAS

LAPACK

ScaLAPACK

FFTW

…

● Contains optimised routines

For Intel CPUs and MIC architecture

● All MKL functions are supported on Xeon Phi

But optimised at different levels


MKL Usage In Accelerator Mode

● Compiler Assisted Offload

Offloading is explicitly controlled by compiler pragmas or

directives.

All MKL functions can be inserted inside offload region to run on

the Xeon Phi (In contrast, only a subset of MKL is subject to AO).

More flexibility in data transfer and remote execution

management.

● Automatic Offload Mode

MKL functions are automatically offloaded to the accelerator.

MKL decides:

when to offload

work division between host and targets

Data is managed automatically

● Native Execution

MKL functions are executed natively on the accelerator.


Intel MKL in Accelerator Mode


Compiler Assisted Offload

MKL functions are offloaded in the same way as any other

offloaded function.

An example in C:

#pragma offload target(mic) \

in(transa, transb, N, alpha, beta) \

in(A:length(matrix_elements)) \

in(B:length(matrix_elements)) \

in(C:length(matrix_elements)) \

out(C:length(matrix_elements) alloc_if(0))

{

sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,&beta, C, &N);

}


How to use CAO

An example in Fortran:

!DEC$ ATTRIBUTES OFFLOAD : TARGET( MIC ) :: SGEMM

!DEC$ OMP OFFLOAD TARGET( MIC ) &

!DEC$ IN( TRANSA, TRANSB, M, N, K, ALPHA, BETA, LDA, LDB, LDC ), &

!DEC$ IN( A: LENGTH( NCOLA * LDA )), &

!DEC$ IN( B: LENGTH( NCOLB * LDB )), &

!DEC$ INOUT( C: LENGTH( N * LDC ))

!$OMP PARALLEL SECTIONS

!$OMP SECTION

CALL SGEMM( TRANSA, TRANSB, M, N, K, ALPHA, &A, LDA, B, LDB BETA, C, LDC )

!$OMP END PARALLEL SECTIONS


Automatic Offload

- With automatic offload the user does not have to change the code at all:MKL_MIC_ENABLE=1

- The runtime may automatically download data to the Xeon Phi coprocessor

and execute (all or part of) the computations there, transparent for the

user

In Intel MKL 11.0.2 the following functions are enabled for automatic offload:

Level-3 BLAS functions

*GEMM (for m,n > 2048, k > 256)

*TRSM (for M,N > 3072)

*TRMM (for M,N > 3072)

*SYMM (for M,N > 2048)

LAPACK functions

LU (M,N > 8192)

QR

Cholesky


Automatic Offload

BLAS only: Work can be divided between host and device using

mkl_mic_set_workdivision(TARGET_TYPE, TARGET_NUMBER, WORK_RATIO)

Users can use AO for some MKL calls and use CAO for others in

the same program

- Only supported by Intel compilers

- Work division must be set explicitly for AO, otherwise, all MKL

AO calls are executed on the host


Automatic Offload Mode Example

#include “mkl.h”

err = mkl_mic_enable();

//Offload all work on the Xeon Phi

err = mkl_mic_set_workdivsion (MKL_TARGET_HOST, MIC_HOST_DEVICE, 0, 0);

//Let MKL decide of the amount of work to offload on coprocessor 0

err = mkl_mic_set_workdivision(MKL_TARGET_MIC, 0, MIC_AUTO_WORKDIVISION);

//Offload 50% of work on coprocessor 0

err = mkl_mic_set_workdivision(MKL_TARGET_MIC, 0, 0.5);

//Get amount of work on coprocessor 0

err = mkl_mic_get_workdivision(MKL_TARGET_MIC, 0, &wd);


Tips for Using Automatic Offload

● AO works only when matrix sizes are right

● SGEMM: Offloading only when M, N > 2048

● Square matrices give much better performance

● These settings may produce better results for SGEMM

calculation for 60-core coprocessor: export MIC_USE_2MB_BUFFERS=16K

export MIC_OMP_NUM_THREADS=240

export MIC_ENV_PREFIX=MIC

export MIC_KMP_AFFINITY=compact,granularity=fine

export MIC_PLACE_THREADS=60C,4t

● Work division settings are just hints to MKL runtimeThreading control tips:

Prevent thread migration on the host using:

export KMP_AFFINITY=granularity=fine, compact, 1,0


Native Execution

● In order to use Intel MKL in a native application, an

additional argument -mkl is required with the compiler

option -mmic.

● Native applications with Intel MKL functions operate

just like native applications with user-defined

functions.

● $ icc –O3 –mmic -mkl sgemm.c –o sgemm.exe


Compile to use Intel MKL

● Compile using –mkl flag

-mkl=parallel (default) for parallel execution

-mkl=sequential for sequential execution

● AO: The same way of building code on Xeon:

user@host $ icc -O3 -mkl sgemm.c -o sgemm.exe

● Native using -mmic

user@host $ ifort –mmic –mkl myProgram.c –o

myExec.mic

● MKL can also be used in native mode if compiled with -mmic


More Code Examples

● $MKLROOT/examples/examples_mic.tgz

sgemm SGEMM example

sgemm_f SGEMM example (Fortran 90)

fft complex-to-complex 1D FFT

solverc Pardiso examples

sgaussian single precision Gaussian RNG

dgaussian double precision Gaussian RNG

. ..


Which Model to Choose

● Native execution for

highly parallel code.

using coprocessors as independent compute nodes

● AO if

Sufficient Byte/FLOP ratio makes offload beneficial.

Using Level-3 BLAS functions: GEMM, TRMM, TRSM

● CAO if

There is enough computations to offset data transfer

overhead

Transferred data can be reused by multiple operations

https://software.intel.com/en-us/articles/recommendations-to-

choose-the-right-mkl-usage-model-for-xeon-phi


Memory Allocation: Data Alignment

Compiler-assisted offload

- Memory alignment is inherited from host!

General memory alignment (SIMD vectorisation)

- Align buffers (leading dimension) to a multiple of

vector width (64 Byte)

- mkl_malloc, _mm_malloc (_aligned_malloc),

tbb::scalable_aligned_malloc, …


Memory Allocation: Data Alignment

void * darray;int workspace;int alignment = 64;

...darray = mkl_malloc(sizeof(double) * workspace,

alignment);

...

mkl_free(darray);


Memory Allocation: Page Size

Performance of many Intel MKL routines improves when input and

output data reside in memory allocated with 2 MB pages

Address more memory with less pages,

reduce overhead of translating between host- and MIC address

spaces

# Allocate all pointer-based variables with run-time# length > 64 KB in 2 MB pages:

$ export MIC_USE_2MB_BUFFERS=64K


Environment Settings

Native:KMP_AFFINITY=balancedOMP_NUM_THREADS=244

Compiler-Assisted Offload:MIC_ENV_PREFIX=MICMIC_KMP_AFFINITY=balancedMIC_OMP_NUM_THREADS=240MIC_USE_2MB_BUFFERS=64K

https://software.intel.com/en-us/articles/performance-tips-of-using-intel-mkl-on-intel-xeon-phi-

coprocessor

Automatic Offload:MKL_MIC_ENABLE=1OFFLOAD_DEVICES=<list>MKL_MIC_MAX_MEMORY=2GBMIC_ENV_PREFIX=MICMIC_OMP_NUM_THREADS=240MIC_KMP_AFFINITY=balanced

+ Compiler-Assisted Offload:OFFLOAD_ENABLE_ORSL=1


Environment Settings: Affinity

KMP_AFFINITY=

- Host: e.g., compact,1

- Coprocessor: balanced

MIC_ENV_PREFIX=MIC; MIC_KMP_AFFINITY=

- Coprocessor (CAO): balanced

KMP_PLACE_THREADS

- Note: does not replace KMP_AFFINITY

- Helps to set/achieve pinning on e.g., 60 cores with 3 threads

each

kmp_* (or mkl_*) functions take precedence over corresponding env.

variables


More MKL Documentation

● https://software.intel.com/en-us/node/528430

● https://www.nersc.gov/assets/MKL_for_MIC.pdf

● https://software.intel.com/en-us/articles/intel-mkl-on-

the-intel-xeon-phi-coprocessors

● Intel Many Integrated Core Community website:

https://software.intel.com/en-us/mic-developer

● Intel MKL forum

https://software.intel.com/en-us/forums/intel-math-

kernel-library


Lab: Intel MKL Library


Intel Xeon Phi: Optimisation and

Vectorisation


Xeon Phi Hardware (Recap.)

60 in-order cores, ring interconnect Scalar unit based on Intel Pentium P54C

– 64-bit addressing mode Vector unit added

– 4 hardware threads per core– Each thread issues instructions in turn– Round-robin execution hides latency– 512 bit (64 byte) vector instructions (IMCI)

Conclusion: need to fully utilise the vector units to achieve performance close to peak


Performance


● Sandy-Bridge-EP: 2 sockets ×8 cores @ 2.7 GHz.

● Xeon Phi: 60 cores @1.0 GHz.

● # cycles/s:

SandyBridge: 4.3E10 cycles/s.

Xeon Phi: 6.0E10 cycles /s.

● DP FLOP/s:

SandyBridge: 2 sockets × 8 cores × 2.7 GHz × 4

(SIMD) × 2 (ALUs) = 345.6 GFLOP/s

Xeon Phi: 60 cores × 1 GHz × 8 (SIMD) × 2 (FMA)=

960 GFLOP/s Factor 2.7

The Intel MIC Architecture: VPU

Vector Processing Unit (VPU)

● The VPU includes the EMU (Extended Math Unit)

and executes 16 single-precision floating point, 16

32-bit integer operations or 8 double-precision

floating point operations per cycle. Each operation

can be a FMA, giving 32 single-precision or 16

double-precision floating-point operations per cycle.

● Contains the vector register file: 32 512-bit wide

registers per thread context, each register can hold

16 singles or 8 doubles.

● Most vector instructions have a 4-clock latency with a

1 clock throughput.


SIMD extensions


SIMD from non-Intel Manufacturers


SIMD


SIMD Fused Multiply Add (FMA)


vfmadd213ps source1, source2, cource3

Vectorisation

Vectorisation: Most important to get performance on Xeon Phi

● Use Intel options –vec-report, -vec-report2 , -vec-report3 to

show information about vectorisation

● Use Intel option -guide-vec to get tips on improvement.

● Prefer SoA over AoS (AoS good for encapsulation but bad

for vector processing)

● Help the compiler with Intel Pragmas


Vectorisation

● Good news:

Compiler can vectorise code for you in many cases

● Bad news:

That sometimes doesn’t work perfectly and the

compiler may need your assistance

Other option: explicit vector programming28.9.2016 Intel Xeon Phi Programming

Vectorisation: Approaches

● Let the compiler do the job: auto-vectorisation

Compiler might need your help: annotations/pragmas

● Explicit vector programming:

Use vector classes or array notation

Not fully general: restricted to C++ and Cilk plus

● Intrinstics / assembly

Full control but low level

● Also important for successful vectorisation

Data alignment

Prefetching


Auto-vectorisation (Intel compiler)

● The vectoriser for MIC works just like for the host

Enabled by default at optimisation level –O2 and above

Data alignment should be 64 bytes instead of 16

More loops can be vectorised, because of masked vector

instructions, gather/scatter and fused multiply-add (FMA)

Try to avoid 64 bit integers (except as addresses)

● Vectorised loops may be recognised by

Vectorisation and optimisation reports (recommended)

–qopt-report=2 –qopt-report-phase=vec

Unmasked vector instructions

Gather and scatter instructions

Math library calls to libsvml


Vectorisation Report

By default, both host and target compilations may generate

messages for the same loop, e.g.

To get a vec. report for the offload target compilation, but not for

the host compilation:

host:~/> icc -qopt-report=2 –qopt-report-phase=vec test_vec.ctest_vec.c(10): (col. 1) remark: LOOP WAS VECTORIZED.test_vec.c(10): (col. 1) remark: *MIC* LOOP WAS VECTORIZED.

host:~/> icc –qopt-report=2 –qopt-report-phase=vec –qoffload-option,mic,compiler,”-qopt-report=2” test_vec.ctest_vec.c(10): (col. 1) remark: *MIC* LOOP WAS VECTORIZED.test_vec.c(20): (col. 1) remark: *MIC* loop was not vectorized: existence of vector dependence.test_vec.c(20): (col. 1) remark: *MIC* PARTIAL LOOP WAS VECTORIZED.


Common Compiler Messages

● “Loop was not vectorized” because

“Low trip count”

“Existence of vector dependence”

Possible dependence of one loop iteration on another,

e.g.

“vectorization possible but seems inefficient”

“Not inner loop”

● It may be possible to overcome these using switches, pragmas,

source code changes

for (j=n; j<MAX; j++) {a[j] = a[j] + c * a[j-n];

}


Intel-Specific Vectorisation Pragmas

● #pragma ivdep: Instructs the compiler to ignore

assumed vector dependencies.

● #pragma loop_count: Specifies the iterations for the

for loop.

● #pragma novector: Specifies that the loop should

never be vectorized.

● #pragma omp simd: Transforms the loop into a loop

that will be executed concurrently using Single

Instruction Multiple Data (SIMD) instructions.

(OpenMP 4.0)


#pragma vector


always instructs the compiler to override any efficiency heuristic during the decision to vectorize or

not, and vectorize non-unit strides or very unaligned memory accesses; controls the

vectorization of the subsequent loop in the program; optionally takes the keyword assert

aligned instructs the compiler to use aligned data movement instructions for all array references

when vectorizing

unaligned instructs the compiler to use unaligned data movement instructions for all array references

when vectorizing

nontemporal directs the compiler to use non-temporal (that is, streaming) stores on systems based on

all supported architectures, unless otherwise specified; optionally takes a comma

separated list of variables.

On systems based on Intel® MIC Architecture, directs the compiler to generate clevict

(cache-line-evict) instructions after the stores based on the non-temporal pragma when the

compiler knows that the store addresses are aligned; optionally takes a comma separated

list of variables

temporal directs the compiler to use temporal (that is, non-streaming) stores on systems based on

all supported architectures, unless otherwise specified

vecremainder instructs the compiler to vectorize the remainder loop when the original loop is vectorized

novecremainder instructs the compiler not to vectorize the remainder loop when the original loop is

vectorized

Example for Vectorisation Pragmas

pragma offload target(mic) in(a,b:length(n*n)) inout(c:length(n*n)) {


for( i = 0; i < n; i++ ) {

for( k = 0; k < n; k++ ) {

#pragma vector aligned

#pragma ivdep

for( j = 0; j < n; j++ ) {

//c[i][j] = c[i][j] + a[i][k]*b[k][j];

c[i*n+j] = c[i*n+j] + a[i*n+k]*b[k*n+j];

}}}}


#pragma simd

● The simd pragma is used to guide the compiler to

vectorize more loops. Vectorization using the simd

pragma complements (but does not replace) the fully

automatic approach.

● Without explicit vectorlength() and vectorlengthfor()

clauses, compiler will choose a vectorlength using its

own cost model. Misclassification of variables into

private, firstprivate, lastprivate, linear, and reduction,

or lack of appropriate classification of variables may

lead to unintended consequences such as runtime

failures and/or incorrect result.


#pragma simd

void add_floats(float *a, float *b, float *c, float *d, float *e, int n)

{

int i;

#pragma simd

for (i=0; i<n; i++){

a[i] = a[i] + b[i] + c[i] + d[i] + e[i];

}

}


Function uses too

many unknown

pointers for the

compiler's

automatic runtime

independence

check

optimization to

kick-in

Vector Intrinsics

● If compiler vectorisation fails…


IMCI Instruction Set

● IMCI: Initial Many-Core instruction set

IMCI is not SSE/AVX!


for (int i=0; i<n; i+=4) {

__m128 Avec=_mm_load_ps(A+i);

__m128 Bvec=_mm_load_ps(B+i);

Avec=_mm_add_ps(Avec, Bvec);

_mm_store_ps(A+i, Avec);

}

for (int i=0; i<n; i+=16) {

__m512 Avec=_mm512_load_ps(A+i);

__m512 Bvec=_mm512_load_ps(B+i);

Avec=_mm512_add_ps(Avec, Bvec);

_mm512_store_ps(A+i, Avec);

}

SSE2 Intrinsics IMCI Intrinsics

Features of the IMCI Instruction Set

● Fused Multiply-Add (FMA) instruction support.

● Gather and Scatter instructions:

copy non-contiguous data from MEM to SIMD registers

(gather) or from SIMD registers to MEM (scatter).

● Swizzle and Permute instructions:

swizzle: rearranges elements within each 128-bit block

permute: rearranges the 128-bit blocks according to

patterns specified by the user.

● Bitmasked operations: Control which of the elements

in the resulting vector are modified / preserved.

● Reduction (sum, product), min/max operations.


Thread Affinity

● Pinning Threads is important!

● export KMP_AFFINITY=“granularity=thread,x”

x=compact, scatter, balanced

● See Intel compiler Documentation.


KMP_AFFINITY=granularity=thread,scatter.KMP_AFFINITY=granularity=thread,compact.

Data Alignment

● Prerequisite for successful use of the SIMD units.

● A pointer p is said to address a memory location

aligned on an n-byte boundary if ((size_t)p%n==0).

● Memory address should a multiple of the vector

register width in bytes, i.e.

SSE2: 16-Byte alignment

AVX: 32-Byte alignment

MIC: 64-Byte alignment


Data Alignment

● Data alignment on the stack:

__declspec(align(64)) double data[N]; (ICC)

double data[N] __attribute__((aligned(64))); (ICC, GCC)

● Data alignment on the heap (ICC)

_mm_malloc/free functions (ICC)

#include <malloc.h>

double*A = (double*)_mm_malloc(N*sizeof(double), 64);

_mm_free(A);

posix_memalign (ICC, GCC)

ok=posix_memalign((void**)&a, 64, n*n*sizeof(double));


Memory Bandwidth

● Sandy-Bridge:

2 sockets× 4 memory channels × 6.4 GT/s × 2

bytes per channel = 102.4 GB/s

● Xeon Phi:

8 memory controllers × 2 channels/controller × 6.0

GT/s ×4 bytes per channel = 384 GB/s.

● For complicated memory access patterns: memory

latency / cache performance is important. Xeon Phi

caches less powerful then Xeon caches (e.g. no L1

prefetcher etc.) .


Factor 3.8

Memory / Host Access

● When is Xeon Phi expected to deliver better

performance than the host:

1. Bandwidth-bound code: If memory access patterns

are streamlined so that application is limited by

memory bandwidth and not memory-latency bound.

2. Compute-bound code: high arithmetic intensity (#

operations / memory transfer).

3. Code should not be dominated by Host <-> MIC

communication limited by slow PCIe v2 bandwidth of

6 GB/s .


Prefetching

● Hardware Prefetching:

Intel Xeon processors: L1 and L2 hardware

prefetchers

Intel Xeon Phi: only L2 hardware prefetcher

● Software Prefetching:

Instructions that request that a cache line is fetched

from memory in cache.

Does not stall execution.

Prefetch distance= time between the prefetch

instruction and the instruction using the data


Prefetching

● If software prefetches are doing a good job, then

hardware prefetching does not kick in.

● In several workloads (such as stream), maximal

software prefetching gives the best performance.

● Any references not prefetched by compiler may get

prefetched by hardware.

● Details: Rakesh Krishnayer

http://software.intel.com/sites/default/files/article/326703/5.3-

prefetching-on-mic-update.pdf.


Summary

● Concerning the ease of use and the programmability Intel Xeon Phi is a

promising hardware architecture compared to other accelerators like

GPGPUs, FPGAs or former CELL processors or ClearSpeed cards.

● Codes using MPI, OpenMP or MKL etc. can be quickly ported. Some

MKL routines have been highly optimised for the MIC.

● Due to the large SIMD width of 64 Bytes vectorisation is even more

important for the MIC architecture than for Intel Xeon based systems.

● It is extremely simple to get a code running on Intel Xeon Phi, but

getting performance out of the chip in most cases needs manual tuning

of the code due to failing auto-vectorisation.

● MIC programming enforces programmer to think about SIMD

vectorisation

→ Performance on current /future Xeon based systems also much

better with MIC-optimised code.


Xeon Phi References

● Books:

James Reinders, James Jeffers, Intel Xeon Phi Coprocessor High

Performance Programming, Morgan Kaufman Publ. Inc., 2013

http://lotsofcores.com ; new KNL edition in July 2016

Rezaur Rahman: Intel Xeon Phi Coprocessor Architecture and

Tools: The Guide for Application Developers, Apress 2013 .

Parallel Programming and Optimization with Intel Xeon Phi

Coprocessors, Colfax 2013

http://www.colfaxintl.com/nd/xeonphi/book.aspx

● Intel Xeon Phi Programming, Training material, CAPS

● Intel Training Material and Webinars

● V. Weinberg (Editor) et al., Best Practice Guide - Intel Xeon Phi,

http://www.prace-project.eu/Best-Practice-Guide-Intel-Xeon-Phi-

HTML and references therein


Acknowledgements

● IT4Innovation, Ostrava.

● Partnership for Advanced Computing in Europe (PRACE)

● Intel

● BMBF (Federal Ministry of Education and Research)

● Dr. Karl Fürlinger (LMU)

Intel Xeon Phi Programming28.9.2016


Thank you for your participation!

intel xeon phi programming - johannes kepler university · multi-kepler gpu vs. multi-intel mic for...

Documents