parallel programming & intel hpc round table visitjanjust/presentations/pdp-groepsoverleg... ·...

J.J. KeijserNikhefAmsterdamCT/PDP

Parallel Programming & Intel HPC Round Table visit

Jan Just Keijser23 April 2014


PPCES 2014:◦ Parallel Programming for Computational

Engineering & Science◦ 10-14 March◦ RWTH Aachen

Intel EMEA HPC Roundtable◦ 8 – 9 April◦ Cambridge, UK◦ I survived sitting in a British car with Tristan

S. driving


Parallel Programming Course1.Introduction, Parallel Computing Architectures,

Serial Tuning.

2.Message Passing with MPI.

3.Shared Memory Programming with OpenMP

4.OpenMP Programming for the MIC Architecture.

5.GPGPU Programming with OpenACC


Day 1: serial tuningVery instructiveFound a bug in the Intel C++ 2014 compiler:

Good

10 times slower:


Day 2&3: MPI & OpenMPMPI a bit boring, although very similar to

OpenMPOpenMP 4.0 looks promising:

◦ Annotate C/C++ or FORTRAN with #pragma's to help parallellisation

◦ New #pragma's “target” and “simd” to explicitly select an offload target and vectorisation

Portland Group (PG) compiler can compile OpenMP 4.0 code to multiple targets:◦ CPU, Xeon Phi, CUDA (Nvidia GPU's),

OpenCL


Day 4: OpenMP on MICMIC = Intel Xeon PhiFirst time that I have seen code that runs

faster on a single Xeon Phi vs a dual Xeon E5 2697v2

Can use either Intel compiler or PG compiler

Again a bug discovered:◦ Code that is offloaded to a Xeon Phi runs

slower than code that is natively compiled for a Xeon Phi by a factor > 2


Day 5: OpenACCSimilar to OpenMPMeant as a open standard to be able to

annotate code to compile on GPU'sThus far, only commercial compilers support

this (SGI, Portland Group)


Intel HPC Round TableNDA meeting, so Intel folks could talk freelyLatest news from Intel on CPUs, Xeon Phi's,

storage, networkingSessions on success stories with Intel stuff The real reason to go is the ability to directly

talk to Intel developers◦ Talked to C++ compiler developer◦ Talked to Xeon Phi developer


News on next-gen Xeon PhiAlso in socket format – replaces a regular

XeonWill always use special memory for best

performance, so caching data locally remains vital

Based on Atom cores instead of current Pentium-M cores: x86_64 compatible


GPUs are excellent for performing the Same operation (Instruction) on

Multiple Data elements (SIMD)Vector processing, anyone?

Remember this slide?

From: Massively Parallel Computing with CUDA


Available GPUs @ NikhefArun: 2 x NVIDIA Tesla M2070

Pleedo: 2 x Intel Xeon Phi 5110P

Plofkip: 2 x NVIDIA Tesla M1060, 1 x NVIDIA GTX580, 1 x AMD Radeon HD7970 GHz

Nesse: Jos Vermaseren's Xeon Phi 7120P

Some engineering workstations now have AMD Firepro W5000's, which are actually quite nice


Which direction to take?OpenMP 4.0 looks promising

NVIDIA vs Intel vs AMD?◦ Most HPC centres use NVIDIA (cartesius)◦ However, look at raw performance:

parallel programming & intel hpc round table visitjanjust/presentations/pdp-groepsoverleg... ·...

Documents