parallel computing and intel® xeon phi™ coprocessors · pdf filelammps, namd, amber,...

64
Parallel Computing and Intel® Xeon Phi™ coprocessors Warren He Ph.D (何万青) [email protected] Technical Computing Solution Architect Datacenter & Connected System Group

Upload: truongkien

Post on 20-Mar-2018

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Parallel Computing and Intel® Xeon Phi™ coprocessors

Warren He Ph.D (何万青) [email protected]

Technical Computing Solution Architect

Datacenter & Connected System Group

Page 2: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Transition from Multicore to Manycore

Page 3: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Agenda

• Why Many-Core?

• Intel Xeon Phi Family & ecosystem

• Intel Xeon Phi HW and SW architecture

• HPC design and development considerations for Xeon Phi

• Case Studies & Demo

3

Page 4: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Parallelize

Vectorize

Scale to manycore

* Theoretical acceleration of a highly parallel processor over a Intel® Xeon® parallel processor

Fraction Parallel

% Vector

Performance

Scaling workloads for Intel® Xeon Phi™ coprocessors

Page 5: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Intel® Xeon Phi™ Coprocessor

All SKUs, pricing and features are PRELIMINARY and are subject to change without notice

VECTOR IA CORE

INTERPROCESSOR NETWORK

INTERPROCESSOR NETWORK

FIX

ED

FU

NCTIO

N L

OG

IC

MEM

ORY a

nd I

/O I

NTERFACES

VECTOR IA CORE

VECTOR IA CORE

VECTOR IA CORE

VECTOR IA CORE

VECTOR IA CORE

VECTOR IA CORE

VECTOR IA CORE

COHERENT CACHE

… …

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

PCIe

GDDR5

Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. Knights Corner and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Silicon Cores 57-61

Silicon Max Freq 1.05-1.25 GHz

Double Precision Peak Performance

1003-1220 GFLOP

GDDR5 Devices 24-32

Memory Capacity 6-8GB

Memory Channels

12-16

Form Factors Refer to picture: passive, active, dense form factor (DFF), no thermal solution (NTS)

Memory/BW Peak 240-352 GT/s

Total Cache 28.5- 30.5MB

Board TDP 225-300 Watts

PCIe PCIe Gen 2 x16, I/O speeds up to: KNC Host: 7.0 GB/s; Host KNC: 6.7 GB/s

Misc.

• Product codename: Knights Corner • Based on 22nm process • ECC enabled • Turbo not currently available

Power/ Mgmt • Future node manager support • IMPI including Intel Coprocessor Communications Link

Tools

• Intel C++/C/FORTRAN compilers, Intel Math Kernel Library • Debuggers, Performance and Correctness Analysis Tools • OpenMP, MPI, OFED messaging infrastructure (Linux only), OpenCL • Programming Models: Offload, Native and mixed Offload+Native

OS • RHEL 6.x, SuSE 11.1, Microsoft Windows 8 Server, Win 7, Win 8

Page 6: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Intel® Many Integrated Core Architecture: Processor Numbering Overview

Performance Shelf

Brand

Intel® Xeon Phi™ coprocessor 7 1 10 P

Generation

SKU Number Suffix

NOTE: Where applicable, may denote form factor, thermal solution, usage model

7 Best Performance

5 Best Performance/Watt

3 Best Value

1 Knights Corner

2 Knights Landing

P Passive Thermal Solution

A Active Thermal Solution

X No Thermal Solution

D Dense Form Factor

Indication of relative pricing and performance within generation

Designates the family of products

Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. Knights Corner and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Page 7: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

7

Page 8: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Open Source Commercial

Compilers, Run environments

gcc (kernel build only, not for applications), python*

Intel® C++ Compiler, Intel® Fortran Compiler, MYO, CAPS* HMPP* 3.2.5 (Beta) compiler, PGI*, PGAS GPI (Fraunhofer ITWM*), ISPC

Debugger gdb Intel Debugger, RogueWave* TotalView* 8.9, Allinea* DDT 3.3

Libraries TBB2, MPICH2 1.5, FFTW, NetCDF

NAG*, Intel® Math Kernel Library, Intel® MPI Library, Intel® OpenMP*, Intel® Cilk™ Plus (in Intel compilers), MAGMA*, Accelereyes* ArrayFire 2.0 (Beta), Boost C++ Libraries 1.47+

Profiling & Analysis Tools

Intel® VTune™ Amplifier XE, Intel® Trace Analyzer & Collector, Intel® Inspector XE, Intel® Advisor XE, TAU* – ParaTools* 2.21.4

Virtualization ScaleMP* vSMP Foundation 5.0, Xen* 4.1.2+

Cluster, Workload Management, and Manageability Tools

Altair* PBS Professional 11.2, Adaptive* Computing Moab 7.2, Bright* Cluster Manager 6.1 (Beta), ET International* SWARM (Beta), IBM* Platform Computing* {LSF 8.3, HPC 3.2 and PCM 3.2}, MPICH2, Univa* Grid Engine 8.1.3

Software Development Ecosystem1 for Intel® Xeon Phi™ Coprocessor

1These are all announced as of November 2012. Intel has said there are more actively being developed but are not yet announced. 2Commercial support of Intel TBB available from Intel.

Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. Knights Corner and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Page 9: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Sub-Segments Applications/Workloads Intel® Xeon Phi™ Candidates

Public Sector HPL, HPCC, NPB, LAMMPS, QCD

Energy (including Oil & Gas) RTM (Reverse Time Migration), WEM (Wave Equation Migration)

Climate Modeling & Weather Simulation WRF, HOMME

Financial Analysis Monte Carlo, Black-Scholes, Binomial model, Heston model

Life Sciences (Molecular Dynamics, Gene Sequencing, Bio-Chemistry)

LAMMPS, NAMD, AMBER, HMMER, BLAST, QCD

Manufacturing CAD/CAM/CAE/CFD/EDA Implicit, Explicit Solvers

Digital Content Creation Ray Tracing, Animation, Effects

Software Ecosystem Tools, Middleware

9

Technical Compute Target Market Segments & Applications

Mix of ISV and End User development

Page 10: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Agenda

• Why Many-Core?

• Intel Xeon Phi Family & ecosystem

• Intel Xeon Phi HW and SW architecture

• HPC design and development considerations for Xeon Phi

• Case Studies & Demo

10

Page 11: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Intel® Xeon Phi™ coprocessor codenamed “Knights Corner” Block Diagram

Cores: lightweight,

power efficient,

in-order, support

4 hardware threads

Vector Processing Unit

(VPU) on each core,

32 512bit SIMD

registers

High-speed

bi-directional

ring interconnect

Distributed tag

directory enables

faster snooping

Page 12: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Speeds and Feeds

Per-core caches reference info:

Memory:

• 8 memory controllers, each supporting 2 32-bit channels

• GDDR5 channels theoretical peak of 5.5GT/s

• Practical peak BW limited by ring and other factors

12

Type Size Ways Set conflict

Location

L1I (instr) 32KB 4 8KB on-core

L1D (data) 32KB 8 4KB on-core

L2 (unified) 512KB 8 64KB connected via core/ring interface

Page 13: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Speeds and Feeds, Part 2

Per-core TLBs reference info:

13

Type Entries Page Size Coverage

L1 Instruction 32 4KB 128KB

L1 Data 64 4KB 256KB

32 64KB 2MB

8 2MB 16MB

L2 64 4KB, 64KB, or 2MB Up to 128MB

• Linux support for 64K pages on Intel64 not yet available

Page 14: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Knights Corner Core

X86 specific logic < 2% of core + L2 area

Intel Confidential

Page 15: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Knights Corner Silicon Core Architecture

Instr

uction

Decode Scala

r U

nit

Scala

r Regis

ters

Vecto

r U

nit

Vecto

r Regis

ters

L1 C

ache

[Data

and

Instr

uction]

L2 c

ache

Inte

rpro

cessor

Netw

ork

In Order

512-wide

Specialized Instructions

(new encoding)

L2 Hardware Prefetching

512 KB Slice per Core –

Fast Access to Local Copy

Improved ring interconnect delivering

3x bandwidth over Aubrey Isle silicon ring

64-bit

4 Threads per Core Fully Coherent

32 KB per core

VPU: integer, SP, DP; 3-operand,

16-instruction

HW transcendentals

Intel Confidential

Page 16: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Vector Processing Unit

PPF PF D0 D1 D2 E WB

VC2 V1-V4 WB D2 E VC1

VC2 V1 V2 D2 E VC1 V3 V4

DEC VPU RF

3R, 1W

Mask RF

Scatter Gather

ST

LD

EMU Vector ALUs

16 Wide x 32 bit 8 Wide x 64 bit

Fused Multiply Add

Intel Confidential

Page 17: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

17

Peak Performace = Clock Freq * # of Cores * 8 Lane (DP) * 2 (FMA) FLOPS/Cycle

e.g. 1.091 GHz * 61 Cores * 8 Lanes * 2 = 1064.8 Gflops

Page 18: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Wider vectorization opportunity

MMX™ (1997)

Intel® Streaming SIMD Extensions (Intel® SSE in 1999 to Intel® SSE4.2 in 2008)

Intel® Advanced Vector Extensions (Intel® AVX in 2011 and Intel® AVX2 in 2013)

Intel® Xeon™ Phi Coprocessor (in 2012)

Intel® Pentium® processor (1993)

Illustrated with the number of 32bit data processed by one “packed” instruction

18

Effective use of ever increasing number of vector elements and new instruction sets is vital to Intel’s success

1/7/2013

Multi-Cores and Many-Cores on top of this!!

Page 19: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Knights Corner PCI Express Card Block Diagram

INTEL CONFIDENTIA

L 19

Page 20: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Intel® Xeon Phi™ Coprocessors Software Architecture (MPSS)

20

Page 21: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Operate as a compute node

Run a full OS

Program to MPI

Run x86 code

Intel® Xeon Phi™ Coprocessors: They’re So Much More General purpose IA Hardware leads to less idle time for your investment.

21

Intel® Xeon Phi™ Coprocessor Custom HW Acceleration

Run restricted code Run offloaded code

It’s a supercomputer on a chip

GPU ASIC FPGA

Restrictive architectures

Source: Intel Estimates

Restrictive architectures limit the ability for applications to use arbitrary nested parallelism, functions calls and threading models

Page 22: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

HOST

Linux

Kernel

Mode

User Mode

Linux

Kernel

Mode

User Mode

PCI-E Communications HW

MIC Software Stack

MIC

Offload Tools/

Applications

Offload Tools/

Applications

Offload Library Offload Library

MYO MYO COI COI

SCIF SCIF

1/7/2013

22

Page 23: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Spectrum of Execution Models of Heterogeneous Computing on Xeon and Xeon Phi Coprocessor

General purpose serial and parallel

computing

Codes with highly- parallel

phases

Highly-parallel codes

Codes with balanced needs

CPU-Centric Intel® MIC-Centric

Many-Core Hosted Symmetric Offload Multi-core Hosted

Main( ) Foo( ) MPI_*()

Foo( )

Main( ) Foo( ) MPI_*()

Main() Foo( ) MPI_*()

Main( ) Foo( ) MPI_*()

Main( ) Foo( ) MPI_*()

Multi-core

Many-core

Productive Programming Models Across the Spectrum

Intel® Many Integrated Core (MIC)

Codes with serial phases

Reverse Offload

Main( ) Foo( ) MPI_*()

Foo( )

PCIe

Not Currently Supported by Language Ext. Offload for Intel Tools

Intel® Xeon Processor

Supported with Intel Tools

Page 24: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

MPSS Setup Step by Step

24

Page 25: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

1/7/2013 Intel® Many Integrated

Core Architecture

Marketing Document

ID: 487731

25

Tools for both OEMs and End Users : • micsmc – A control panel application for system management and viewing configuration

status information specific to SMC operations on one or more MIC cards in a single platform.

• micflash – A command line utility to update both the MIC SPI Flash and the SMC Firmware as well as grab various information about the flash from either on the ROM on the MIC card or from a ROM file.

• micinfo – A command line utility that allows the OEM and the end user to get various information (version data, memory info, bus speeds, etc.) from one or more MIC cards in a single platform.

• hwcfg – A command line utility that can be used to view and configure various hardware aspects of one or more MIC cards in a single platform.

Tools for OEMs only : • micdiag – A command line utility that performs various diagnostics on one or more MIC

cards in a single platform. • micerrinj – A command line utility that allows the OEMs to inject one or more ECC

errors (correctable, uncorrectable or fatal) into the GDDR on a MIC card to test platform error handling capabilities.

• KncPtuGen – A power/thermal command line utility that can be used to put various loads on one or more individual cores on a MIC card to simulate compromised performance on specific cores.

• micprun (OEM Workloads) – A command line utility that provides an environment for executing various performance workloads (provided by Intel) that allow the OEM to test their platform’s performance over time and compare to Intel’s “Gold” configuration in the factory.

Page 26: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

‘micsmc’ – MIC SMC Control Panel

• What does it do?

– A graphical display of SMC status items, such as temperature, memory usage and core utilization for one or multiple MIC cards on a single platform.

– Command Line Parameters available for scripting purposes.

• What does it look like?

1/7/2013

Intel® Many Integrated

Core Architecture

Marketing

26

Page 27: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

‘micsmc’ Card View Windows

1/7/2013

Intel® Many Integrated

Core Architecture

Marketing

27

Utilization View

Core Histogram View

Historical Utilization View

Page 28: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

‘micprun’ – MIC Performance Tool

• What can it do?

– This is actually a wrapper that can be used to execute the OEM workloads.

– Neatly wraps up many of the parameters specific to the workload in addition to setting up workload comparisons to previous executions and even to Intel “Gold” system configurations.

• What does it look like?

– ./micprun [–k workload] [–p workloadParams] [-c category] [-x method] [-v level] [-o output] [-t tag] [-r pickle]

– Workload (-k for kernel) specifies which workload to run... SGEMM, DGEMM, etc.

– Categories (-c) include “optimal”, “scaling”, “test”.

– Methods (-x) include “native”, “scif”, “coi” and “myo”.

– Levels (-v) specify the level of detail in the output.

– Output (-o) specifies to create a “pickle” file for future comparisons (use Tag (-t) to specify custom name).

– Pickle (-r) is used to compare the current workload execution to a previously saved workload execution and execute the current run exactly the same as the specified pickle file.

1/7/2013

Intel® Many Integrated

Core Architecture

Marketing

28

Page 29: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

‘micprun’ continued…

• Show me some examples!

– Run all of the native workloads with optimal parameters :

– ./micprun

– ./micprun –k all –c optimal –x native –v 0

– Run SGEMM in scaling mode, and save to a pickle file :

– ./micprun –k sgemm –c scaling –o . –t mySgemm

– Run a comparison to the above example and create a graph :

– ./micprun –r micp_run_mySgemm –v 2

– View the SGEMM workload parameters :

– ./micprun –k sgemm --help

– Run SGEMM on only 4 cores :

– ./micprun –k sgemm –p ‘--num_core 4’

1/7/2013

Intel® Many Integrated

Core Architecture

Marketing

29

Page 30: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Agenda

• Why Many-Core?

• Intel Xeon Phi Family & ecosystem

• Intel Xeon Phi HW and SW architecture

• HPC design and development considerations for Xeon Phi

• Case Studies & Demo

30

Page 31: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Considerations for Application Development on MIC

Fork/Join

Scientific Parallel Application Considerations

Dataset Size

Grid Shape

Algorithm Implementation Workload

Data Access Patterns

SPMD Loop Structure

Load Balancing

Communication

Page 32: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Considerations in Selecting Workload Size

Applicability of Amdahl's law, • Speedup = (s+p)/(s+(p/N)) ..(1),

where p is the parallel fraction, N number of parallel processors

• Using single processor runtime to normalize, s+p=1, • Speedup = 1/(s+(p/N))

• Indicates serial portion ‘s’ will dominate in massively parallel hardware like Intel® Xeon Phi™ • Limiting advantage of the many core hardware

Ref: “Scientific Parallel Computing” – Scott Ridgeway et. al.

Page 33: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Scaled vs. Fixed Size Speedup

Amdahl’s Law Implication

Amdahl’s Law considered harmful in designing parallel architecture.

Page 34: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Gustafson’s[1] Observation:

• For Scientific computing problems, problem size increases with parallel computing power

• The parallel component of the problem scales with problem size

• Ask “how long a given parallel program would have taken on serial processor”

• Scaled Speedup = (s’+p’N)/(s’+p’) • where s’,p’ represent serial and parallel time spent on parallel system,

s’+p’=1

• Scaled Speedup = s’ + (1-s’)*N = N + (1-N) s’ ..(2)

When speedup is measured by scaling problem size, the scalar fraction tend to shrink as more processors are added

Page 35: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Gustafson’s Law

Scaled Speedup = s’ + (1-s’)*N = N + (1-N) s’ When speedup is measured by scaling problem size, the scalar fraction tend to shrink as more processors are added

Fig: John L Gustafson, “Reevaluating Amdahl’s Law”

Page 36: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Intel® Xeon Phi™ Coprocessors Workload Suitability

Representative example workload for illustration purposes only

Multi-core peak

Perf

orm

ance

Threads

Many core peak

Can your workload scale to

over 100 threads?

Yes

No

No

No

If application scales with threads and vectors or memory BW

Intel® Xeon PhiTM Coprocessors

Click to see Animation video on

Exploiting Parallelism on Intel® Xeon PhiTM

Coprocessors

Can your workload benefit

from large vectors ?

Can your workload benefit

from more memory

bandwidth?

Yes

Yes

Page 37: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

A picture speaks a Thousands Words

37

Page 38: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

The Keys to Take Advantages of MIC

• Choose the right Xeon-centric or MIC-centric model for your application

• Vectorize your application

– Use Intel’s vectorizing compiler

• Parallelize your application

– With MPI (or other multiprocess model)

– With threads (via Pthreads, TBB, Cilk Plus, OpenMP, etc.)

• Consider standard optimization techniques such as tiling, overlapping computation and communication, etc.

38

Page 39: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Intel® Xeon™ Phi SW Imperatives

• In order for Intel® Xeon™ Phi program to succeed, think differently and guide the programmers to think differently.

39 1/7/2013

Time

Speed

port

Time

Speed optimize

Naïve-Port/Longer-Optimization Losing Strategy

Intelligent-Port/Shorter-Optimization Winning Strategy

Intelligent application porting to Intel® Xeon™ Phi Platforms requires parallelization and vectorization

Page 40: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Algorithm and Data Structure

One needs to consider

• Type of Parallel Algorithm to use for the problem

– Look carefully at every sequential part of the algorithm

• Memory Access Patterns – Input and output data set

• Communication Between Threads

• Work division between threads/processes – Load Balancing

– Granularity

40

Page 41: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Parallelism Pattern For Xeon Phi™

41

Page 42: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Memory Access Patterns

42

• Algorithm determines the data access patterns.

• Data Access Patterns can be categorized into:

– Arrays/Linear Data Structures

– Conducive to low latency and high bandwidth

– Look out for: Unaligned access

– Random Access

– Try to avoid. Needs prefetching if possible

– Not many threads to hide latency compared to Tesla implementation

– Recursive (Tree)

• Data Decomposition:

– Geometric Decomposition

– Regular

– Irregular

– Recursive Decomposition

Page 43: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

• Auto vectorization may be improved by additional pragmas like #pragma IVDEP

• A more aggressive pragma is #pragma SIMD

this pragma will use less strict heuristics and may produce wrong code. Results must be carefully checked

• Experts may use compiler intrinsics. The caveat is the loss of portability. A portable layer between intrinsics and app. may help

Vectorization methodologies

1/7/2013 43

Page 44: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Compiler has to assume the worst case the language/flag allow.

• float *A; void vectorize() { int i; for (i = 0; i < 102400; i++) { A[i] *= 2.0f; } }

•Loop Body:

• Load of A

• Load of A[i]

• Multiply with 2.0f

• Store of A[i]

3) Recompile with –ansi-alias

• icc –vec-report1 –ansi-alias test1.c

– test1.c(4): (col. 3) remark: LOOP WAS VECTORIZED.

4) Change “float *A” to “float *restrict A”.

• icc –vec-report1 test1a.c

– test1a.c(4): (col. 3) remark: LOOP WAS VECTORIZED.

5) Add “#pragma ivdep” to the loop.

• icc –vec-report1 test1b.c

– test1b.c(5): (col. 3) remark: LOOP WAS VECTORIZED.

44 1/7/2013

Q: Will the store modify A?

A: Maybe

“NO” is needed to make vectorization legal.

Page 45: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Writing Explicit Vector Code with Intel® Cilk™Plus

• float *A; void vectorize() { for (int i = 0; i < 102400; i++) { A[i] *= 2.0f; } }

8) Using SIMD Pragma

• float *A; void vectorize() { #pragma simd vectorlength(4) for (int i = 0; i < 102400; i++) { A[i] *= 2.0f; } }

9) Using Array Notation

• float *A; void vectorize() { A[0:102400] *= 2.0f; }

10)Using Elemental Function

• float *A; __declspec(noinline) __declspec(vector(uniform(p), linear(i))) void mul(int i){ p[i] *= 2.0f; } void vectorize() { for (int i = 0; i < 102400; i++) { mul(A, i); } }

45 1/7/2013

Page 46: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Memory Accesses and Alignment

• Memory Access Patterns

•Alignment Optimization

46 1/7/2013

A[i] A[i+1] A[i+2] A[i+3] … …

Unit-stride

A[2*i] A[2*(i+1)] A[2*(i+2)]

Strided (special form of gather/scatter)

A[B[i+2]] A[B[i]] A[B[i+1]]

Gather/Scatter

If you write in Cilk™Plus Array Notation, access patterns are obvious in your eyes: Unit-stride means A[:] or A[lb:#elems], helps you think more clearly.

A[i] A[i+1] A[i+2] A[i+3] … …

Aligned Unit-stride

Addr % SIZE == 0

A[i] A[i+1] A[i+2] A[i+3] … …

Alignment unknown Unit-stride

Addr % SIZE == ???

SIZE: 64B for Xeon™Phi, 32B for AVX1/2, 16B for SSE4.2 and below

A[i] A[i+1] A[i+2] A[i+3] … …

Misaligned Unit-stride

Addr % SIZE != 0

Page 47: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Memory Access Patterns

• float A[1000]; Struct BB { float x, y, z; } B[1000]; int C[1000];

• for(i=0; i<n; i++){ a) A[I] // unit-stride b) A[2*I] // strided c) B[i].x // strided d) j = C[i] // unit-stride A[j] // gather/scatter e) A[C[i]] // same as d). f) B[i].x + B[i].y + B[i].z // 3 strided loads }

• Unit-Stride

– Good --- with one exception.

– Out-of-order cores won’t store-forward masked (unit-stride) store.

– Avoid storing under the mask (unit-stride or not), but unnecessary stores can also create bandwidth problem. Ouch.

• Strided, Gather/Scatter

– Perf varies on micro-arch and the actual index patterns.

– Big latency is exposed if you have these on the critical path

– F90 Arrays tend to be strided if not carefully used. See this article. Use F2008 “contiguous” attribute.

http://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization

47 1/7/2013

Page 48: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Data Alignment Optimization

• float *A, *B, *C, *D; A[0:n] = B[0:n] * C[0:n] + D[0:n];

• Explain compiler’s assumption on data alignment and how that relates to the generated code.

• Identify different ways to optimize the data alignment of this code.

1) Understand the baseline assumption icc –xSSE4.2 –vec-report1 –restrict test2.c ifort –xSSE4.2 –vec-report1 ftest2.f icc –mmic –vec-report6 test2.c ifort –mmic –vec-report6 ftest2.f Examine the ASM code.

2) Add “vector aligned” icc –xSSE4.2 –vec-report1 test2a.c ifort –xSSE4.2 –vec-report1 ftest2a.f icc –mmic –vec-report6 test2a.c ifort –mmic –vec-report6 ftest2a.f Examine the ASM code

3) Compile with –opt-assume-safe-padding icc –mmic –vec-report6 –opt-assume-safe-padding test2.c ifort –mmic –vec-report6 –opt-assume-safe-padding ftest2.f Examine the ASM code

48 1/7/2013

Page 49: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Align the Data at Allocation and Tell that to the Compiler at the Use

• C/C++

• Align at Data Allocation __declspec(align(n, [off])) __mm_malloc(size, n)

• Inform the Compiler at Use Sites #pragma vector aligned __assume_aligned(ptr, n); __assume(var % x == 0);

•FORTRAN

•Align at Data Allocation !DIR$ attribute align:n :: var -align arrayNbyte (N=16,32,64)

•Inform the Compiler at Use Sites !DIR$ vector aligned

•__assume_aligned(ptr, n);

49 1/7/2013

Reading Material: http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization

Page 50: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Array of Struct is BAD

• struct node { float x, y, z; }; struct node NODES[1024]; float dist[1024]; for(i=0;i<1024;i+=16){ float x[16],y[16],z[16],d[16]; x[:] = NODES[i:16].x; y[:] = NODES[i:16].y; z[:] = NODES[i:16].z; d[:] = sqrtf(x[:]*x[:] + y[:]*y[:] + z[:]*z[:]); dist[i:16] = d[:]; }

• Use Struct of Arrays struct node1 { float x[1024], y[1024], z[1024]; } struct node1 NODES1;

• Use Tiled Array of Structs struct node2 { float x[16], y[16], z[16]; } struct nodes2 NODES2[];

50 1/7/2013

Page 51: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Example in Optimizing LBM for MIC

• Updates – Particle distribution function of

particles in pos x at time t traveling in direction ei

• D3Q19 model

– Each cell has |ei| = 19 values

– Uses 19 values at curr node, does ~12 flops/direction (220 flops total) and writes 19 neighbors

– Has special case computation at boundaries (flag array used here)

𝑓𝑖(𝑥, 𝑡)

𝑓𝑖 𝑥 + 𝑒𝑖𝑑𝑡, 𝑡 + 𝑑𝑡 = 𝑓𝑖 𝑥, 𝑡 + Ω𝑖(𝑓 𝑥, 𝑡 )

Ω𝑖 𝑓 𝑥, 𝑡 = 1

𝜏(𝑓𝑖

0 − 𝑓𝑖)

51

Page 52: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

DO k = 1, NZ

DO j = 1, NY

!DIR$ IVDEP

DO i = 1, NX

m = i + NXG*( j + NYG*k )

phin = phi(m)

rhon = g0(m+now) + g1(m+now) + g2(m+now) + g3(m+now) + g4(m+now) &

+ g5(m+now) + g6(m+now) + g7(m+now) + g8(m+now) + g9(m+now) &

+ g10(m+now) + g11(m+now) + g12(m+now) + g13(m+now) + g14(m+now) &A

+ g15(m+now) + g16(m+now) + g17(m+now) + g18(m+now)

sFx = muPhin*gradPhiX(m)

sFy = muPhin*gradPhiY(m)

sFz = muPhin*gradPhiZ(m)

ux = ( g1(m+now) - g2(m+now) + g7(m+now) - g8(m+now) + g9(m+now) &

- g10(m+now) + g11(m+now) - g12(m+now) + g13(m+now) - g14(m+now) &

+ 0.5D0*sFx )*invRhon

uy = ( g3(m+now) - g4(m+now) + g7(m+now) - g8(m+now) - g9(m+now) &

+ g10(m+now) + g15(m+now) - g16(m+now) + g17(m+now) - g18(m+now) &

+ 0.5D0*sFy )*invRhon

uz = ( g5(m+now) - g6(m+now) + g11(m+now) - g12(m+now) - g13(m+now) &

+ g14(m+now) + g15(m+now) - g16(m+now) - g17(m+now) + g18(m+now) &

+ 0.5D0*sFz )*invRhon

f0(m+nxt) = invTauPhi1*f0(m+now) + Af0 + invTauPhi*phin

f1(m+now) = invTauPhi1*f1(m+now) + Af1 + Cf1*ux

f2(m+now) = invTauPhi1*f2(m+now) + Af1 - Cf1*ux

f3(m+now) = invTauPhi1*f3(m+now) + Af1 + Cf1*uy

f4(m+now) = invTauPhi1*f4(m+now) + Af1 - Cf1*uy

f5(m+now) = invTauPhi1*f5(m+now) + Af1 + Cf1*uz

f6(m+now) = invTauPhi1*f6(m+now) + Af1 - Cf1*uz

g0(m+nxt) = invTauRhoOne*g0(m+now) + ... !! g0'nxt(i , j , k ) = g0(i,j,k) 1

g1(m+nxt + 1) = invTauRhoOne*g1(m+now) + ... !! g1'nxt(i+1, j , k ) = g1(i,j,k) 1

g2(m+nxt - 1) = invTauRhoOne*g2(m+now) + ... !! g2'nxt(i-1, j , k ) = g2(i,j,k) 1 (left)

g3(m+nxt + NXG) = invTauRhoOne*g3(m+now) + ... !! g3'nxt(i , j+1, k ) = g3(i,j,k) 2

g4(m+nxt - NXG) = invTauRhoOne*g4(m+now) + ... !! g4'nxt(i , j-1, k ) = g4(i,j,k) 3 (left)

g5(m+nxt + NXG*NYG) = invTauRhoOne*g5(m+now) + ... !! g5'nxt(i , j , k+1) = g5(i,j,k) 4

g6(m+nxt - NXG*NYG) = invTauRhoOne*g6(m+now) + ... !! g6'nxt(i , j , k-1) = g6(i,j,k) (left)

g7(m+nxt + 1 + NXG) = invTauRhoOne*g7(m+now) + ... !! g7'nxt(i+1, j+1, k ) = g7(i,j,k)

g8(m+nxt - 1 - NXG) = invTauRhoOne*g8(m+now) + ... !! g8'nxt(i-1, j-1, k ) = g8(i,j,k) (left)

g9(m+nxt + 1 - NXG) = invTauRhoOne*g9(m+now) + ... !! g9'nxt(i+1, j-1, k ) = g9(i,j,k) (left)

g10(m+nxt - 1 + NXG) = invTauRhoOne*g10(m+now) + ... !! g10'nxt(i-1, j+1, k ) = g10(i,j,k)

g11(m+nxt + 1 + NXG*NYG) = invTauRhoOne*g11(m+now) + ... !! g11'nxt(i+1, j , k+1) = g11(i,j,k)

g12(m+nxt - 1 - NXG*NYG) = invTauRhoOne*g12(m+now) + ... !! g12'nxt(i-1, j , k-1) = g12(i,j,k) (left)

g13(m+nxt + 1 - NXG*NYG) = invTauRhoOne*g13(m+now) + ... !! g13'nxt(i+1, j , k-1) = g13(i,j,k) (left)

g14(m+nxt - 1 + NXG*NYG) = invTauRhoOne*g14(m+now) + ... !! g14'nxt(i-1, j , k+1) = g14(i,j,k)

g15(m+nxt + NXG + NXG*NYG) = invTauRhoOne*g15(m+now) + ... !! g15'nxt(i , j+1, k+1) = g15(i,j,k)

g16(m+nxt - NXG - NXG*NYG) = invTauRhoOne*g16(m+now) + ... !! g16'nxt(i , j-1, k-1) = g16(i,j,k) (left)

g17(m+nxt + NXG - NXG*NYG) = invTauRhoOne*g17(m+now) + ... !! g17'nxt(i , j+1, k-1) = g17(i,j,k) (left)

g18(m+nxt - NXG + NXG*NYG) = invTauRhoOne*g18(m+now) + ... !! g18'nxt(i , j-1, k+1) = g18(i,j,k)

END DO

END DO

END DO

Key challenges:

- 49 streams of data (some unaligned)

– 19 read streams g* (blue)

– 19 write streams g* (orange)

– 7 read/write streams f* (green)

– 4 read streams (red) (phi, gradPhiX, gradPhiY, gradPhiZ)

- Prefetch / TLB associative issues

- Compiler not vectorizing some hot loops.

NX = NY = NZ = 128

52

Page 53: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

1 0.11

8

16

0

5

10

15

20

SNB-EP KNC SNB-EP KNC

Reference Code Moderate Optimized

Code (Pragmas,Compiler, etc)

Sp

eed

up

Initial Results + Basic Optimizations

• Straight porting:

– KNC was 1/10 the performance of a 2-socket SNB-EP system

• Basic optimizations improve both CPU and MIC performance:

– CPU Perf improved by 8x

– KNC Perf improved by 150x

– Optimizations:

– compiler vector directives

– 2M pages

– loop fusion

– change in OMP work distribution Data measured by Tanmay Gangwani, Shreeniwas Sapre and

Sangeeta Bhattacharya on 2x8 SNB-EP 2.2GHz, KNC-ES2

53

Page 54: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Advanced Optimizations

• Include modifications to data structure and tiling to make better use of vector units, caches and available memory bandwidth

– AOS to SOA transformation helps improve SIMD utilization by avoiding gather / scatter operations

– 3.5D blocking to take advantage of large coherent caches

– SW Prefetching for MIC

• Results:

– CPU and MIC both further improves by 9x

Ref: Anthony Nguyen et. al. “3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs,” SC 2010.

Similarity in Intel ® Architectures make optimizations (basic + advanced) work for both Intel Xeon processor and Intel ® Xeon Phi ® coprocessor

54

Page 55: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Agenda

• Why Many-Core?

• Intel Xeon Phi Family & ecosystem

• Intel Xeon Phi HW and SW architecture

• HPC design and development considerations for Xeon Phi

• Case Studies & Demo

55

Page 56: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

ScaleMP

• vSMP Foundation aggregates multiple x86 systems into a single, virtual x86 system

– SCALE their application compute, memory, and I/O resources

• Simple to move an application to Intel® Xeon Phi™ coprocessor based systems using ScaleMP software

• Zero porting effort

Video: ScaleMP to Support Intel Xeon Phi Coprocessors ... ScaleMP CEO Shai Fultheim describes how vSMP Foundation supports the new Intel Xeon Phi coprocessor. (From ISC’12 – Hamburg)

Page 57: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Intel demo booth #: ScaleMP – SMP VM for Xeon & Xeon Phi

57

Page 58: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Programming Intel® MIC-based Systems

58

CPU

CPU

CPU

CPU

Network

MPI

MIC

MIC

MIC

MIC

Data

Data

Data

Data

CPU

CPU

CPU

CPU

Network

MPI

Data

Data

Data

Data

MIC

MIC

MIC

MIC

CPU-Centric Intel®MIC-Centric CPU-hosted Offload Symmetric Reverse Offload MIC-hosted

...

Page 59: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

LBM running in Xeon cluster with Xeon Phi co-processor

59

CPU

MIC

CPU

CPU

MIC

CPU

Network

Data

Data

Data

Data

MPI

Offload

Homogenous Node

Page 60: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

LES_MIC: Load balance between. CPU & MIC

Copyright © 2012 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Sponsors of Tomorrow., and the Intel Sponsors of Tomorrow. logo are trademarks of Intel Corporation in the U.S. and other countries.

Relative speedup with different problem size LES CPU+MIC simultaneous computing design

Page 61: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Demo#: LES_MIC

61

Page 62: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Recommended reading: 4 books

Page 63: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

Intel® Xeon Phi™ Coprocessor Developer site: http://software.intel.com/mic-developer

• Architecture, Setup and Programming resources

• Self-guided

training • Case Studies • Information on

Tools and Ecosystem

• Support through

Community Forum

63

Page 64: Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

64