parallel computing and intel® xeon phi™ coprocessors · pdf filelammps, namd, amber,...

Parallel Computing and Intel® Xeon Phi™ coprocessors

Warren He Ph.D (何万青) [email protected]

Technical Computing Solution Architect

Datacenter & Connected System Group

mailto:[email protected]

Transition from Multicore to Manycore

Agenda

• Why Many-Core?

• Intel Xeon Phi Family & ecosystem

• Intel Xeon Phi HW and SW architecture

• HPC design and development considerations for Xeon Phi

• Case Studies & Demo

3

Parallelize

Vectorize

Scale to manycore

* Theoretical acceleration of a highly parallel processor over a Intel® Xeon® parallel processor

Fraction Parallel

% Vector

Performance

Scaling workloads for Intel® Xeon Phi™ coprocessors

Intel® Xeon Phi™ Coprocessor

All SKUs, pricing and features are PRELIMINARY and are subject to change without notice

VECTOR IA CORE

INTERPROCESSOR NETWORK

INTERPROCESSOR NETWORK

FIX

ED

FU

NCTIO

N L

OG

IC

MEM

ORY a

nd I

/O I

NTERFACES

VECTOR IA CORE

VECTOR IA CORE

VECTOR IA CORE

VECTOR IA CORE

VECTOR IA CORE

VECTOR IA CORE

VECTOR IA CORE

COHERENT CACHE

…

… …

…

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

COHERENT CACHE

PCIe

GDDR5

Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. Knights Corner and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Silicon Cores 57-61

Silicon Max Freq 1.05-1.25 GHz

Double Precision Peak Performance

1003-1220 GFLOP

GDDR5 Devices 24-32

Memory Capacity 6-8GB

Memory Channels

12-16

Form Factors Refer to picture: passive, active, dense form factor (DFF), no thermal solution (NTS)

Memory/BW Peak 240-352 GT/s

Total Cache 28.5- 30.5MB

Board TDP 225-300 Watts

PCIe PCIe Gen 2 x16, I/O speeds up to: KNC Host: 7.0 GB/s; Host KNC: 6.7 GB/s

Misc.

• Product codename: Knights Corner • Based on 22nm process • ECC enabled • Turbo not currently available

Power/ Mgmt • Future node manager support • IMPI including Intel Coprocessor Communications Link

Tools

• Intel C++/C/FORTRAN compilers, Intel Math Kernel Library • Debuggers, Performance and Correctness Analysis Tools • OpenMP, MPI, OFED messaging infrastructure (Linux only), OpenCL • Programming Models: Offload, Native and mixed Offload+Native

OS • RHEL 6.x, SuSE 11.1, Microsoft Windows 8 Server, Win 7, Win 8

Intel® Many Integrated Core Architecture: Processor Numbering Overview

Performance Shelf

Brand

Intel® Xeon Phi™ coprocessor 7 1 10 P

Generation

SKU Number Suffix

NOTE: Where applicable, may denote form factor, thermal solution, usage model

7 Best Performance

5 Best Performance/Watt

3 Best Value

1 Knights Corner

2 Knights Landing

P Passive Thermal Solution

A Active Thermal Solution

X No Thermal Solution

D Dense Form Factor

Indication of relative pricing and performance within generation

Designates the family of products


Open Source Commercial

Compilers, Run environments

gcc (kernel build only, not for applications), python*

Intel® C++ Compiler, Intel® Fortran Compiler, MYO, CAPS* HMPP* 3.2.5 (Beta) compiler, PGI*, PGAS GPI (Fraunhofer ITWM*), ISPC

Debugger gdb Intel Debugger, RogueWave* TotalView* 8.9, Allinea* DDT 3.3

Libraries TBB2, MPICH2 1.5, FFTW, NetCDF

NAG*, Intel® Math Kernel Library, Intel® MPI Library, Intel® OpenMP*, Intel® Cilk™ Plus (in Intel compilers), MAGMA*, Accelereyes* ArrayFire 2.0 (Beta), Boost C++ Libraries 1.47+

Profiling & Analysis Tools

Intel® VTune™ Amplifier XE, Intel® Trace Analyzer & Collector, Intel® Inspector XE, Intel® Advisor XE, TAU* – ParaTools* 2.21.4

Virtualization ScaleMP* vSMP Foundation 5.0, Xen* 4.1.2+

Cluster, Workload Management, and Manageability Tools

Altair* PBS Professional 11.2, Adaptive* Computing Moab 7.2, Bright* Cluster Manager 6.1 (Beta), ET International* SWARM (Beta), IBM* Platform Computing* {LSF 8.3, HPC 3.2 and PCM 3.2}, MPICH2, Univa* Grid Engine 8.1.3

Software Development Ecosystem1 for Intel® Xeon Phi™ Coprocessor

1These are all announced as of November 2012. Intel has said there are more actively being developed but are not yet announced. 2Commercial support of Intel TBB available from Intel.


Sub-Segments Applications/Workloads Intel® Xeon Phi™ Candidates

Public Sector HPL, HPCC, NPB, LAMMPS, QCD

Energy (including Oil & Gas) RTM (Reverse Time Migration), WEM (Wave Equation Migration)

Climate Modeling & Weather Simulation WRF, HOMME

Financial Analysis Monte Carlo, Black-Scholes, Binomial model, Heston model

Life Sciences (Molecular Dynamics, Gene Sequencing, Bio-Chemistry)

LAMMPS, NAMD, AMBER, HMMER, BLAST, QCD

Manufacturing CAD/CAM/CAE/CFD/EDA Implicit, Explicit Solvers

Digital Content Creation Ray Tracing, Animation, Effects

Software Ecosystem Tools, Middleware

9

Technical Compute Target Market Segments & Applications

Mix of ISV and End User development

http://www.forecastsforfilms.com/HurricaneFromSpace2-BPSPP-Ed.jpg

Agenda

• Why Many-Core?





10

Intel® Xeon Phi™ coprocessor codenamed “Knights Corner” Block Diagram

Cores: lightweight,

power efficient,

in-order, support

4 hardware threads

Vector Processing Unit

(VPU) on each core,

32 512bit SIMD

registers

High-speed

bi-directional

ring interconnect

Distributed tag

directory enables

faster snooping

Speeds and Feeds

Per-core caches reference info:

Memory:

• 8 memory controllers, each supporting 2 32-bit channels

• GDDR5 channels theoretical peak of 5.5GT/s

• Practical peak BW limited by ring and other factors

12

Type Size Ways Set conflict

Location

L1I (instr) 32KB 4 8KB on-core

L1D (data) 32KB 8 4KB on-core

L2 (unified) 512KB 8 64KB connected via core/ring interface

Speeds and Feeds, Part 2

Per-core TLBs reference info:

13

Type Entries Page Size Coverage

L1 Instruction 32 4KB 128KB

L1 Data 64 4KB 256KB

32 64KB 2MB

8 2MB 16MB

L2 64 4KB, 64KB, or 2MB Up to 128MB

• Linux support for 64K pages on Intel64 not yet available

Knights Corner Core

X86 specific logic < 2% of core + L2 area

Intel Confidential

Knights Corner Silicon Core Architecture

Instr

uction

Decode Scala

r U

nit

Scala

r Regis

ters

Vecto

r U

nit

Vecto

r Regis

ters

L1 C

ache

[Data

and

Instr

uction]

L2 c

ache

Inte

rpro

cessor

Netw

ork

In Order

512-wide

Specialized Instructions

(new encoding)

L2 Hardware Prefetching

512 KB Slice per Core –

Fast Access to Local Copy

Improved ring interconnect delivering

3x bandwidth over Aubrey Isle silicon ring

64-bit

4 Threads per Core Fully Coherent

32 KB per core

VPU: integer, SP, DP; 3-operand,

16-instruction

HW transcendentals

Intel Confidential

Vector Processing Unit

PPF PF D0 D1 D2 E WB

VC2 V1-V4 WB D2 E VC1

VC2 V1 V2 D2 E VC1 V3 V4

DEC VPU RF

3R, 1W

Mask RF

Scatter Gather

ST

LD

EMU Vector ALUs

16 Wide x 32 bit 8 Wide x 64 bit

Fused Multiply Add

Intel Confidential

17

Peak Performace = Clock Freq * # of Cores * 8 Lane (DP) * 2 (FMA) FLOPS/Cycle

e.g. 1.091 GHz * 61 Cores * 8 Lanes * 2 = 1064.8 Gflops

Wider vectorization opportunity

MMX™ (1997)

Intel® Streaming SIMD Extensions (Intel® SSE in 1999 to Intel® SSE4.2 in 2008)

Intel® Advanced Vector Extensions (Intel® AVX in 2011 and Intel® AVX2 in 2013)

Intel® Xeon™ Phi Coprocessor (in 2012)

Intel® Pentium® processor (1993)

Illustrated with the number of 32bit data processed by one “packed” instruction

18

Effective use of ever increasing number of vector elements and new instruction sets is vital to Intel’s success

1/7/2013

Multi-Cores and Many-Cores on top of this!!

Knights Corner PCI Express Card Block Diagram

INTEL CONFIDENTIA

L 19

Intel® Xeon Phi™ Coprocessors Software Architecture (MPSS)

20

Operate as a compute node

Run a full OS

Program to MPI

Run x86 code

Intel® Xeon Phi™ Coprocessors: They’re So Much More General purpose IA Hardware leads to less idle time for your investment.

21

Intel® Xeon Phi™ Coprocessor Custom HW Acceleration

Run restricted code Run offloaded code

It’s a supercomputer on a chip

GPU ASIC FPGA

Restrictive architectures

Source: Intel Estimates

Restrictive architectures limit the ability for applications to use arbitrary nested parallelism, functions calls and threading models

HOST

Linux

Kernel

Mode

User Mode

Linux

Kernel

Mode

User Mode

PCI-E Communications HW

MIC Software Stack

MIC

Offload Tools/

Applications

Offload Tools/

Applications

Offload Library Offload Library

MYO MYO COI COI

SCIF SCIF

1/7/2013

22

Spectrum of Execution Models of Heterogeneous Computing on Xeon and Xeon Phi Coprocessor

General purpose serial and parallel

computing

Codes with highly- parallel

phases

Highly-parallel codes

Codes with balanced needs

CPU-Centric Intel® MIC-Centric

Many-Core Hosted Symmetric Offload Multi-core Hosted

Main( ) Foo( ) MPI_*()

Foo( )


Main() Foo( ) MPI_*()



Multi-core

Many-core

Productive Programming Models Across the Spectrum

Intel® Many Integrated Core (MIC)

Codes with serial phases

Reverse Offload


Foo( )

PCIe

Not Currently Supported by Language Ext. Offload for Intel Tools

Intel® Xeon Processor

Supported with Intel Tools

MPSS Setup Step by Step

24

1/7/2013 Intel® Many Integrated

Core Architecture

Marketing Document

ID: 487731

25

Tools for both OEMs and End Users : • micsmc – A control panel application for system management and viewing configuration

status information specific to SMC operations on one or more MIC cards in a single platform.

• micflash – A command line utility to update both the MIC SPI Flash and the SMC Firmware as well as grab various information about the flash from either on the ROM on the MIC card or from a ROM file.

• micinfo – A command line utility that allows the OEM and the end user to get various information (version data, memory info, bus speeds, etc.) from one or more MIC cards in a single platform.

• hwcfg – A command line utility that can be used to view and configure various hardware aspects of one or more MIC cards in a single platform.

Tools for OEMs only : • micdiag – A command line utility that performs various diagnostics on one or more MIC

cards in a single platform. • micerrinj – A command line utility that allows the OEMs to inject one or more ECC

errors (correctable, uncorrectable or fatal) into the GDDR on a MIC card to test platform error handling capabilities.

• KncPtuGen – A power/thermal command line utility that can be used to put various loads on one or more individual cores on a MIC card to simulate compromised performance on specific cores.

• micprun (OEM Workloads) – A command line utility that provides an environment for executing various performance workloads (provided by Intel) that allow the OEM to test their platform’s performance over time and compare to Intel’s “Gold” configuration in the factory.

‘micsmc’ – MIC SMC Control Panel

• What does it do?

– A graphical display of SMC status items, such as temperature, memory usage and core utilization for one or multiple MIC cards on a single platform.

– Command Line Parameters available for scripting purposes.

• What does it look like?

1/7/2013

Intel® Many Integrated

Core Architecture

Marketing

26

‘micsmc’ Card View Windows

1/7/2013


Core Architecture

Marketing

27

Utilization View

Core Histogram View

Historical Utilization View

‘micprun’ – MIC Performance Tool

• What can it do?

– This is actually a wrapper that can be used to execute the OEM workloads.

– Neatly wraps up many of the parameters specific to the workload in addition to setting up workload comparisons to previous executions and even to Intel “Gold” system configurations.

• What does it look like?

– ./micprun [–k workload] [–p workloadParams] [-c category] [-x method] [-v level] [-o output] [-t tag] [-r pickle]

– Workload (-k for kernel) specifies which workload to run... SGEMM, DGEMM, etc.

– Categories (-c) include “optimal”, “scaling”, “test”.

– Methods (-x) include “native”, “scif”, “coi” and “myo”.

– Levels (-v) specify the level of detail in the output.

– Output (-o) specifies to create a “pickle” file for future comparisons (use Tag (-t) to specify custom name).

– Pickle (-r) is used to compare the current workload execution to a previously saved workload execution and execute the current run exactly the same as the specified pickle file.

1/7/2013


Core Architecture

Marketing

28

‘micprun’ continued…

• Show me some examples!

– Run all of the native workloads with optimal parameters :

– ./micprun

– ./micprun –k all –c optimal –x native –v 0

– Run SGEMM in scaling mode, and save to a pickle file :

– ./micprun –k sgemm –c scaling –o . –t mySgemm

– Run a comparison to the above example and create a graph :

– ./micprun –r micp_run_mySgemm –v 2

– View the SGEMM workload parameters :

– ./micprun –k sgemm --help

– Run SGEMM on only 4 cores :

– ./micprun –k sgemm –p ‘--num_core 4’

1/7/2013


Core Architecture

Marketing

29

Agenda

• Why Many-Core?





30

Considerations for Application Development on MIC

Fork/Join

Scientific Parallel Application Considerations

Dataset Size

Grid Shape

Algorithm Implementation Workload

Data Access Patterns

SPMD Loop Structure

Load Balancing

Communication

Considerations in Selecting Workload Size

Applicability of Amdahl's law, • Speedup = (s+p)/(s+(p/N)) ..(1),

where p is the parallel fraction, N number of parallel processors

• Using single processor runtime to normalize, s+p=1, • Speedup = 1/(s+(p/N))

• Indicates serial portion ‘s’ will dominate in massively parallel hardware like Intel® Xeon Phi™ • Limiting advantage of the many core hardware

Ref: “Scientific Parallel Computing” – Scott Ridgeway et. al.

Scaled vs. Fixed Size Speedup

Amdahl’s Law Implication

Amdahl’s Law considered harmful in designing parallel architecture.

Gustafson’s[1] Observation:

• For Scientific computing problems, problem size increases with parallel computing power

• The parallel component of the problem scales with problem size

• Ask “how long a given parallel program would have taken on serial processor”

• Scaled Speedup = (s’+p’N)/(s’+p’) • where s’,p’ represent serial and parallel time spent on parallel system,

s’+p’=1

• Scaled Speedup = s’ + (1-s’)*N = N + (1-N) s’ ..(2)

When speedup is measured by scaling problem size, the scalar fraction tend to shrink as more processors are added

Gustafson’s Law

Scaled Speedup = s’ + (1-s’)*N = N + (1-N) s’ When speedup is measured by scaling problem size, the scalar fraction tend to shrink as more processors are added

Fig: John L Gustafson, “Reevaluating Amdahl’s Law”

Intel® Xeon Phi™ Coprocessors Workload Suitability

Representative example workload for illustration purposes only

Multi-core peak

Perf

orm

ance

Threads

Many core peak

Can your workload scale to

over 100 threads?

Yes

No

No

No

If application scales with threads and vectors or memory BW

Intel® Xeon PhiTM Coprocessors

Click to see Animation video on

Exploiting Parallelism on Intel® Xeon PhiTM

Coprocessors

Can your workload benefit

from large vectors ?

Can your workload benefit

from more memory

bandwidth?

Yes

Yes

http://www.youtube.com/watch?v=g9ehO6duNuE&list=UUH5Rft7GYM8KZpxA-4Ohihg&index=9&feature=plcp



A picture speaks a Thousands Words

37

The Keys to Take Advantages of MIC

• Choose the right Xeon-centric or MIC-centric model for your application

• Vectorize your application

– Use Intel’s vectorizing compiler

• Parallelize your application

– With MPI (or other multiprocess model)

– With threads (via Pthreads, TBB, Cilk Plus, OpenMP, etc.)

• Consider standard optimization techniques such as tiling, overlapping computation and communication, etc.

38

Intel® Xeon™ Phi SW Imperatives

• In order for Intel® Xeon™ Phi program to succeed, think differently and guide the programmers to think differently.

39 1/7/2013

Time

Speed

port

Time

Speed optimize

Naïve-Port/Longer-Optimization Losing Strategy

Intelligent-Port/Shorter-Optimization Winning Strategy

Intelligent application porting to Intel® Xeon™ Phi Platforms requires parallelization and vectorization

Algorithm and Data Structure

One needs to consider

• Type of Parallel Algorithm to use for the problem

– Look carefully at every sequential part of the algorithm

• Memory Access Patterns – Input and output data set

• Communication Between Threads

• Work division between threads/processes – Load Balancing

– Granularity

40

Parallelism Pattern For Xeon Phi™

41

Memory Access Patterns

42

• Algorithm determines the data access patterns.

• Data Access Patterns can be categorized into:

– Arrays/Linear Data Structures

– Conducive to low latency and high bandwidth

– Look out for: Unaligned access

– Random Access

– Try to avoid. Needs prefetching if possible

– Not many threads to hide latency compared to Tesla implementation

– Recursive (Tree)

• Data Decomposition:

– Geometric Decomposition

– Regular

– Irregular

– Recursive Decomposition

• Auto vectorization may be improved by additional pragmas like #pragma IVDEP

• A more aggressive pragma is #pragma SIMD

this pragma will use less strict heuristics and may produce wrong code. Results must be carefully checked

• Experts may use compiler intrinsics. The caveat is the loss of portability. A portable layer between intrinsics and app. may help

Vectorization methodologies

1/7/2013 43

Compiler has to assume the worst case the language/flag allow.

• float *A; void vectorize() { int i; for (i = 0; i < 102400; i++) { A[i] *= 2.0f; } }

•Loop Body:

• Load of A

• Load of A[i]

• Multiply with 2.0f

• Store of A[i]

3) Recompile with –ansi-alias

• icc –vec-report1 –ansi-alias test1.c

– test1.c(4): (col. 3) remark: LOOP WAS VECTORIZED.

4) Change “float *A” to “float *restrict A”.

• icc –vec-report1 test1a.c

– test1a.c(4): (col. 3) remark: LOOP WAS VECTORIZED.

5) Add “#pragma ivdep” to the loop.

• icc –vec-report1 test1b.c

– test1b.c(5): (col. 3) remark: LOOP WAS VECTORIZED.

44 1/7/2013

Q: Will the store modify A?

A: Maybe

“NO” is needed to make vectorization legal.

Writing Explicit Vector Code with Intel® Cilk™Plus

• float *A; void vectorize() { for (int i = 0; i < 102400; i++) { A[i] *= 2.0f; } }

8) Using SIMD Pragma

• float *A; void vectorize() { #pragma simd vectorlength(4) for (int i = 0; i < 102400; i++) { A[i] *= 2.0f; } }

9) Using Array Notation

• float *A; void vectorize() { A[0:102400] *= 2.0f; }

10)Using Elemental Function

• float *A; __declspec(noinline) __declspec(vector(uniform(p), linear(i))) void mul(int i){ p[i] *= 2.0f; } void vectorize() { for (int i = 0; i < 102400; i++) { mul(A, i); } }

45 1/7/2013

Memory Accesses and Alignment

• Memory Access Patterns

•Alignment Optimization

46 1/7/2013

A[i] A[i+1] A[i+2] A[i+3] … …

Unit-stride

A[2*i] A[2*(i+1)] A[2*(i+2)]

Strided (special form of gather/scatter)

A[B[i+2]] A[B[i]] A[B[i+1]]

Gather/Scatter

If you write in Cilk™Plus Array Notation, access patterns are obvious in your eyes: Unit-stride means A[:] or A[lb:#elems], helps you think more clearly.

A[i] A[i+1] A[i+2] A[i+3] … …

Aligned Unit-stride

Addr % SIZE == 0

A[i] A[i+1] A[i+2] A[i+3] … …

Alignment unknown Unit-stride

Addr % SIZE == ???

SIZE: 64B for Xeon™Phi, 32B for AVX1/2, 16B for SSE4.2 and below

A[i] A[i+1] A[i+2] A[i+3] … …

Misaligned Unit-stride

Addr % SIZE != 0

Memory Access Patterns

• float A[1000]; Struct BB { float x, y, z; } B[1000]; int C[1000];

• for(i=0; i<n; i++){ a) A[I] // unit-stride b) A[2*I] // strided c) B[i].x // strided d) j = C[i] // unit-stride A[j] // gather/scatter e) A[C[i]] // same as d). f) B[i].x + B[i].y + B[i].z // 3 strided loads }

• Unit-Stride

– Good --- with one exception.

– Out-of-order cores won’t store-forward masked (unit-stride) store.

– Avoid storing under the mask (unit-stride or not), but unnecessary stores can also create bandwidth problem. Ouch.

• Strided, Gather/Scatter

– Perf varies on micro-arch and the actual index patterns.

– Big latency is exposed if you have these on the critical path

– F90 Arrays tend to be strided if not carefully used. See this article. Use F2008 “contiguous” attribute.

http://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization

47 1/7/2013

















Data Alignment Optimization

• float *A, *B, *C, *D; A[0:n] = B[0:n] * C[0:n] + D[0:n];

• Explain compiler’s assumption on data alignment and how that relates to the generated code.

• Identify different ways to optimize the data alignment of this code.

1) Understand the baseline assumption icc –xSSE4.2 –vec-report1 –restrict test2.c ifort –xSSE4.2 –vec-report1 ftest2.f icc –mmic –vec-report6 test2.c ifort –mmic –vec-report6 ftest2.f Examine the ASM code.

2) Add “vector aligned” icc –xSSE4.2 –vec-report1 test2a.c ifort –xSSE4.2 –vec-report1 ftest2a.f icc –mmic –vec-report6 test2a.c ifort –mmic –vec-report6 ftest2a.f Examine the ASM code

3) Compile with –opt-assume-safe-padding icc –mmic –vec-report6 –opt-assume-safe-padding test2.c ifort –mmic –vec-report6 –opt-assume-safe-padding ftest2.f Examine the ASM code

48 1/7/2013

Align the Data at Allocation and Tell that to the Compiler at the Use

• C/C++

• Align at Data Allocation __declspec(align(n, [off])) __mm_malloc(size, n)

• Inform the Compiler at Use Sites #pragma vector aligned __assume_aligned(ptr, n); __assume(var % x == 0);

•FORTRAN

•Align at Data Allocation !DIR$ attribute align:n :: var -align arrayNbyte (N=16,32,64)

•Inform the Compiler at Use Sites !DIR$ vector aligned

•__assume_aligned(ptr, n);

49 1/7/2013

Reading Material: http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization

http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization












Array of Struct is BAD

• struct node { float x, y, z; }; struct node NODES[1024]; float dist[1024]; for(i=0;i<1024;i+=16){ float x[16],y[16],z[16],d[16]; x[:] = NODES[i:16].x; y[:] = NODES[i:16].y; z[:] = NODES[i:16].z; d[:] = sqrtf(x[:]*x[:] + y[:]*y[:] + z[:]*z[:]); dist[i:16] = d[:]; }

• Use Struct of Arrays struct node1 { float x[1024], y[1024], z[1024]; } struct node1 NODES1;

• Use Tiled Array of Structs struct node2 { float x[16], y[16], z[16]; } struct nodes2 NODES2[];

50 1/7/2013

Example in Optimizing LBM for MIC

• Updates – Particle distribution function of

particles in pos x at time t traveling in direction ei

• D3Q19 model

– Each cell has |ei| = 19 values

– Uses 19 values at curr node, does ~12 flops/direction (220 flops total) and writes 19 neighbors

– Has special case computation at boundaries (flag array used here)

𝑓𝑖(𝑥, 𝑡)

𝑓𝑖 𝑥 + 𝑒𝑖𝑑𝑡, 𝑡 + 𝑑𝑡 = 𝑓𝑖 𝑥, 𝑡 + Ω𝑖(𝑓 𝑥, 𝑡 )

Ω𝑖 𝑓 𝑥, 𝑡 = 1

𝜏(𝑓𝑖

0 − 𝑓𝑖)

51

DO k = 1, NZ

DO j = 1, NY

!DIR$ IVDEP

DO i = 1, NX

m = i + NXG*( j + NYG*k )

phin = phi(m)

rhon = g0(m+now) + g1(m+now) + g2(m+now) + g3(m+now) + g4(m+now) &

+ g5(m+now) + g6(m+now) + g7(m+now) + g8(m+now) + g9(m+now) &

+ g10(m+now) + g11(m+now) + g12(m+now) + g13(m+now) + g14(m+now) &A

+ g15(m+now) + g16(m+now) + g17(m+now) + g18(m+now)

sFx = muPhin*gradPhiX(m)

sFy = muPhin*gradPhiY(m)

sFz = muPhin*gradPhiZ(m)

ux = ( g1(m+now) - g2(m+now) + g7(m+now) - g8(m+now) + g9(m+now) &

- g10(m+now) + g11(m+now) - g12(m+now) + g13(m+now) - g14(m+now) &

+ 0.5D0*sFx )*invRhon

uy = ( g3(m+now) - g4(m+now) + g7(m+now) - g8(m+now) - g9(m+now) &

+ g10(m+now) + g15(m+now) - g16(m+now) + g17(m+now) - g18(m+now) &

+ 0.5D0*sFy )*invRhon

uz = ( g5(m+now) - g6(m+now) + g11(m+now) - g12(m+now) - g13(m+now) &

+ g14(m+now) + g15(m+now) - g16(m+now) - g17(m+now) + g18(m+now) &

+ 0.5D0*sFz )*invRhon

f0(m+nxt) = invTauPhi1*f0(m+now) + Af0 + invTauPhi*phin

f1(m+now) = invTauPhi1*f1(m+now) + Af1 + Cf1*ux

f2(m+now) = invTauPhi1*f2(m+now) + Af1 - Cf1*ux

f3(m+now) = invTauPhi1*f3(m+now) + Af1 + Cf1*uy

f4(m+now) = invTauPhi1*f4(m+now) + Af1 - Cf1*uy

f5(m+now) = invTauPhi1*f5(m+now) + Af1 + Cf1*uz

f6(m+now) = invTauPhi1*f6(m+now) + Af1 - Cf1*uz

g0(m+nxt) = invTauRhoOne*g0(m+now) + ... !! g0'nxt(i , j , k ) = g0(i,j,k) 1

g1(m+nxt + 1) = invTauRhoOne*g1(m+now) + ... !! g1'nxt(i+1, j , k ) = g1(i,j,k) 1

g2(m+nxt - 1) = invTauRhoOne*g2(m+now) + ... !! g2'nxt(i-1, j , k ) = g2(i,j,k) 1 (left)

g3(m+nxt + NXG) = invTauRhoOne*g3(m+now) + ... !! g3'nxt(i , j+1, k ) = g3(i,j,k) 2

g4(m+nxt - NXG) = invTauRhoOne*g4(m+now) + ... !! g4'nxt(i , j-1, k ) = g4(i,j,k) 3 (left)

g5(m+nxt + NXG*NYG) = invTauRhoOne*g5(m+now) + ... !! g5'nxt(i , j , k+1) = g5(i,j,k) 4

g6(m+nxt - NXG*NYG) = invTauRhoOne*g6(m+now) + ... !! g6'nxt(i , j , k-1) = g6(i,j,k) (left)

g7(m+nxt + 1 + NXG) = invTauRhoOne*g7(m+now) + ... !! g7'nxt(i+1, j+1, k ) = g7(i,j,k)

g8(m+nxt - 1 - NXG) = invTauRhoOne*g8(m+now) + ... !! g8'nxt(i-1, j-1, k ) = g8(i,j,k) (left)

g9(m+nxt + 1 - NXG) = invTauRhoOne*g9(m+now) + ... !! g9'nxt(i+1, j-1, k ) = g9(i,j,k) (left)

g10(m+nxt - 1 + NXG) = invTauRhoOne*g10(m+now) + ... !! g10'nxt(i-1, j+1, k ) = g10(i,j,k)

g11(m+nxt + 1 + NXG*NYG) = invTauRhoOne*g11(m+now) + ... !! g11'nxt(i+1, j , k+1) = g11(i,j,k)

g12(m+nxt - 1 - NXG*NYG) = invTauRhoOne*g12(m+now) + ... !! g12'nxt(i-1, j , k-1) = g12(i,j,k) (left)

g13(m+nxt + 1 - NXG*NYG) = invTauRhoOne*g13(m+now) + ... !! g13'nxt(i+1, j , k-1) = g13(i,j,k) (left)

g14(m+nxt - 1 + NXG*NYG) = invTauRhoOne*g14(m+now) + ... !! g14'nxt(i-1, j , k+1) = g14(i,j,k)

g15(m+nxt + NXG + NXG*NYG) = invTauRhoOne*g15(m+now) + ... !! g15'nxt(i , j+1, k+1) = g15(i,j,k)

g16(m+nxt - NXG - NXG*NYG) = invTauRhoOne*g16(m+now) + ... !! g16'nxt(i , j-1, k-1) = g16(i,j,k) (left)

g17(m+nxt + NXG - NXG*NYG) = invTauRhoOne*g17(m+now) + ... !! g17'nxt(i , j+1, k-1) = g17(i,j,k) (left)

g18(m+nxt - NXG + NXG*NYG) = invTauRhoOne*g18(m+now) + ... !! g18'nxt(i , j-1, k+1) = g18(i,j,k)

END DO

END DO

END DO

Key challenges:

- 49 streams of data (some unaligned)

– 19 read streams g* (blue)

– 19 write streams g* (orange)

– 7 read/write streams f* (green)

– 4 read streams (red) (phi, gradPhiX, gradPhiY, gradPhiZ)

- Prefetch / TLB associative issues

- Compiler not vectorizing some hot loops.

NX = NY = NZ = 128

52

1 0.11

8

16

0

5

10

15

20

SNB-EP KNC SNB-EP KNC

Reference Code Moderate Optimized

Code (Pragmas,Compiler, etc)

Sp

eed

up

Initial Results + Basic Optimizations

• Straight porting:

– KNC was 1/10 the performance of a 2-socket SNB-EP system

• Basic optimizations improve both CPU and MIC performance:

– CPU Perf improved by 8x

– KNC Perf improved by 150x

– Optimizations:

– compiler vector directives

– 2M pages

– loop fusion

– change in OMP work distribution Data measured by Tanmay Gangwani, Shreeniwas Sapre and

Sangeeta Bhattacharya on 2x8 SNB-EP 2.2GHz, KNC-ES2

53

Advanced Optimizations

• Include modifications to data structure and tiling to make better use of vector units, caches and available memory bandwidth

– AOS to SOA transformation helps improve SIMD utilization by avoiding gather / scatter operations

– 3.5D blocking to take advantage of large coherent caches

– SW Prefetching for MIC

• Results:

– CPU and MIC both further improves by 9x

Ref: Anthony Nguyen et. al. “3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs,” SC 2010.

Similarity in Intel ® Architectures make optimizations (basic + advanced) work for both Intel Xeon processor and Intel ® Xeon Phi ® coprocessor

54

Agenda

• Why Many-Core?





55

ScaleMP

• vSMP Foundation aggregates multiple x86 systems into a single, virtual x86 system

– SCALE their application compute, memory, and I/O resources

• Simple to move an application to Intel® Xeon Phi™ coprocessor based systems using ScaleMP software

• Zero porting effort

Video: ScaleMP to Support Intel Xeon Phi Coprocessors ... ScaleMP CEO Shai Fultheim describes how vSMP Foundation supports the new Intel Xeon Phi coprocessor. (From ISC’12 – Hamburg)

http://insidehpc.com/2012/06/23/video-scalemp-to-support-intel-xeon-phi-coprocessors/





Intel demo booth #: ScaleMP – SMP VM for Xeon & Xeon Phi

57

Programming Intel® MIC-based Systems

58

CPU

CPU

CPU

CPU

Network

MPI

MIC

MIC

MIC

MIC

Data

Data

Data

Data

CPU

CPU

CPU

CPU

Network

MPI

Data

Data

Data

Data

MIC

MIC

MIC

MIC

CPU-Centric Intel®MIC-Centric CPU-hosted Offload Symmetric Reverse Offload MIC-hosted

...

LBM running in Xeon cluster with Xeon Phi co-processor

59

CPU

MIC

CPU

CPU

MIC

CPU

Network

Data

Data

Data

Data

MPI

Offload

Homogenous Node

LES_MIC: Load balance between. CPU & MIC

Copyright © 2012 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Sponsors of Tomorrow., and the Intel Sponsors of Tomorrow. logo are trademarks of Intel Corporation in the U.S. and other countries.

Relative speedup with different problem size LES CPU+MIC simultaneous computing design

Demo#: LES_MIC

61

parallel computing and intel® xeon phi™ coprocessors · pdf filelammps, namd, amber,...

Documents