parallel computing and intel® xeon phi™ coprocessors · pdf filelammps, namd, amber,...
TRANSCRIPT
Parallel Computing and Intel® Xeon Phi™ coprocessors
Warren He Ph.D (何万青) [email protected]
Technical Computing Solution Architect
Datacenter & Connected System Group
Transition from Multicore to Manycore
Agenda
• Why Many-Core?
• Intel Xeon Phi Family & ecosystem
• Intel Xeon Phi HW and SW architecture
• HPC design and development considerations for Xeon Phi
• Case Studies & Demo
3
Parallelize
Vectorize
Scale to manycore
* Theoretical acceleration of a highly parallel processor over a Intel® Xeon® parallel processor
Fraction Parallel
% Vector
Performance
Scaling workloads for Intel® Xeon Phi™ coprocessors
Intel® Xeon Phi™ Coprocessor
All SKUs, pricing and features are PRELIMINARY and are subject to change without notice
VECTOR IA CORE
INTERPROCESSOR NETWORK
INTERPROCESSOR NETWORK
FIX
ED
FU
NCTIO
N L
OG
IC
MEM
ORY a
nd I
/O I
NTERFACES
VECTOR IA CORE
VECTOR IA CORE
VECTOR IA CORE
VECTOR IA CORE
VECTOR IA CORE
VECTOR IA CORE
VECTOR IA CORE
COHERENT CACHE
…
… …
…
COHERENT CACHE
COHERENT CACHE
COHERENT CACHE
COHERENT CACHE
COHERENT CACHE
COHERENT CACHE
COHERENT CACHE
PCIe
GDDR5
Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. Knights Corner and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
Silicon Cores 57-61
Silicon Max Freq 1.05-1.25 GHz
Double Precision Peak Performance
1003-1220 GFLOP
GDDR5 Devices 24-32
Memory Capacity 6-8GB
Memory Channels
12-16
Form Factors Refer to picture: passive, active, dense form factor (DFF), no thermal solution (NTS)
Memory/BW Peak 240-352 GT/s
Total Cache 28.5- 30.5MB
Board TDP 225-300 Watts
PCIe PCIe Gen 2 x16, I/O speeds up to: KNC Host: 7.0 GB/s; Host KNC: 6.7 GB/s
Misc.
• Product codename: Knights Corner • Based on 22nm process • ECC enabled • Turbo not currently available
Power/ Mgmt • Future node manager support • IMPI including Intel Coprocessor Communications Link
Tools
• Intel C++/C/FORTRAN compilers, Intel Math Kernel Library • Debuggers, Performance and Correctness Analysis Tools • OpenMP, MPI, OFED messaging infrastructure (Linux only), OpenCL • Programming Models: Offload, Native and mixed Offload+Native
OS • RHEL 6.x, SuSE 11.1, Microsoft Windows 8 Server, Win 7, Win 8
Intel® Many Integrated Core Architecture: Processor Numbering Overview
Performance Shelf
Brand
Intel® Xeon Phi™ coprocessor 7 1 10 P
Generation
SKU Number Suffix
NOTE: Where applicable, may denote form factor, thermal solution, usage model
7 Best Performance
5 Best Performance/Watt
3 Best Value
1 Knights Corner
2 Knights Landing
P Passive Thermal Solution
A Active Thermal Solution
X No Thermal Solution
D Dense Form Factor
Indication of relative pricing and performance within generation
Designates the family of products
Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. Knights Corner and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
7
Open Source Commercial
Compilers, Run environments
gcc (kernel build only, not for applications), python*
Intel® C++ Compiler, Intel® Fortran Compiler, MYO, CAPS* HMPP* 3.2.5 (Beta) compiler, PGI*, PGAS GPI (Fraunhofer ITWM*), ISPC
Debugger gdb Intel Debugger, RogueWave* TotalView* 8.9, Allinea* DDT 3.3
Libraries TBB2, MPICH2 1.5, FFTW, NetCDF
NAG*, Intel® Math Kernel Library, Intel® MPI Library, Intel® OpenMP*, Intel® Cilk™ Plus (in Intel compilers), MAGMA*, Accelereyes* ArrayFire 2.0 (Beta), Boost C++ Libraries 1.47+
Profiling & Analysis Tools
Intel® VTune™ Amplifier XE, Intel® Trace Analyzer & Collector, Intel® Inspector XE, Intel® Advisor XE, TAU* – ParaTools* 2.21.4
Virtualization ScaleMP* vSMP Foundation 5.0, Xen* 4.1.2+
Cluster, Workload Management, and Manageability Tools
Altair* PBS Professional 11.2, Adaptive* Computing Moab 7.2, Bright* Cluster Manager 6.1 (Beta), ET International* SWARM (Beta), IBM* Platform Computing* {LSF 8.3, HPC 3.2 and PCM 3.2}, MPICH2, Univa* Grid Engine 8.1.3
Software Development Ecosystem1 for Intel® Xeon Phi™ Coprocessor
1These are all announced as of November 2012. Intel has said there are more actively being developed but are not yet announced. 2Commercial support of Intel TBB available from Intel.
Notice: This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Contact your local Intel sales office or your distributor to obtain the latest specification before placing your product order. Knights Corner and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user. All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
Sub-Segments Applications/Workloads Intel® Xeon Phi™ Candidates
Public Sector HPL, HPCC, NPB, LAMMPS, QCD
Energy (including Oil & Gas) RTM (Reverse Time Migration), WEM (Wave Equation Migration)
Climate Modeling & Weather Simulation WRF, HOMME
Financial Analysis Monte Carlo, Black-Scholes, Binomial model, Heston model
Life Sciences (Molecular Dynamics, Gene Sequencing, Bio-Chemistry)
LAMMPS, NAMD, AMBER, HMMER, BLAST, QCD
Manufacturing CAD/CAM/CAE/CFD/EDA Implicit, Explicit Solvers
Digital Content Creation Ray Tracing, Animation, Effects
Software Ecosystem Tools, Middleware
9
Technical Compute Target Market Segments & Applications
Mix of ISV and End User development
Agenda
• Why Many-Core?
• Intel Xeon Phi Family & ecosystem
• Intel Xeon Phi HW and SW architecture
• HPC design and development considerations for Xeon Phi
• Case Studies & Demo
10
Intel® Xeon Phi™ coprocessor codenamed “Knights Corner” Block Diagram
Cores: lightweight,
power efficient,
in-order, support
4 hardware threads
Vector Processing Unit
(VPU) on each core,
32 512bit SIMD
registers
High-speed
bi-directional
ring interconnect
Distributed tag
directory enables
faster snooping
Speeds and Feeds
Per-core caches reference info:
Memory:
• 8 memory controllers, each supporting 2 32-bit channels
• GDDR5 channels theoretical peak of 5.5GT/s
• Practical peak BW limited by ring and other factors
12
Type Size Ways Set conflict
Location
L1I (instr) 32KB 4 8KB on-core
L1D (data) 32KB 8 4KB on-core
L2 (unified) 512KB 8 64KB connected via core/ring interface
Speeds and Feeds, Part 2
Per-core TLBs reference info:
13
Type Entries Page Size Coverage
L1 Instruction 32 4KB 128KB
L1 Data 64 4KB 256KB
32 64KB 2MB
8 2MB 16MB
L2 64 4KB, 64KB, or 2MB Up to 128MB
• Linux support for 64K pages on Intel64 not yet available
Knights Corner Core
X86 specific logic < 2% of core + L2 area
Intel Confidential
Knights Corner Silicon Core Architecture
Instr
uction
Decode Scala
r U
nit
Scala
r Regis
ters
Vecto
r U
nit
Vecto
r Regis
ters
L1 C
ache
[Data
and
Instr
uction]
L2 c
ache
Inte
rpro
cessor
Netw
ork
In Order
512-wide
Specialized Instructions
(new encoding)
L2 Hardware Prefetching
512 KB Slice per Core –
Fast Access to Local Copy
Improved ring interconnect delivering
3x bandwidth over Aubrey Isle silicon ring
64-bit
4 Threads per Core Fully Coherent
32 KB per core
VPU: integer, SP, DP; 3-operand,
16-instruction
HW transcendentals
Intel Confidential
Vector Processing Unit
PPF PF D0 D1 D2 E WB
VC2 V1-V4 WB D2 E VC1
VC2 V1 V2 D2 E VC1 V3 V4
DEC VPU RF
3R, 1W
Mask RF
Scatter Gather
ST
LD
EMU Vector ALUs
16 Wide x 32 bit 8 Wide x 64 bit
Fused Multiply Add
Intel Confidential
17
Peak Performace = Clock Freq * # of Cores * 8 Lane (DP) * 2 (FMA) FLOPS/Cycle
e.g. 1.091 GHz * 61 Cores * 8 Lanes * 2 = 1064.8 Gflops
Wider vectorization opportunity
MMX™ (1997)
Intel® Streaming SIMD Extensions (Intel® SSE in 1999 to Intel® SSE4.2 in 2008)
Intel® Advanced Vector Extensions (Intel® AVX in 2011 and Intel® AVX2 in 2013)
Intel® Xeon™ Phi Coprocessor (in 2012)
Intel® Pentium® processor (1993)
Illustrated with the number of 32bit data processed by one “packed” instruction
18
Effective use of ever increasing number of vector elements and new instruction sets is vital to Intel’s success
1/7/2013
Multi-Cores and Many-Cores on top of this!!
Knights Corner PCI Express Card Block Diagram
INTEL CONFIDENTIA
L 19
Intel® Xeon Phi™ Coprocessors Software Architecture (MPSS)
20
Operate as a compute node
Run a full OS
Program to MPI
Run x86 code
Intel® Xeon Phi™ Coprocessors: They’re So Much More General purpose IA Hardware leads to less idle time for your investment.
21
Intel® Xeon Phi™ Coprocessor Custom HW Acceleration
Run restricted code Run offloaded code
It’s a supercomputer on a chip
GPU ASIC FPGA
Restrictive architectures
Source: Intel Estimates
Restrictive architectures limit the ability for applications to use arbitrary nested parallelism, functions calls and threading models
HOST
Linux
Kernel
Mode
User Mode
Linux
Kernel
Mode
User Mode
PCI-E Communications HW
MIC Software Stack
MIC
Offload Tools/
Applications
Offload Tools/
Applications
Offload Library Offload Library
MYO MYO COI COI
SCIF SCIF
1/7/2013
22
Spectrum of Execution Models of Heterogeneous Computing on Xeon and Xeon Phi Coprocessor
General purpose serial and parallel
computing
Codes with highly- parallel
phases
Highly-parallel codes
Codes with balanced needs
CPU-Centric Intel® MIC-Centric
Many-Core Hosted Symmetric Offload Multi-core Hosted
Main( ) Foo( ) MPI_*()
Foo( )
Main( ) Foo( ) MPI_*()
Main() Foo( ) MPI_*()
Main( ) Foo( ) MPI_*()
Main( ) Foo( ) MPI_*()
Multi-core
Many-core
Productive Programming Models Across the Spectrum
Intel® Many Integrated Core (MIC)
Codes with serial phases
Reverse Offload
Main( ) Foo( ) MPI_*()
Foo( )
PCIe
Not Currently Supported by Language Ext. Offload for Intel Tools
Intel® Xeon Processor
Supported with Intel Tools
MPSS Setup Step by Step
24
1/7/2013 Intel® Many Integrated
Core Architecture
Marketing Document
ID: 487731
25
Tools for both OEMs and End Users : • micsmc – A control panel application for system management and viewing configuration
status information specific to SMC operations on one or more MIC cards in a single platform.
• micflash – A command line utility to update both the MIC SPI Flash and the SMC Firmware as well as grab various information about the flash from either on the ROM on the MIC card or from a ROM file.
• micinfo – A command line utility that allows the OEM and the end user to get various information (version data, memory info, bus speeds, etc.) from one or more MIC cards in a single platform.
• hwcfg – A command line utility that can be used to view and configure various hardware aspects of one or more MIC cards in a single platform.
Tools for OEMs only : • micdiag – A command line utility that performs various diagnostics on one or more MIC
cards in a single platform. • micerrinj – A command line utility that allows the OEMs to inject one or more ECC
errors (correctable, uncorrectable or fatal) into the GDDR on a MIC card to test platform error handling capabilities.
• KncPtuGen – A power/thermal command line utility that can be used to put various loads on one or more individual cores on a MIC card to simulate compromised performance on specific cores.
• micprun (OEM Workloads) – A command line utility that provides an environment for executing various performance workloads (provided by Intel) that allow the OEM to test their platform’s performance over time and compare to Intel’s “Gold” configuration in the factory.
‘micsmc’ – MIC SMC Control Panel
• What does it do?
– A graphical display of SMC status items, such as temperature, memory usage and core utilization for one or multiple MIC cards on a single platform.
– Command Line Parameters available for scripting purposes.
• What does it look like?
1/7/2013
Intel® Many Integrated
Core Architecture
Marketing
26
‘micsmc’ Card View Windows
1/7/2013
Intel® Many Integrated
Core Architecture
Marketing
27
Utilization View
Core Histogram View
Historical Utilization View
‘micprun’ – MIC Performance Tool
• What can it do?
– This is actually a wrapper that can be used to execute the OEM workloads.
– Neatly wraps up many of the parameters specific to the workload in addition to setting up workload comparisons to previous executions and even to Intel “Gold” system configurations.
• What does it look like?
– ./micprun [–k workload] [–p workloadParams] [-c category] [-x method] [-v level] [-o output] [-t tag] [-r pickle]
– Workload (-k for kernel) specifies which workload to run... SGEMM, DGEMM, etc.
– Categories (-c) include “optimal”, “scaling”, “test”.
– Methods (-x) include “native”, “scif”, “coi” and “myo”.
– Levels (-v) specify the level of detail in the output.
– Output (-o) specifies to create a “pickle” file for future comparisons (use Tag (-t) to specify custom name).
– Pickle (-r) is used to compare the current workload execution to a previously saved workload execution and execute the current run exactly the same as the specified pickle file.
1/7/2013
Intel® Many Integrated
Core Architecture
Marketing
28
‘micprun’ continued…
• Show me some examples!
– Run all of the native workloads with optimal parameters :
– ./micprun
– ./micprun –k all –c optimal –x native –v 0
– Run SGEMM in scaling mode, and save to a pickle file :
– ./micprun –k sgemm –c scaling –o . –t mySgemm
– Run a comparison to the above example and create a graph :
– ./micprun –r micp_run_mySgemm –v 2
– View the SGEMM workload parameters :
– ./micprun –k sgemm --help
– Run SGEMM on only 4 cores :
– ./micprun –k sgemm –p ‘--num_core 4’
1/7/2013
Intel® Many Integrated
Core Architecture
Marketing
29
Agenda
• Why Many-Core?
• Intel Xeon Phi Family & ecosystem
• Intel Xeon Phi HW and SW architecture
• HPC design and development considerations for Xeon Phi
• Case Studies & Demo
30
Considerations for Application Development on MIC
Fork/Join
Scientific Parallel Application Considerations
Dataset Size
Grid Shape
Algorithm Implementation Workload
Data Access Patterns
SPMD Loop Structure
Load Balancing
Communication
Considerations in Selecting Workload Size
Applicability of Amdahl's law, • Speedup = (s+p)/(s+(p/N)) ..(1),
where p is the parallel fraction, N number of parallel processors
• Using single processor runtime to normalize, s+p=1, • Speedup = 1/(s+(p/N))
• Indicates serial portion ‘s’ will dominate in massively parallel hardware like Intel® Xeon Phi™ • Limiting advantage of the many core hardware
Ref: “Scientific Parallel Computing” – Scott Ridgeway et. al.
Scaled vs. Fixed Size Speedup
Amdahl’s Law Implication
Amdahl’s Law considered harmful in designing parallel architecture.
Gustafson’s[1] Observation:
• For Scientific computing problems, problem size increases with parallel computing power
• The parallel component of the problem scales with problem size
• Ask “how long a given parallel program would have taken on serial processor”
• Scaled Speedup = (s’+p’N)/(s’+p’) • where s’,p’ represent serial and parallel time spent on parallel system,
s’+p’=1
• Scaled Speedup = s’ + (1-s’)*N = N + (1-N) s’ ..(2)
When speedup is measured by scaling problem size, the scalar fraction tend to shrink as more processors are added
Gustafson’s Law
Scaled Speedup = s’ + (1-s’)*N = N + (1-N) s’ When speedup is measured by scaling problem size, the scalar fraction tend to shrink as more processors are added
Fig: John L Gustafson, “Reevaluating Amdahl’s Law”
Intel® Xeon Phi™ Coprocessors Workload Suitability
Representative example workload for illustration purposes only
Multi-core peak
Perf
orm
ance
Threads
Many core peak
Can your workload scale to
over 100 threads?
Yes
No
No
No
If application scales with threads and vectors or memory BW
Intel® Xeon PhiTM Coprocessors
Click to see Animation video on
Exploiting Parallelism on Intel® Xeon PhiTM
Coprocessors
Can your workload benefit
from large vectors ?
Can your workload benefit
from more memory
bandwidth?
Yes
Yes
A picture speaks a Thousands Words
37
The Keys to Take Advantages of MIC
• Choose the right Xeon-centric or MIC-centric model for your application
• Vectorize your application
– Use Intel’s vectorizing compiler
• Parallelize your application
– With MPI (or other multiprocess model)
– With threads (via Pthreads, TBB, Cilk Plus, OpenMP, etc.)
• Consider standard optimization techniques such as tiling, overlapping computation and communication, etc.
38
Intel® Xeon™ Phi SW Imperatives
• In order for Intel® Xeon™ Phi program to succeed, think differently and guide the programmers to think differently.
39 1/7/2013
Time
Speed
port
Time
Speed optimize
Naïve-Port/Longer-Optimization Losing Strategy
Intelligent-Port/Shorter-Optimization Winning Strategy
Intelligent application porting to Intel® Xeon™ Phi Platforms requires parallelization and vectorization
Algorithm and Data Structure
One needs to consider
• Type of Parallel Algorithm to use for the problem
– Look carefully at every sequential part of the algorithm
• Memory Access Patterns – Input and output data set
• Communication Between Threads
• Work division between threads/processes – Load Balancing
– Granularity
40
Parallelism Pattern For Xeon Phi™
41
Memory Access Patterns
42
• Algorithm determines the data access patterns.
• Data Access Patterns can be categorized into:
– Arrays/Linear Data Structures
– Conducive to low latency and high bandwidth
– Look out for: Unaligned access
– Random Access
– Try to avoid. Needs prefetching if possible
– Not many threads to hide latency compared to Tesla implementation
– Recursive (Tree)
• Data Decomposition:
– Geometric Decomposition
– Regular
– Irregular
– Recursive Decomposition
• Auto vectorization may be improved by additional pragmas like #pragma IVDEP
• A more aggressive pragma is #pragma SIMD
this pragma will use less strict heuristics and may produce wrong code. Results must be carefully checked
• Experts may use compiler intrinsics. The caveat is the loss of portability. A portable layer between intrinsics and app. may help
Vectorization methodologies
1/7/2013 43
Compiler has to assume the worst case the language/flag allow.
• float *A; void vectorize() { int i; for (i = 0; i < 102400; i++) { A[i] *= 2.0f; } }
•Loop Body:
• Load of A
• Load of A[i]
• Multiply with 2.0f
• Store of A[i]
3) Recompile with –ansi-alias
• icc –vec-report1 –ansi-alias test1.c
– test1.c(4): (col. 3) remark: LOOP WAS VECTORIZED.
4) Change “float *A” to “float *restrict A”.
• icc –vec-report1 test1a.c
– test1a.c(4): (col. 3) remark: LOOP WAS VECTORIZED.
5) Add “#pragma ivdep” to the loop.
• icc –vec-report1 test1b.c
– test1b.c(5): (col. 3) remark: LOOP WAS VECTORIZED.
44 1/7/2013
Q: Will the store modify A?
A: Maybe
“NO” is needed to make vectorization legal.
Writing Explicit Vector Code with Intel® Cilk™Plus
• float *A; void vectorize() { for (int i = 0; i < 102400; i++) { A[i] *= 2.0f; } }
8) Using SIMD Pragma
• float *A; void vectorize() { #pragma simd vectorlength(4) for (int i = 0; i < 102400; i++) { A[i] *= 2.0f; } }
9) Using Array Notation
• float *A; void vectorize() { A[0:102400] *= 2.0f; }
10)Using Elemental Function
• float *A; __declspec(noinline) __declspec(vector(uniform(p), linear(i))) void mul(int i){ p[i] *= 2.0f; } void vectorize() { for (int i = 0; i < 102400; i++) { mul(A, i); } }
45 1/7/2013
Memory Accesses and Alignment
• Memory Access Patterns
•Alignment Optimization
46 1/7/2013
A[i] A[i+1] A[i+2] A[i+3] … …
Unit-stride
A[2*i] A[2*(i+1)] A[2*(i+2)]
Strided (special form of gather/scatter)
A[B[i+2]] A[B[i]] A[B[i+1]]
Gather/Scatter
If you write in Cilk™Plus Array Notation, access patterns are obvious in your eyes: Unit-stride means A[:] or A[lb:#elems], helps you think more clearly.
A[i] A[i+1] A[i+2] A[i+3] … …
Aligned Unit-stride
Addr % SIZE == 0
A[i] A[i+1] A[i+2] A[i+3] … …
Alignment unknown Unit-stride
Addr % SIZE == ???
SIZE: 64B for Xeon™Phi, 32B for AVX1/2, 16B for SSE4.2 and below
A[i] A[i+1] A[i+2] A[i+3] … …
Misaligned Unit-stride
Addr % SIZE != 0
Memory Access Patterns
• float A[1000]; Struct BB { float x, y, z; } B[1000]; int C[1000];
• for(i=0; i<n; i++){ a) A[I] // unit-stride b) A[2*I] // strided c) B[i].x // strided d) j = C[i] // unit-stride A[j] // gather/scatter e) A[C[i]] // same as d). f) B[i].x + B[i].y + B[i].z // 3 strided loads }
• Unit-Stride
– Good --- with one exception.
– Out-of-order cores won’t store-forward masked (unit-stride) store.
– Avoid storing under the mask (unit-stride or not), but unnecessary stores can also create bandwidth problem. Ouch.
• Strided, Gather/Scatter
– Perf varies on micro-arch and the actual index patterns.
– Big latency is exposed if you have these on the critical path
– F90 Arrays tend to be strided if not carefully used. See this article. Use F2008 “contiguous” attribute.
http://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization
47 1/7/2013
Data Alignment Optimization
• float *A, *B, *C, *D; A[0:n] = B[0:n] * C[0:n] + D[0:n];
• Explain compiler’s assumption on data alignment and how that relates to the generated code.
• Identify different ways to optimize the data alignment of this code.
1) Understand the baseline assumption icc –xSSE4.2 –vec-report1 –restrict test2.c ifort –xSSE4.2 –vec-report1 ftest2.f icc –mmic –vec-report6 test2.c ifort –mmic –vec-report6 ftest2.f Examine the ASM code.
2) Add “vector aligned” icc –xSSE4.2 –vec-report1 test2a.c ifort –xSSE4.2 –vec-report1 ftest2a.f icc –mmic –vec-report6 test2a.c ifort –mmic –vec-report6 ftest2a.f Examine the ASM code
3) Compile with –opt-assume-safe-padding icc –mmic –vec-report6 –opt-assume-safe-padding test2.c ifort –mmic –vec-report6 –opt-assume-safe-padding ftest2.f Examine the ASM code
48 1/7/2013
Align the Data at Allocation and Tell that to the Compiler at the Use
• C/C++
• Align at Data Allocation __declspec(align(n, [off])) __mm_malloc(size, n)
• Inform the Compiler at Use Sites #pragma vector aligned __assume_aligned(ptr, n); __assume(var % x == 0);
•FORTRAN
•Align at Data Allocation !DIR$ attribute align:n :: var -align arrayNbyte (N=16,32,64)
•Inform the Compiler at Use Sites !DIR$ vector aligned
•__assume_aligned(ptr, n);
49 1/7/2013
Reading Material: http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization
Array of Struct is BAD
• struct node { float x, y, z; }; struct node NODES[1024]; float dist[1024]; for(i=0;i<1024;i+=16){ float x[16],y[16],z[16],d[16]; x[:] = NODES[i:16].x; y[:] = NODES[i:16].y; z[:] = NODES[i:16].z; d[:] = sqrtf(x[:]*x[:] + y[:]*y[:] + z[:]*z[:]); dist[i:16] = d[:]; }
• Use Struct of Arrays struct node1 { float x[1024], y[1024], z[1024]; } struct node1 NODES1;
• Use Tiled Array of Structs struct node2 { float x[16], y[16], z[16]; } struct nodes2 NODES2[];
50 1/7/2013
Example in Optimizing LBM for MIC
• Updates – Particle distribution function of
particles in pos x at time t traveling in direction ei
• D3Q19 model
– Each cell has |ei| = 19 values
– Uses 19 values at curr node, does ~12 flops/direction (220 flops total) and writes 19 neighbors
– Has special case computation at boundaries (flag array used here)
𝑓𝑖(𝑥, 𝑡)
𝑓𝑖 𝑥 + 𝑒𝑖𝑑𝑡, 𝑡 + 𝑑𝑡 = 𝑓𝑖 𝑥, 𝑡 + Ω𝑖(𝑓 𝑥, 𝑡 )
Ω𝑖 𝑓 𝑥, 𝑡 = 1
𝜏(𝑓𝑖
0 − 𝑓𝑖)
51
DO k = 1, NZ
DO j = 1, NY
!DIR$ IVDEP
DO i = 1, NX
m = i + NXG*( j + NYG*k )
phin = phi(m)
rhon = g0(m+now) + g1(m+now) + g2(m+now) + g3(m+now) + g4(m+now) &
+ g5(m+now) + g6(m+now) + g7(m+now) + g8(m+now) + g9(m+now) &
+ g10(m+now) + g11(m+now) + g12(m+now) + g13(m+now) + g14(m+now) &A
+ g15(m+now) + g16(m+now) + g17(m+now) + g18(m+now)
sFx = muPhin*gradPhiX(m)
sFy = muPhin*gradPhiY(m)
sFz = muPhin*gradPhiZ(m)
ux = ( g1(m+now) - g2(m+now) + g7(m+now) - g8(m+now) + g9(m+now) &
- g10(m+now) + g11(m+now) - g12(m+now) + g13(m+now) - g14(m+now) &
+ 0.5D0*sFx )*invRhon
uy = ( g3(m+now) - g4(m+now) + g7(m+now) - g8(m+now) - g9(m+now) &
+ g10(m+now) + g15(m+now) - g16(m+now) + g17(m+now) - g18(m+now) &
+ 0.5D0*sFy )*invRhon
uz = ( g5(m+now) - g6(m+now) + g11(m+now) - g12(m+now) - g13(m+now) &
+ g14(m+now) + g15(m+now) - g16(m+now) - g17(m+now) + g18(m+now) &
+ 0.5D0*sFz )*invRhon
f0(m+nxt) = invTauPhi1*f0(m+now) + Af0 + invTauPhi*phin
f1(m+now) = invTauPhi1*f1(m+now) + Af1 + Cf1*ux
f2(m+now) = invTauPhi1*f2(m+now) + Af1 - Cf1*ux
f3(m+now) = invTauPhi1*f3(m+now) + Af1 + Cf1*uy
f4(m+now) = invTauPhi1*f4(m+now) + Af1 - Cf1*uy
f5(m+now) = invTauPhi1*f5(m+now) + Af1 + Cf1*uz
f6(m+now) = invTauPhi1*f6(m+now) + Af1 - Cf1*uz
g0(m+nxt) = invTauRhoOne*g0(m+now) + ... !! g0'nxt(i , j , k ) = g0(i,j,k) 1
g1(m+nxt + 1) = invTauRhoOne*g1(m+now) + ... !! g1'nxt(i+1, j , k ) = g1(i,j,k) 1
g2(m+nxt - 1) = invTauRhoOne*g2(m+now) + ... !! g2'nxt(i-1, j , k ) = g2(i,j,k) 1 (left)
g3(m+nxt + NXG) = invTauRhoOne*g3(m+now) + ... !! g3'nxt(i , j+1, k ) = g3(i,j,k) 2
g4(m+nxt - NXG) = invTauRhoOne*g4(m+now) + ... !! g4'nxt(i , j-1, k ) = g4(i,j,k) 3 (left)
g5(m+nxt + NXG*NYG) = invTauRhoOne*g5(m+now) + ... !! g5'nxt(i , j , k+1) = g5(i,j,k) 4
g6(m+nxt - NXG*NYG) = invTauRhoOne*g6(m+now) + ... !! g6'nxt(i , j , k-1) = g6(i,j,k) (left)
g7(m+nxt + 1 + NXG) = invTauRhoOne*g7(m+now) + ... !! g7'nxt(i+1, j+1, k ) = g7(i,j,k)
g8(m+nxt - 1 - NXG) = invTauRhoOne*g8(m+now) + ... !! g8'nxt(i-1, j-1, k ) = g8(i,j,k) (left)
g9(m+nxt + 1 - NXG) = invTauRhoOne*g9(m+now) + ... !! g9'nxt(i+1, j-1, k ) = g9(i,j,k) (left)
g10(m+nxt - 1 + NXG) = invTauRhoOne*g10(m+now) + ... !! g10'nxt(i-1, j+1, k ) = g10(i,j,k)
g11(m+nxt + 1 + NXG*NYG) = invTauRhoOne*g11(m+now) + ... !! g11'nxt(i+1, j , k+1) = g11(i,j,k)
g12(m+nxt - 1 - NXG*NYG) = invTauRhoOne*g12(m+now) + ... !! g12'nxt(i-1, j , k-1) = g12(i,j,k) (left)
g13(m+nxt + 1 - NXG*NYG) = invTauRhoOne*g13(m+now) + ... !! g13'nxt(i+1, j , k-1) = g13(i,j,k) (left)
g14(m+nxt - 1 + NXG*NYG) = invTauRhoOne*g14(m+now) + ... !! g14'nxt(i-1, j , k+1) = g14(i,j,k)
g15(m+nxt + NXG + NXG*NYG) = invTauRhoOne*g15(m+now) + ... !! g15'nxt(i , j+1, k+1) = g15(i,j,k)
g16(m+nxt - NXG - NXG*NYG) = invTauRhoOne*g16(m+now) + ... !! g16'nxt(i , j-1, k-1) = g16(i,j,k) (left)
g17(m+nxt + NXG - NXG*NYG) = invTauRhoOne*g17(m+now) + ... !! g17'nxt(i , j+1, k-1) = g17(i,j,k) (left)
g18(m+nxt - NXG + NXG*NYG) = invTauRhoOne*g18(m+now) + ... !! g18'nxt(i , j-1, k+1) = g18(i,j,k)
END DO
END DO
END DO
Key challenges:
- 49 streams of data (some unaligned)
– 19 read streams g* (blue)
– 19 write streams g* (orange)
– 7 read/write streams f* (green)
– 4 read streams (red) (phi, gradPhiX, gradPhiY, gradPhiZ)
- Prefetch / TLB associative issues
- Compiler not vectorizing some hot loops.
NX = NY = NZ = 128
52
1 0.11
8
16
0
5
10
15
20
SNB-EP KNC SNB-EP KNC
Reference Code Moderate Optimized
Code (Pragmas,Compiler, etc)
Sp
eed
up
Initial Results + Basic Optimizations
• Straight porting:
– KNC was 1/10 the performance of a 2-socket SNB-EP system
• Basic optimizations improve both CPU and MIC performance:
– CPU Perf improved by 8x
– KNC Perf improved by 150x
– Optimizations:
– compiler vector directives
– 2M pages
– loop fusion
– change in OMP work distribution Data measured by Tanmay Gangwani, Shreeniwas Sapre and
Sangeeta Bhattacharya on 2x8 SNB-EP 2.2GHz, KNC-ES2
53
Advanced Optimizations
• Include modifications to data structure and tiling to make better use of vector units, caches and available memory bandwidth
– AOS to SOA transformation helps improve SIMD utilization by avoiding gather / scatter operations
– 3.5D blocking to take advantage of large coherent caches
– SW Prefetching for MIC
• Results:
– CPU and MIC both further improves by 9x
Ref: Anthony Nguyen et. al. “3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs,” SC 2010.
Similarity in Intel ® Architectures make optimizations (basic + advanced) work for both Intel Xeon processor and Intel ® Xeon Phi ® coprocessor
54
Agenda
• Why Many-Core?
• Intel Xeon Phi Family & ecosystem
• Intel Xeon Phi HW and SW architecture
• HPC design and development considerations for Xeon Phi
• Case Studies & Demo
55
ScaleMP
• vSMP Foundation aggregates multiple x86 systems into a single, virtual x86 system
– SCALE their application compute, memory, and I/O resources
• Simple to move an application to Intel® Xeon Phi™ coprocessor based systems using ScaleMP software
• Zero porting effort
Video: ScaleMP to Support Intel Xeon Phi Coprocessors ... ScaleMP CEO Shai Fultheim describes how vSMP Foundation supports the new Intel Xeon Phi coprocessor. (From ISC’12 – Hamburg)
Intel demo booth #: ScaleMP – SMP VM for Xeon & Xeon Phi
57
Programming Intel® MIC-based Systems
58
CPU
CPU
CPU
CPU
Network
MPI
MIC
MIC
MIC
MIC
Data
Data
Data
Data
CPU
CPU
CPU
CPU
Network
MPI
Data
Data
Data
Data
MIC
MIC
MIC
MIC
CPU-Centric Intel®MIC-Centric CPU-hosted Offload Symmetric Reverse Offload MIC-hosted
...
LBM running in Xeon cluster with Xeon Phi co-processor
59
CPU
MIC
CPU
CPU
MIC
CPU
Network
Data
Data
Data
Data
MPI
Offload
Homogenous Node
LES_MIC: Load balance between. CPU & MIC
Copyright © 2012 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Sponsors of Tomorrow., and the Intel Sponsors of Tomorrow. logo are trademarks of Intel Corporation in the U.S. and other countries.
Relative speedup with different problem size LES CPU+MIC simultaneous computing design
Demo#: LES_MIC
61
Recommended reading: 4 books
Intel® Xeon Phi™ Coprocessor Developer site: http://software.intel.com/mic-developer
• Architecture, Setup and Programming resources
• Self-guided
training • Case Studies • Information on
Tools and Ecosystem
• Support through
Community Forum
63
64