vectorization shuo li financial services engineering software and services group intel corporation
Post on 01-Jan-2016
229 Views
Preview:
TRANSCRIPT
Vectorization
Shuo LiFinancial Services EngineeringSoftware and Services GroupIntel Corporation
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Legal NoticesINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
This document contains information on products in the design phase of development.
Cilk, Core Inside, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Itanium, Itanium Inside, MCS, MMX, Pentium, Pentium Inside, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, XMM, are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.
Copyright © 2012 Intel Corporation. All rights reserved.
2
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Agenda
• Vectorization Overview• Compiler-base Autoectorization• Step 3 Vectorization• Intel Cilk Plus for Vectorization• Intel C/C++ Vector Classes• Summary
3
Vectorization Overview
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Optimization Notice
5
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2®, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for
use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding
the specific instruction sets covered by this notice.
Notice revision #20110804
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Vectorization and SIMD Execution
• SIMD– Flynn’s Taxonomy: Single Instruction, Multiple Data– CPU perform the same operation on multiple data elements
• SISD– Single Instruction, Single Data
• Vectorization– In the context of Intel® Architecture Processors, the process of
transforming a scalar operation (SISD), that acts on a single data element to the vector operation that that act on multiple data elements at once(SIMD).
– Assuming that setup code does not tip the balance, this can result in more compact and efficient generated code
– For loops in ”normal” or ”unvectorized” code, each assembly instruction deals with the data from only a single loop iteration
6
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
v5 = 0 4 7 8 3 9 2 0 6 3 8 9 4 5 0 1
v6 = 9 4 8 2 0 9 4 5 5 3 4 6 9 1 3 0
vcmppi_lt k7, v5, v6
k7 = 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0
v3 = 5 6 7 8 5 6 7 8 5 6 7 8 5 6 7 8
v1 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
vaddpi v1{k7}, v1, v3
v1 = 6 1 8 1 1 1 8 9 1 1 1 1 6 1 8 1
SIMD Abstraction – Vectorization/SIMD
7
for (i = 0; i < 15; i++)
if (v5[i] < v6[i])
v1[i] += v3[i];
SIMD can simplify your code and reduce the jumps, breaks in program flow control
Note the lack of jumps or conditional code branches
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Software Behind the Vectorization
float *restrict A, *B, *C;
for(i=0;i<n;i++){
A[i] = B[i] + C[i];
}
8
• [SSE2] 4 elems at a time addps xmm1, xmm2
• [AVX] 8 elems at a timevaddps ymm1, ymm2, ymm3
• [IMCI] 16 elems at a timevaddps zmm1, zmm2, zmm3
Vector (or SIMD) Code computes morethan one element at a time.
X3
Y3
X3opY3
0127
X2
Y2
X2opY2
X1
Y1
X1opY1
X0
Y0
X0opY0
X7
Y7
X7opY7
128255
X6
Y6
X6opY6
X5
Y5
X5opY5
X4
Y4
X4opY4
X11
Y11
X11opY11
256383
X10
Y10
X10opY10
X9
Y9
X9opY9
X8
Y8
X8opY8
X15
Y15
X15opY15
384512
X14
Y14
X14opY14
X13
Y13
X13opY13
X12
Y12
X12opY12
X87SSE 2AVXIMIC
Hardware resources behind the vectorization
• CPU has a lot of computation power in the form of SIMD unit.
• XMM (128bit) can operate– 16x chars– 8x shorts– 4x dwords/floats– 2x qwords/doubles/float
complex
• YMM (256bit) can operate– 32x chars– 16x shorts– 8x dwords/floats– 4x qwords/doubles/float complex– 2x double complex
• Intel® Xeon Phi™ Coprocessor (512bit) can operate– 16x chars/shorts (converted to int)– 16x dwords/floats– 8x qwords/doubles/float complex– 4x double complex
9
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
SIMD Abstraction – Options Compared
10
Programmer control
Ease of use / code maintainability (depends
on problem)
Compiler-based Autovectorization
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Compiler-Based Autovectorization
• Compiler recreate vector instructions from the serial Program
• Compiler make decisions based on some assumption
• The programmer reassures the compiler on those assumptions– The compiler takes the directives and compares them with its
analysis of the code
12
• Compiler checks for– Is “*p” loop invariant?– Are a, b, and c loop invariant?– Does a[] overlap with b[], c[], and/or sum?– Is “+” operator associative? (Does the order of “add”s matter?)– Vector computation on the target expected to be faster than scalar
code?
for(i=0;i<*p;i++) { a[i] = b[i]*c[i]; sum = sum + a[i];}
#pragma simd reduction(+:sum)for(i=0;i<*p;i++) { a[i] = b[i]*c[i]; sum = sum + a[i];}
• Compiler Confirms this loop :– “*p” is loop invariant– a[] is not aliased with b[], c[], and sum– sum is not aliased with b[] and c[]– “+” operation on sum is associative (Compiler can reorder the
“add”s on sum)– Vector code to be generated even if it could be slower than
scalar code
iXPTC 2013 Intel® Xeon Phi ™Coprocessor13
#pragma Semantics#pragma ivdep Ignore vector dependences unless they are proven by the
compiler
#pragma vector always [assert] If the loop is vectorizable, ignore any benefit analysisIf the loop did not vectorize, give a compile-time error message via assert
#pragma novector Specifies that a loop should never be vectorized, even if it is legal to do so, when avoiding vectorization of a loop is desirable (when vectorization results in a performance regression)
#pragma vector aligned / unaligned
instructs the compiler to use aligned (unaligned) data movement instructions for all array references when vectorizing
#pragma vector temporal / nontemporal
directs the compiler to use temporal/non-temporal (that is, streaming) stores on systems based on IA-32 and Intel® 64 architectures; optionally takes a comma separated list of variables
Hints to Compiler for Vectorization Opportunities
iXPTC 2013 Intel® Xeon Phi ™Coprocessor14
Clause Semantics
No clause Enforce vectorization of innermost loops; ignore dependencies etc
vectorlength (n1[, n2]…) Select one or more vector lengths (range: 2, 4, 8, 16) for the vectorizer to use.
private (var1, var2, …, varN) Scalars private to each iteration. Initial value broadcast to all instances. Last value copied out from the last loop iteration instance.
linear (var1:step1, …, varN:stepN) Declare induction variables and corresponding positive integer step sizes (in multiples of vector length)
reduction (operator:var1, var2,…, varN) Declare the private scalars to be combined at the end of the loop using the specified reduction operator
[no]assert Direct compiler to assert when the vectorization fails. Default is to assert for SIMD pragma.
Demand vectorization by annotation - #pragma simd• Syntax: #pragma simd [<clause-list>]
– Mechanism to force vectorization of a loop– Programmer: asserts a loop ought to be vectorized– Compiler: vectorizes the loop or gives an error
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Annotate Black-Scholes for Vectorization
15
#pragma simd vectorlength(64)#pragma vector aligned#pragma vector nontemporal (CallResult, PutResult)for(int opt = 0; opt < OptPerThread; opt++){ float CNDD1; float CNDD2; float T = OptionYears[opt]; float X = OptionStrike[opt]; float S = StockPrice[opt]; float sqrtT = sqrtf(T); float d1 = log2f(S/X)/(VLOG2E*sqrtT) + RVV*sqrtT; float d2 = d1 - VOLATILITY * sqrtT; CNDD1 = HALF + HALF*erff(M_SQRT1_2*d1); CNDD2 = HALF + HALF*erff(M_SQRT1_2*d2); float XexpRT = X*exp2f(RLOG2E * T); float CallVal = S * CNDD1 - XexpRT * CNDD2; float PutVal = CallVal + XexpRT - S; CallResult[opt] = CallVal ; PutResult[opt] = PutVal ;}
Compiler Invocation Options:
-fno-alias No pointer aliasing in the program.
-[no-]restrict –std=c99Enable/disable restrict keyword for pointer disambiguation.
-vec-report[n]
-opt-report-phase hpo Turn on the vectorization report.
bs_sp.c(174): (col. 2) remark: loop was not vectorized: existence of vector dependence.bs_sp.c(196): (col. 3) remark: pragma supersedes previous setting.bs_sp.c(196): (col. 3) remark: SIMD LOOP WAS VECTORIZED.bs_sp.c(190): (col. 6) remark: loop was not vectorized: not inner loop.bs_sp.c(189): (col. 2) remark: loop was not vectorized: not inner loop.bs_sp.c(235): (col. 3) remark: LOOP WAS VECTORIZED.
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Get Your Code Vectorized by Intel Compiler
• Data Layout, AOS -> SOA
• Data Alignment (next slide)
• Make the loop innermost
• Function call in treatment– Inline yourself– inline! Use __forceinline– Define your own vector version – Call vector math library - SVML
• Adopt jumpless algorithm
• Read/Write is OK if it’s continuous
• Loop carried dependency
16
for(int i = TIMESTEPS; i > 0; i--)#pragma simd #pragma unroll(4)for(int j = 0; j <= i - 1; j++) cell[j]=puXDf*cell[j+1]+pdXDf*cell[j];CallResult[opt] = (Basetype)cell[0];
for (j=1; j<MAX; j++) a[j] = a[j] + c * a[j-n];
Not a true dependency A true dependency
Array of Structures
S0 X0 T0
S1 X1 T1
… … …
Structure of Arrays
S0 S1 …
X0 X1 …
S0 S1 …
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Memory Alignment
• Allocated memory on heap– _mm_malloc(int size, int aligned)– scalable_aligned_malloc(int size, int aligned)
• Declarations memory:– __attribute__((aligned(n))) float v1[];– __declspec(align(n)) float v2[];
• Use this to notify compiler – __assume_aligned(array, n);
• Natural boundary– Unaligned access can fault the processor
• Cacheline Boundary– Frequently accessed data should be in 64
• 4K boundary– Sequentially accessed large data should be in 4K boundary
17
Instruction Length Alignment
SSE 128 Bits 16 Bytes
AVX 256 Bits 32 Bytes
IMCI 512 Bits 64 Bytes
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Vectorized C/C++ Runtime Functions
• Intel Compiler provide a set of vectorized runtimes function– It’s free. You call them serially, compiler still can vectorize the code
• Multiple version of accuracy exists high medium and low
• Choose the right version by using –imf_precision=low
• Compiler with –S disassembly switches
• If any of these function can be inlined,you should ask for it.
• Use an advanced compiler witch
-fimf-precision=low –fimf-domain_exclusion=31Or -fp-model fast=2
18
acos ceil fabs round
acosh cos floor sin
asin cosh fmax sinh
asinh erf fmin sqrt
atan erfc log tan
atan2 erfinv log10 tanh
atanh exp log2 trunc
cbrt exp2 pow
vmovaps %zmm21, %zmm0 call __svml_erff16_ep
Lab Step 3 Vectorization
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Vectorization of Monte Carlo European Options
• Identify the loop to be vectorized - remember Innermost loop
• Ensure alignment of Dynamically allocated memory– Driver.cpp malloc(int size) -> _mm_malloc(size, align) – Driver.cpp free() -> _mm_free()
• Self inline simple macro max– float callValue=max(0.0,Sval*expf(MuByT+VBySqrtT*random[pos])-Xval);
Move to– float callValue=Sval*expf(MuByT+VBySqrtT*random[pos])-Xval;– callValue = (callValue > 0) ? callValue : 0
• Add Annotation– #pragma vector aligned
– #pragma simd reduction(+:val) reduction(+:val2)
– #pragma unroll(4)
• Makefile – Add –xAVX –vec-report6 to your compiler invocation line
20
Intel® Cilk™ Plus for Vectorization
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Intel® Cilk™ Plus Technology - Elemental Function • Allow you to define data operations using scalar syntax
• Compiler apply the operation to data arrays in parallel, utilizing both SIMD parallelism and core parallelism
22
__declspec (vector) double BlackScholesCall(double S, double K, double T){ double d1, d2, sqrtT = sqrt(T); d1 = (log(S/K)+R*T)/(V*sqrtT)+0.5*V*sqrtT; d2 = d1-(V*sqrtT); return S*CND(d1) - K*exp(-R*T)*CND(d2);}
Cilk_for (int i=0; i < NUM_OPTIONS; i++) call[i] = BlackScholesCall(SList[i], KList[i], TList[i]);
Programmer Intel Compile with Cilk Plus Technology
1. Writes a standard C/C++ scalar syntax2. Annotate it with __declspec(vector)3. Use one of the parallel syntax choices
to invoke the function
1. Generates vector code with SIMD Instr.2. Invokes the function iteratively, until all
elements are processed3. Execute on a single core, or use the
task scheduler, execute on multicores
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Intel® Cilk™ Plus Array Notation• C/C++ Language extension supported by the Intel® Compiler
• Based on the concept of array-section notation:<array>[<low_bound> : <len> : <stride>] [<low_bound> : <len> : <stride>]…
• C/C++ Operators / Function Calls– d[:] = a[:] + (b[:] * c[:])
– b[:] = exp(a[:]); // Call exp() on each element of a[]
• Reductions combine array section elements to generate a scalar result– Nine built-in reduction functions supporting basic C data-types:
• add, mul, max, max_ind, min, min_ind, all_zero, all_non_zero, any_nonzero
– Supports user-defined reduction function
• Built-in reductions provide best performance
0 1 2 3 4 5 6 7 8 9
float a[10];.. = a[:];
0 1 2 3 4 5 6 7 8 9
float a[10];.. = a[2:6];
float a[10];.. = c[][5];
0 1 2 3 4 5 6 7 8 9
float a[10];.. = d[0:3:2];
Intel® C/C++ Vector Classes
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Vector Classes: Boxed Intrinsic Data Types
• Intel® C/C++ Compiler provides C++ Classes that wrap vector registers and vector intrinsic– Class interface for native vector data types such as _mm512– Class constructors use broadcast intrinsic functions– overloaded operator for basic arithmetic and bitwise operations: +-
*/, &,! ^– Provide transcendental functions interface – exp(a) wraps
__mm512_exp_ps(a)– Defined reduction operations such as reduce_add(),
reduce_and(), reduce_min()
• Classes– Intel® Xeon® Processor with SSE4.2 ISA: F32vec4 F64vec2– Intel® Xeon® Processor that support AVX: F32vec8, F64vec4
– Intel® Xeon Phi™ Coprocessors: F32vec16, F64vec8, I32vec16, Is32vec16, Iu32vec16, I64vec8
25
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Generic Computing with Vector Classes• The Intel Compiler provides vector classes for all SIMD lengths
– Support Intel® SSE2 and later 128-bit SIMD -- F32vec4, F64vec2
– Support Intel® AVX-based 256-bit SIMD -- F32vec8, F64vec4
– Support Intel® IMCI 512-bit SIMD -- F32vec16, F64vec8
• Template Method definitions can abstract out SIMD class and length – Create a template that takes a Vector Class , and fundamental type as inputs
• Instead of F32vec16 foo( F32vec16 a), only on Intel MIC architecture
• Try generic SIMDType foo_t<SIMDType, BasicType>(SIMDType a)
– Compiler creates a version of the template for each class the user instantiates• int laneN = sizeof(SIMDType)/Sizeof(BaseType); // the num. of SIMD lanes
• int alignN = sizeof(SIMDType)/sizeof(char); // minimum SIMD alignment
• SIMDType Tvec = *(SIMDType*)&Tmem[0]; // read SIMD-full of data from Tmem
• *(SIMDType *)&(Tmem[0]) = Tvec; //write SIMD-full of data to Tmem, which point to BaseType
• Benefit– Same code template can create different binaries on different architectures
– Same code template for single precision and double precision
– Uses vector class constructor/methods for intrinsic function calls
26
iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Summary
• Fill All the SIMD lane on using Compiler based Vectorization technology
27
top related