vectorization shuo li financial services engineering software and services group intel corporation

Vectorization

Shuo LiFinancial Services EngineeringSoftware and Services GroupIntel Corporation

iXPTC 2013 Intel® Xeon Phi ™Coprocessor

Legal NoticesINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

This document contains information on products in the design phase of development.

Cilk, Core Inside, Intel, the Intel logo, Intel AppUp, Intel Atom, Intel Atom Inside, Intel Core, the Intel Sponsors of Tomorrow. logo, Intel StrataFlash, Intel vPro, Itanium, Itanium Inside, MCS, MMX, Pentium, Pentium Inside, Ultrabook, vPro Inside, VTune, Xeon, Xeon Inside, XMM, are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

Copyright © 2012 Intel Corporation. All rights reserved.

2


Agenda

• Vectorization Overview• Compiler-base Autoectorization• Step 3 Vectorization• Intel Cilk Plus for Vectorization• Intel C/C++ Vector Classes• Summary

3

Vectorization Overview


Optimization Notice

5

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that

are not unique to Intel microprocessors. These optimizations include SSE2®, SSE3, and SSSE3 instruction sets and

other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on

microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for

use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel

microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding

the specific instruction sets covered by this notice.

Notice revision #20110804

http://software.intel.com/en-us/articles/optimization-notice/


Vectorization and SIMD Execution

• SIMD– Flynn’s Taxonomy: Single Instruction, Multiple Data– CPU perform the same operation on multiple data elements

• SISD– Single Instruction, Single Data

• Vectorization– In the context of Intel® Architecture Processors, the process of

transforming a scalar operation (SISD), that acts on a single data element to the vector operation that that act on multiple data elements at once(SIMD).

– Assuming that setup code does not tip the balance, this can result in more compact and efficient generated code

– For loops in ”normal” or ”unvectorized” code, each assembly instruction deals with the data from only a single loop iteration

6


v5 = 0 4 7 8 3 9 2 0 6 3 8 9 4 5 0 1

v6 = 9 4 8 2 0 9 4 5 5 3 4 6 9 1 3 0

vcmppi_lt k7, v5, v6

k7 = 1 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0

v3 = 5 6 7 8 5 6 7 8 5 6 7 8 5 6 7 8

v1 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

vaddpi v1{k7}, v1, v3

v1 = 6 1 8 1 1 1 8 9 1 1 1 1 6 1 8 1

SIMD Abstraction – Vectorization/SIMD

7

for (i = 0; i < 15; i++)

if (v5[i] < v6[i])

v1[i] += v3[i];

SIMD can simplify your code and reduce the jumps, breaks in program flow control

Note the lack of jumps or conditional code branches


Software Behind the Vectorization

float *restrict A, *B, *C;

for(i=0;i<n;i++){

A[i] = B[i] + C[i];

}

8

• [SSE2] 4 elems at a time addps xmm1, xmm2

• [AVX] 8 elems at a timevaddps ymm1, ymm2, ymm3

• [IMCI] 16 elems at a timevaddps zmm1, zmm2, zmm3

Vector (or SIMD) Code computes morethan one element at a time.

X3

Y3

X3opY3

0127

X2

Y2

X2opY2

X1

Y1

X1opY1

X0

Y0

X0opY0

X7

Y7

X7opY7

128255

X6

Y6

X6opY6

X5

Y5

X5opY5

X4

Y4

X4opY4

X11

Y11

X11opY11

256383

X10

Y10

X10opY10

X9

Y9

X9opY9

X8

Y8

X8opY8

X15

Y15

X15opY15

384512

X14

Y14

X14opY14

X13

Y13

X13opY13

X12

Y12

X12opY12

X87SSE 2AVXIMIC

Hardware resources behind the vectorization

• CPU has a lot of computation power in the form of SIMD unit.

• XMM (128bit) can operate– 16x chars– 8x shorts– 4x dwords/floats– 2x qwords/doubles/float

complex

• YMM (256bit) can operate– 32x chars– 16x shorts– 8x dwords/floats– 4x qwords/doubles/float complex– 2x double complex

• Intel® Xeon Phi™ Coprocessor (512bit) can operate– 16x chars/shorts (converted to int)– 16x dwords/floats– 8x qwords/doubles/float complex– 4x double complex

9


SIMD Abstraction – Options Compared

10

Programmer control

Ease of use / code maintainability (depends

on problem)

Compiler-based Autovectorization


Compiler-Based Autovectorization

• Compiler recreate vector instructions from the serial Program

• Compiler make decisions based on some assumption

• The programmer reassures the compiler on those assumptions– The compiler takes the directives and compares them with its

analysis of the code

12

• Compiler checks for– Is “*p” loop invariant?– Are a, b, and c loop invariant?– Does a[] overlap with b[], c[], and/or sum?– Is “+” operator associative? (Does the order of “add”s matter?)– Vector computation on the target expected to be faster than scalar

code?

for(i=0;i<*p;i++) { a[i] = b[i]*c[i]; sum = sum + a[i];}

#pragma simd reduction(+:sum)for(i=0;i<*p;i++) { a[i] = b[i]*c[i]; sum = sum + a[i];}

• Compiler Confirms this loop :– “*p” is loop invariant– a[] is not aliased with b[], c[], and sum– sum is not aliased with b[] and c[]– “+” operation on sum is associative (Compiler can reorder the

“add”s on sum)– Vector code to be generated even if it could be slower than

scalar code

iXPTC 2013 Intel® Xeon Phi ™Coprocessor13

#pragma Semantics#pragma ivdep Ignore vector dependences unless they are proven by the

compiler

#pragma vector always [assert] If the loop is vectorizable, ignore any benefit analysisIf the loop did not vectorize, give a compile-time error message via assert

#pragma novector Specifies that a loop should never be vectorized, even if it is legal to do so, when avoiding vectorization of a loop is desirable (when vectorization results in a performance regression)

#pragma vector aligned / unaligned

instructs the compiler to use aligned (unaligned) data movement instructions for all array references when vectorizing

#pragma vector temporal / nontemporal

directs the compiler to use temporal/non-temporal (that is, streaming) stores on systems based on IA-32 and Intel® 64 architectures; optionally takes a comma separated list of variables

Hints to Compiler for Vectorization Opportunities

iXPTC 2013 Intel® Xeon Phi ™Coprocessor14

Clause Semantics

No clause Enforce vectorization of innermost loops; ignore dependencies etc

vectorlength (n1[, n2]…) Select one or more vector lengths (range: 2, 4, 8, 16) for the vectorizer to use.

private (var1, var2, …, varN) Scalars private to each iteration. Initial value broadcast to all instances. Last value copied out from the last loop iteration instance.

linear (var1:step1, …, varN:stepN) Declare induction variables and corresponding positive integer step sizes (in multiples of vector length)

reduction (operator:var1, var2,…, varN) Declare the private scalars to be combined at the end of the loop using the specified reduction operator

[no]assert Direct compiler to assert when the vectorization fails. Default is to assert for SIMD pragma.

Demand vectorization by annotation - #pragma simd• Syntax: #pragma simd [<clause-list>]

– Mechanism to force vectorization of a loop– Programmer: asserts a loop ought to be vectorized– Compiler: vectorizes the loop or gives an error


Annotate Black-Scholes for Vectorization

15

#pragma simd vectorlength(64)#pragma vector aligned#pragma vector nontemporal (CallResult, PutResult)for(int opt = 0; opt < OptPerThread; opt++){ float CNDD1; float CNDD2; float T = OptionYears[opt]; float X = OptionStrike[opt]; float S = StockPrice[opt]; float sqrtT = sqrtf(T); float d1 = log2f(S/X)/(VLOG2E*sqrtT) + RVV*sqrtT; float d2 = d1 - VOLATILITY * sqrtT; CNDD1 = HALF + HALF*erff(M_SQRT1_2*d1); CNDD2 = HALF + HALF*erff(M_SQRT1_2*d2); float XexpRT = X*exp2f(RLOG2E * T); float CallVal = S * CNDD1 - XexpRT * CNDD2; float PutVal = CallVal + XexpRT - S; CallResult[opt] = CallVal ; PutResult[opt] = PutVal ;}

Compiler Invocation Options:

-fno-alias No pointer aliasing in the program.

-[no-]restrict –std=c99Enable/disable restrict keyword for pointer disambiguation.

-vec-report[n]

-opt-report-phase hpo Turn on the vectorization report.

bs_sp.c(174): (col. 2) remark: loop was not vectorized: existence of vector dependence.bs_sp.c(196): (col. 3) remark: pragma supersedes previous setting.bs_sp.c(196): (col. 3) remark: SIMD LOOP WAS VECTORIZED.bs_sp.c(190): (col. 6) remark: loop was not vectorized: not inner loop.bs_sp.c(189): (col. 2) remark: loop was not vectorized: not inner loop.bs_sp.c(235): (col. 3) remark: LOOP WAS VECTORIZED.


Get Your Code Vectorized by Intel Compiler

• Data Layout, AOS -> SOA

• Data Alignment (next slide)

• Make the loop innermost

• Function call in treatment– Inline yourself– inline! Use __forceinline– Define your own vector version – Call vector math library - SVML

• Adopt jumpless algorithm

• Read/Write is OK if it’s continuous

• Loop carried dependency

16

for(int i = TIMESTEPS; i > 0; i--)#pragma simd #pragma unroll(4)for(int j = 0; j <= i - 1; j++) cell[j]=puXDf*cell[j+1]+pdXDf*cell[j];CallResult[opt] = (Basetype)cell[0];

for (j=1; j<MAX; j++) a[j] = a[j] + c * a[j-n];

Not a true dependency A true dependency

Array of Structures

S0 X0 T0

S1 X1 T1

… … …

Structure of Arrays

S0 S1 …

X0 X1 …

S0 S1 …


Memory Alignment

• Allocated memory on heap– _mm_malloc(int size, int aligned)– scalable_aligned_malloc(int size, int aligned)

• Declarations memory:– __attribute__((aligned(n))) float v1[];– __declspec(align(n)) float v2[];

• Use this to notify compiler – __assume_aligned(array, n);

• Natural boundary– Unaligned access can fault the processor

• Cacheline Boundary– Frequently accessed data should be in 64

• 4K boundary– Sequentially accessed large data should be in 4K boundary

17

Instruction Length Alignment

SSE 128 Bits 16 Bytes

AVX 256 Bits 32 Bytes

IMCI 512 Bits 64 Bytes


Vectorized C/C++ Runtime Functions

• Intel Compiler provide a set of vectorized runtimes function– It’s free. You call them serially, compiler still can vectorize the code

• Multiple version of accuracy exists high medium and low

• Choose the right version by using –imf_precision=low

• Compiler with –S disassembly switches

• If any of these function can be inlined,you should ask for it.

• Use an advanced compiler witch

-fimf-precision=low –fimf-domain_exclusion=31Or -fp-model fast=2

18

acos ceil fabs round

acosh cos floor sin

asin cosh fmax sinh

asinh erf fmin sqrt

atan erfc log tan

atan2 erfinv log10 tanh

atanh exp log2 trunc

cbrt exp2 pow

vmovaps %zmm21, %zmm0 call __svml_erff16_ep

Lab Step 3 Vectorization


Vectorization of Monte Carlo European Options

• Identify the loop to be vectorized - remember Innermost loop

• Ensure alignment of Dynamically allocated memory– Driver.cpp malloc(int size) -> _mm_malloc(size, align) – Driver.cpp free() -> _mm_free()

• Self inline simple macro max– float callValue=max(0.0,Sval*expf(MuByT+VBySqrtT*random[pos])-Xval);

Move to– float callValue=Sval*expf(MuByT+VBySqrtT*random[pos])-Xval;– callValue = (callValue > 0) ? callValue : 0

• Add Annotation– #pragma vector aligned

– #pragma simd reduction(+:val) reduction(+:val2)

– #pragma unroll(4)

• Makefile – Add –xAVX –vec-report6 to your compiler invocation line

20

Intel® Cilk™ Plus for Vectorization


Intel® Cilk™ Plus Technology - Elemental Function • Allow you to define data operations using scalar syntax

• Compiler apply the operation to data arrays in parallel, utilizing both SIMD parallelism and core parallelism

22

__declspec (vector) double BlackScholesCall(double S, double K, double T){ double d1, d2, sqrtT = sqrt(T); d1 = (log(S/K)+R*T)/(V*sqrtT)+0.5*V*sqrtT; d2 = d1-(V*sqrtT); return S*CND(d1) - K*exp(-R*T)*CND(d2);}

Cilk_for (int i=0; i < NUM_OPTIONS; i++) call[i] = BlackScholesCall(SList[i], KList[i], TList[i]);

Programmer Intel Compile with Cilk Plus Technology

1. Writes a standard C/C++ scalar syntax2. Annotate it with __declspec(vector)3. Use one of the parallel syntax choices

to invoke the function

1. Generates vector code with SIMD Instr.2. Invokes the function iteratively, until all

elements are processed3. Execute on a single core, or use the

task scheduler, execute on multicores


Intel® Cilk™ Plus Array Notation• C/C++ Language extension supported by the Intel® Compiler

• Based on the concept of array-section notation:<array>[<low_bound> : <len> : <stride>] [<low_bound> : <len> : <stride>]…

• C/C++ Operators / Function Calls– d[:] = a[:] + (b[:] * c[:])

– b[:] = exp(a[:]); // Call exp() on each element of a[]

• Reductions combine array section elements to generate a scalar result– Nine built-in reduction functions supporting basic C data-types:

• add, mul, max, max_ind, min, min_ind, all_zero, all_non_zero, any_nonzero

– Supports user-defined reduction function

• Built-in reductions provide best performance

0 1 2 3 4 5 6 7 8 9

float a[10];.. = a[:];

0 1 2 3 4 5 6 7 8 9

float a[10];.. = a[2:6];

float a[10];.. = c[][5];

0 1 2 3 4 5 6 7 8 9

float a[10];.. = d[0:3:2];

Intel® C/C++ Vector Classes


Vector Classes: Boxed Intrinsic Data Types

• Intel® C/C++ Compiler provides C++ Classes that wrap vector registers and vector intrinsic– Class interface for native vector data types such as _mm512– Class constructors use broadcast intrinsic functions– overloaded operator for basic arithmetic and bitwise operations: +-

*/, &,! ^– Provide transcendental functions interface – exp(a) wraps

__mm512_exp_ps(a)– Defined reduction operations such as reduce_add(),

reduce_and(), reduce_min()

• Classes– Intel® Xeon® Processor with SSE4.2 ISA: F32vec4 F64vec2– Intel® Xeon® Processor that support AVX: F32vec8, F64vec4

– Intel® Xeon Phi™ Coprocessors: F32vec16, F64vec8, I32vec16, Is32vec16, Iu32vec16, I64vec8

25


Generic Computing with Vector Classes• The Intel Compiler provides vector classes for all SIMD lengths

– Support Intel® SSE2 and later 128-bit SIMD -- F32vec4, F64vec2

– Support Intel® AVX-based 256-bit SIMD -- F32vec8, F64vec4

– Support Intel® IMCI 512-bit SIMD -- F32vec16, F64vec8

• Template Method definitions can abstract out SIMD class and length – Create a template that takes a Vector Class , and fundamental type as inputs

• Instead of F32vec16 foo( F32vec16 a), only on Intel MIC architecture

• Try generic SIMDType foo_t<SIMDType, BasicType>(SIMDType a)

– Compiler creates a version of the template for each class the user instantiates• int laneN = sizeof(SIMDType)/Sizeof(BaseType); // the num. of SIMD lanes

• int alignN = sizeof(SIMDType)/sizeof(char); // minimum SIMD alignment

• SIMDType Tvec = *(SIMDType*)&Tmem[0]; // read SIMD-full of data from Tmem

• *(SIMDType *)&(Tmem[0]) = Tvec; //write SIMD-full of data to Tmem, which point to BaseType

• Benefit– Same code template can create different binaries on different architectures

– Same code template for single precision and double precision

– Uses vector class constructor/methods for intrinsic function calls

26


Summary

• Fill All the SIMD lane on using Compiler based Vectorization technology

27

vectorization shuo li financial services engineering software and services group intel corporation

Documents

intel logo

intel atom

intel core

intel vpro

intel appup

intel strataflash

intel reserves

nonintel microprocessors