parallelization shuo li financial services engineering software and service group intel corporation

ParallelizationShuo Li Financial Services EngineeringSoftware and Service GroupIntel Corporation

iXPTC 2013 Intel® Xeon Phi™ Coprocessor

Agenda

• Parallelism on Intel® Architecture• Challenges in Parallelization• Options for Parallelization• Summary

2 iXPTC 2013 Intel® Xeon Phi ™Coprocessor

Parallelism on Intel® Architecture


Parallelism on Intel® Architecture

4

Intel® Xeon® processor64-bit

Intel Xeon

processor 5100 series

Intel Xeon processor 5500 series

Intel Xeon processor 5600 series

Intel Xeon

processor E5

Product Family

Intel Xeon

processor code name

Ivy Bridge

Intel Xeon

processor code name

Haswell

Intel® Xeon Phi™

Coprocessor

Core(s) 1 2 4 6 8 10 To be Announced

61244Threads 2 2 8 12 16 20

Images do not reflect actual die sizes. Actual production die may differ from images.

Intel® Xeon Phi™ Coprocessor extends established CPU architecture and programming paradigm to highly parallel applications

Options for Parallelism


Options for Parallelism on Intel® Architecture

6

pthreads*

OpenMP*

Intel® Cilk™ Plus

Intel® TBB

More control

Ease of use maintainability

• What’s available on Intel® host processor are also available on Intel® target coprocessor

• Many others (boost, zthreads) are ported to the coprocessor• Choose the best threading model your problem dictates

Well known industry standardBest suited when resource utilization is known at design time

C++ template Library of parallel algorithms, containersLoad balancing via work stealingKeyword extension of C/C++, Serial equivalence via compiler Load balancing via work stealing

Time-tested industry standard for Unix-likeCommon denominator or other high level threading libraries


Options for Parallelism – pthreads*• POSIX* Standard for thread API with 20 years history• Foundation for other high level threading libraries• Independently exist on the host and Intel® MIC• No extension to go from the host to Intel® MIC• Advantage: Programmer has explicit control

– From workload partition to thread creation, synchronization, load balance, affinity settings, etc.

• Disadvantage: Programmer has too much control– Code longevity– Maintainability– Scalability

7


Black-Scholesusing pthreads*

8

pthread_attr_init(&attr);clock_gettime(0, &t0); for (int i = 0; i < nthreads; i++){ int t = 4 * (i / SMT) + (i % SMT); set_thread_affinity_attr(t, &attr); pthread_create(&threads[i], 0, bs_thread, (void *) i);}for(i=0; i<nThreads; i++) { int ret; pthread_join(threads[i], (void **)&ret);}clock_gettime(0, &t1);

__forceinlinevoid BlkSchlsEqEuroNoDiv_C(fptype * OptionPrice,fptype * OptionPrice2, fptype * sptprice,fptype * strike, fptype * time){ int i; fptype sqrtT; fptype d1, d2; fptype NofXd1, NofXd2; sqrtT = SQRT(*time); d1 = LOG2(*sptprice / *strike) / (Vlog2E * sqrtT) + RVV * sqrtT; d2 = d1 - VOLATILITY*sqrtT; CNDF_C( &NofXd1, &d1 ); CNDF_C( &NofXd2, &d2 );fptype expRT = EXP2(ZR * (*time)); *OptionPrice = ((*sptprice) * NofXd1) - ((*strike)*expRT * NofXd2); *OptionPrice2 = *OptionPrice + expRT -(*sptprice); return;}

void *bs_thread(void * arg1){ int i, j, k; fptype priceDelta; int tid = (int) arg1;

int start = tid * (numOptions / nThreads); int end = start + (numOptions / nThreads);

for (j=0; j<numRuns; j++) { #pragma ivdep #pragma vector aligned for (i=start; i<end; i++) BlkSchlsEqEuroNoDiv_C(&(gprice[i]), &(gprice2[i]), &(sptprice[i]), &(strike[i]), &(otime[i])); } barrier(tid); return (NULL);}


Thread Affinity using pthreads*• Partition the workload to avoid load imbalance

– Understand static vs. dynamic workload partition • Use pthread API, define, initialize, set, destroy

– Set CPU affinity with pthead_setaffinity_np()– Know the thread enumeration and avoid core 0– Core 0 boots the coprocessor, job scheduler, service interrupts

9

Core 0

0241

242

243

Core 1

41 2 3

Core 2

85 6 7Core 60

240

237

238

239

Busy


Options for Parallelism – OpenMP*• Compiler directives/pragmas based threading constructs

– Utility library functions and Environment variables• Specify blocks of code executing in parallel

• Fork-Join Parallelism: – Master thread spawns a team of worker threads as needed– Parallelism grow incrementally

10

Parallel RegionsMaster Thread

#pragma omp parallel sections{ #pragma omp section task1(); #pragma omp section task2(); #pragma omp section task3();}


OpenMP* Pragmas and Extensions• OpenMP* pragmas in C/C++:

#pragma omp construct [clause [clause]…]

• Large robust specification that includes– Parallel sections and tasks– Parallel loops– Synchronization points

• critical sections• barriers

– Atomic and ordered updates– Serial sections within the parallel code

• Extension to support offloading – OpenMP* 4.0 RC2– Use #pragma omp target or #pragma offload from Intel LEO– Either syntax works, no performance differences

11

#pragma omp parallel sections{ #pragma omp section { BinomialTemplate<F32vec8, float>(callResult, S, X, T, R, V, N, NUM_STEPS); } #pragma omp section#pragma offload target(mic:0) in(S, X, T:length(N)) out(MICExpected, MICConfidence:length(N)) { montecarlo(S, X, T, MICExpected, MICConfidence); }}


OpenMP* Worksharing Construct

12

for (i = 0; i < N; i++) a[i] = a[i] + b[i];Sequential code

#pragma omp parallel{ int id, i, Nthrds, istart, iend;

id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id + 1) * N / Nthrds;

for (i = istart; i < iend; i++) a[i] = a[i] + b[i];}

OpenMP* parallel region

#pragma omp parallel #pragma omp for for (i = 0; i < N; i++) a[i] = a[i] + b[i];

OpenMP* worksharing construct


OpenMP*: Shared, Private and Reduction Variables• Default Rules

– Variables defined outside the parallel region are shared – Variables defined inside the parallel region are private

• Override the defaults– The private(var) clause creates a local copy of var for each thread– Loop indices in a parallel for are private by default

• The reduction(op:list) clause is a special case of shared– Variables in “list” must be shared in the enclosing parallel region

• A local copy of each reduction variable is created and initialized based on the op (0 for “+”)• Compiler finds reduction expressions containing op and uses them to update the local copy• Local copies are reduced to a single value and combined with the original global value

13

#pragma omp parallel reduction(+ : sum_delta) reduction(+ : sum_ref){ float local_sum_delta = 0.0f; for(int i = 0; i < OptPerThread; i++) { ref = callReference; delta = fabs(callReference - CallResult[i]); local_sum_delta += delta; sum_ref += fabs(ref); } sum_delta += local_sum_delta;}


OpenMP* Performance, Scalability Related Issues

14

• Manage Thread Creation Cost– Create threads as early as possible, Maximize

the work for worker threads– IA threads take some time to create, But

once they’re up, they last till the end• Take advantage of memory locality,

use NUMA memory manager– Allocate the memory on the thread that will

access them later on.– Try not to allocate the memory the worker

threads use in the main thread• Ensure your OpenMP* program works

serially, compiles without openmp* – Protect OpenMP* API calls with _OPENMP, – Make sure serial works before enable

OpenMP* (e.g. compile with –openmp)• Minimize the thread synchronization

– use local variable to reduce the need to access global variable

#ifdef _OPENMPint ThreadNum = omp_get_max_threads();omp_set_num_threads(ThreadNum); #elseint ThreadNum = 1; #endif

#pragma omp parallel{ #ifdef _OPENMP int threadID = omp_get_thread_num(); #else int threadID = 0; #endif

float *CallResult = (float *) scalable_aligned_malloc (mem_size, SIMDALIGN); float *PutResult = (float *) scalable_aligned_malloc (mem_size, SIMDALIGN);

}

#pragma omp parallel forfor (int k = 0; k < RAND_N; k++) h_Random[k] = cdfnorminv ((k+1.0)/(RAND_N+1.0));

#pragma omp parallel forfor(int opt = 0; opt < OPT_N; opt++){ CallResultList[opt] = 0; CallConfidenceList[opt] = 0;}


OpenMP* Offload Environment Variables• Set/Get the number of coprocessor threads from the host

– Notice that omp_get_max_thread_target()return 4*(ncore-1)– Use omp_set_num_threads_target() omp_get_num_threads_target()– Protect under #ifdef __INTEL_OFFLOAD,

• Access coprocessor environment variables from the host processor– First define MIC_ENV_PREFIX=MIC– Issue export MIC_OMP_NUM_THREADS=240 on the host– OpenMP sets the coprocessor max threads to 240– Host OpenMP threads still take the cues from OMP_NUM_THREADS

• Initial Stack Size on the device is default to be 12MB – Use MIC_STACKSIZE to override the default size for main threads in coprocessor– Use MIC_OMP_STACKSIZE to change the default stack size for worker threads

15

Step 4 Parallelization


Step 4: Parallelization

• Add #pragma omp parallel for to the outer loop• Add –openmp to the C/C++ Compiler invocation

option CCFLAGS• Rerun the program • ./MonteCarlo• Record the performance againexport KMP_AFFINITY=“compact,granularity=fine”

17


Other Options for Parallelism – Intel® Cilk™ Plus• C/C++ extension for fine-grained task parallelism• Three keywords

_Cilk_spawn• Function call may be run in parallel with caller – up to the runtime

_Cilk_sync• Caller waits for all children to complete

_Cilk_for• Iterations are structured into a work queue• Busy cores do not execute the loop• Idle cores steal work items from the queue• Countable loop Granularity is N/2, N/4, N/8, for trip count of N• Intended use:

– when iterations are not balanced, or– When overall load is not known at design time

18

Offload Using Intel® Cilk™ Plus

19

Feature Example Description

Offloading a function call x = _Cilk_shared _Cilk_offload func(y); func can executes on Intel MIC

Offloading asynchronously x = _Cilk_spawn _Cilk_offload func(y); Non blocking offload

Data available on both sides _Cilk_shared int x = 0; Allocated in the shared memory area, can be synchronized

Function available on both sides

int _Clik_shared f(int x) { return x+1}

The function can execute on either side

Offload a parallel for loop(Requires Cilk on Intel MIC)

_Cilk_offload _Cilk_for (i = 0; i < N; i++) { a[i] = b[i] + c[i]; }

Loop executes in parallel on Intel MIC. The loop is implicitly outlined as a function call. (borrow inside the loop disallowed)

Offload array expressions _Offload a[:] = b[:] <op> c[:];_Offload a[:] = elemental_func(b[:]);

Array operations execute in parallel on Intel MIC.

• Intel ® C/C++ Compiler extension with new offloading key words• Provide the appearance of shared memory using virtual Shared-

memory technology


Black-Scholes – Using Intel® C/C++ Compiler with Cilk™ Plus Technology

double option_price_call_black_scholes( double S, // spot (underlying) price double K, // strike (exercise) price, double r, // interest rate double sigma, // volatility double time) // time to maturity{ double time_sqrt = sqrt(time); double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt; double d2 = d1-(sigma*time_sqrt); return S*N(d1) - K*exp(-r*time)*N(d2);}

__declspec (vector) double option_price_call_black_scholes( double S, // spot (underlying) price double K, // strike (exercise) price, double r, // interest rate double sigma, // volatility double time) // time to maturity{ double time_sqrt = sqrt(time); double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt; double d2 = d1-(sigma*time_sqrt); return S*N(d1) - K*exp(-r*time)*N(d2);}

// invoke calculations for call-optionsfor (int i=0; i<NUM_OPTIONS; i++) { call[i] = option_price_call_black_scholes(S[i], K[i], r, sigma, time[i]);}

// invoke calculations for call-optionsCilk_for (int i=0; i<NUM_OPTIONS; i++) { call[i] = option_price_call_black_scholes(S[i], K[i], r, sigma, time[i]);}

Elemental functions utilize both core and vector parallelism20


__declspec (vector) double option_price_call_black_scholes( double S, // spot (underlying) price double K, // strike (exercise) price, double r, // interest rate double sigma, // volatility double time) // time to maturity{ double time_sqrt = sqrt(time); double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt; double d2 = d1-(sigma*time_sqrt); return S*N(d1) - K*exp(-r*time)*N(d2);}

Black-Scholes – Using Intel® C/C++ Compiler keyword Extension for Offload

// invoke calculations for call-options: first invocaiton on MIC, second on Xeon_Offload Cilk_for (int i=0; i<NUM_OPTIONS; i++) { call[i] = option_price_call_black_scholes(S[i], K[i], r, sigma, time[i]);}Cilk_for (int i=0; i < NUM_OPTIONS; i++) { call[i] = option_price_call_black_scholes(s[i], K[i], r, sigma, time[i]);}

_Shared __declspec (vector) double option_price_call_black_scholes( double S, // spot (underlying) price double K, // strike (exercise) price, double r, // interest rate double sigma, // volatility double time) // time to maturity{ double time_sqrt = sqrt(time); double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt; double d2 = d1-(sigma*time_sqrt); return S*N(d1) - K*exp(-r*time)*N(d2);}

21


Running Black-Scholes on Intel® Xeon® Processor and Intel® Xeon Phi™ Coprocessor in concurrent

// invoke calculations for call-options: first invocaiton on MIC, second on Xeon… Cilk_spawn _Offload wrapper(); wrapper(); Cilk_sync;…

_Shared __declspec (vector) double option_price_call_black_scholes( double S, // spot (underlying) price double K, // strike (exercise) price, double r, // interest rate double sigma, // volatility double time) // time to maturity{ double time_sqrt = sqrt(time); double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt; double d2 = d1-(sigma*time_sqrt); return S*N(d1) - K*exp(-r*time)*N(d2);}

_Shared wrapper () { Cilk_for (int i=0; i < NUM_OPTIONS; i++) { call[i] = option_price_call_balck_scholes(s[i],k[i],r,sigma,time[i]); } }

22


Another for Parallelism – Intel® Threading Building Blocks (TBB)

• C++ classes and templates that implement task-based parallelism– As opposed to “threads”– Makes use of “work stealing” to evenly distribute work across threads and

ensure good cache behavior• Provides a wide range of template classes to implement efficient

parallel algorithms– Generic parallel patterns– Concurrent containers– Synchronization primitives– Memory allocation– Task scheduling– Thread local storage– Etc.

Intel Confidential

23

Concurrent Containersconcurrent_hash_map

concurrent_queueconcurrent_bounded_queue

concurrent_vectorconcurrent_unordered_map

Miscellaneoustick_count

Generic Parallel Algorithmsparallel_for(range)

parallel_reduceparallel_for_each(begin, end)

parallel_doparallel_invoke

pipeline, parallel_pipelineparallel_sortparallel_scan

Task schedulertask_group

task_structured_grouptask_scheduler_init

task_scheduler_observer

Synchronization Primitivesatomic; mutex; recursive_mutex;

spin_mutex; spin_rw_mutex;queuing_mutex; queuing_rw_mutex; reader_writer_lock; critical_section;

condition_variable;lock_guard; unique_lock;

null_mutex; null_rw_mutex;

Memory Allocationtbb_allocator; cache_aligned_allocator; scalable_allocator; zero_allocator

Threadstbb_thread, thread

Thread Local Storageenumerable_thread_specific

combinable


Options for Parallelism - comparison

24

Pthreads* OpenMP* Intel® Cilk™ Plus Intel® TBB

Code rewrite required to use Lots Little Little Moderate

Serial code Source Compatibility No yes likely No

Compiler Dependency No No Yes No

Supports Fortran Yes Yes No No

Supports C Yes Yes Yes No

Supports C++ Yes Yes Yes Yes

parallelization shuo li financial services engineering software and service group intel corporation

Documents