parallelization shuo li financial services engineering software and service group intel corporation
DESCRIPTION
Parallelism on Intel® ArchitectureTRANSCRIPT
ParallelizationShuo Li Financial Services EngineeringSoftware and Service GroupIntel Corporation
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Agenda
• Parallelism on Intel® Architecture• Challenges in Parallelization• Options for Parallelization• Summary
2 iXPTC 2013 Intel® Xeon Phi ™Coprocessor
Parallelism on Intel® Architecture
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Parallelism on Intel® Architecture
4
Intel® Xeon® processor64-bit
Intel Xeon
processor 5100 series
Intel Xeon processor 5500 series
Intel Xeon processor 5600 series
Intel Xeon
processor E5
Product Family
Intel Xeon
processor code name
Ivy Bridge
Intel Xeon
processor code name
Haswell
Intel® Xeon Phi™
Coprocessor
Core(s) 1 2 4 6 8 10 To be Announced
61244Threads 2 2 8 12 16 20
Images do not reflect actual die sizes. Actual production die may differ from images.
Intel® Xeon Phi™ Coprocessor extends established CPU architecture and programming paradigm to highly parallel applications
Options for Parallelism
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Options for Parallelism on Intel® Architecture
6
pthreads*
OpenMP*
Intel® Cilk™ Plus
Intel® TBB
More control
Ease of use maintainability
• What’s available on Intel® host processor are also available on Intel® target coprocessor
• Many others (boost, zthreads) are ported to the coprocessor• Choose the best threading model your problem dictates
Well known industry standardBest suited when resource utilization is known at design time
C++ template Library of parallel algorithms, containersLoad balancing via work stealingKeyword extension of C/C++, Serial equivalence via compiler Load balancing via work stealing
Time-tested industry standard for Unix-likeCommon denominator or other high level threading libraries
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Options for Parallelism – pthreads*• POSIX* Standard for thread API with 20 years history• Foundation for other high level threading libraries• Independently exist on the host and Intel® MIC• No extension to go from the host to Intel® MIC• Advantage: Programmer has explicit control
– From workload partition to thread creation, synchronization, load balance, affinity settings, etc.
• Disadvantage: Programmer has too much control– Code longevity– Maintainability– Scalability
7
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Black-Scholesusing pthreads*
8
pthread_attr_init(&attr);clock_gettime(0, &t0); for (int i = 0; i < nthreads; i++){ int t = 4 * (i / SMT) + (i % SMT); set_thread_affinity_attr(t, &attr); pthread_create(&threads[i], 0, bs_thread, (void *) i);}for(i=0; i<nThreads; i++) { int ret; pthread_join(threads[i], (void **)&ret);}clock_gettime(0, &t1);
__forceinlinevoid BlkSchlsEqEuroNoDiv_C(fptype * OptionPrice,fptype * OptionPrice2, fptype * sptprice,fptype * strike, fptype * time){ int i; fptype sqrtT; fptype d1, d2; fptype NofXd1, NofXd2; sqrtT = SQRT(*time); d1 = LOG2(*sptprice / *strike) / (Vlog2E * sqrtT) + RVV * sqrtT; d2 = d1 - VOLATILITY*sqrtT; CNDF_C( &NofXd1, &d1 ); CNDF_C( &NofXd2, &d2 );fptype expRT = EXP2(ZR * (*time)); *OptionPrice = ((*sptprice) * NofXd1) - ((*strike)*expRT * NofXd2); *OptionPrice2 = *OptionPrice + expRT -(*sptprice); return;}
void *bs_thread(void * arg1){ int i, j, k; fptype priceDelta; int tid = (int) arg1;
int start = tid * (numOptions / nThreads); int end = start + (numOptions / nThreads);
for (j=0; j<numRuns; j++) { #pragma ivdep #pragma vector aligned for (i=start; i<end; i++) BlkSchlsEqEuroNoDiv_C(&(gprice[i]), &(gprice2[i]), &(sptprice[i]), &(strike[i]), &(otime[i])); } barrier(tid); return (NULL);}
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Thread Affinity using pthreads*• Partition the workload to avoid load imbalance
– Understand static vs. dynamic workload partition • Use pthread API, define, initialize, set, destroy
– Set CPU affinity with pthead_setaffinity_np()– Know the thread enumeration and avoid core 0– Core 0 boots the coprocessor, job scheduler, service interrupts
9
Core 0
0241
242
243
Core 1
41 2 3
Core 2
85 6 7Core 60
240
237
238
239
Busy
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Options for Parallelism – OpenMP*• Compiler directives/pragmas based threading constructs
– Utility library functions and Environment variables• Specify blocks of code executing in parallel
• Fork-Join Parallelism: – Master thread spawns a team of worker threads as needed– Parallelism grow incrementally
10
Parallel RegionsMaster Thread
#pragma omp parallel sections{ #pragma omp section task1(); #pragma omp section task2(); #pragma omp section task3();}
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
OpenMP* Pragmas and Extensions• OpenMP* pragmas in C/C++:
#pragma omp construct [clause [clause]…]
• Large robust specification that includes– Parallel sections and tasks– Parallel loops– Synchronization points
• critical sections• barriers
– Atomic and ordered updates– Serial sections within the parallel code
• Extension to support offloading – OpenMP* 4.0 RC2– Use #pragma omp target or #pragma offload from Intel LEO– Either syntax works, no performance differences
11
#pragma omp parallel sections{ #pragma omp section { BinomialTemplate<F32vec8, float>(callResult, S, X, T, R, V, N, NUM_STEPS); } #pragma omp section#pragma offload target(mic:0) in(S, X, T:length(N)) out(MICExpected, MICConfidence:length(N)) { montecarlo(S, X, T, MICExpected, MICConfidence); }}
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
OpenMP* Worksharing Construct
12
for (i = 0; i < N; i++) a[i] = a[i] + b[i];Sequential code
#pragma omp parallel{ int id, i, Nthrds, istart, iend;
id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id + 1) * N / Nthrds;
for (i = istart; i < iend; i++) a[i] = a[i] + b[i];}
OpenMP* parallel region
#pragma omp parallel #pragma omp for for (i = 0; i < N; i++) a[i] = a[i] + b[i];
OpenMP* worksharing construct
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
OpenMP*: Shared, Private and Reduction Variables• Default Rules
– Variables defined outside the parallel region are shared – Variables defined inside the parallel region are private
• Override the defaults– The private(var) clause creates a local copy of var for each thread– Loop indices in a parallel for are private by default
• The reduction(op:list) clause is a special case of shared– Variables in “list” must be shared in the enclosing parallel region
• A local copy of each reduction variable is created and initialized based on the op (0 for “+”)• Compiler finds reduction expressions containing op and uses them to update the local copy• Local copies are reduced to a single value and combined with the original global value
13
#pragma omp parallel reduction(+ : sum_delta) reduction(+ : sum_ref){ float local_sum_delta = 0.0f; for(int i = 0; i < OptPerThread; i++) { ref = callReference; delta = fabs(callReference - CallResult[i]); local_sum_delta += delta; sum_ref += fabs(ref); } sum_delta += local_sum_delta;}
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
OpenMP* Performance, Scalability Related Issues
14
• Manage Thread Creation Cost– Create threads as early as possible, Maximize
the work for worker threads– IA threads take some time to create, But
once they’re up, they last till the end• Take advantage of memory locality,
use NUMA memory manager– Allocate the memory on the thread that will
access them later on.– Try not to allocate the memory the worker
threads use in the main thread• Ensure your OpenMP* program works
serially, compiles without openmp* – Protect OpenMP* API calls with _OPENMP, – Make sure serial works before enable
OpenMP* (e.g. compile with –openmp)• Minimize the thread synchronization
– use local variable to reduce the need to access global variable
#ifdef _OPENMPint ThreadNum = omp_get_max_threads();omp_set_num_threads(ThreadNum); #elseint ThreadNum = 1; #endif
#pragma omp parallel{ #ifdef _OPENMP int threadID = omp_get_thread_num(); #else int threadID = 0; #endif
float *CallResult = (float *) scalable_aligned_malloc (mem_size, SIMDALIGN); float *PutResult = (float *) scalable_aligned_malloc (mem_size, SIMDALIGN);
}
#pragma omp parallel forfor (int k = 0; k < RAND_N; k++) h_Random[k] = cdfnorminv ((k+1.0)/(RAND_N+1.0));
#pragma omp parallel forfor(int opt = 0; opt < OPT_N; opt++){ CallResultList[opt] = 0; CallConfidenceList[opt] = 0;}
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
OpenMP* Offload Environment Variables• Set/Get the number of coprocessor threads from the host
– Notice that omp_get_max_thread_target()return 4*(ncore-1)– Use omp_set_num_threads_target() omp_get_num_threads_target()– Protect under #ifdef __INTEL_OFFLOAD,
• Access coprocessor environment variables from the host processor– First define MIC_ENV_PREFIX=MIC– Issue export MIC_OMP_NUM_THREADS=240 on the host– OpenMP sets the coprocessor max threads to 240– Host OpenMP threads still take the cues from OMP_NUM_THREADS
• Initial Stack Size on the device is default to be 12MB – Use MIC_STACKSIZE to override the default size for main threads in coprocessor– Use MIC_OMP_STACKSIZE to change the default stack size for worker threads
15
Step 4 Parallelization
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Step 4: Parallelization
• Add #pragma omp parallel for to the outer loop• Add –openmp to the C/C++ Compiler invocation
option CCFLAGS• Rerun the program • ./MonteCarlo• Record the performance againexport KMP_AFFINITY=“compact,granularity=fine”
17
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Other Options for Parallelism – Intel® Cilk™ Plus• C/C++ extension for fine-grained task parallelism• Three keywords
_Cilk_spawn• Function call may be run in parallel with caller – up to the runtime
_Cilk_sync• Caller waits for all children to complete
_Cilk_for• Iterations are structured into a work queue• Busy cores do not execute the loop• Idle cores steal work items from the queue• Countable loop Granularity is N/2, N/4, N/8, for trip count of N• Intended use:
– when iterations are not balanced, or– When overall load is not known at design time
18
Offload Using Intel® Cilk™ Plus
19
Feature Example Description
Offloading a function call x = _Cilk_shared _Cilk_offload func(y); func can executes on Intel MIC
Offloading asynchronously x = _Cilk_spawn _Cilk_offload func(y); Non blocking offload
Data available on both sides _Cilk_shared int x = 0; Allocated in the shared memory area, can be synchronized
Function available on both sides
int _Clik_shared f(int x) { return x+1}
The function can execute on either side
Offload a parallel for loop(Requires Cilk on Intel MIC)
_Cilk_offload _Cilk_for (i = 0; i < N; i++) { a[i] = b[i] + c[i]; }
Loop executes in parallel on Intel MIC. The loop is implicitly outlined as a function call. (borrow inside the loop disallowed)
Offload array expressions _Offload a[:] = b[:] <op> c[:];_Offload a[:] = elemental_func(b[:]);
Array operations execute in parallel on Intel MIC.
• Intel ® C/C++ Compiler extension with new offloading key words• Provide the appearance of shared memory using virtual Shared-
memory technology
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Black-Scholes – Using Intel® C/C++ Compiler with Cilk™ Plus Technology
double option_price_call_black_scholes( double S, // spot (underlying) price double K, // strike (exercise) price, double r, // interest rate double sigma, // volatility double time) // time to maturity{ double time_sqrt = sqrt(time); double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt; double d2 = d1-(sigma*time_sqrt); return S*N(d1) - K*exp(-r*time)*N(d2);}
__declspec (vector) double option_price_call_black_scholes( double S, // spot (underlying) price double K, // strike (exercise) price, double r, // interest rate double sigma, // volatility double time) // time to maturity{ double time_sqrt = sqrt(time); double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt; double d2 = d1-(sigma*time_sqrt); return S*N(d1) - K*exp(-r*time)*N(d2);}
// invoke calculations for call-optionsfor (int i=0; i<NUM_OPTIONS; i++) { call[i] = option_price_call_black_scholes(S[i], K[i], r, sigma, time[i]);}
// invoke calculations for call-optionsCilk_for (int i=0; i<NUM_OPTIONS; i++) { call[i] = option_price_call_black_scholes(S[i], K[i], r, sigma, time[i]);}
Elemental functions utilize both core and vector parallelism20
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
__declspec (vector) double option_price_call_black_scholes( double S, // spot (underlying) price double K, // strike (exercise) price, double r, // interest rate double sigma, // volatility double time) // time to maturity{ double time_sqrt = sqrt(time); double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt; double d2 = d1-(sigma*time_sqrt); return S*N(d1) - K*exp(-r*time)*N(d2);}
Black-Scholes – Using Intel® C/C++ Compiler keyword Extension for Offload
// invoke calculations for call-options: first invocaiton on MIC, second on Xeon_Offload Cilk_for (int i=0; i<NUM_OPTIONS; i++) { call[i] = option_price_call_black_scholes(S[i], K[i], r, sigma, time[i]);}Cilk_for (int i=0; i < NUM_OPTIONS; i++) { call[i] = option_price_call_black_scholes(s[i], K[i], r, sigma, time[i]);}
_Shared __declspec (vector) double option_price_call_black_scholes( double S, // spot (underlying) price double K, // strike (exercise) price, double r, // interest rate double sigma, // volatility double time) // time to maturity{ double time_sqrt = sqrt(time); double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt; double d2 = d1-(sigma*time_sqrt); return S*N(d1) - K*exp(-r*time)*N(d2);}
21
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Running Black-Scholes on Intel® Xeon® Processor and Intel® Xeon Phi™ Coprocessor in concurrent
// invoke calculations for call-options: first invocaiton on MIC, second on Xeon… Cilk_spawn _Offload wrapper(); wrapper(); Cilk_sync;…
_Shared __declspec (vector) double option_price_call_black_scholes( double S, // spot (underlying) price double K, // strike (exercise) price, double r, // interest rate double sigma, // volatility double time) // time to maturity{ double time_sqrt = sqrt(time); double d1 = (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt; double d2 = d1-(sigma*time_sqrt); return S*N(d1) - K*exp(-r*time)*N(d2);}
_Shared wrapper () { Cilk_for (int i=0; i < NUM_OPTIONS; i++) { call[i] = option_price_call_balck_scholes(s[i],k[i],r,sigma,time[i]); } }
22
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Another for Parallelism – Intel® Threading Building Blocks (TBB)
• C++ classes and templates that implement task-based parallelism– As opposed to “threads”– Makes use of “work stealing” to evenly distribute work across threads and
ensure good cache behavior• Provides a wide range of template classes to implement efficient
parallel algorithms– Generic parallel patterns– Concurrent containers– Synchronization primitives– Memory allocation– Task scheduling– Thread local storage– Etc.
Intel Confidential
23
Concurrent Containersconcurrent_hash_map
concurrent_queueconcurrent_bounded_queue
concurrent_vectorconcurrent_unordered_map
Miscellaneoustick_count
Generic Parallel Algorithmsparallel_for(range)
parallel_reduceparallel_for_each(begin, end)
parallel_doparallel_invoke
pipeline, parallel_pipelineparallel_sortparallel_scan
Task schedulertask_group
task_structured_grouptask_scheduler_init
task_scheduler_observer
Synchronization Primitivesatomic; mutex; recursive_mutex;
spin_mutex; spin_rw_mutex;queuing_mutex; queuing_rw_mutex; reader_writer_lock; critical_section;
condition_variable;lock_guard; unique_lock;
null_mutex; null_rw_mutex;
Memory Allocationtbb_allocator; cache_aligned_allocator; scalable_allocator; zero_allocator
Threadstbb_thread, thread
Thread Local Storageenumerable_thread_specific
combinable
iXPTC 2013 Intel® Xeon Phi™ Coprocessor
Options for Parallelism - comparison
24
Pthreads* OpenMP* Intel® Cilk™ Plus Intel® TBB
Code rewrite required to use Lots Little Little Moderate
Serial code Source Compatibility No yes likely No
Compiler Dependency No No Yes No
Supports Fortran Yes Yes No No
Supports C Yes Yes Yes No
Supports C++ Yes Yes Yes Yes