programming models - polyhedron software

73
Software and Services Group Optimization Notice Choosing the Right Parallel Model Stephen Blair-Chappell Intel Compiler Labs 1

Upload: others

Post on 28-Mar-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Microsoft PowerPoint - 5_Which Model.pptxChoosing the Right Parallel Model
Stephen Blair-Chappell
P e rf
Parallel Processors Need Parallel Applications
Intel® Parallel Building Blocks provides the tools you need to build parallel applications
2
What is a good parallel programming model?
• Easy to use
3
Agenda
Software and Services Group
Different Kinds of Programmer
Stupid compiler!! The inc sets the z flag, so what’s the compare doing here? Wasted cycle!
Yuck@
Hey honey! I found this piece of code, made two simple
changes and now it works!! So what are you doing tonight?
What CPU does it run on? Huh? What’s the
difference? I just need to reformat the output
here and we can run that experiment again
All programmers are not equal
Software and Services Group Optimization Notice
Family of Parallel Models
Libraries
Different Levels of Abstraction
Intel Parallel Programming Model
Tasking Fundamental Concepts
– Large teams that develop components independently
– Calling into libraries with
• Utilize HW resources
Family of Parallel Models
Intel® Cilk Plus • Easy to learn. Both C and C++
• Serial semantics3 keywords
• Tasks, not threads
• Array operations
• Guaranteed vector implementation by compiler Pragma SIMD
11
Anatomy of a spawn
cilk_spawncilk_spawncilk_spawncilk_spawn g();g();g();g(); workworkworkwork workworkworkwork workworkworkwork cilk_sync;cilk_sync;cilk_sync;cilk_sync; workworkworkwork
workworkworkwork workworkworkwork workworkworkwork
Work Stealing when another worker is available
void f()void f()void f()void f() {{{{
cilk_spawncilk_spawncilk_spawncilk_spawn g();g();g();g(); workworkworkwork workworkworkwork workworkworkwork cilk_sync;cilk_sync;cilk_sync;cilk_sync; workworkworkwork
workworkworkwork workworkworkwork workworkworkwork
cilk_for and reducer
#include <#include <#include <#include <cilkcilkcilkcilk\\\\reducer_opadd.hreducer_opadd.hreducer_opadd.hreducer_opadd.h>>>>
cilkcilkcilkcilk::::::::reducer_opaddreducer_opaddreducer_opaddreducer_opadd<<<<intintintint> > > > gIntgIntgIntgInt; ; ; ;
{{{{
cilk_forcilk_forcilk_forcilk_for((((intintintint iiii=0; =0; =0; =0; iiii < 8; < 8; < 8; < 8; iiii++){++){++){++){
gIntgIntgIntgInt++;++;++;++;
}}}} The Cilk scheduler automatically
Array notations for C/C++ Data parallel operations on array sections
vectorization is always semantically correct
<array base> [<lower
bound>:<length>[:<stride>]]+
B[2:6] // Elements 2 to 7 of vector B
C[:][5] // Column 5 of matrix C
D[0:3:2] // Elements 0,2,4 of vector D
A[:] = B[:]
guaranteed vector
Software and Services Group
Elemental Functions
• Use scalar syntax to describe an operation on a single element • Apply operation to arrays in parallel
• Utilize both vector parallelism and core parallelism
_declspec(vector)
{
double d2 = d1-(sigma*time_sqrt);
}
}
a[j] = my_ef(b[j]);
a[j] = my_ef(b[j]);
a[j] = my_ef(b[j]);
[Concurrency not yet
Software and Services Group Optimization Notice
Pragma SIMD • Write a C/C++/FTN standard loop, add a pragma to get the compiler to
vectorize it
• The compiler does not prove equivalence to sequential loop, no performance heuristics
• The programmer may need to provide additional clauses for correct code generation – Private, reduction, scalar
• Elemental functions can be called from the loop
#pragma simd
a[j] = my_ef(b[j]);
Family of Parallel Models
• C++ generic programming C++ library
• Tasks, not threads
• Common parallel patterns
allocator
20
Threads
const int N = 100000;
for (int i = 0; i < M; i++){
array[i] *= 2;
Software and Services Group Optimization Notice
• Include and initialize the library
An Example using parallel_for
• Include and initialize the library
#include “tbb/task_scheduler_init.h”
#include “tbb/blocked_range.h”
#include “tbb/parallel_for.h”
• Include and initialize the library
#include “tbb/task_scheduler_init.h”
#include “tbb/blocked_range.h”
#include “tbb/parallel_for.h”
• Use the parallel_for pattern
for (int i = 0; i < M; i++){
array[i] *= 2;
class ChangeArrayBody {
float *array;
array[i] *= 2;
parallel_for (blocked_range <int>(0, M, IdealGrainSize),
ChangeArrayBody(array));
green = provided by TBB
red = boilerplate for library
class ChangeArrayBody {
float *array;
array[i] *= 2;
parallel_for (blocked_range <int>(0, M, IdealGrainSize),
ChangeArrayBody(array));
• Use the parallel_for pattern blue = original code
green = provided by TBB
red = boilerplate for library
tasks available to thieves
float Example() {
return sum;
functor::operator()
Software and Services Group 31
Lambda Syntax
parameters and return type
void or code is “return
expr;”
[&] ⇒ by-reference
[=] ⇒ by-value
[]{return rand();}
if(x<y) return x;
#include <tbb/tbb.h>
#include <tbb/tbb.h>
#include <tbb/tbb.h>
#include <tbb/tbb.h>
#include <vector>
void RunWhileLoop()
}
Family of Parallel Models
Libraries
OpenMP
Software and Services Group Optimization Notice
What is OpenMP™ ?
Portable, Shared Memory Multi-processing API – Fortran 77, Fortran 90, C, and C++ – Multi-vendor support, for both Unix and Windows
• Standardizes loop-level parallelism
• Supports coarse-grained parallelism
• Combines serial and parallel code in single source – No need for separate source code revision
• See www.openmp.org for standard documents, tutorials, sample code
• Intel is premier member of OpenMP Review Board
Software and Services Group Optimization Notice
Parallel APIs: OpenMP*
#pragma omp critical
C$OMP PARALLEL REDUCTION (+: A, B)
call OMP_INIT_LOCK (ilok)
C$OMP THREADPRIVATE(/ABC/)
OpenMP: An API for Writing Multithreaded Applications
• A set of compiler directives and library routines for parallel application programmers
• Makes it easy to create multithreaded (MT) programs in Fortran, C and C++
Software and Services Group Optimization Notice
OpenMP Architecture
• Fork-Join Model
• Worksharing constructs
• Synchronization constructs
• Directive/pragma-based parallelism
Software and Services Group
OpenMP Programming Model:
Fork-Join Parallelism: Master thread spawns a team of threads as needed.
Parallelism added incrementally until performance are met: i.e. the sequential program evolves into a parallel program.
Parallel Regions Master Thread in red
A Nested
Parallel region
A Nested
Parallel region
Software and Services Group Optimization Notice
Hello World • This program runs on three threads: • Prints this:
Hello World
Hello World
Hello World
Iter: 1
Iter: 2
Iter: 3
Iter: 4
Goodbye World
Goodbye World
Goodbye World
Void main()
The Private Clause
• Variables are un-initialized; C++ object is default constructed
• Any value external to the parallel region is undefined
void* work(float* c, int N) {
float x, y; int i;
#pragma omp parallel for private(x,y)
for(i=0; i<N; i++) {
x = a[i]; y = b[i];
c[i] = x + y;
OpenMP* Critical Construct
float R1, R2;
#pragma omp parallel
{ float A, B;
#pragma omp for
B = big_job(i);
a time, only one calls consum() thereby
protecting R1 and R2 from race conditions.
Naming the critical
performance.
(R1_lock)
(R2_lock)
Software and Services Group Optimization Notice
Parallel Sections • Independent sections of code
can execute concurrently
The OpenMP Task
Mixing-and-Matching
Array Building Blocks
Why Mix
• Using third party libraries, or code developed by other developers
• Supplementing one parallel model with bits ‘borrowed’ from another model
49
Different Parallel Constructs
Software and Services Group
gInt++;
gInt = 0;
cilk_for(int i=0; i < 8; i++)
Hello(i + 1);
}
{
cilk_for(int i=0; i < 8; i++) {
Hello(i + 1);
scalable_free( pA);
OpenMP • Good for monolithic applications
• But a SW architect needs to
• break the application work into chunks,
• determine which thread does what
• make threads do equal amount of work.
• Good performance when it works, • some applications too complex to design with a global view.
• Hard to use when the application is composed of libraries, or
of independently developed modules.
53
OpenMP
Making the right choice
Parallel Model Num words
• Use Word Bucket counting to approx amount of editing needed
Software and Services Group Optimization Notice
Factors influencing your choice
• 4. Type of Parallelism
1. Language
• C\C++
– Cilk Plus
• Fortran
– OpenMP
– Coarrays
2. Operating System
– OpenMP (supported by GCC)
3. How many developers? Exclusive control of machine?
• Multiple developers / third-party libraries
59
4. Type of Work
5. What Compiler?
• MS
• No Array Notation
61
6. Standards
• Emerging Standards
• Open Source
7. Productised?
8. CPU
– OpenMP
64
9. Open Source?
What is a good parallel programming model?
• Easy to use
66
Optimization Notice
Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.
Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non- Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel®
Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.
Notice revision #20101101
68
Backup
69
Languages1 Learning
Distributed
Very
(Many
Vendors)
Good
Linux,
Windows,
Apple
Good
Intel Parallel
Building Blocks
Intel Cilk Plus C++, C Easy Yes Very Good Stay Tuned Yes4
Intel TBB C++ Medium Yes Very Good Yes
(Open Source) Yes
Other Standards
Posix C Easy/Hard2 No Difficult
Yes
(Many
Vendors)
float Example() {
return sum;
functor::operator()
Software and Services Group 72
Lambda Syntax
parameters and return type
void or code is “return
expr;”
[&] ⇒ by-reference
[=] ⇒ by-value
[]{return rand();}
if(x<y) return x;
#include <cilk\cilk.h>
#include <cilk\reducer_opadd.h>
gInt++;
{
cilk_spawn []{
Hello(1);
Hello(2);
Hello(3);
Hello(4);}();
Hello(5);
Hello(6);
Hello(7);
Hello(8);
cilk_sync;
}
cilk_spawn []{
Hello(1);
Hello(2);
Hello(3);
Hello(4);}();
Hello(5);
Hello(6);
Hello(7);
Hello(8);
cilk_sync;