openmp: small and easy motivation
TRANSCRIPT
![Page 1: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/1.jpg)
OpenMP:Small and Easy Motivation
#include <stdio.h>
#include <stdlib.h>
int main() {
// Do this part in parallel
printf( "Hello, World!\n" );
return 0;
}
![Page 2: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/2.jpg)
OpenMP:Small and Easy Motivation #include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main() {
omp_set_num_threads(16);
// Do this part in parallel
#pragma omp parallel
{
printf( "Hello, World!\n" );
}
return 0;
}
![Page 3: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/3.jpg)
Simple!
OpenMP can parallelize many serial programs with relatively few annotations that specify parallelism and independence
OpenMP is a small API that hides cumbersome threading calls with simpler directives
![Page 4: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/4.jpg)
OpenMP
• An API for shared-memory parallelprogramming.
• Designed for systems in which eachthread can potentially have access to allavailable memory.
• System is viewed as a collection ofcores or CPU’s, all of which have accessto main memory shared memoryarchitecture
![Page 5: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/5.jpg)
A shared memory system
Copyright © 2010, Elsevier Inc. All rights Reserved
![Page 6: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/6.jpg)
Pragmas
• Special preprocessor instructions.
• Typically added to a system to allowbehaviors that aren’t part of the basic Cspecification.
• Compilers that don’t support thepragmas ignore them.
Copyright © 2010, Elsevier Inc. All rights Reserved
#pragma
![Page 7: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/7.jpg)
Copyright © 2010, Elsevier Inc. All rights Reserved
![Page 8: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/8.jpg)
gcc −fopenmp −o omp_hello omp_hello.c
. /omp_hello 4
running with 4 threads
Hello from thread 0 of 4
Hello from thread 1 of 4
Hello from thread 2 of 4
Hello from thread 3 of 4 Hello from thread 1 of 4
Hello from thread 2 of 4
Hello from thread 0 of 4
Hello from thread 3 of 4
Hello from thread 3 of 4
Hello from thread 1 of 4
Hello from thread 2 of 4
Hello from thread 0 of 4
possible outcomes
![Page 9: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/9.jpg)
OpenMp pragmas
Copyright © 2010, Elsevier Inc. All rights Reserved
• # pragma omp parallel
– Most basic parallel directive.
– The number of threads that runthe following structured block of codeis determined by the run-time system.
![Page 10: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/10.jpg)
A process forking and joining two threads
Copyright © 2010, Elsevier Inc. All rights Reserved
![Page 11: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/11.jpg)
Clause
• Text that modifies a directive.
• The num_threads clause can be addedto a parallel directive.
• It allows the programmer to specify thenumber of threads that should executethe following block.
Copyright © 2010, Elsevier Inc. All rights Reserved
# pragma omp parallel num_threads ( thread_count )
![Page 12: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/12.jpg)
Of note… • There may be system-defined limitations on
the number of threads that a program canstart.
• The OpenMP standard doesn’t guarantee thatthis will actually start thread_count threads.
• Unless we’re trying to start a lot of threads,we will almost always get the desired numberof threads.
Copyright © 2010, Elsevier Inc. All rights Reserved
![Page 13: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/13.jpg)
Some terminology
• In OpenMP parlance the collection ofthreads executing the parallel block —the original thread and the new threads— is called a team, the original thread iscalled the master, and the additionalthreads are called slaves.
![Page 14: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/14.jpg)
Consider the trapezoidal rule
Copyright © 2010, Elsevier Inc. All rights Reserved
![Page 15: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/15.jpg)
Serial algorithm
Copyright © 2010, Elsevier Inc. All rights Reserved
![Page 16: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/16.jpg)
A First OpenMP Version
1) We identify two types of tasks:a) computing the areas of individual
trapezoids, and
b) adding the areas of trapezoids.
2) There is no communication among thetasks in the first collection, but eachtask in the first collectioncommunicates with task 1b.
Copyright © 2010, Elsevier Inc. All rights Reserved
![Page 17: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/17.jpg)
A First OpenMP Version
3) We assume that there will be manymore trapezoids than cores.
• So we aggregate tasks by assigning a contiguous block of trapezoids to each thread.
Copyright © 2010, Elsevier Inc. All rights Reserved
![Page 18: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/18.jpg)
Assignment of trapezoids to threads
Copyright © 2010, Elsevier Inc. All rights Reserved
![Page 19: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/19.jpg)
Copyright © 2010, Elsevier Inc. All rights Reserved
Unpredictable results when two (or more) threads attempt to simultaneously execute:
global_result += my_result ;
![Page 20: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/20.jpg)
Mutual exclusion
Copyright © 2010, Elsevier Inc. All rights Reserved
# pragma omp critical
global_result += my_result ;
only one thread can execute
the following structured block at a time
![Page 21: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/21.jpg)
Copyright © 2010, Elsevier Inc. All rights Reserved
![Page 22: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/22.jpg)
Copyright © 2010, Elsevier Inc. All rights Reserved
![Page 23: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/23.jpg)
Another Example
• Problem: Count the number of timeseach ASCII character occurs on a pageof text.
• Input: ASCII text stored as an arrayof characters.
• Output: A histogram with 128 buckets –one for each ASCII character
source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html
![Page 24: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/24.jpg)
Another Example
source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html
1: void compute_histogram(char *page, int page_size, int *histogram){ 2: for(int i = 0; i < page_size; i++){ 3: char read_character = page[i]; 4: histogram[read_character]++; 5: } 6: }
Sequential Version
Speed on Quad Core: 10.4 seconds
![Page 25: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/25.jpg)
Another Example
source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html
We need to parallelize this.
![Page 26: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/26.jpg)
Another Example
source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html
1: void compute_histogram(char *page, int page_size, int *histogram){ 2: #pragma omp parallel for 3: for(int i = 0; i < page_size; i++){ 4: char read_character = page[i]; 5: histogram[read_character]++; 6: }
The above code does not work!! Why?
![Page 27: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/27.jpg)
Another Example
source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html
1: void compute_histogram(char *page, int page_size, int *histogram){ 2: #pragma omp parallel for 3: for(int i = 0; i < page_size; i++){ 4: char read_character = page[i]; 5: #pragma omp atomic 6: histogram[read_character]++; 7: } 8: }
Speed on Quad Core: 114.9 seconds > 10x slower than the single thread version, which was 10.4 seconds
![Page 28: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/28.jpg)
Another Example
source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html
1: void compute_histogram(char *page, int page_size,
int *histogram, int num_buckets){ 2: #pragma omp parallel 3: { 4: int local_histogram[128][num_buckets]; 5: int tid = omp_get_thread_num(); 6: #pragma omp for nowait 7: for(int i = 0; i < page_size; i++){ 8: char read_character = page[i]; 9: local_histogram[tid][read_character]++; 10: } 11: for(int i = 0; i < num_buckets; i++){ 12: #pragma omp atomic 13: histogram[i] += local_histogram[tid][i]; 14: } 15: } 16: }
Runs in 3.8 secs. Why speedup is not 4 yet?
![Page 29: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/29.jpg)
Another Example
source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html
void compute_histogram(char *page, int page_size, int *histogram, int num_buckets){
1: int num_threads = omp_get_max_threads(); 2: 3: 4: 5: 6: 7:
int local_histogram[num_threads][num_buckets];#pragma omp parallel { Int tid = omp_get_thread_num(); #pragma omp for for(int i = 0; i < page_size; i++){
8: char read_character = page[i]; 9: local_histogram[tid][read_character]++; 10: 11: 12: 13:
} #pragma omp barrier #pragma omp single for(int t = 0; t < num_threads; t++)
14: for(int i = 0; i < num_buckets; i++) 15: histogram[i] += local_histogram[t][i]; 16: } 17: }
Speed is 4.4 seconds. Slower than the previous version.
![Page 30: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/30.jpg)
Another Example
source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html
void compute_histogram(char *page, int page_size, int *histogram, int num_buckets){
1: int num_threads = omp_get_max_threads(); 2: 3: 4: 5: 6: 7:
int local_histogram[num_threads][num_buckets]; #pragma omp parallel {
int tid = omp_get_thread_num(); #pragma omp for for(int i = 0; i < page_size; i++){
8: char read_character = page[i]; 9: local_histogram[tid][read_character]++; 10: }
11: 12: 13:
#pragma omp for/* observe the new order of following two loops */
for(int i = 0; i < num_buckets; i++)14: for(int t = 0; t < num_threads; t++)15:
histogram[i] += local_histogram[t][i]; 16:
}
17: }
Speed is 3.6 seconds.
![Page 31: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/31.jpg)
What Can We Learn from the Previous Example?
• Atomic operations– They are expensive– Yet, they are fundamental building blocks.
• Synchronization:– correctness vs performance loss– Rich interaction of hardware-software
tradeoffs– Must evaluate hardware primitives and
software algorithms together
![Page 32: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/32.jpg)
OpenMP Parallel Programming
1. Start with a parallelizable algorithm• loop-level parallelism is necessary
2. Implement serially
3. Test and Debug
4. Annotate the code with parallelization(and synchronization) directives• Hope for linear speedup
5. Test and Debug
![Page 33: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/33.jpg)
All OpenMP programs begin with a single
thread: master thread (ID = 0 )
FORK: the master thread then creates a
team of parallel threads.
JOIN: When the team threads complete the
statements in the parallel region construct, they
synchronize and terminate
OpenMP uses the fork-join model of parallel execution.
![Page 34: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/34.jpg)
Programming Model - Threading
int main() {
}
// serial region
printf(“Hello…”);
// serial again
printf(“!”);
Fork
Join
// parallel region
#pragma omp parallel
{
printf(“World”);
}
We didn’t use omp_set_num_threads(), so what will be the output? "World" will appear the number of times specified by default value of omp_max_threads, which is same as number of cores on the machine.
![Page 35: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/35.jpg)
Isn’t Nested Parallelism Interesting?
Fork
Join
Fork
Join
![Page 36: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/36.jpg)
Important!
• The following are implementationdependent:– Nested parallelism
– Dynamically alter number of threads
• It is entirely up to the programmer toensure that I/O is conducted correctlywithin the context of a multithreadedprogram.
![Page 37: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/37.jpg)
What we learned so far • #include <omp.h>• gcc –fopenmp …• omp_set_num_threads(x);• omp_get_thread_num();• omp_get_num_threads();• #pragma omp parallel [num_threads(x)]• #pragma omp parallel for [nowait]• #pragma omp atomic• #pragma omp barrier• #pragma omp single• #pragma omp critical
![Page 38: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/38.jpg)
Functional Parallelism Example
v = alpha(); W = beta(); x = gamma(v,w); y = delta(); z = epsilon(x,y);
α β
γ
ε
δ
![Page 39: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/39.jpg)
Functional Parallelism with OpenMP, 1 parallel sections, section pragmas
#pragma omp parallel sections {
#pragma omp section //optional v = alpha(); #pragma omp section W = beta(); #pragma omp section y = delta();
} x = gamma(v,w); z = epsilon(x,y);
![Page 40: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/40.jpg)
Functional Parallelism with OpenMP, 2 separating parallel from sections
#pragma omp parallel {
#pragma omp sections { #pragma omp section //optional v = alpha(); #pragma omp section W = beta();
} #pragma omp sections { #pragma omp section //optional x = gamma(v,w); #pragma omp section y = delta();
} }
z = epsilon(x,y);
![Page 41: OpenMP: Small and Easy Motivation](https://reader033.vdocuments.mx/reader033/viewer/2022060603/6296b125105f68469c63efe5/html5/thumbnails/41.jpg)
Conclusions
• OpenMP is a standard for programmingshared-memory systems.
• OpenMP uses both special functions andpreprocessor directives called pragmas.
• OpenMP programs start multiplethreads rather than multiple processes.
• Many OpenMP directives can bemodified by clauses.