openmp: small and easy motivation

41
OpenMP: Small and Easy Motivation #include <stdio.h> #include <stdlib.h> int main() { // Do this part in parallel printf( "Hello, World!\n" ); return 0; }

Upload: others

Post on 01-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OpenMP: Small and Easy Motivation

OpenMP:Small and Easy Motivation

#include <stdio.h>

#include <stdlib.h>

int main() {

// Do this part in parallel

printf( "Hello, World!\n" );

return 0;

}

Page 2: OpenMP: Small and Easy Motivation

OpenMP:Small and Easy Motivation #include <stdio.h>

#include <stdlib.h>

#include <omp.h>

int main() {

omp_set_num_threads(16);

// Do this part in parallel

#pragma omp parallel

{

printf( "Hello, World!\n" );

}

return 0;

}

Page 3: OpenMP: Small and Easy Motivation

Simple!

OpenMP can parallelize many serial programs with relatively few annotations that specify parallelism and independence

OpenMP is a small API that hides cumbersome threading calls with simpler directives

Page 4: OpenMP: Small and Easy Motivation

OpenMP

• An API for shared-memory parallelprogramming.

• Designed for systems in which eachthread can potentially have access to allavailable memory.

• System is viewed as a collection ofcores or CPU’s, all of which have accessto main memory shared memoryarchitecture

Page 5: OpenMP: Small and Easy Motivation

A shared memory system

Copyright © 2010, Elsevier Inc. All rights Reserved

Page 6: OpenMP: Small and Easy Motivation

Pragmas

• Special preprocessor instructions.

• Typically added to a system to allowbehaviors that aren’t part of the basic Cspecification.

• Compilers that don’t support thepragmas ignore them.

Copyright © 2010, Elsevier Inc. All rights Reserved

#pragma

Page 7: OpenMP: Small and Easy Motivation

Copyright © 2010, Elsevier Inc. All rights Reserved

Page 8: OpenMP: Small and Easy Motivation

gcc −fopenmp −o omp_hello omp_hello.c

. /omp_hello 4

running with 4 threads

Hello from thread 0 of 4

Hello from thread 1 of 4

Hello from thread 2 of 4

Hello from thread 3 of 4 Hello from thread 1 of 4

Hello from thread 2 of 4

Hello from thread 0 of 4

Hello from thread 3 of 4

Hello from thread 3 of 4

Hello from thread 1 of 4

Hello from thread 2 of 4

Hello from thread 0 of 4

possible outcomes

Page 9: OpenMP: Small and Easy Motivation

OpenMp pragmas

Copyright © 2010, Elsevier Inc. All rights Reserved

• # pragma omp parallel

– Most basic parallel directive.

– The number of threads that runthe following structured block of codeis determined by the run-time system.

Page 10: OpenMP: Small and Easy Motivation

A process forking and joining two threads

Copyright © 2010, Elsevier Inc. All rights Reserved

Page 11: OpenMP: Small and Easy Motivation

Clause

• Text that modifies a directive.

• The num_threads clause can be addedto a parallel directive.

• It allows the programmer to specify thenumber of threads that should executethe following block.

Copyright © 2010, Elsevier Inc. All rights Reserved

# pragma omp parallel num_threads ( thread_count )

Page 12: OpenMP: Small and Easy Motivation

Of note… • There may be system-defined limitations on

the number of threads that a program canstart.

• The OpenMP standard doesn’t guarantee thatthis will actually start thread_count threads.

• Unless we’re trying to start a lot of threads,we will almost always get the desired numberof threads.

Copyright © 2010, Elsevier Inc. All rights Reserved

Page 13: OpenMP: Small and Easy Motivation

Some terminology

• In OpenMP parlance the collection ofthreads executing the parallel block —the original thread and the new threads— is called a team, the original thread iscalled the master, and the additionalthreads are called slaves.

Page 14: OpenMP: Small and Easy Motivation

Consider the trapezoidal rule

Copyright © 2010, Elsevier Inc. All rights Reserved

Page 15: OpenMP: Small and Easy Motivation

Serial algorithm

Copyright © 2010, Elsevier Inc. All rights Reserved

Page 16: OpenMP: Small and Easy Motivation

A First OpenMP Version

1) We identify two types of tasks:a) computing the areas of individual

trapezoids, and

b) adding the areas of trapezoids.

2) There is no communication among thetasks in the first collection, but eachtask in the first collectioncommunicates with task 1b.

Copyright © 2010, Elsevier Inc. All rights Reserved

Page 17: OpenMP: Small and Easy Motivation

A First OpenMP Version

3) We assume that there will be manymore trapezoids than cores.

• So we aggregate tasks by assigning a contiguous block of trapezoids to each thread.

Copyright © 2010, Elsevier Inc. All rights Reserved

Page 18: OpenMP: Small and Easy Motivation

Assignment of trapezoids to threads

Copyright © 2010, Elsevier Inc. All rights Reserved

Page 19: OpenMP: Small and Easy Motivation

Copyright © 2010, Elsevier Inc. All rights Reserved

Unpredictable results when two (or more) threads attempt to simultaneously execute:

global_result += my_result ;

Page 20: OpenMP: Small and Easy Motivation

Mutual exclusion

Copyright © 2010, Elsevier Inc. All rights Reserved

# pragma omp critical

global_result += my_result ;

only one thread can execute

the following structured block at a time

Page 21: OpenMP: Small and Easy Motivation

Copyright © 2010, Elsevier Inc. All rights Reserved

Page 22: OpenMP: Small and Easy Motivation

Copyright © 2010, Elsevier Inc. All rights Reserved

Page 23: OpenMP: Small and Easy Motivation

Another Example

• Problem: Count the number of timeseach ASCII character occurs on a pageof text.

• Input: ASCII text stored as an arrayof characters.

• Output: A histogram with 128 buckets –one for each ASCII character

source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

Page 24: OpenMP: Small and Easy Motivation

Another Example

source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

1: void compute_histogram(char *page, int page_size, int *histogram){ 2: for(int i = 0; i < page_size; i++){ 3: char read_character = page[i]; 4: histogram[read_character]++; 5: } 6: }

Sequential Version

Speed on Quad Core: 10.4 seconds

Page 25: OpenMP: Small and Easy Motivation

Another Example

source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

We need to parallelize this.

Page 26: OpenMP: Small and Easy Motivation

Another Example

source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

1: void compute_histogram(char *page, int page_size, int *histogram){ 2: #pragma omp parallel for 3: for(int i = 0; i < page_size; i++){ 4: char read_character = page[i]; 5: histogram[read_character]++; 6: }

The above code does not work!! Why?

Page 27: OpenMP: Small and Easy Motivation

Another Example

source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

1: void compute_histogram(char *page, int page_size, int *histogram){ 2: #pragma omp parallel for 3: for(int i = 0; i < page_size; i++){ 4: char read_character = page[i]; 5: #pragma omp atomic 6: histogram[read_character]++; 7: } 8: }

Speed on Quad Core: 114.9 seconds > 10x slower than the single thread version, which was 10.4 seconds

Page 28: OpenMP: Small and Easy Motivation

Another Example

source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

1: void compute_histogram(char *page, int page_size,

int *histogram, int num_buckets){ 2: #pragma omp parallel 3: { 4: int local_histogram[128][num_buckets]; 5: int tid = omp_get_thread_num(); 6: #pragma omp for nowait 7: for(int i = 0; i < page_size; i++){ 8: char read_character = page[i]; 9: local_histogram[tid][read_character]++; 10: } 11: for(int i = 0; i < num_buckets; i++){ 12: #pragma omp atomic 13: histogram[i] += local_histogram[tid][i]; 14: } 15: } 16: }

Runs in 3.8 secs. Why speedup is not 4 yet?

Page 29: OpenMP: Small and Easy Motivation

Another Example

source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

void compute_histogram(char *page, int page_size, int *histogram, int num_buckets){

1: int num_threads = omp_get_max_threads(); 2: 3: 4: 5: 6: 7:

int local_histogram[num_threads][num_buckets];#pragma omp parallel { Int tid = omp_get_thread_num(); #pragma omp for for(int i = 0; i < page_size; i++){

8: char read_character = page[i]; 9: local_histogram[tid][read_character]++; 10: 11: 12: 13:

} #pragma omp barrier #pragma omp single for(int t = 0; t < num_threads; t++)

14: for(int i = 0; i < num_buckets; i++) 15: histogram[i] += local_histogram[t][i]; 16: } 17: }

Speed is 4.4 seconds. Slower than the previous version.

Page 30: OpenMP: Small and Easy Motivation

Another Example

source: http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html

void compute_histogram(char *page, int page_size, int *histogram, int num_buckets){

1: int num_threads = omp_get_max_threads(); 2: 3: 4: 5: 6: 7:

int local_histogram[num_threads][num_buckets]; #pragma omp parallel {

int tid = omp_get_thread_num(); #pragma omp for for(int i = 0; i < page_size; i++){

8: char read_character = page[i]; 9: local_histogram[tid][read_character]++; 10: }

11: 12: 13:

#pragma omp for/* observe the new order of following two loops */

for(int i = 0; i < num_buckets; i++)14: for(int t = 0; t < num_threads; t++)15:

histogram[i] += local_histogram[t][i]; 16:

}

17: }

Speed is 3.6 seconds.

Page 31: OpenMP: Small and Easy Motivation

What Can We Learn from the Previous Example?

• Atomic operations– They are expensive– Yet, they are fundamental building blocks.

• Synchronization:– correctness vs performance loss– Rich interaction of hardware-software

tradeoffs– Must evaluate hardware primitives and

software algorithms together

Page 32: OpenMP: Small and Easy Motivation

OpenMP Parallel Programming

1. Start with a parallelizable algorithm• loop-level parallelism is necessary

2. Implement serially

3. Test and Debug

4. Annotate the code with parallelization(and synchronization) directives• Hope for linear speedup

5. Test and Debug

Page 33: OpenMP: Small and Easy Motivation

All OpenMP programs begin with a single

thread: master thread (ID = 0 )

FORK: the master thread then creates a

team of parallel threads.

JOIN: When the team threads complete the

statements in the parallel region construct, they

synchronize and terminate

OpenMP uses the fork-join model of parallel execution.

Page 34: OpenMP: Small and Easy Motivation

Programming Model - Threading

int main() {

}

// serial region

printf(“Hello…”);

// serial again

printf(“!”);

Fork

Join

// parallel region

#pragma omp parallel

{

printf(“World”);

}

We didn’t use omp_set_num_threads(), so what will be the output? "World" will appear the number of times specified by default value of omp_max_threads, which is same as number of cores on the machine.

Page 35: OpenMP: Small and Easy Motivation

Isn’t Nested Parallelism Interesting?

Fork

Join

Fork

Join

Page 36: OpenMP: Small and Easy Motivation

Important!

• The following are implementationdependent:– Nested parallelism

– Dynamically alter number of threads

• It is entirely up to the programmer toensure that I/O is conducted correctlywithin the context of a multithreadedprogram.

Page 37: OpenMP: Small and Easy Motivation

What we learned so far • #include <omp.h>• gcc –fopenmp …• omp_set_num_threads(x);• omp_get_thread_num();• omp_get_num_threads();• #pragma omp parallel [num_threads(x)]• #pragma omp parallel for [nowait]• #pragma omp atomic• #pragma omp barrier• #pragma omp single• #pragma omp critical

Page 38: OpenMP: Small and Easy Motivation

Functional Parallelism Example

v = alpha(); W = beta(); x = gamma(v,w); y = delta(); z = epsilon(x,y);

α β

γ

ε

δ

Page 39: OpenMP: Small and Easy Motivation

Functional Parallelism with OpenMP, 1 parallel sections, section pragmas

#pragma omp parallel sections {

#pragma omp section //optional v = alpha(); #pragma omp section W = beta(); #pragma omp section y = delta();

} x = gamma(v,w); z = epsilon(x,y);

Page 40: OpenMP: Small and Easy Motivation

Functional Parallelism with OpenMP, 2 separating parallel from sections

#pragma omp parallel {

#pragma omp sections { #pragma omp section //optional v = alpha(); #pragma omp section W = beta();

} #pragma omp sections { #pragma omp section //optional x = gamma(v,w); #pragma omp section y = delta();

} }

z = epsilon(x,y);

Page 41: OpenMP: Small and Easy Motivation

Conclusions

• OpenMP is a standard for programmingshared-memory systems.

• OpenMP uses both special functions andpreprocessor directives called pragmas.

• OpenMP programs start multiplethreads rather than multiple processes.

• Many OpenMP directives can bemodified by clauses.