concurrent programming with openmp - ulisboa · concurrent programming with openmp parallel and...

44
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior T´ ecnico October 3, 2011 CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 1 / 42

Upload: vuongtuong

Post on 07-Apr-2019

261 views

Category:

Documents


0 download

TRANSCRIPT

Concurrent Programming with OpenMP

Parallel and Distributed Computing

Department of Computer Science and Engineering (DEI)Instituto Superior Tecnico

October 3, 2011

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 1 / 42

Outline

Shared Memory Concurrent Programming

Review of Operating Systems: PThreads

OpenMP

Parallel ClausesPrivate / Shared Variables

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 2 / 42

Shared-Memory Systems

Uniform Memory Access (UMA) architecturealso known as

Symmetric Shared-Memory Multiprocessors (SMP)

P

Cache

P P P

MainMemory

I / O

Cache Cache Cache

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 3 / 42

Fork/Join Parallelism

“Cheap” creation/termination of tasks invites forIncremental Parallelization: process of converting a sequential program toa parallel program a little bit at a time.

initially only master thread is active

master thread executes sequential code

Fork: master thread creates or awakens additional threads to executeparallel code

Join: at end of parallel code created threads die or are suspendedMaster Thread

Other ThreadsFork

Join

Fork

Join

Tim

e

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 4 / 42

Fork/Join Parallelism

read(A, B);

x = initX(A, B);

y = initY(A, B);

z = initZ(A, B);

for(i = 0; i < N_ENTRIES; i++)

x[i] = compX(y[i], z[i]);

for(i = 1; i < N_ENTRIES; i++){

x[i] = solveX(x[i-1]);

z[i] = x[i] + y[i];

}

finalize1(&x, &y, &z);

finalize2(&x, &y, &z);

finalize3(&x, &y, &z);

.

.

.

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 5 / 42

Processes and Threads

Process A

Global Data,Shared Code

SystemResources

InterprocessCommunication

Environment

Thread 1

Private Data

Stack

Thread 2

Private Data

Stack

Thread 3

Private Data

Stack

Thread n

Private Data

Stack...

Process B

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 6 / 42

POSIX Threads (PThreads): Creation

int pthread_create(pthread_t *thread, pthread_attr_t *attr,void *(*start_routine)(void*), void *arg)

Example:

pthread_t pt_worker;

void *thread_function(void *args) { /* thread code */ }

pthread_create(&pt_worker, NULL,thread_function, (void *) thread_args);

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 7 / 42

PThreads: Termination and Synchronization

int pthread_exit(void *value_ptr)

int pthread_join(pthread_t thread, void **value_ptr)

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 8 / 42

PThread Example: Summing the Values in Matrix Rows

#include <stdlib.h>

#include <stdio.h>

#include <unistd.h>

#include <pthread.h>

int buffer[N][SIZE];

void *sum_row (void *ptr){

int index = 0, sum = 0;

int *b = (int *) ptr;

while (index < SIZE - 1)

sum += b[index++]; /* sum row*/

b[index]=sum; /* store sum

in last col. */

pthread_exit(NULL);

}

int main(void){

int i,j;

pthread_t tid[N];

for(i = 0; i < N; i++)

for(j = 0; j < SIZE-1; j++)

buffer[i][j] = rand()%10;

for(i = 0; i < N; i++)

if(pthread_create(&tid[i], NULL, sum_row,

(void *) &(buffer[i])) != 0){

printf("Error creating thread, id=%d\n", i);

exit(-1);

}

else

printf ("Created thread w/ id %d\n", i);

for(i = 0; i < N; i++)

pthread_join(tid[i], NULL);

printf("All threads have concluded\n");

for(i = 0; i < N; i++){

for(j = 0; j < SIZE; j++)

printf(" %d ", buffer[i][j]);

printf ("Row %d \n", i);

}

exit(0);

}

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 9 / 42

PThreads: Synchronization

int pthread_mutex_init(pthread_mutex_t *mutex,pthread_mutexattr_t *attr);

int pthread_mutex_lock(pthread_mutex_t *mutex);int pthread_mutex_unlock(pthread_mutex_t *mutex);

Example:

pthread_mutex_t count_lock;

pthread_mutex_init(&count_lock, NULL);

pthread_mutex_lock(&count_lock);atomic_function();pthread_mutex_unlock(&count_lock);

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 10 / 42

PThreads: Synchronization Example

int count;

void *sum_row(void *ptr){int index = 0, sum = 0;int *b = (int *) ptr;

while(index < SIZE - 1)sum += b[index++]; /* sum row */

b[index] = sum; /* store sumin last col. */

count++;

pthread_exit(NULL);}

Problem?

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 11 / 42

PThreads: Synchronization Example

int count;

pthread_mutex_t count_lock;

void *sum_row(void *ptr){

int index = 0, sum = 0;

int *b = (int *) ptr;

while(index < SIZE - 1)

sum += b[index++]; /* sum row */

b[index]=sum; /* store sum

in last col. */

pthread_mutex_lock(&count_lock);

count++;

pthread_mutex_unlock(&count_lock);

pthread_exit(NULL);

}

main() { /*...*/

pthread_mutex_init(&count_lock, NULL);

}

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 12 / 42

OpenMP

What is OpenMP?

Open specification for Multi-Threaded, Shared Memory Parallelism

Standard Application Programming Interface (API):

Preprocessor (compiler) directivesLibrary CallsEnvironment Variables

More info at www.openmp.org

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 13 / 42

OpenMP vs Threads

(Supposedly) Better than threads:

Simpler programming model

Separate a program into serial and parallel regions, rather than Tconcurrently-executing threads

Similar to threads:

Programmer must detect dependencies

Programmer must prevent data races

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 14 / 42

Parallel Programming Recipes

Threads:

1 Start with a parallel algorithm2 Implement, keeping in mind:

Data racesSynchronizationThreading syntax

3 Test & Debug

4 Goto step 2

OpenMP:

1 Start with some algorithm2 Implement serially, ignoring:

Data racesSynchronizationThreading syntax

3 Test & Debug4 Automagically parallelize

with relatively few annotationsthat specify parallelism andsynchronization

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 15 / 42

Parallel Programming Recipes

Threads:

1 Start with a parallel algorithm2 Implement, keeping in mind:

Data racesSynchronizationThreading syntax

3 Test & Debug

4 Goto step 2

OpenMP:

1 Start with some algorithm2 Implement serially, ignoring:

Data racesSynchronizationThreading syntax

3 Test & Debug4 Automagically parallelize

with relatively few annotationsthat specify parallelism andsynchronization

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 15 / 42

OpenMP Directives

Parallelization directives:

parallel region

parallel for

parallel sections

task

Data environment directives:

shared, private, threadprivate, reduction, etc.

Synchronization directives:

barrier, critical

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 16 / 42

C / C++ Directives Format

#pragma omp directive-name [clause,...] \n

Case sensitive

Long directive lines may be continued on succeeding lines by escapingthe newline character with a “\” at the end of the directive line

Always apply to the next statement, which must be a structuredblock. Examples:

#pragma omp ...statement

#pragma omp ...{ statement1; statement2; statement3; }

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 17 / 42

Parallel Region

#pragma omp parallel [clauses]

Creates N parallel threads

All execute subsequent block

All wait for each other at the end of executing the block

Barrier synchronization

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 18 / 42

How Many Threads?

The number of threads created is determined by, in order of precedence:

Use of omp set num threads() library function

Setting of the OMP NUM THREADS environment variable

Implementation default - usually the number of CPUs

Possible to query number of CPUs:

int omp_get_num_procs (void)

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 19 / 42

Parallel Region Example

main() {

printf("Serial Region 1\n");

omp_set_num_threads(4);

#pragma omp parallel{printf("Parallel Region\n");

}

printf("Serial Region 2\n");}

Output?

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 20 / 42

Thread Count and Id API

#include <omp.h>

int omp_get_thread_num()

int omp_get_num_threads()

void omp_set_num_threads(int num)

Example:

#pragma omp parallel{if( !omp_get_thread_num() )master();

elseslave();

}

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 21 / 42

Work Sharing Directives

Always occur within a parallel region

Divide the execution of the enclosed code region among the membersof the team

Do not create new threads

Two main directives are

parallel forparallel section

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 22 / 42

Parallel For

#pragma omp parallel#pragma omp for [clauses]for( ; ; ) { ... }

Each thread executes a subset of the iterations

All threads synchronize at the end of parallel for

Restrictions

No data dependencies between iterations

Program correctness must not depend upon which thread executes aparticular iteration

Paradigm of Data Parallelism.

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 23 / 42

Handy Shortcut

#pragma omp parallel#pragma omp forfor ( ; ; ) { ... }

is equivalent to

#pragma omp parallel forfor ( ; ; ) { ... }

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 24 / 42

PThread Example Revisited

#include <stdlib.h>

#include <stdio.h>

#include <unistd.h>

#include <pthread.h>

int buffer[N][SIZE];

void *sum_row (void *ptr){

int index = 0, sum = 0;

int *b = (int *) ptr;

while (index < SIZE - 1)

sum += b[index++]; /* sum row*/

b[index]=sum; /* store sum

in last col. */

pthread_exit(NULL);

}

int main(void){

int i,j;

pthread_t tid[N];

for(i = 0; i < N; i++)

for(j = 0; j < SIZE-1; j++)

buffer[i][j] = rand()%10;

for(i = 0; i < N; i++)

if(pthread_create(&tid[i], 0, sum_row,

(void *) &(buffer[i])) != 0){

printf("Error creating thread, id=%d\n", i);

exit(-1);

}

else

printf ("Created thread w/ id %d\n", i);

for(i = 0; i < N; i++)

pthread_join(tid[i], NULL);

printf("All threads have concluded\n");

for(i = 0; i < N; i++){

for(j = 0; j < SIZE; j++)

printf(" %d ", buffer[i][j]);

printf ("Row %d \n", i);

}

exit(0);

}

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 25 / 42

PThread Example Revisited

#include <stdlib.h>

#include <stdio.h>

#include <unistd.h>

#include <omp.h>

int buffer[N][SIZE];

void *sum_row (void *ptr){

int index = 0, sum = 0;

int *b = (int *) ptr;

while (index < SIZE - 1)

sum += b[index++]; /* sum row*/

b[index]=sum; /* store sum

in last col. */

}

int main(void){

int i,j;

for(i = 0; i < N; i++)

for(j = 0; j < SIZE-1; j++)

buffer[i][j] = rand()%10;

#pragma omp parallel for

for(i = 0; i < N; i++)

sum_row(buffer[i]);

printf("All threads have concluded\n");

for(i = 0; i < N; i++){

for(j = 0; j < SIZE; j++)

printf(" %d ", buffer[i][j]);

printf ("Row %d \n", i);

}

exit(0);

}

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 26 / 42

Multiple Work Sharing Directives

May occur within the same parallel region:

#pragma omp parallel{#pragma omp forfor( ; ; ) { ... }

#pragma omp forfor( ; ; ) { ... }

}

Implicit barrier at the end of each for.

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 27 / 42

Parallel Sections

Functional Parallelism: several blocks are executed in parallel

#pragma omp parallel

{

#pragma omp sections

{

#pragma omp section

{ a=...;

b=...; }

#pragma omp section /* <- delimiter! */

{ c=...;

d=...; }

#pragma omp section

{ e=...;

f=...; }

#pragma omp section

{ g=...;

h=...; }

} /*omp end sections*/

} /*omp end parallel*/

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 28 / 42

Handy Shortcut

#pragma omp parallel#pragma omp sections{ ... }

is equivalent to

#pragma omp parallel sections{ ... }

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 29 / 42

OpenMP Memory Model

Concurrent programs access two types of data

Shared data, visible to all threads

Private data, visible to a single thread (often stack-allocated)

Threads:

Global variables are shared

Local variables are private

OpenMP:

All variables are by default shared.

Some exceptions:

the loop variable of a parallel for is privatestack (local) variables in called subroutines are private

By using data directives, some variables can be made private or givenother special characteristics.

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 30 / 42

OpenMP Memory Model

Concurrent programs access two types of data

Shared data, visible to all threads

Private data, visible to a single thread (often stack-allocated)

Threads:

Global variables are shared

Local variables are private

OpenMP:

All variables are by default shared.

Some exceptions:

the loop variable of a parallel for is privatestack (local) variables in called subroutines are private

By using data directives, some variables can be made private or givenother special characteristics.

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 30 / 42

Private Variables

#pragma omp parallel for private( list )

Makes a private copy for each thread for each variable in the list.

No storage association with original object

All references are to the local object

Values are undefined on entry and exit

Also applies to other region and work-sharing directives.

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 31 / 42

Shared Variables

#pragma omp parallel for shared ( list )

Similarly, there is a shared data directive.

Shared variables exist in a single location and all threads can read andwrite it

It is the programmer’s responsibility to ensure that all multiple threadsproperly access shared variables (will discuss synchronization next)

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 32 / 42

Example PThread vs OpenMP

PThreads OpenMP

// shared, globalsint n, *x, *y;

void loop() {int i; // private, stack

for(i = 0; i < n; i++)x[i] += y[i];

}

#pragma omp parallel \shared(n,x,y) private(i){#pragma omp forfor(i = 0; i < n; i++)x[i] += y[i];

}

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 33 / 42

Example PThread vs OpenMP

PThreads OpenMP

// shared, globalsint n, *x, *y;

void loop() {int i; // private, stack

for(i = 0; i < n; i++)x[i] += y[i];

}

#pragma omp parallel for{for(i = 0; i < n; i++)x[i] += y[i];

}

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 34 / 42

Example of private Clause

for(i = 0; i < n; i++)for(j = 0; j < n; j++)a[i][j] = b[i][j] + c[i][j];

Make outer loop parallel, to reduce number of forks/joins.Give each thread its own private copy of variable j.

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 35 / 42

Example of private Clause

for(i = 0; i < n; i++)for(j = 0; j < n; j++)a[i][j] = b[i][j] + c[i][j];

Make outer loop parallel, to reduce number of forks/joins.Give each thread its own private copy of variable j.

#pragma omp parallel for private(j)for(i = 0; i < n; i++)for(j = 0; j < n; j++)a[i][j] = b[i][j] + c[i][j];

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 36 / 42

firstprivate / lastprivate Clauses

As mentioned, values of private variables are undefined on entry and exit.

⇒ A private variable within a region has no storage association with thesame variable outside of the region

firstprivate (list)

Variables in list are initialized with the value the original variable hadbefore entering the parallel construct

lastprivate (list)

The thread that executes the sequentially last iteration or section updatesthe value of the variables in list

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 37 / 42

Example of firstprivate / lastprivate Clauses

main()

{

a = 1;

#pragma omp parallel for private(i), firstprivate(a), lastprivate(b)

for (i = 0; i < n; i++) {

...

b = a + i; /* a undefined, unless declared firstprivate */

...

}

a = b; /* b undefined, unless declared lastprivate */

}

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 38 / 42

threadprivate Variables

Private variables are private on a parallel region basis.

threadprivate variables are global variables that are private throughoutthe execution of the program.

#pragma omp threadprivate(x)

Initial data is undefined, unless copyin is used

copyin (list)

data of the master thread is copied to the threadprivate copies

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 39 / 42

Example of threadprivate Clause

#include <omp.h>

int a, b, i, tid;

float x;

#pragma omp threadprivate(a, x)

main () {

printf("1st Parallel Region:\n");

#pragma omp parallel private(b,tid)

{

tid = omp_get_thread_num();

a = tid;

b = tid;

x = 1.1 * tid +1.0;

printf("Thread %d: a,b,x= %d %d %f\n",tid,a,b,x);

} /* end of parallel section */

printf("2nd Parallel Region:\n");

#pragma omp parallel private(tid)

{

tid = omp_get_thread_num();

printf("Thread %d: a,b,x = %d %d %f\n",tid,a,b,x);

} /* end of parallel section */

}

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 40 / 42

Review

Shared Memory Concurrent Programming

Review of Operating Systems: PThreads

OpenMP

Parallel ClausesPrivate / Shared Variables

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 41 / 42

Next Class

More on OpenMP:

Synchronism

Conditional Parallelism

Reduction Clause

Scheduling Options

CPD (DEI / IST) Parallel and Distributed Computing – 7 2011-10-3 42 / 42