pthreads topics introduction to pthreads data parallelism task parallelism: pipeline task...

105
Pthreads Topics Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

Upload: robert-waters

Post on 19-Dec-2015

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

PthreadsPthreads

TopicsTopics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

Page 2: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 2 –

Goal of next lecturesGoal of next lectures

Introduction to programming with Pthreads. Introduction to programming with Pthreads.

Standard patterns of parallel programs.Standard patterns of parallel programs. data parallelism task parallelism

Examples of each.Examples of each.

Page 3: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 3 –

Intro to Pthreads for Shared MemoryIntro to Pthreads for Shared Memory

proc1 proc2 proc3 procN

Shared Memory Address Space

All threads access the same shared memory data All threads access the same shared memory data space.space.

Page 4: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 4 –

Intro to Pthreads (continued)Intro to Pthreads (continued)

Concretely, it means that a variable x, a pointer p, or an Concretely, it means that a variable x, a pointer p, or an array a[] refer to the same object, no matter what array a[] refer to the same object, no matter what processor the reference originates from.processor the reference originates from.

We have more or less implicitly assumed this to be the We have more or less implicitly assumed this to be the case in earlier examples.case in earlier examples.

Page 5: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 5 –

MultithreadingMultithreading

User has explicit control over thread.User has explicit control over thread.

Good: control can be used to performance benefit.Good: control can be used to performance benefit.

Bad: user has to deal with it.Bad: user has to deal with it.

Page 6: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 6 –

PthreadsPthreads

POSIX standard shared-memory multithreading POSIX standard shared-memory multithreading interface.interface.

Provides primitives for process management and Provides primitives for process management and synchronization.synchronization.

Page 7: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 7 –

What does the user have to do?What does the user have to do?

Decide how to decompose the computation into parallel Decide how to decompose the computation into parallel parts.parts.

Create (and destroy) threads to support that Create (and destroy) threads to support that decomposition.decomposition.

Add synchronization to make sure dependences are Add synchronization to make sure dependences are covered.covered.

Page 8: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 8 –

General Thread Structure General Thread Structure

Typically, a thread is a concurrent execution of a Typically, a thread is a concurrent execution of a function or a procedure.function or a procedure.

So, your program needs to be restructured such that So, your program needs to be restructured such that parallel parts form separate procedures or functions.parallel parts form separate procedures or functions.

Page 9: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 9 –

Example of Thread Creation (contd.)Example of Thread Creation (contd.)

main()

pthread_create(func)

func()

Page 10: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 10 –

Thread Joining ExampleThread Joining Example

void *func(void *) { ….. }

pthread_t id; int X;

pthread_create(&id, NULL, func, &X);

…..

pthread_join(id, NULL);

…..

Page 11: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 11 –

Example of Thread Creation (contd.)Example of Thread Creation (contd.)

main()

pthread_create(func) func()

pthread_join(id)

pthread_ exit()

Page 12: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 12 –

Example: Matrix MultiplyExample: Matrix Multiply

for( i=0; i<n; i++ )

for( j=0; j<n; j++ ) {

c[i][j] = 0.0;

for( k=0; k<n; k++ )

c[i][j] += a[i][k]*b[k][j];

}

Page 13: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 13 –

Parallel Matrix MultiplyParallel Matrix Multiply

All i- or j-iterations can be run in parallel.All i- or j-iterations can be run in parallel.

If we have p processors, n/p rows to each processor.If we have p processors, n/p rows to each processor.

Corresponds to partitioning i-loop.Corresponds to partitioning i-loop.

Page 14: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 14 –

Matrix Multiply: Parallel PartMatrix Multiply: Parallel Part

void mmult(void* s) { int slice = (int) s;

int from = (slice*n)/p;int to = ((slice+1)*n)/p;for(i=from; i<to; i++)

for(j=0; j<n; j++) {c[i][j] = 0.0;for(k=0; k<n; k++)

c[i][j] += a[i][k]*b[k][j];}

}

Page 15: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 15 –

Matrix Multiply: MainMatrix Multiply: Main

int main(){pthread_t thrd[p];

for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, mmult,(void*) i);

for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL);

}

Page 16: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 16 –

Summary: Thread ManagementSummary: Thread Management

pthread_create(): creates a parallel thread executing a pthread_create(): creates a parallel thread executing a given function (and arguments), returns thread given function (and arguments), returns thread identifier.identifier.

pthread_exit(): terminates thread.pthread_exit(): terminates thread.

pthread_join(): waits for thread with particular thread pthread_join(): waits for thread with particular thread identifier to terminate.identifier to terminate.

Page 17: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 17 –

Summary: Program StructureSummary: Program Structure

Encapsulate parallel parts in functions.Encapsulate parallel parts in functions.

Use function arguments to parameterize what a Use function arguments to parameterize what a particular thread does.particular thread does.

Call pthread_create() with the function and arguments, Call pthread_create() with the function and arguments, save thread identifier returned.save thread identifier returned.

Call pthread_join() with that thread identifier.Call pthread_join() with that thread identifier.

Page 18: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 18 –

Private Data in PthreadsPrivate Data in Pthreads

To make a variable private in Pthreads, you need to To make a variable private in Pthreads, you need to make an array out of it.make an array out of it.

Index the array by thread identifier, which you keep Index the array by thread identifier, which you keep

Can also get thread id maintained by the system by Can also get thread id maintained by the system by calling the pthreads_self() call.calling the pthreads_self() call.

Not very elegant or efficient.Not very elegant or efficient.

Page 19: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 19 –

Pthreads SynchronizationPthreads Synchronization

Need for fine-grain synchronizationNeed for fine-grain synchronization mutex locks condition variables.

Page 20: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 20 –

Use of Mutex LocksUse of Mutex Locks

To implement critical sections.To implement critical sections.

Pthreads provides only exclusive locks.Pthreads provides only exclusive locks.

Some other systems allow shared-read, exclusive-write Some other systems allow shared-read, exclusive-write locks.locks.

Page 21: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 21 –

Condition variables (1 of 5)Condition variables (1 of 5)

pthread_cond_init(pthread_cond_init(pthread_cond_t *cond,

pthread_cond_attr *attr)

Creates a new condition variable cond.Creates a new condition variable cond.

Attribute: ignore for now.Attribute: ignore for now.

Page 22: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 22 –

Condition Variables (2 of 5)Condition Variables (2 of 5)

pthread_cond_destroy(pthread_cond_destroy(pthread_cond_t *cond)

Destroys the condition variable cond.Destroys the condition variable cond.

Page 23: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 23 –

Condition Variables (3 of 5)Condition Variables (3 of 5)

pthread_cond_wait(pthread_cond_wait(pthread_cond_t *cond,

pthread_mutex_t *mutex)

Blocks the calling thread, waiting on cond.Blocks the calling thread, waiting on cond.

Unlocks the mutex.Unlocks the mutex.

Page 24: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 24 –

Condition Variables (4 of 5)Condition Variables (4 of 5)

pthread_cond_signal(pthread_cond_signal(pthread_cond_t *cond)

UnblocksUnblocks one thread waiting on cond.one thread waiting on cond.

Which one is determined by scheduler.Which one is determined by scheduler.

If no thread waiting, then signal is a no-op.If no thread waiting, then signal is a no-op.

Page 25: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 25 –

Condition Variables (5 of 5)Condition Variables (5 of 5)

pthread_cond_broadcast(pthread_cond_broadcast(pthread_cond_t *cond)

Unblocks all threads waiting on cond.Unblocks all threads waiting on cond.

If no thread waiting, then broadcast is a no-op.If no thread waiting, then broadcast is a no-op.

Page 26: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 26 –

Use of Condition VariablesUse of Condition Variables

To implement signal-wait synchronization discussed in To implement signal-wait synchronization discussed in earlier examples.earlier examples.

Important note: a signal is “forgotten” if there is no Important note: a signal is “forgotten” if there is no corresponding wait that has already happened.corresponding wait that has already happened.

Page 27: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 27 –

Example (from a few lectures ago)Example (from a few lectures ago)

for( i=1; i<100; i++ ) {

a[i] = …;

…;

… = a[i-1];

}

Loop-carried dependence, not parallelizableLoop-carried dependence, not parallelizable

Page 28: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 28 –

Example (continued)Example (continued)

for( i=...; i<...; i++ ) {

a[i] = …;

signal(e_a[i]);

…;

wait(e_a[i-1]);

… = a[i-1];

}

Page 29: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 29 –

How to Remember a Signal (1 of 2)How to Remember a Signal (1 of 2)

semaphore_signal(i) {

pthread_mutex_lock(&mutex_rem[i]);

arrived [i]= 1;

pthread_cond_signal(&cond[i]);

pthread_mutex_unlock(&mutex_rem[i]);

}

Page 30: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 30 –

How to Remember a Signal (2 of 2)How to Remember a Signal (2 of 2)

semaphore_wait(i) {

pthreads_mutex_lock(&mutex_rem[i]);

if( arrived[i] = 0 ) {

pthreads_cond_wait(&cond[i], mutex_rem[i]);

}

arrived[i] = 0;

pthreads_mutex_unlock(&mutex_rem[i]);

}

Page 31: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 31 –

Example (continued)Example (continued)

for( i=...; i<...; i++ ) {

a[i] = …;

semaphore_signal(e_a[i]);

…;

semaphore_wait(e_a[i-1]);

… = a[i-1];

}

Page 32: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 32 –

More Examples: SORMore Examples: SOR

SOR implements a mathematical model for many SOR implements a mathematical model for many natural phenomena, e.g., heat dissipation in a metal natural phenomena, e.g., heat dissipation in a metal sheet.sheet.

Model is a partial differential equation.Model is a partial differential equation.

Focus is on algorithm, not on derivation.Focus is on algorithm, not on derivation.

Page 33: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 33 –

Problem StatementProblem Statement

x

yF = 1

F = 0

F = 0

F = 0 F(x,y) = 02

Page 34: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 34 –

DiscretizationDiscretization

Represent F in continuous rectangle by a 2-dimensional Represent F in continuous rectangle by a 2-dimensional discrete grid (array).discrete grid (array).

The boundary conditions on the rectangle are the The boundary conditions on the rectangle are the boundary values of the arrayboundary values of the array

The internal values are found by the relaxation The internal values are found by the relaxation algorithm.algorithm.

Page 35: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 35 –

Discretized Problem StatementDiscretized Problem Statement

i

j

Page 36: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 36 –

Relaxation AlgorithmRelaxation Algorithm

For some number of iterationsFor some number of iterationsfor each internal grid point

compute average of its four neighbors

Termination condition:Termination condition:values at grid points change very little

(we will ignore this part in our example)

Page 37: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 37 –

Discretized Problem StatementDiscretized Problem Statement

for some number of timesteps/iterations {for (i=1; i<n; i++ )

for( j=1, j<n, j++ )temp[i][j] = 0.25 *

( grid[i-1][j] + grid[i+1][j]grid[i][j-1] + grid[i][j+1] );

for( i=1; i<n; i++ )for( j=1; j<n; j++ )

grid[i][j] = temp[i][j];}

Page 38: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 38 –

Parallel SORParallel SOR

No dependences between iterations of first (i,j) loop No dependences between iterations of first (i,j) loop nest.nest.

No dependences between iterations of second (i,j) loop No dependences between iterations of second (i,j) loop nest.nest.

True-dependence between first and second loop nest in True-dependence between first and second loop nest in the same timestep.the same timestep.

True dependence between second loop nest and first True dependence between second loop nest and first loop nest of next timestep.loop nest of next timestep.

Page 39: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 39 –

Parallel SOR (continued)Parallel SOR (continued)

First (i,j) loop nest can be parallelized.First (i,j) loop nest can be parallelized.

Second (i,j) loop nest can be parallelized.Second (i,j) loop nest can be parallelized.

We must make processors wait at the end of each (i,j) We must make processors wait at the end of each (i,j) loop nest.loop nest.

Natural synchronization: fork-join.Natural synchronization: fork-join.

Page 40: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 40 –

Parallel SOR (continued)Parallel SOR (continued)

If we have P processors, we can give n/P rows or If we have P processors, we can give n/P rows or columns to each processor.columns to each processor.

Or, we can divide the array in P squares, and give each Or, we can divide the array in P squares, and give each processor a square to compute.processor a square to compute.

Page 41: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 41 –

Pthreads SOR: mainPthreads SOR: main

for some number of timesteps {for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, sor_1, (void *)i);for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL);for( i=0; i<p; i++ ) pthread_create(&thrd[i], NULL, sor_2, (void *)i);for( i=0; i<p; i++ ) pthread_join(thrd[i], NULL);

}

Page 42: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 42 –

Pthreads SOR: Parallel parts (1)Pthreads SOR: Parallel parts (1)

void* sor_1(void *s){

int slice = (int) s;int from = (slice*n)/p;int to = ((slice+1)*n)/p;for(i=from;i<to;i++)

for( j=0; j<n; j++ )temp[i][j] = 0.25*(grid[i-1][j] +

grid[i+1][j]+grid[i][j-1] + grid[i]

[j+1]);}

Page 43: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 43 –

Pthreads SOR: Parallel parts (2)Pthreads SOR: Parallel parts (2)

void* sor_2(void *s)

{int slice = (int) s;int from = (slice*n)/p;int to = ((slice+1)*n)/p;

for(i=from;i<to;i++)

for( j=0; j<n; j++ )

grid[i][j] = temp[i][j];

}

Page 44: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 44 –

Reality bites ...Reality bites ...

Create/exit/join is not so cheap.Create/exit/join is not so cheap.

It would be more efficient if we could come up with a It would be more efficient if we could come up with a parallel program, in whichparallel program, in which create/exit/join would happen rarely (once!), cheaper synchronization were used.

We need something that makes all threads wait, until all We need something that makes all threads wait, until all have arrived -- a barrier.have arrived -- a barrier.

Page 45: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 45 –

Barrier SynchronizationBarrier Synchronization

A wait at a barrier causes a thread to wait until all A wait at a barrier causes a thread to wait until all threads have performed a wait at the barrier.threads have performed a wait at the barrier.

At that point, they all proceed.At that point, they all proceed.

Page 46: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 46 –

Implementing Barriers in PthreadsImplementing Barriers in Pthreads

Count the number of arrivals at the barrier.Count the number of arrivals at the barrier.

Wait if this is not the last arrival.Wait if this is not the last arrival.

Make everyone unblock if this is the last arrival.Make everyone unblock if this is the last arrival.

Since the arrival count is a shared variable, enclose the Since the arrival count is a shared variable, enclose the whole operation in a mutex lock-unlock.whole operation in a mutex lock-unlock.

Page 47: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 47 –

Implementing Barriers in PthreadsImplementing Barriers in Pthreads

void barrier(){

pthread_mutex_lock(&mutex_arr);arrived++;if (arrived<N) {

pthread_cond_wait(&cond, &mutex_arr);}else {

pthread_cond_broadcast(&cond); arrived=0; /* be prepared for next

barrier */ }

pthread_mutex_unlock(&mutex_arr);}

Page 48: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 48 –

Parallel SOR with Barriers (1 of 2)Parallel SOR with Barriers (1 of 2)

void* sor (void* arg)

{

int slice = (int)arg;

int from = (slice * (n-1))/p + 1;

int to = ((slice+1) * (n-1))/p + 1;

for some number of iterations { … }

}

Page 49: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 49 –

Parallel SOR with Barriers (2 of 2)Parallel SOR with Barriers (2 of 2)

for (i=from; i<to; i++)

for (j=1; j<n; j++)

temp[i][j] = 0.25 * (grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]);

barrier();

for (i=from; i<to; i++)

for (j=1; j<n; j++)

grid[i][j]=temp[i][j];

barrier();

Page 50: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 50 –

Parallel SOR with Barriers: mainParallel SOR with Barriers: main

int main(int argc, char *argv[]){

pthread_t *thrd[p];/* Initialize mutex and condition variables */for (i=0; i<p; i++) pthread_create (&thrd[i], &attr, sor, (void*)i);for (i=0; i<p; i++) pthread_join (thrd[i], NULL);/* Destroy mutex and condition variables */

}

Page 51: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 51 –

Note againNote again

Many shared memory programming systems (other Many shared memory programming systems (other than Pthreads) have barriers as basic primitive.than Pthreads) have barriers as basic primitive.

If they do, you should use it, not construct it yourself.If they do, you should use it, not construct it yourself.

Implementation may be more efficient than what you Implementation may be more efficient than what you can do yourself.can do yourself.

Page 52: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 52 –

Molecular Dynamics (MD)Molecular Dynamics (MD)

Simulation of a set of bodies under the influence of Simulation of a set of bodies under the influence of physical laws.physical laws.

Atoms, molecules, celestial bodies, ...Atoms, molecules, celestial bodies, ...

Have same basic structure.Have same basic structure.

Page 53: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 53 –

Molecular Dynamics (Skeleton) Molecular Dynamics (Skeleton)

for some number of timesteps {

for all molecules i

for all other molecules j

force[i] += f( loc[i], loc[j] );

for all molecules i

loc[i] = g( loc[i], force[i] );

}

Page 54: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 54 –

Molecular Dynamics (continued)Molecular Dynamics (continued)

To reduce amount of computation, account for To reduce amount of computation, account for interaction only with nearby molecules.interaction only with nearby molecules.

Page 55: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 55 –

Molecular Dynamics (continued) Molecular Dynamics (continued)

for some number of timesteps {

for all molecules i

for all nearby molecules j

force[i] += f( loc[i], loc[j] );

for all molecules i

loc[i] = g( loc[i], force[i] );

}

Page 56: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 56 –

Molecular Dynamics (continued)Molecular Dynamics (continued)

for each molecule i

number of nearby molecules count[i]

array of indices of nearby molecules index[j]

( 0 <= j < count[i])

Page 57: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 57 –

Molecular Dynamics (continued) Molecular Dynamics (continued)

for some number of timesteps {

for( i=0; i<num_mol; i++ )

for( j=0; j<count[i]; j++ )

force[i] += f(loc[i],loc[index[j]]);

for( i=0; i<num_mol; i++ )

loc[i] = g( loc[i], force[i] );

}

Page 58: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 58 –

Molecular Dynamics (continued)Molecular Dynamics (continued)

No loop-carried dependence in first i-loop.No loop-carried dependence in first i-loop.

Loop-carried dependence (reduction) in j-loop.Loop-carried dependence (reduction) in j-loop.

No loop-carried dependence in second i-loop.No loop-carried dependence in second i-loop.

True dependence between first and second i-loop.True dependence between first and second i-loop.

Page 59: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 59 –

Molecular Dynamics (continued)Molecular Dynamics (continued)

First i-loop can be parallelized.First i-loop can be parallelized.

Second i-loop can be parallelized.Second i-loop can be parallelized.

Must make processors wait between loops.Must make processors wait between loops.

Natural synchronization: fork-join.Natural synchronization: fork-join.

Page 60: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 60 –

Molecular Dynamics (continued) Molecular Dynamics (continued)

for some number of timesteps {

for( i=0; i<num_mol; i++ )

for( j=0; j<count[i]; j++ )

force[i] += f(loc[i],loc[index[j]]);

for( i=0; i<num_mol; i++ )

loc[i] = g( loc[i], force[i] );

}

Parallelize the two for loops (assume fork-join parallelism)

I will use the notation “Parallel for” to denote fork-join parallelism for each for loop

Page 61: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 61 –

Molecular Dynamics (simple) Molecular Dynamics (simple)

for some number of timesteps {Parallel for

for( i=0; i<num_mol; i++ )for( j=0; j<count[i]; j++ )

force[i] += f(loc[i],loc[index[j]]);Parallel for

for( i=0; i<num_mol; i++ )loc[i] = g( loc[i], force[i] );

}

Page 62: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 62 –

Irregular vs. regular data parallelIrregular vs. regular data parallel

In SOR, all arrays are accessed through linear In SOR, all arrays are accessed through linear expressions of the loop indices, known at compile expressions of the loop indices, known at compile time [regular].time [regular].

In MD, some arrays are accessed through non-linear In MD, some arrays are accessed through non-linear expressions of the loop indices, some known only at expressions of the loop indices, some known only at runtime [irregular].runtime [irregular].

Page 63: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 63 –

Irregular vs. regular data parallelIrregular vs. regular data parallel

No real differences in terms of parallelization (based on No real differences in terms of parallelization (based on dependences).dependences).

Will lead to fundamental differences in expressions of Will lead to fundamental differences in expressions of parallelism:parallelism: irregular difficult for parallelism based on data distribution not difficult for parallelism based on iteration distribution.

Page 64: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 64 –

Molecular Dynamics (continued)Molecular Dynamics (continued)

Parallelization of first loop:Parallelization of first loop: has a load balancing issue some molecules have few/many neighbors more sophisticated loop partitioning necessary

Page 65: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 65 –

Flavors of ParallelismFlavors of Parallelism

Data parallelism: all processors do the same thing on Data parallelism: all processors do the same thing on different data.different data. Regular Irregular

Task parallelism: processors do different tasks.Task parallelism: processors do different tasks. Task queue Pipelines

Page 66: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 66 –

Task ParallelismTask Parallelism

Each process performs a different task.Each process performs a different task.

Two principal flavors:Two principal flavors: pipelines task queues

Program Examples: PIPE (pipeline), TSP (task queue).Program Examples: PIPE (pipeline), TSP (task queue).

Page 67: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 67 –

PipelinePipeline

Often occurs with image processing applications, Often occurs with image processing applications, where a number of images undergoes a sequence of where a number of images undergoes a sequence of transformations.transformations.

E.g., rendering, clipping, compression, etc.E.g., rendering, clipping, compression, etc.

Page 68: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 68 –

Sequential ProgramSequential Program

for( i=0; i<num_pic, read(in_pic[i]); i++ ) {

int_pic_1[i] = trans1( in_pic[i] );

int_pic_2[i] = trans2( int_pic_1[i]);

int_pic_3[i] = trans3( int_pic_2[i]);

out_pic[i] = trans4( int_pic_3[i]);

}

Page 69: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 69 –

Parallelizing a PipelineParallelizing a Pipeline

For simplicity, assume we have 4 processors (i.e., equal For simplicity, assume we have 4 processors (i.e., equal to the number of transformations).to the number of transformations).

Furthermore, assume we have a very large number of Furthermore, assume we have a very large number of pictures (>> 4).pictures (>> 4).

Page 70: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 70 –

Parallelizing a Pipeline (part 1)Parallelizing a Pipeline (part 1)

Processor 1:

for( i=0; i<num_pics, read(in_pic[i]); i++ ) {

int_pic_1[i] = trans1( in_pic[i] );

signal(event_1_2[i]);

}

Page 71: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 71 –

Parallelizing a Pipeline (part 2)Parallelizing a Pipeline (part 2)

Processor 2:

for( i=0; i<num_pics; i++ ) {wait( event_1_2[i] );int_pic_2[i] = trans1( int_pic_1[i] );signal(event_2_3[i] );

}

Same for processor 3

Page 72: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 72 –

Parallelizing a Pipeline (part 3)Parallelizing a Pipeline (part 3)

Processor 4:

for( i=0; i<num_pics; i++ ) {

wait( event_3_4[i] );

out_pic[i] = trans1( int_pic_3[i] );

}

Page 73: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 73 –

Sequential vs. Parallel ExecutionSequential vs. Parallel Execution

SequentialSequential

(Pattern -- picture; horiz. line -- processor).(Pattern -- picture; horiz. line -- processor).

Page 74: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 74 –

Another Sequential ProgramAnother Sequential Program

for( i=0; i<num_pic, read(in_pic); i++ ) {

int_pic_1 = trans1( in_pic );

int_pic_2 = trans2( int_pic_1);

int_pic_3 = trans3( int_pic_2);

out_pic = trans4( int_pic_3);

}

Page 75: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 75 –

Can we use same parallelization?Can we use same parallelization?

Processor 2:

for( i=0; i<num_pics; i++ ) {wait( event_1_2[i] );int_pic_2 = trans1( int_pic_1 );signal(event_2_3[i] );

}

Same for processor 3

Page 76: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 76 –

Can we use same parallelization?Can we use same parallelization?

No, because of anti-dependence between stages, there No, because of anti-dependence between stages, there is no parallelism. is no parallelism.

We used privatization to enable pipeline parallelism.We used privatization to enable pipeline parallelism.

Used often to avoid dependences (not only with Used often to avoid dependences (not only with pipelines).pipelines).

Costly in terms of memory.Costly in terms of memory.

Page 77: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 77 –

In-between SolutionIn-between Solution

Use n>1 buffers between stages.Use n>1 buffers between stages.

Block when buffers are full or empty.Block when buffers are full or empty.

Page 78: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 78 –

Perfect PipelinePerfect Pipeline

SequentialSequential

ParallelParallel

(Pattern -- picture; horiz. line -- processor).(Pattern -- picture; horiz. line -- processor).

Page 79: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 79 –

Things are often not that perfectThings are often not that perfect

One stage takes more time than others.One stage takes more time than others.

Stages take a variable amount of time.Stages take a variable amount of time.

Extra buffers provide some cushion against variability.Extra buffers provide some cushion against variability.

Page 80: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 80 –

PIPE Using PthreadsPIPE Using Pthreads

Remember - replacing the original wait/signal by a Remember - replacing the original wait/signal by a Pthreads condition variable wait/signal will not work.Pthreads condition variable wait/signal will not work. signals before a wait are forgotten. we need to remember a signal (semaphore wait and signal).

Page 81: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 81 –

PIPE with PthreadsPIPE with Pthreads

P1:for( i=0; i<num_pics, read(in_pic); i++ ) {P1:for( i=0; i<num_pics, read(in_pic); i++ ) {int_pic_1[i] = trans1( in_pic );semaphore_signal( event_1_2[i] );

}

P2: for( i=0; i<num_pics; i++ ) {P2: for( i=0; i<num_pics; i++ ) {semaphore_wait( event_1_2[i] );int_pic_2[i] = trans2( int_pic_1[i] );semaphore_signal( event_2_3[i] );

}

Page 82: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 82 –

NoteNote

Many shared memory programming systems (other Many shared memory programming systems (other than Pthreads) have semaphores as basic primitive.than Pthreads) have semaphores as basic primitive.

If they do, you should use it, not construct it yourself.If they do, you should use it, not construct it yourself.

Implementation may be more efficient than what you Implementation may be more efficient than what you can do yourself.can do yourself.

Page 83: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 83 –

TSP (Traveling Salesman)TSP (Traveling Salesman)

Goal:Goal: given a list of cities, a matrix of distances between them, and

a starting city, find the shortest tour in which all cities are visited exactly

once.

Example of an NP-hard search problem.Example of an NP-hard search problem.

Algorithm: branch-and-bound.Algorithm: branch-and-bound.

Page 84: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 84 –

BranchingBranching

Page 85: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 85 –

BranchingBranching

Initialization: Initialization: go from starting city to each of remaining cities put resulting partial path into priority queue, ordered by its

current length.

Further (repeatedly):Further (repeatedly): take head element out of priority queue, expand by each one of remaining cities, put resulting partial path into priority queue.

Page 86: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 86 –

Finding the SolutionFinding the Solution

Eventually, a complete path will be found.Eventually, a complete path will be found.

Remember its length as the current shortest path.Remember its length as the current shortest path.

Every time a complete path is found, check if we need Every time a complete path is found, check if we need to update current best path.to update current best path.

When priority queue becomes empty, best path is When priority queue becomes empty, best path is found.found.

Page 87: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 87 –

Using a Simple BoundUsing a Simple Bound

Once a complete path is found, we have a bound on the Once a complete path is found, we have a bound on the length of shortest path.length of shortest path.

No use in exploring partial path that is already longer No use in exploring partial path that is already longer than the current lower bound.than the current lower bound.

Page 88: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 88 –

Sequential TSP: Data StructuresSequential TSP: Data Structures

Priority queue of partial paths.Priority queue of partial paths.

Current best solution and its length.Current best solution and its length.

For simplicity, we will ignore bounding.For simplicity, we will ignore bounding.

Page 89: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 89 –

Sequential TSP: Code OutlineSequential TSP: Code Outline

init_q(); init_best();

while( (p=de_queue()) != NULL ) {

for each expansion by one city {

q = add_city(p);

if( complete(q) ) { update_best(q) };

else { en_queue(q) };

}

}

Page 90: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 90 –

Parallel TSP: PossibilitiesParallel TSP: Possibilities

Have each process do one expansion.Have each process do one expansion.

Have each process do expansion of one partial path.Have each process do expansion of one partial path.

Have each process do expansion of multiple partial Have each process do expansion of multiple partial paths.paths.

Issue of granularity/performance, not an issue of Issue of granularity/performance, not an issue of correctness.correctness.

Assume: process expands one partial path.Assume: process expands one partial path.

Page 91: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 91 –

Parallel TSP: SynchronizationParallel TSP: Synchronization

True dependence between process that puts partial True dependence between process that puts partial path in queue and the one that takes it out.path in queue and the one that takes it out.

Dependences arise dynamically.Dependences arise dynamically.

Required synchronization: need to make process wait if Required synchronization: need to make process wait if q is empty.q is empty.

Page 92: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 92 –

Parallel TSP: First cut (part 1)Parallel TSP: First cut (part 1)

process i:while( (p=de_queue()) != NULL ) {

for each expansion by one city {

q = add_city(p);

if complete(q) { update_best(q) };

else en_queue(q);

}

}

Page 93: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 93 –

Parallel TSP: First cut (part 2)Parallel TSP: First cut (part 2)

In de_queue: wait if q is emptyIn de_queue: wait if q is empty

In en_queue: signal that q is no longer emptyIn en_queue: signal that q is no longer empty

Page 94: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 94 –

Parallel TSP: More synchronizationParallel TSP: More synchronization

All processes operate, potentially at the same time, on All processes operate, potentially at the same time, on q and best.q and best.

This must not be allowed to happen.This must not be allowed to happen.

Critical section: only one process can execute in Critical section: only one process can execute in critical section at once.critical section at once.

Page 95: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 95 –

Parallel TSP: Critical SectionsParallel TSP: Critical Sections

All shared data must be protected by critical section.All shared data must be protected by critical section.

Update_best must be protected by a critical section.Update_best must be protected by a critical section.

En_queue and de_queue must be protected by the En_queue and de_queue must be protected by the same critical section.same critical section.

Page 96: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 96 –

Termination conditionTermination condition

How do we know when we are done?How do we know when we are done?

All processes are waiting inside de_queue.All processes are waiting inside de_queue.

Count the number of waiting processes before waiting.Count the number of waiting processes before waiting.

If equal to total number of processes, we are done.If equal to total number of processes, we are done.

Page 97: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 97 –

Parallel TSPParallel TSP

Complete parallel program will be provided on the Web.Complete parallel program will be provided on the Web.

Includes wait/signal on empty q.Includes wait/signal on empty q.

Includes critical sections.Includes critical sections.

Includes termination condition.Includes termination condition.

Page 98: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 98 –

Parallel TSPParallel TSP

process i:

while( (p=de_queue()) != NULL ) {

for each expansion by one city {

q = add_city(p);

if complete(q) { update_best(q) };

else en_queue(q);

}

}

Page 99: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 99 –

Parallel TSPParallel TSP

Need critical sectionNeed critical section in update_best, in en_queue/de_queue.

In de_queueIn de_queue wait if q is empty, terminate if all processes are waiting.

In en_queue: In en_queue: signal q is no longer empty.

Page 100: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 100 –

Parallel TSP: Mutual ExclusionParallel TSP: Mutual Exclusion

en_queue() / de_queue() {pthreads_mutex_lock(&queue);…;pthreads_mutex_unlock(&queue);

}update_best() {

pthreads_mutex_lock(&best);…;pthreads_mutex_unlock(&best);

}

Page 101: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 101 –

Parallel TSP: Condition SynchronizationParallel TSP: Condition Synchronization

de_queue() {while( (q is empty) and (not done) ) { waiting++; if( waiting == p ) {

done = true; pthreads_cond_broadcast(&empty,

&queue); } else {

pthreads_cond_wait(&empty, &queue);

waiting--; }}if( done )

return null;else

remove and return head of the queue;}

Page 102: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 102 –

Other Primitives in PthreadsOther Primitives in Pthreads

Set the attributes of a thread.Set the attributes of a thread.

Set the attributes of a mutex lock.Set the attributes of a mutex lock.

Set scheduling parameters.Set scheduling parameters.

Page 103: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 103 –

Busy WaitingBusy Waiting

Not an explicit part of the API.Not an explicit part of the API.

Available in a general shared memory programming Available in a general shared memory programming environment.environment.

Page 104: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 104 –

Busy WaitingBusy Waiting

initially: flag = 0;

P1: produce data;

flag = 1;

P2: while( !flag ) ;

consume data;

Page 105: Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples

– 105 –

Use of Busy WaitingUse of Busy Waiting

On the surface, simple and efficient.On the surface, simple and efficient.

In general, not a recommended practice.In general, not a recommended practice.

Often leads to messy and unreadable code (blurs Often leads to messy and unreadable code (blurs data/synchronization distinction).data/synchronization distinction).

May be inefficientMay be inefficient