parallel - uni-hamburg.de · decompose a problem‘s data into units that can be operated on...

40
Joachim Nitschke PARALLEL PROGRAMMING Project Seminar “Parallel Programming”, Summer Semester 2011

Upload: others

Post on 28-Oct-2019

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Joachim

Nitschke

PARALLEL

PROGRAMMING

Project Seminar “Parallel Programming”, Summer Semester 2011

Page 2: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Introduction

Parallel program design

Patterns for parallel programming

A: Algorithm structure

B: Supporting structures

CONTENT

2

Page 3: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Context

around

parallel

programming

INTRODUCTION

Page 4: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Many different models reflecting the various

different parallel hardware architectures

2 or rather 3 most common models:

Shared memory

Distributed memory

Hybrid models (combining shared and distributed memory)

PARALLEL PROGRAMMING MODELS

4

Page 5: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Shared memory Distributed memory

PARALLEL PROGRAMMING MODELS

5

Page 6: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Shared memory

Synchronize memory

access

Locking vs. potential

race conditions

Distributed memory

Communication

bandwidth and

resulting latency

Manage message

passing

Synchronous vs.

asynchronous

communication

PROGRAMMING CHALLENGES

6

Page 7: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

2 common standards as examples for the 2

parallel programming models:

Open Multi-Processing (OpenMP)

Message passing interface (MPI)

PARALLEL PROGRAMMING STANDARDS

7

Page 8: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Collection of libraries and compiler directives for parallel programming on shared memory computers

Programmers have to explicitly designate blocks that are to run in parallel by adding directives like:

OpenMP then creates a number of threads executing the designated code block

OpenMP

8

Page 9: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Library with routines to manage message passing

for programming on distributed memory computers

Messages are sent from one process to another

Routines for synchronization, broadcasts, blocking

and non blocking communication

MPI

9

Page 10: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

MPI.Scatter MPI.Gather

MPI EXAMPLE

10

Page 11: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

General

strategies

for finding

concurrency

PARALLEL PROGRAM

DESIGN

Page 12: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

General approach: Analyze a problem to identify

exploitable concurrency

Main concept is decomposition : Divide a

computation into smaller parts all or some of

which can run concurrently

FINDING CONCURRENCY

12

Page 13: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Tasks: Programmer-defined units into which the

main computation is decomposed

Unit of execution (UE) : Generalization of processes

and threads

SOME TERMINOLOGY

13

Page 14: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Decompose a problem into tasks that can run

concurrently

Few large tasks vs. many small tasks

Minimize dependencies among tasks

TASK DECOMPOSITION

14

Page 15: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Group tasks to simplify managing their dependencies

Tasks within a group run at the same time

Based on decomposition: Group tasks that belong to the same high-level operations

Based on constraints: Group tasks with the same constraints

GROUP TASKS

15

Page 16: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Order task groups to satisfy constraints among

them

Order must be:

Restrictive enough to satisfy constraints

Not too restrictive to improve flexibility and hence efficiency

Identify dependencies – e.g.:

Group A requires data from group B

Important: Also identify the independent groups

Identify potential dead locks

ORDER TASKS

16

Page 17: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Decompose a problem‘s data into units that can be operated on relatively independent

Look at problem‘s central data structures

Decomposition already implied by or basis for task decomposition

Again: Few large chunks vs. many small chunks

Improve flexibility: Configurable granularity

DATA DECOMPOSITION

17

Page 18: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Share decomposed data among tasks

Identify task-local and shared data

Classify shared data: read/write or read only?

Identify potential race conditions

Note: Sometimes data sharing implies communication

DATA SHARING

18

Page 19: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Typical

parallel

program

structures

PATTERNS FOR

PARALLEL

PROGRAMMING

Page 20: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

How can the identified concurrency be used to

build a program?

3 examples for typical parallel algorithm

structures:

Organize by tasks: Divide & conquer

Organize by data decomposition: Geometric/domain

decomposition

Organize by data flow: Pipeline

A: ALGORITHM STRUCTURE

20

Page 21: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Principle: Split a problem recursively into smaller

solvable sub problems and merge their results

Potential concurrency: Sub problems can be solved

simultaneously

DIVIDE & CONQUER

21

Page 22: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Precondition: Sub problems can be solved

independently

Efficiency constraint: Split and merge should be

trivial compared to sub problems

Challenge: Standard base case can lead to too

many too small tasks

End recursion earlier?

DIVIDE & CONQUER

22

Page 23: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Principle: Organize an algorithm around a linear

data structure that was decomposed into

concurrently updatable chunks

Potential concurrency: Chunks can be updated

simultaneously

GEOMETRIC/DOMAIN DECOMPOSITION

23

Page 24: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Example: Simple blur filter

where every pixel is set to

the average value of its

surrounding pixels

Image can be split into

squares

Each square is updated by a

task

To update square border

information from other

squares is required

GEOMETRIC/DOMAIN DECOMPOSITION

24

Page 25: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Again: Granularity of decomposition?

Choose square/cubic chunks to minimize surface

and thus nonlocal data

Replicating nonlocal data can reduce

communication → “ghost boundaries”

Optimization: Overlap update and exchange of

nonlocal data

Number of tasks > number of UEs for better load

balance

GEOMETRIC/DOMAIN DECOMPOSITION

25

Page 26: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Principle based on analogy assembly line : Data

flowing through a set of stages

Potential concurrency: Operations can be performed

simultaneously on different data items

PIPELINE

26

C1 C2 C3 C4 C5 C6

C1 C2 C3 C4 C5 C6

C1 C2 C3 C4 C5 C6

Pipeline stage 1

Pipeline stage 2

Pipeline stage 3

time

Page 27: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Example: Instruction pipeline in CPUs

Fetch (instruction)

Decode

Execute

...

PIPELINE

27

Page 28: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Precondition: Dependencies among tasks allow an

appropriate ordering

Efficiency constraint: Number of stages << number

of processed items

Pipeline can also be nonlinear

PIPELINE

28

Page 29: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Intermediate stage between problem oriented

algorithm structure patterns and their realization

in a programming environment

Structures that “support” the realization of parallel

algorithms

4 examples:

Single program, multiple data (SPMD)

Task farming/Master & Worker

Fork & Join

Shared data

B: SUPPORTING STRUCTURES

29

Page 30: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Principle: The same code runs on every UE

processing different data

Most common technique to write parallel

programs!

SINGLE PROGRAM, MULTIPLE DATA

30

Page 31: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Program stages:

1. Initialize and obtain unique ID for each UE

2. Run the same program on every UE: Differences in the

instructions are driven by the ID

3. Distribute data by decomposing or sharing/copying global

data

Risk: Complex branching and data decomposition

can make the code awful to understand and

maintain

SINGLE PROGRAM, MULTIPLE DATA

31

Page 32: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Principle: A master task (“farmer”) dispatches

tasks to many worker UEs and collects (“farms”)

the results

TASK FARMING/MASTER & WORKER

32

Page 33: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

TASK FARMING/MASTER & WORKER

33

Page 34: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Precondition: Tasks are relatively independent

Master:

Initiates computation

Creates a bag of tasks and stores them e.g. in a shared queue

Launches the worker tasks and waits

Collects the results and shuts down the computation

Workers:

While the bag of tasks is not empty pop a task and solve it

Flexible through indirect scheduling

Optimization: Master can become a worker too

TASK FARMING/MASTER & WORKER

34

Page 35: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Principle: Tasks create (“fork”) and terminate

(“join”) other tasks dynamically

Example: An algorithm designed after the Divide &

Conquer pattern

FORK & JOIN

35

Page 36: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Mapping the tasks to UEs can be done directly or

indirectly

Direct: Each subtask is mapped to a new UE

Disadvantage: UE creation and destruction is expensive

Standard programming model in OpenMP

Indirect: Subtasks are stored inside a shared

queue and handled by a static number of UEs

Concept behind OpenMP

FORK & JOIN

36

Page 37: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Problem: Manage access to shared data

Principle: Define an access protocol that assures

that the results of a computation are correct for

any ordering of the operations on the data

SHARED DATA

37

Page 38: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

Model shared data as a(n) (abstract) data type with

a fixed set of operations

Operations can be seen as transactions (→ ACID

properties)

Start with a simple solution and improve

performance step-by-step:

Only one operation can be executed at any point in time

Improve performance by separating operations into

noninterfering sets

Separate operations in read and write operations

Many different lock strategies…

SHARED DATA

38

Page 39: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

QUESTIONS?

Page 40: PARALLEL - uni-hamburg.de · Decompose a problem‘s data into units that can be operated on relatively independent Look at problem‘s central data structures Decomposition already

T. Mattson, B. Sanders and B. Massingill. Patterns for parallel programming. Addison-Wesley, 2004.

A. Grama, A. Gupta, G. Karypis and V. Kumar. Introduction to parallel computing. Addison Wesley, 2nd Edition, 2003.

P. S. Pacheco. An introduction to parallel programming . Morgan Kaufmann, 2011.

Images from Mattson et al. 2004

REFERENCES

40