me964 high performance computing for engineering applications parallel programming patterns oct. 16,...
TRANSCRIPT
ME964High Performance Computing for Engineering Applications
Parallel Programming Patterns
Oct. 16, 2008
Before we get started…
Last Time CUDA arithmetic support
Performance implications Accuracy implications
A software design exercise: prefix scan Preamble to the topic of software design patterns
Today Parallel programming patterns
Other issues: HW due on Saturday, 11:59 PM. I will be in Milwaukee next Th, Mik will be the lecturer - tracing the
evolution of the hardware support for the graphics pipeline
2
“If you build it, they will come.”
“And so we built them. Multiprocessor workstations, massively parallel supercomputers, a cluster in every department … and they haven’t come.
Programmers haven’t come to program these wonderful machines.…The computer industry is ready to flood the market with hardware that will only run at full speed with parallel programs. But who will write these programs?”
- Mattson, Sanders, Massingill (2005)3HK-UIUC
Objective
Outline a framework that draws on techniques & best practices used by experienced parallel programmers in order to enable you to
Think about the problem of parallel programming
Address functionality and performance issues in your parallel program design
Discuss your design decisions with others
Select an appropriate implementation platform that will that will facilitate application development
4
Fundamentals of Parallel Computing
Parallel computing requires that The problem can be decomposed into sub-problems that
can be safely solved at the same time The programmer structures the code and data to solve
these sub-problems concurrently
The goals of parallel computing are To solve problems in less time, and/or To solve bigger problems
The problems must be large enough to justify parallel computing and to exhibit exploitable concurrency.
5HK-UIUC
Recommended Reading
Mattson, Sanders, Massingill, Patterns for Parallel Programming, Addison Wesley, 2005, ISBN 0-321-22811-1.
Used for this presentation
A good overview of challenges, best practices, and common techniques in all aspects of parallel programming
Book is on reserve at Wendt Library
6HK-UIUC
Parallel Programming: Key Steps
1) Find the concurrency in the problem
2) Structure the algorithm so that concurrency can be exploited
3) Implement the algorithm in a suitable programming environment
4) Execute and tune the performance of the code on a parallel system
NOTE: The reality is that these have not been separated into levels of abstractions that can be
dealt with independently. 7HK-UIUC
What’s Next?
Finding Concurrency
Implementation Mechanisms
Supporting Structures
Algorithm Structure
Focus on thisfor a while
8From “Patterns for Parallel Programming”
Challenges of Parallel Programming
Finding and exploiting concurrency often requires looking at the problem from a non-obvious angle
Dependences need to be identified and managed
The order of task execution may change the answers
Obvious: One step feeds result to the next steps
Subtle: numeric accuracy may be affected by ordering steps that are logically parallel with each other
9HK-UIUC
Challenges of Parallel Programming (cont.)
Performance can be drastically reduced by many factors
Overhead of parallel processing
Load imbalance among processor elements
Inefficient data sharing patterns
Saturation of critical resources such as memory bandwidth
10HK-UIUC
Shared Memory vs. Message Passing
We will focus on shared memory parallel programming
This is what CUDA is based on
Future massively parallel microprocessors are expected to support shared memory at the chip level
The programming considerations of message passing model is quite different!
11HK-UIUC
Finding Concurrency in Problems
Identify a decomposition of the problem into sub-problems that can be solved simultaneously
A task decomposition that identifies tasks that can execute concurrently
A data decomposition that identifies data local to each task
A way of grouping tasks and ordering the groups to satisfy temporal constraints
An analysis on the data sharing patterns among the concurrent tasks
A design evaluation that assesses of the quality the choices made in all the steps
12HK-UIUC
Finding Concurrency – The Process
Task Decomposition
Data Decomposition
Data Sharing
Order Tasks
Decomposition Group Tasks
Dependence Analysis
Design Evaluation
This is typically an iterative process.Opportunities exist for dependence analysis to play earlier role in decomposition.
13HK-UIUC
Find Concr. 1: Decomp. Stage: Task Decomposition
Many large problems have natural independent tasks
The number of tasks used should be adjustable to the execution resources available.
Each task must include sufficient work in order to compensate for the overhead of managing their parallel execution.
Tasks should maximize reuse of sequential program code to minimize design/implementation effort.
“In an ideal world, the compiler would find tasks for the programmer. Unfortunately, this almost never happens.” - Mattson, Sanders, Massingill 14HK-UIUC
Example: Task Decomposition Square Matrix Multiplication
P = M * N of WIDTH ● WIDTH One natural task (sub-problem)
produces one element of P All tasks can execute in parallel
M
N
P
WIDTH
WIDTH
WIDTH WIDTH
15 HK-UIUC
Task Decomposition Example: Molecular Dynamics
Simulation of motions of a large molecular system A natural task is to perform calculations of individual atoms in parallel For each atom, there are several tasks to complete
Covalent forces acting on the atom Neighbors that must be considered in non-bonded forces Non-bonded forces Update position and velocity Misc physical properties based on motions
Some of these can go in parallel for an atom
It is common that there are multiple ways to decompose any given problem.
16HK-UIUC
Find Concr. 2: Decomp. Stage: Data Decomposition
The most compute intensive parts of many large problems manipulate a large data structure Similar operations are being applied to different parts of the data
structure, in a mostly independent manner. This is what CUDA is optimized for.
The data decomposition should lead to Efficient data usage by each UE within the partition Few dependencies between the UEs that work on different partitions Adjustable partitions that can be varied according to the hardware
characteristics
17HK-UIUC
Data Decomposition Example - Square Matrix Multiplication
P Row blocks (adjacent full rows) Computing each partition requires
access to entire N array Square sub-blocks
Only bands of M and N are needed
M
N
P
WIDTH
WIDTH
WIDTH WIDTH
18
Think how data is stored in the Matrix type Computing things
row-wise or column-wise might actually lead to different performance…
Find Concr. 3: Dependence Analysis - Tasks Grouping
Sometimes natural tasks of a problem can be grouped together to improve efficiency
Reduced synchronization overhead – because when task grouping there is supposedly no need for synchronization
All tasks in the group efficiently share data loaded into a common on-chip, shared storage (Shard Memory)
19HK-UIUC
Task Grouping Example - Square Matrix Multiplication
Tasks calculating a P sub-block Extensive input data sharing,
reduced memory bandwidth using Shared Memory
All synched in execution
M
N
P
WIDTH
WIDTH
WIDTH WIDTH
20HK-UIUC
Find Concr. 4: Dependence Analysis - Task Ordering
Identify the data required by a group of tasks before they can execute
Find the task group that creates it (look upwind)
Determine a temporal order that satisfy all data constraints
21HK-UIUC
Task Ordering Example:Molecular Dynamics
Neighbor List
Covalent Forces
Non-bonded Force
Next Time Step
Update atomic positions and velocities
22HK-UIUC
Find Concr. 5: Dependence Analysis - Data Sharing
Data sharing can be a double-edged sword An algorithm that calls for excessive data sharing can drastically
reduce advantage of parallel execution
Localized sharing can improve memory bandwidth efficiency
Use the execution of task groups to interleave with (mask) global memory data accesses
Read-only sharing can usually be done at much higher efficiency than read-write sharing, which often requires a higher level of synchronization
23HK-UIUC
Data Sharing Example – Matrix Multiplication
Each task group will finish usage of each sub-block of N and M before moving on N and M sub-blocks loaded into Shared Memory for use by all
threads of a P sub-block Amount of on-chip Shared Memory strictly limits the number of
threads working on a P sub-block
Read-only shared data can be efficiently accessed as Constant or Texture data (on the GPU) Frees up the shared memory for other uses
24
Data Sharing Example – Molecular Dynamics
The atomic coordinates Read-only access by the neighbor list, bonded force, and non-
bonded force task groups Read-write access for the position update task group
The force array Read-only access by position update group Accumulate access by bonded and non-bonded task groups
The neighbor list Read-only access by non-bonded force task groups Generated by the neighbor list task group
25HK-UIUC
Find Concr. 6: Design Evaluation
Key questions to ask
How many UE can be used?
How are the data structures shared?
Is there a lot of data dependency that leads to excessive synchronization needs?
Is there enough work in each UE between synchronizations to make parallel execution worthwhile?
26HK-UIUC
Key Parallel Programming Steps
1) Find the concurrency in the problem
2) Structure the algorithm so that concurrency can be exploited
3) Implement the algorithm in a suitable programming environment
4) Execute and tune the performance of the code on a parallel system
27
What’s Next?
Finding Concurrency
Implementation Mechanisms
Supporting Structures
Algorithm StructureFocus on thisfor a while
28From “Patterns for Parallel Programming”
Algorithm: Definition
A step by step procedure that is guaranteed to terminate, such that each step is precisely stated and can be carried out by a computer
Definiteness – the notion that each step is precisely stated
Effective computability – each step can be carried out by a computer
Finiteness – the procedure terminates
Multiple algorithms can be used to solve the same problem Some require fewer steps and some exhibit more parallelism
29HK-UIUC
Algorithm Structure - Strategies
Task Parallelism
Divide and Conquer
Organize by Tasks
Geometric Decomposition
Recursive DataDecomposition
Organize by Data
Pipeline
Event Condition
Organize by Flow
30HK-UIUC
Choosing Algorithm Structure
Algorithm Design
Organize by Task
Organize by Data
Organize by Data Flow
Linear Recursive Linear Recursive
TaskParallelism
Divide andConquer
GeometricDecomposition
RecursiveData
Regular Irregular
Pipeline Event Driven
31HK-UIUC
Alg. Struct. 1: Organize by StructureLinear Parallel Tasks
Common in the context of distributed memory models
These are the algorithms where the work needed can be represented as a collection of decoupled or loosely coupled tasks that can be executed in parallel
The tasks don’t even have to be identical
Load balancing between UE can be an issue (dynamic load balancing?)
If there is no data dependency involved there is no need for synchronization: the so called “embarrassingly parallel” scenario
Example: ray tracing, a good portion of the N-body problem, Monte Carlo type simulation
32
Alg. Struct. 2: Organize by StructureRecursive Parallel Tasks (Divide & Conquer)
Valid when you can break a problem into a set of decoupled or loosely coupled smaller problems, which in turn… This pattern is applicable when you can solve concurrently and with little
synchronization the small problems (the leafs)
Chances are you need synchronization when you have this balanced tree type algorithm Often required by the merging step (assembling the result from the
“subresults”)
Examples: FFT, Linear Algebra problems (see FLAME project), the vector reduce operation
33
Computational ordering can have major effects on memory bank conflicts and control flow divergence.
0 1 2 3 … 13 1514 181716 19
1
2
3
34
Alg. Struct. 3: Organize by Data~ Linear (Geometric Decomposition) ~
This is the case where the UEs gather and work around a big chunk of data with no or little synchronization Imagine a car that needs to be painted:
One robot paints the front left door, another one the rear left door, another one the hood, etc.
The car is the “data” and it’s parceled up with a collection of UEs acting on these chunks of data
NOTE: this is exactly the algorithmic approach enabled best by the GPU & CUDA
Examples: Matrix multiplication, matrix convolution
35
M
N
P
Psub
BLOCK_WIDTH
WIDTHWIDTH
BLOCK_WIDTHBLOCK_WIDTH
bx
tx01 bsize-12
0 1 2
byty
210
bsize-1
2
1
0
BLOCK_WIDTH
BLOCK_WIDTH
BLOCK_SIZE
WIDTH
WIDTH
Tiled (Stenciled) Algorithms are Important for G80
36HK-UIUC
Alg. Struct. 4: Organize by Data~ Recursive Data Scenario ~
This scenario comes into play when the data that you operate on is structured in a recursive fashion Balanced Trees Graphs Lists
Sometimes you don’t even have to have the data represented as such See example with prefix scan You choose to look at data as having a balanced tree structure
Many problems that seem inherently sequential can be approached in this framework This is typically associated with an net increase in the amount of work you have to do Work goes up from O(n) to O(n log(n)) (see for instance Hillis and Steele algorithm) The key question is whether parallelism gained brings you ahead of the sequential
alternative37
Example: The prefix scan~ Reduction Step~
38
x0 x1 x2 x3 x4 x5 x6 x7
x0 S(x0..x1) x2 S(x2..x3) x4 S(x4..x5) x6 S(x6..x7)
x0 S(x0..x1) x2 S(x0..x3) x4 S(x4..x5) x6 S(x4..x7)
x0 S(x0..x1) x2 S(x0..x3) x4 S(x4..x5) x6 S(x0..x7)d=0
d=1
d=2
d=3
1[ 2 1]kx j +× -
for k=0 to M-1offset = 2k
for j=1 to 2M-k-1 in parallel do
x[j·2k+1-1] = x[j·2k+1-1] + x[j·2k+1-2k-1]endfor
endfor
Example: The array reduction (the bad choice)
0 1 2 3 4 5 76 1098 11
0+1 2+3 4+5 6+7 10+118+9
0...3 4..7 8..11
0..7 8..15
1
2
3
Array elements
iterations
39HK-UIUC
Alg. Struct. 5: Organize by Data Flow~ Regular: Pipeline ~
Tasks are parallel, but there is a high degree of synchronization (coordination) between the UEs The input of one UE is the output of an upwind UE (the “pipeline”) The time elapsed between two pipeline ticks dictated by the slowest stage of
the pipe (the bottleneck)
Commonly employed by sequential chips
For complex problems having a deep pipeline is attractive At each pipeline tick you output one set of results
Examples: CPU: instruction fetch, decode, data fetch, instruction execution, data
write are pipelined. The Cray vector architecture drawing heavily on this algorithmic
structure when performing linear algebra operations
40
Alg. Struct. 6: Organize by Data Flow~ Irregular: Event Driven Scenarios ~
The furthest away from what the GPU can support today
Well supported by MIMD architectures
You coordinate UEs through asynchronous events
Critical aspects: Load balancing – should be dynamic Communication overhead, particularly in real-time application
Suitable for action-reaction type simulations
Examples: computer games, traffic control algorithms
41
Key Parallel Programming Steps
1) Find the concurrency in the problem
2) Structure the algorithm so that concurrency can be exploited
3) Implement the algorithm in a suitable programming environment
4) Execute and tune the performance of the code on a parallel system
42
What’s Next?
Finding Concurrency
Implementation Mechanisms
Supporting Structures
Algorithm Structure
Focus on thisfor a while
43
Objective
To understand SPMD and the role of the CUDA supporting structures
How it relates other common parallel programming structures.
What the supporting structures entail for applications development.
How the supporting structures could be used to implement parameterized applications for different levels of efficiency and complexity.
44
Parallel Programming Coding Styles - Program and Data Models
Fork/Join
Master/Worker
SPMD
Program Models
Loop ParallelismDistributed Array
Shared Queue
Shared Data
Data Models
These are not necessarily mutually
exclusive. 45
Program Models
SPMD (Single Program, Multiple Data) All PE’s (Processor Elements) execute the same program in
parallel Each PE has its own data Each PE uses a unique ID to access its portion of data Different PE can follow different paths through the same code This is essentially the CUDA Grid model SIMD is a special case - WARP
Master/Worker
Loop Parallelism
Fork/Join46
Program Models
SPMD (Single Program, Multiple Data)
Master/Worker A Master thread sets up a pool of worker threads and a bag of
tasks Workers execute concurrently, removing tasks until done
Loop Parallelism Loop iterations execute in parallel FORTRAN do-all (truly parallel), do-across (with dependence)
Fork/Join Most general, generic way of creation of threads
47
Review: Algorithm Structure
Algorithm Design
Organize by Task
Organize by Data
Organize by Data Flow
Linear Recursive Linear Recursive
TaskParallelism
Divide andConquer
GeometricDecomposition
RecursiveData
Regular Irregular
Pipeline Event Driven
48
Algorithm Structures vs. Coding Styles(four smiles is best)
Task Parallel. Divide/ConquerGeometric Decomp.
Recursive Data
Pipeline Event-based
SPMD ☺☺☺☺ ☺☺☺ ☺☺☺☺ ☺☺ ☺☺☺ ☺☺
Loop Parallel ☺☺☺☺ ☺☺ ☺☺☺
Master/
Worker ☺☺☺☺ ☺☺ ☺ ☺ ☺ ☺
Fork/
Join ☺☺ ☺☺☺☺ ☺☺ ☺☺☺☺ ☺☺☺☺
Source: Mattson, et al. 49
Supporting Structures vs. Program Models
OpenMP MPI CUDA
SPMD ☺☺☺ ☺☺☺☺ ☺☺☺☺
Loop Parallel
☺☺☺☺ ☺
Master/
Slave☺☺ ☺☺☺
Fork/Join ☺☺☺
50
More on SPMD
Dominant coding style of scalable parallel computing MPI code is mostly developed in SPMD style Many OpenMP code is also in SPMD (next to loop parallelism) Particularly suitable for algorithms based on data parallelism,
geometric decomposition, divide and conquer.
Main advantage Tasks and their interactions visible in one piece of source code,
no need to correlated multiple sources
SPMD is by far the most commonly used pattern for structuring parallel programs.
51
Typical SPMD Program Phases
Initialize Establish localized data structure and communication channels
Obtain a unique identifier Each thread acquires a unique identifier, typically range from 0 to N=1,
where N is the number of threads. Both OpenMP and CUDA have built-in support for this.
Distribute Data Decompose global data into chunks and localize them, or Sharing/replicating major data structure using thread ID to associate
subset of the data to threads
Run the core computation More details in next slide…
Finalize Reconcile global data structure, prepare for the next major iteration 52
Core Computation Phase
Thread IDs are used to differentiate behavior of threads
Use thread ID in loop index calculations to split loop iterations among threads
Use thread ID or conditions based on thread ID to branch to their specific actions
Both can have very different performance results and code complexity depending on the
way they are done.53
A Simple Example
Assume The computation being parallelized has 1,000,000 iterations.
Sequential code:
Num_steps = 1000000;
for (i=0; i< num_steps, i++) {…}
54
SPMD Code Version 1
Assign a chunk of iterations to each thread The last thread also finishes up the remainder iterations
num_steps = 1000000;
i_start = my_id * (num_steps/num_threads);i_end = i_start + (num_steps/num_threads);if (my_id == (num_threads-1)) i_end = num_steps;
for (i = i_start; i < i_end; i++) {….}Reconciliation of results across threads if necessary.
55
Problems with Version 1
The last thread executes more iterations than others
The number of extra iterations is up to the total number of threads – 1 This is not a big problem when the number of threads is
small When there are thousands of threads, this can create
serious load imbalance problems.
Also, the extra if statement is a typical source of “branch divergence” in CUDA programs.
56
SPMD Code Version 2 Assign one more iterations to some of the threads
int rem = num_steps % num_threads;
i_start = my_id * (num_steps/num_threads);i_end = i_start + (num_steps/num_threads);
if (rem != 0) { if (my_id < rem) {
i_start += my_id;i_end += (my_id +1);
} else {
i_start += rem;i_end += rem;
}.
Less load imbalance
More branch divergence.
57
SPMD Code Version 3
Use cyclic distribution of iteration
num_steps = 1000000;
for (i = my_id; i < num_steps; i+= num_threads) {….}
Less load imbalance
Loop branch divergence in the last Warp
Data padding further eliminates divergence.58
Comparing the Three Versions
ID=0 ID=1 ID=3ID=2
ID=0 ID=1 ID=3ID=2
ID=0 ID=1 ID=3ID=2
Version 1
Version 2
Version 3
Padded version1 1 may be bestfor some data access patterns.59
Data Styles
Shared Data All threads share a major data structure This is what CUDA supports
Shared Queue All threads see a “thread safe” queue that maintains
ordering of data communication
Distributed Array Decomposed and distributed among threads Limited support in CUDA Shared Memory
60