text: structured parallel programming: patterns for efficient computation m. mccool, a. robison, j....

TEXT: Structured Parallel Programming: Patterns for Efficient Computation

M. McCool, A. Robison, J. Reinders

Chapter 1

CMPS 5433 – Parallel AlgorithmsDr. Ranette Halverson

•Parallel? Sequential?• In general, what does Sequential mean? Parallel?• Is Parallel always better than Sequential?• Examples??• Parallel resources?• How Efficient is Parallel? • How can you measure Efficiency?

Introduction

Automatic Parallelization

vs. Explicit Parallel Programming

•Vector instructions•Multithreaded cores•Multicore processors•Multiple processors•Graphics engines•Parallel co-processors•Pipelining•Multi-tasking

Parallel Computing Features

•All programmers must be able to program for parallel computers!!•Design & Implement• Efficient, Reliable, Maintainable• Scalable

•Will try to avoid hardware issues…

All Modern Computers are Parallel

•Patterns: valuable algorithmic structures commonly seen in efficient parallel programs•AKA: algorithm skeletons

• Examples of Patterns in Sequential Programming?• Linear search• Recursion ~~ Divide-and-Conquer• Greedy Algorithm

Patterns for Efficient Computation

•Capture Algorithmic Intent•Avoid mapping algorithms to particular hardware• Focus on • Performance• Low-overhead implementation• Achieving efficiency & scalability

•Think Parallel

Goals of This Book

• Scalable: ability to efficiently apply an algorithm to ‘any’ number of processors•Considerations:• Bottleneck: situation in which productive work is delayed

due to limited resources relative to amount of work; congestion occurring when work arrives more quickly than it can be processed • Overhead: cost of implementation of a (parallel) algorithm

that is not part of the solution itself – often due to the distribution or collection of data & results

Scalable

•Act of putting set of operations into a specific order• Serial Semantics & Serial Illusion• But were computers really serial?• Have we become overly dependent on sequential strategy?

•Pipelining, Multi-tasking

Serialization (1.1)

• Improved Performance• The Past• The Future•Why is my new faster computer not faster??

•Deterministic algorithms • End result vs. Intermediate results• Timing issues – Relative vs. Absoute

Can we ignore parallel strategies?

• Structured Patterns for parallelism• Eliminate non-determinism as much as possible

Approach of THIS Book

Consider: A = B + 7 Read Z C = X * 2Does order matter?? A = B + 7 Read A B = X * 2

• Programmers think serial (sequential)• List of instructions,

executed in serial order• Standard programming

training• Serial Trap: Assumption of

serial ordering

Serial Traps

What does parallel_for imply? Are both correct? Will the results differ? What about time?

for (i=0; i<num_web_sites;++i)search (search_phrase, website[i]

parallel_for (i=0; i<num_web_sites;++i)search (search_phrase, website[i]

Example:Search web for a particular search phrase

parallel_do{ A = B + 7 Read A B = X * 2}

parallel_do{ A = B + 7 Read Z C = X * 2}

Can we apply “parallel” to previous example?? Results?

Complexity of a Parallel Algorithm• Time •How can we compare?•Work•How do we split problems

across processors?

Complexity of a Sequential Algorithm• Time complexity•Big Oh•What do we mean ?•Why do we measure

performance?

Performance (1.2)

•Doesn’t really mean time•Assumptions about measuring “time”• All instructions take same amount of time• Computers are same speed• Are these assumptions true? • So, what is Time Complexity, really?

Time Complexity

Suppose want to calculate & print 1000 payroll checks.Each employee’s pay is independent of others.

parallel_for (i<num_employees;++i) Calc_Pay(empl[i]);

Any problems with this?

Example: Payroll Checks

•What if all employee data is stored in an array? Can it be accessed in parallel?•What about printing* the checks?•What if computer has 2000 processors?• Is this application a good candidate for a parallel

program?•How much faster can we do the job?

Payroll Checks (cont’d)

•How can you possibly sum a single list of integers in parallel?•What is the optimal number of processors to use?•Consider number of operations.

Example: Sum 1000 Integers

Considerations of parallel processing•Communication •How is data distributed? Results compiled?

• Shared memory vs. Local memory• One computer vs. Network (e.g. SETI Project)

•Work

Actual Time vs. Number of Instructions

• The time required to perform the longest chain of tasks that must be performed sequentially• SPAN limits speed up that can be accomplished via

parallelism• See Figures 1.1 & 1.2•Provides a basis for optimization

SPAN = Critical Path

• Shared memory model – book assumption• Communication is simplified• Causes bottlenecks ~~ Why? Locality

• Locality•Memory (data) accesses tend to be close together• Accesses near each other (time & space) tend to be cheaper

•Communication – best when none•More processors is not always better

Shared Memory & Locality (of reference)

•Proposed in 1967•Provides upper bound on speed-up attainable via

parallel processing• Sequential vs. Parallel parts of a program• Sequential portion limits parallelism

Amdahl’s Law (Argument)

Example: Consider program with 1000 operations• Are ops independent?• OFTEN, parallelizing adds

to the total number of operations to be performed

3 components to consider when parallelizing• Total Work• Span•Communication

Work-Span Model

•Communication strategy for “reducing” number information from n to 1 workers•O(log p) = p is number of workers/processors• Example: 16 processors hold 1 integer each & integers

need to be added together. •How can we effectively do this?

Reduction

• The “opposite” of Reduction•Communication strategy for distributing data from 1

worker to n workers• Example: 1 processor holds 1 integer & integer needs to

be distributed to the 15 others•How can we effectively do this?

Broadcast

•A process that endures all processors have approximately the same amount of work/instructions• To ensure processors are not idle

Load Balancing

•A measure of the degree to which a problem is broken down into smaller parts, particularly for assigning each part to a separate processor in a parallel computer•Coarse granularity: Few parts, each relatively large• Fine granularity: Many parts, each relatively small

Granularity

Moore’s Law – 1965, Gordon More of Intel•Number of integrated circuits on silicon chip doubles

every 2 years (approximately)•Until 2004 – increased transistor switching speeds

increased clock rate increased performance•Note Figures 1.3 & 1.4 in text•BUT around 2003 no more increase in clock rate – 3 GHz

Pervasive Parallelism & Hardware Trends (1.3.1)

• Power wall: power consumption growth with clock speed, non-linear growth• Instruction-level parallelism wall: no new low-level parallelism

(automatic) – only constant factor increase• Superscalar instructions• Very Large Instruction Word (by compiler)• Pipelining – max 10 stages

•Memory wall: difference between processor & memory speeds• Latency• Bandwidth

Walls limiting single processor performance

• Virtual memory• Prefetching• Cache

• Benchmarks• Standard programs used to

compare performance on different computers or to demonstrate a computer’s capabilities

• Early Parallelism – WW2• Newer HW features• Larger word size• Superscalar• Vector• Multithreading• Parallel ALUs• Pipelines• GPUs

Historical Trends (1.3.2)

• Serial Traps: unnecessary assumptions deriving from serial programming & execution• Programmer assumes serial execution so gives not

consideration to possible parallelism• Later, no possible to parallelize

•Automatic Parallelism ~ Different strategies• Absolute automatic – no help from programmer – compiler• Parallel constructs – “optional”

Explicit Parallel Programming (1.3.3)

Rule 1: Parallelize a block of seq. instruct. with no repeated variables.Rule 2: Parallelize any block of seq. instruct. if repeated variables don’t change (not on left)

What about instruction 5? What can we do?

Consider the following simple program:1. A = B + C2. D = 5 * X + Y3. E = P + Z / 74. F = B + X5. G = A * 10

Examples of Automatic Parallelism

•Pointers allow a data structure to be distributed across memory• Parallel analysis is very difficult

• Loops can accommodate or restrict or hide possible parallelization

void addme(int n, double a[n],b[n], c[n]{ int I;

for (I=0; I <n; ++I)a[I] = b[I] + c[I];}

Parallelization Problems

double a[10]a[0]= 1addme(9,a+1,a,a)

void addme(int n, double a[n],b[n], c[n]{ int I; for (I=0; I <n; ++I)

a[I]=b[I]+c[I];}

Examples

Mandatory Parallelism VS.

Optional Parallelism

“Explicit parallel programming constructs allow algorithms to be expressed without specifying unintended & unnecessary serial constraints”

void call me ( ){ foo ( );

bar ( ); }

void call me ( ){ cilk_spawn foo ( );

bar ( ); }

Examples

•Patterns: commonly recurring strategies for dealing with particular problems• Tools for Parallel problems• Goals: parallel scalability, good data locality, reduced overhead

•Common in CS: OOP, Natural language proc., data structures, SW engineering

Structured Pattern-based Programming (1.4)

•Abstractions: strategies or approaches which help to hide certain details & simplify problem solving• Implementation Patterns – low-level, system specific•Design Patterns – high-level abstraction•Algorithm Strategy Patterns – Algorithm Skeletons• Semantics – pattern = building block, task arrangement, data

dependencies, abstract• Implementation – for real machine, granularity, cache• Ideally, treat separately, but not always possible

Patterns

Figure 1.1 – Overview of Parallel Patterns

•Contemporary, Popular languages – not designed parallel• Need transition to parallel

•Desired Properties of Parallel Language (p. 21)• Performance – achievable, scalable, predictable, tunable• Productivity – expressive, composable, debuggable,

maintainable• Portability – functionality & performance across systems

(compilers & OS)

Parallel Programming Models (1.5)Desired Properties (1.5.1)

Textbook – Focus on•C++ & parallel support• Intel support • Intel Threading Building Blocks (TBB) – Appendix C• C++ Template Library, open source & commercial

• Intel Cilk Plus – Appendix B• Compiler extensions fof C & C++, open source & commercial

•Other products available – Figure 1.2

Programming – C, C++

•Avoid HW Mechanisms• Particularly vectors & threads

• Focus on • TASKS – opportunities for parallelism• DECOMPOSITION – breaking problem down• DESIGN of ALGORITHM – overall strategy

Abstractions vs. Mechanisms (1.5.2)

“(Parallel) programming should focus on decomposition of problem & design of algorithm rather than specific mechanisms by which it will be parallelized.” p. 23

Reasons to avoid HW specifics (mechanisms)•Reduced portability•Difficult to manage nested parallelism•Mechanisms vary by machine

Abstractions, not Mechanisms

•Key to scalability – Data Parallelism•Divide up DATA not CODE!•Data Parallelism: any form of parallelism in which the

amount of work grows with the size of the problem•Regular Data Parallelism: subcategory of D.P. which

maps efficiently to vector instructions•Parallel languages contain constructs for D.P.

Regular Data Parallelism (1.5.3)

The ability to use a feature in a program without regard to other features being used elsewhere• Issues:• Incompatibility• Inability to support hierarchical composition (nesting)

•Oversubscription: situation in nested parallelism in which a very large number of threads are created• Can lead to failure, inefficiency, inconsistency

Composability (1.5.4)

Thr1~~

Thr2~~

Thr3~~

• Smallest sequence of program instructions that can be managed•Program Processes

Threads• “Cheap” Context Switch•Multiple Processors

Thread (p. 387)Process

•Portable: the ability to run on a variety of HW with little adjustments• Very desirable; C, C++, Java are portable languages

•Performance Portability: the ability to maintain performance levels when run on a variety of HW• Trade-offs General/Portable Specific/Performance

Portability & Performance (1.5.5 & 1.5.6)

•Determinism vs. Non-determinism•Safety – ensuring only correct orderings occur•Serially Consistent•Maintainability

Issues (1.5.7)

Exam 1 on Chapter 1

text: structured parallel programming: patterns for efficient computation m. mccool, a. robison, j....

Documents

in re mccool no. 13 db 059

reinders sebastien ue tice livret accompagnement

2015-04-05 in re mccool updated sct brief

32. contract maple reinders signed feb 26, 2010

criminal history file: mccool, frank.€¦ · frank mccool,...

audit report of the village of mccool junction · pdf fileof...

finn mccool

reinders academie

fex | zorg | 131029 | bestemming: topfitte (financiële)...

accident investigation board - globalsecurity.orgfollows in...

lecturer: a. reinders sources of innovation - · pdf...

from: thomas, m., reinders, h., warschauer, m. (eds ... -...

parallel programming patterns · mccool et al., chapter 3 ....

karl mccool - new york university...karl mccool nyu miap...

tap r.v.guha, ibm research rob mccool, stanford ksl

Το κίνημα 2 ιούνη ralf reinders ronald fritzsch

20160801 0930 atpesc argonne reinders

reinders siggraph 2017 as presented - … · – not...

william r. cowden steven j. mccool mallon & mccool, llc...

2015-07-14 in re mccool, motion for rehearing