introduction to parallel computingws16/17 ©jesper larsson träff linguistic: how are the...

134
©Jesper Larsson Träff WS16/17 Introduction to Parallel Computing Introduction, problems, models, performance Jesper Larsson Träff [email protected] TU Wien Parallel Computing

Upload: others

Post on 06-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Introduction to Parallel Computing Introduction, problems, models, performance

Jesper Larsson Träff [email protected]

TU Wien Parallel Computing

Page 2: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Mobile phones: „dual core“, „quad core“ (2012?), …

June 2012: IBM BlueGene/Q, 1572864 cores, 16PF

“Never mind that there's little-to-no software that can take advantage of four processing cores, Xtreme Notebooks has released the first quad-core laptop in the U.S (2007)”.

June 2016: 10,649,600 cores, 93 PFLOPS (125 PFLOPS peak)

Parallelism everywhere! How to use all these resources? Limits?

…octa-core (2016, Samsung Galaxy 7)

Practical Parallel Computing

Page 3: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

As per ca. 2010: No sequential computer systems anymore (almost): multi-core, GPU/accelerator enhanced, …

Why?

June 2011: Fujitsu K, 705024 cores, 11PF

Practical Parallel Computing

Page 4: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

June 2011: Fujitsu K, 705024 cores, 11PF

As per 2010: a) All applications must be parallelized; or b) enough independent applications that can run at the same time

Challenge: How do I speed up windows…?

Practical Parallel Computing

Page 5: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

“I spoke to five experts in the course of preparing this article, and they all emphasized the need for the developers who actually program the apps and games to code with multithreaded execution in mind”

7 myths about quad-core phones (Smartphones Unlocked) by Jessica Dolcourt April 8, 2012 12:00 PM PDT

The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software By Herb Sutter The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency. This article appeared in Dr. Dobb's Journal,30(3), March 2005

…many similar quotations can (still) be found

Practical Parallel Computing

Page 6: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software By Herb Sutter The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency. This article appeared in Dr. Dobb's Journal,30(3), March 2005

“Free lunch” (as in “There is no such thing as a free lunch”): Exponential increase in single-core performance (18-24 month doubling rate, Moore’s “Law”), no software changes needed to exploit the faster processors

Page 7: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

“The free lunch” was over…” ca. 2005

•Clock speed limit around 2003 •Power consumption limit from 2000 •Instructions/clock limit late 90ties

But: Numbers of transistors/chip can continue to increase (>1Billion)

Solution(?): Put more cores on chip „multi-core revolution“

Kunle Olukotun (Stanford), ca. 2010

Page 8: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Parallelism challenge: Solve problem p times faster on p (slower, more energy efficient) cores

Kunle Olukotun (Stanford), ca. 2010

“The free lunch” was over…” ca. 2005

•Clock speed limit around 2003 •Power consumption limit from 2000 •Instructions/clock limit late 90ties

Page 9: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Kunle Olukotun, 2010, Karl Rupp (TU Wien), 2015

Page 10: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Single-core performance does still increase, somewhat, but much slower… Henk Poley, 2014, www.preshing.com

Average, normalized, SPEC benchmark numbers, see www.spec.org

Page 11: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

2006: Factor 103-104 increase in integer performance over 1978 high-performance processor

Single-core processor development (“free lunch”) made life very difficult for parallel computing in the 90ties

From Hennessy/Patterson, Computer Architecture

Page 12: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Architectural ideas driving sequential performance increase

•Deep pipelining •Superscalar execution (>1 instruction/clock) through multiple-functional units… •Data parallelism through SIMD units (old HPC idea: vector processing), same instruction on multiple data/clock •Out-of-order execution, speculative execution •Branch prediction •Caches (pre-fetching) •Simplified/better instruction sets (better for compiler) •SMT/Hyperthreading

Increase in clock-frequency (“technological advances”) alone (factor 20-200?) alone does not explain performance increase:

Mostly fully transparent, at most the compiler needs to care: the “free lunch”

Page 13: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

•Very diverse parallel architectures •No single, commonly agreed upon abstract model (for designing and analyzing algorithms)

Many different programming paradigms (models), different programming frameworks (interfaces, languages)

•Has been so •Is still so (largely)

Page 14: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Theoretical Parallel Computing

Parallelism was always there! What can be done? Limits?

Assume p processors instead of just one, reasonably connected (memory, network, …) •How much master can some given problem be solved? Can some problems be solved better? •How? New algorithms? New techniques? •Can all problems be solved faster? Are there problems that cannot be solved faster with more processors? •Which assumptions are reasonable? •Does parallelism give new insights into nature of computation?

Page 15: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Algorithm in model

Concrete program (C, C++, Java, Haskell, Fortran,…)

Sequential computing

Concrete architecture

is difficult. Analysis in model (often) has some relation to concrete execution

vs. Parallel computing

Algorithm in model A

Algorithm in model B

Algorithm in model Z

Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)

Conc. Arch. A Conc. Arch. Z …

Page 16: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

is extremely difficult. Analysis in model may have little relation to concrete execution

Parallel computing

Algorithm in model A

Algorithm in model B

Algorithm in model Z

Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)

Conc. Arch. A Conc. Arch. Z

Huge gap: no „standard“ abstract model (e.g., RAM),

Page 17: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Extremely difficult. Analysis in model may have little relation to concrete execution

Algorithm in model A

Algorithm in model B

Algorithm in model Z

Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)

Conc. Arch. A Conc. Arch. Z

Challenges: •Algorithmic: Not all problems seems to be easily parallelizable •Portability: Support for same language/interface on different architectures (e.g. MPI in HPC) •Performance portability (?)

Parallel computing

Page 18: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Algorithm in model A

Algorithm in model B

Algorithm in model Z

Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)

Conc. Arch. A Conc. Arch. Z

Elements of Parallel computing: •Algorithmic: Find the parallelism •Linguistic: Express the parallelism •Practical: Validate the parallelism (correctness, performance)

•Technical challenge: run-time/compiler-support support for paradigm

Parallel computing

Page 19: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Parallel computing: Accomplish something with a coordinated set of processors under control of a parallel algorithm

•It is inevitable: multi-core revolution, GPGPU paradigm, … •It‘s interesting, challenging, highly non-trivial – full of surprises •Key discipline of computer science (von Neumann, golden theory decade: 1980-early 90ies) •It‘s ubiquituous (gates, architecture: pipelines, ILP, TLP, systems: operating systems, software), not always transparent •It‘s useful: large, extremely computationally intensive problems, Scientific Computing, HPC •…

Why study parallel computing?

Page 20: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Parallel computing: The discipline of efficiently utilizing dedicated parallel resources (processors, memories, …) to solve single, given computational problem.

Specifically: Parallel resources with significant inter-communication capabilities, for problems with non-trivial communication and computational demands

Buzz words: tightly coupled, dedicated parallel system; multi-core processor, GPGPU, High-Performance Computing (HPC), …

Focus on properties of solution (time, size, energy, …) to given, individual problem

Page 21: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Distributed computing: The discipline of making independent, non-dedicated resources available and coorperative toward solving specified problem complexes.

Buzz words: internet, grid, cloud, agents, autonomous computing, mobile computing, …

Typical concerns: correctness, availability, progress, security, integrity, privacy, robustness, fault tolerance, …

Page 22: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Concurrent computing: The discipline of managing and reasoning about interacting processes that may (or may not) progress simultaneously

Buzz words: operating systems, synchronization, interprocess communication, locks, semaphores, autonomous computing, process calculi, CSP, CCS, pi-calculus, …

Typical concerns: correctness (often formal), e.g. deadlock-freedom, starvation-freedom, mutual exclusion, fairness

Page 23: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Parallel vs. Concurrent computing

Process

Resource Proc

Process Process Process …

Concurrent computing: Focus on coordination of access to/usage of shared resources (to solve given, computational problem)

Proc

Memory (locks, semaphores, data structures), device, …

Given problem: specification, algorithm, data

Adopted from Madan Musuvathi

Page 24: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Subproblem

Proc

Subproblem Subproblem Subproblem

Problem Specification, algorithm, data

Parallel computing: Focus on dividing given problem (specification, algorithm, data) into subproblems that can be solved by dedicated processors (in coordination)

Proc Proc Proc

Coordination: synchronization, communication

Page 25: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

The “problem” of parallelization

Subproblem Subproblem Subproblem Subproblem

Problem

How to divide given problem into subproblems that can be solved in parallel?

•Specification •Algorithm? •Data?

•How is the computation divided? Coordination necessary? Does the sequential algorithm help/suffice? •Where are the data? Is communication necessary?

Page 26: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Aspects of parallelization

•Algorithmic: How to divide computation into independent parts that can be executed in parallel? What kinds of shared resources are necessary? Which kinds of coordination? How can overheads be minimized (redundancy, coordination, synchronization)? •Scheduling/Mapping: How are independent parts of the computation assigned to processors? •Load balancing: How can independent parts be assigned to processors such that all resources are utilized efficiently? •Communication: When must processors communicate? How? •Synchronization: When must processors agree/wait?

Page 27: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming language, interface, library) Pragmatic/practical: How does the actual, parallel machine look? What is a reasonable, abstract model? Architectural: which kinds of parallel machines can be realized? How do they look?

Page 28: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Levels of parallelization

Architecture: gates, logical and functional units Computational: Instruction Level Parallelism (ILP)

SIMD

Functional, explicit: threads (possibly concurrent, sharing architecture resources), cores, processors

“Parallel computing”

Not here

Large-scale: coarse grained tasks, multi-level, coupled application parallelism

Not here

Page 29: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Can’t we just leave it all to the compiler/hardware?

Automatic parallelization(?)

•“high-level” problem specification •Sequential program

Efficient code that can be executed on parallel, multi-core processor

Successsful only to a limited extent: •Compilers cannot invent a different algorithm •Hardware parallelism not likely to go much further

Samuel P. Midkiff: Automatic Parallelization: An Overview of Fundamental Compiler Techniques. Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers 2012

Page 30: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Explicitly, parallel programming (today):

•Explicitly parallel code in some parallel language •Support from parallel libraries •(Domain specific languages) •Compiler does as much as compiler can do

Lot’s of interesting problems and tradeoffs, active area of research

Page 31: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Some „given, individual problems“ for this lecture

Problem 1: Matrix-vector multiplication: given nxm matrix A, m-column vector v, compute n-column vector u = Av

x = A[i,j] u[i] = ∑A[i,j]v[j]

Dimensions n,m large: obviously some parallelism

How to ∑ in parallel? Access to vector?

Page 32: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Problem 1a: Matrix-Matrix multiplication: given nxk matrix A, kxm matrix B, compute nxm matrix product C = AB

x =

C[i,j] = ∑A[i,k]B[k,j]

Page 33: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Problem 1b: Solving sets of linear equations. Given matrix A and vector b, find x such that Ax = b

Preprocess A such that solution to Ax = b can easily be found for any B (LU factorization, …)

Page 34: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Problem 2: Stencil computation, given nxm matrix A, update as follows, and iterate until some convergence criteria is fulfilled

A[i,j]

iterate { for all (i,j) { A[i,j] <- (A[i-1,j]+A[i+1,j]+A[i,j-1]+A[i,j+1])/4 } } until (convergence)

with suitable handling of matrix border

Looks well-behaved, „embarassingly parallel“? Data distribution? Conflicts on updates?

Use: discretization and solution of certain parallel differential equations (PDEs); image processing; …

5-point stencil in 2d

Page 35: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Problem 3: Merging two sorted arrays of size n and m into a sorted array of size n+m

Problem 4: Computation of all prefix-sums: given an array A of size n of elements of some type S with an associative operations +, compute for all indices 0≤i<n the prefix-sums in array B B[i] = ∑0≤j<iA[j]

Easy to do sequentially, but sequential algorithm looks… sequential

Implies solution to problem of computing just ∑. Parallel?

A: 1 2 3 4 5 6 7 …

B: x 1 3 6 10 15 21 28 … All prefix-sums

Page 36: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Problem 5: Sorting a sequence of objects („reals“, integers, objects with order relation) stored in array

Hopefully parallel merge solution can be of help. Other approaches? Quicksort? Integers (counting, bucket, radix)

Page 37: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Parallel computing as a (theoretical) CS discipline

•Than what? •What is a reasonable „model“ for parallel computing? •How fast can a given problem be solved? How many resources can be productively exploited? •Are there problems that cannot be solved in parallel? Fast? At all? •…

(Traditional) objective: Solve a given computational problem faster

Page 38: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

M

P

Example: RAM (Random-Access Machine)

Processor (ALU, PC, registers) capable of executing instructions stored in memory on data in memory Execution of instruction, access to memory: first assumption: unit cost

Realistic?

Useful?

Page 39: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P

Example: RAM (Random-Access Machine)

•Memory: load, store •Arithmetic (integer, real numbers) •Logic: and, or, xor, shift •Branch, compare, procedure call

Traditional RAM: Same unit cost for all operations

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Page 40: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P

Example: RAM (Random-Access Machine)

Aka von Neumann architecture or stored program computer

John von Neumann (1903-57), Report on EDVAC, 1945; also Eckert&Mauchly, ENIAC

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Page 41: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P

Example: RAM (Random-Access Machine)

„von Neumann bottleneck“: Program and data separate from CPU, processing rate limited by memory rate.

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Page 42: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P

Example: RAM (Random-Access Machine)

John W. Backus: Can Programming Be Liberated From the von Neumann Style? A Functional Style and its Algebra of Programs. Commun. ACM 21(8): 613-641 (1978), Turing Award Lecture, 1977

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Page 43: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P

Example: RAM (Random-Access Machine) with cache/memory hierarchy

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

cache Different (NOT unit) cost model, some memory accesses cheaper than other

Page 44: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P

M M M

Increased memory rate, vector computer, ALU operates on vectors instead of scalars

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Page 45: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P P P P

Shared-memory model, bus based

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Page 46: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P P P P

Shared-memory model (network based, emulated)

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Page 47: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P P P P

Synchronous, shared-memory model

Processors in lock-step, uniform memory access time = instruction time: Parallel RAM (PRAM)

Realistic?

Useful?

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Page 48: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P P P P

Synchronous shared-memory model

PRAM main theoretical model, from late 70ties, throughout 80ties, lost interest ca. 1993. Very important analysis/idea tool

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Page 49: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P P P P

Shared-memory model, banked, network based, not synchronous

M M M …

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Page 50: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

UMA (Uniform Memory Access): Access time to memory location is independent of location and accessing processor, e.g., O(1), O(log M), …

NUMA (Non-Uniform Memory Access): Access time dependent on processor and location. Locality: some locations can be accessed faster by a processor than other

M

P P P P

M

P P P P

M M M

Page 51: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Architecture model defines resources, describes •Composition of processor, functional units: ALU, FPU, registers, w-bit words vs. unlimited, Vector Unit (MMX, SSE, AVX) •Types of instructions •Memory system, caches •…

Execution model/cost model specifies •How instructions are executed •(relative) Cost of instructions, memory accesses •…

Level of detail/formality dependent on purpose: What is to be studied (complexity theory, algorithms design, …), what counts (instructions, memory accesses, …)

Page 52: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P P P P

Distributed memory model

M M M

Communication network

Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

Page 53: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Parallel architecture model defines •Synchronization between processors •Synchronization operations •Atomic operations, shared resources (memory, registers) •Communication mechanisms: network topology, properties •Memory: shared, distributed, hierarchical, … •…

Cost model defines •Cost of synchronization, atomic operations •Cost of communication (latency, bandwidth, …) •…

Page 54: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Different parallel architecture model: Cellular automaton, systolic array, … : simple processors without memory (finite state automata, FSA), operate in lock step on (potentially infinite) grid, local communication only

[John on Neumann, Arthur W. Burks: Theory of self-reproducing automata, 1966] [H. T. Kung: Why systolic architectures? IEEE Computer 15(1): 37-46, 1982]

State of cell (i,j) in next step determined by •Own state •State of neighboors in some neighborhood, e.g., (i,j-1),(i+1,j), (i,j+1),(i-1,j)s

Page 55: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Flynn‘s taxonomy: Orthognal classification of (parallel) architectures/models

SISD Single Instruction Single Data

MISD Multiple Instruction Single Data

SIMD Single Instruction Multiple Data

MIMD Multiple Instruction Multiple Data

[M. J. Flynn: Some computer organizations and their effectiveness. IEEE Trans. Comp. C-21(9):948-960, 1972]

Intruction stream

Dat

a st

ream

Page 56: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

SISD: Single processor, single stream of instructions, operates on single stream of data. Sequential architecture (e.g. RAM) SIMD: Single processor, single stream of operations, operates on multiple data per instruction. Example: traditional vector computer, PRAM (some variants) MISD: Multiple instructions operate on single data stream. Example: pipelined architectures, streaming architectures(?), systolic arrays (70ties architetural idea). MIMD: Multiple instruction streams, multiple data streams

Some say: Empty

Page 57: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

M

P SISD

M

P

M M M

SIMD

Typical instances

M

P P P P

M M M … M

P P P P

M M M

Communication network

MIMD

Page 58: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Programming model: Abstraction, close to programming language, defining parallel resources, management of parallel resources, parallelization paradigms, memory structure, memory model, synchronization and communication features, and their semantics

Parallel programming language, or library („interface“) is the concrete implementation of one (or more: multi-modal, hybrid) parallel programming model(s)

Cost of operations: May be specified with programming model; often by architecture/computational model

Execution model: when and how parallelism in programming model is effected

Page 59: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

•Parallel resources, entities, units: processes, threads, tasks, … •Expression of parallelism: explicit or implicit •Level and granularity of parallelism •Memory model: shared, distributed, hybrid •Memory semantics („when operations take effect/become visible“) •Data structures, data distributions •Methods of synchronization (implicit/explicit) •Methods and modes of communication

Parallel programming model defines, e.g.,

Page 60: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

(*)SPMD: Single Program, Multiple Data (**)PGAS: Partitioned Global Address Space – not in this lecture

Examples:

1. Threads, shared memory, block distributed arrays, fork-join parallelism

2. Distributed memory, explicit message passing, collective communication, one-sided communication („RDMA“), PGAS(**)

3. Data parallel SIMD, SPMD(*) 4. …

Concrete libraries/languages: pthreads, OpenMP, MPI, UPC, TBB, …

Page 61: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Same Program Multiple Data (SPMD)

[F.Darema at al.: A single-program-multiple-data computational model for EPEX/FORTRAN, 1988]

Restricted MIMD model: All processors execute same program •May do so asynchronously, different processors may be in different parts of program at any given time •Same objects (procedures, variables) exist for all processors, “remote procedure call”, “active messages, “remote-memory access” make sense

Programming model concepts (not this lecture): active messages, remote procedure call, … (MPI: RMA/one-sided communication)

Page 62: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Programming language/library/interface/paradigm

Programming model

Architecture model

„Real“ Hardware

Different architectures models can realize given programming model

Closer fit: more efficient use of architecture

Challenge: Programming model that is useful and close to „realistic“ architecture models to enable realistic analysis/prediction

OpenMP MPI

Challenge: Language that conveniently realizes programming model

Algorithms support, „run-time“

Cilk

Page 63: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Examples:

OpenMP programming interface/language for shared-memory model, intended for shared memory architectures; „Data parallel“ Can be implemented with DSM (Distributed Shared Memory) on distributed memory architectures – but performance has usually not been good. Requires DSM implementation/algorithms

MPI interface/library for distributed memory model, can be used on shared-memory architectures, too. Needs algorithmic support (e.g., „collective operations“)

Cilk language (extended C) for shared-memory model, for shared-memory architectures; „Task parallel“. Need run-time (e.g., „work-stealing“)

Page 64: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Performance: Basic observations and goals

Model: p dedicated parallel processors collaborate to solve given problem of input size n. We will be interested in worst-case complexities (execution time)

•Processors can work independently (local memory, program) •Communication and coordination incurs some overhead

Page 65: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Challenge and one main goal of parallel processing

… not to make life easy

Be faster than commonly used (good, best known, best possible…) sequential algorithm and implementation, utilizing the p processors efficiently

Page 66: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Tseq(n): time for 1 processor to solve problem of size n

Tpar(p,n): time for p processors to solve problem of size n

Sp(n) = Tseq(n)/Tpar(p,n)

Speedup measures the gain in moving from sequential to parallel computation (note: parameters p and n)

Main goal: Speeding up computations by parallel processing

Goal: Achieve as large speed-up as possible

Page 67: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

What is „time“ (number of instructions)?

-Time for some algorithm for solving problem? -Time for a specific algorithm for solving problem?

-Time for best known algorithm for problem? -Time for best possible algorithm for problem?

-Time for specific input of size n, average case, worst case, …? -Asymptotic time, large n, large p?

-Do constants matter, e.g. O(f(p,n)) or 25n/p+3ln (4 (p/n))… ?

What exactly is Tseq(n), Tpar(p,n)?

Page 68: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Tseq(n): •Theory: Number of instructions (or other critical cost measure) to be executed in the worst case for inputs of size n. •The number of instructions carried out is often termed WORK •Practice: Measured time (or other parameter) of execution over some inputs (experiment design)

Choose sequential algorithm (theory), choose an implementation of this algorithm (practice)

Theory and practice: Always state baseline sequential algorithm/implementation

Page 69: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Examples (theory):

Tseq(n) = O(n): Tseq(n,m) = O(n+m): Tseq(n) = O(n log n): Tseq(n,m) = O(n log n + m): Tseq(n) = O(n3): …

Finding maximum of n numbers in unsorted array; prefix-sums Merging of two sequences; BFS/DFS in graph Comparison-based sorting Single-source Shortest Path (SSSP) Matrix multiplication, input two nxn matrices …

Standard, worst-case, asymptotic complexities

Cormen, Leiserson, Rivest, Stein: Introduction to Algorithms. 3rd ed., MIT Press, 2009

Can be solved in o(n3), Strassen etc.

Page 70: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

New issue with modern processors: Is time always the same thing? Clock frequency may not be constant, e.g., can depend on load of system, energy cap, “turbomode”, etc.. Such factors difficult to control

Practice: •Construct good input examples to measure running timeTseq(n); experimental methodology •Worst-case not always possible, not always interesting; best case? •Experimental methods to get stable, accurate Tseq(n): Repeat measurements many times (thumb rule: average over at least 30 repetitions)

Experimental science: Always some assumptions about repeatability, regularity, determinism, …

Page 71: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Parallel performance in theory

Definition: Let Tseq(n) be the (worst-case) time of best possible/best known sequential algorithm, andTpar(n,p) the (worst-case) time of parallel algorithm. The absolute speed-up of Tpar(n,p) with p processors over Tseq(n) is

Sp(n) = Tseq(n)/Tpar(p,n)

Observation (proof follows): Best-possible, absolute speed-up is linear in p

Goal: Obtain (linear) speedup for as large p as possible (as function of problem size n), for as many n as possible

Page 72: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Definition: T∞(n): the smallest possible running time of parallel algorithm given arbitrarily many processors. Per definition T∞(n) ≤ Tpar(p,n) for all p. Speedup is limited by

Sp(n) = Tseq(n)/Tpar(p,n) ≤ Tseq(n)/T∞(n)

Definition: Tseq(n)/T∞(n) is called the parallelism of the algorithm. The parallelism is the largest number of processors that can be employed and still give linear speedup (assume Tseq(n)/T∞(n)<p in equation above)

Page 73: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

For speedup (and other complexity measures), distinguish: •Problem G to be solved (mathematical specification) •Some algorithm A to solve G •Best possible (lower bound) algorithm A* for G, best known algorithm A+ for G: The complexity of G •Implementation of A on some machine M

Page 74: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Subproblem: Sum of n/p elements

Subproblem: sum of n/p elements

Subproblem:sum of n/p elements

Subproblem: sum of n/p elements

Problem: sum of two n-element vectors

Example: “data parallel” (SIMD) computation

for (i=0; i<n; i++) { a[i] = b[i]+c[i]; }

Algorithm/ implementation:

Tpar(p,n) = Tseq(n)/p Speedup(p) = p

Best possible parallelization: sequential work divided evenly across p processors

Complexity: Tseq(n) = Θ(n)

Page 75: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Subproblem: Sum of n/p elements

Subproblem: sum of n/p elements

Subproblem:sum of n/p elements

Subproblem: sum of n/p elements

Problem: sum of two n-element vectors

Example: “data parallel” (SIMD) computation

for (i=0; i<n; i++) { a[i] = b[i]+c[i]; }

•Perfect speedup •“Embarassingly parallel” •“Pleasantly parallel”

Tpar(p,n) = c(n/p) for constant c≥1

Algorithm/ implementation:

Page 76: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Work

Time Tseq(n)

start

The work, measured in instructions and/or time, that has to be carried out for problem of size n

stop

Page 77: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

p processors

Perfect parallelization: Sequential work evenly divided between p processors, no overhead, so Tpar(p,n) = Tseq(n)/p

Perfect speedup Sp(n) = Tseq(n)/(Tseq(n)/p) = p

… and very rare in practice

Time Tseq(n)

Page 78: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Ti

p processors

Sequential work is unevenly divided between p processors: Load imbalance, Tpar(p,n)>Tseq(n)/p, even though ∑Ti(n)=Tseq(n)

start

stop Tpar is time for slowest processor to complete, all processors assumed to start at same time

Define Tpar(p,n) = max Ti(n) over all processors

Time Tseq(n)

Page 79: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Ti

p processors

Wpar(n) = ∑Ti(n) is called the work of the parallel algorithm = total number of instructions performed by the p processors

The product C(n) = p*Tpar(p,n) is called the cost of the parallel algorithm: total time in which the p processors are reserved (and has to be paid for)

Area C(n) = p*Tpar(p,n)

Time Tseq(n)

Page 80: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

“Theorem:” Perfect speedup Sp(n) = p is best possible and cannot be exceeded

“Proof”: Assume Sp(n) > p for some n. Tseq(n)/Tpar(p,n) > p implies Tseq(n) > p*Tpar(p,n). A better sequential algorithm could be constructed by simulating the parallel algorithm on a single processor. The instructions of the p processors are carried out in some correct order, one after another on the sequential processor. This contradicts that Tseq(n) was best possible/known time.

Reminder: Speedup is calculated (measured) relative to “best” sequential implementation/algorithm

Page 81: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Ti

p processors

By assumption C(n) = p*Tpar(p,n) < Tseq(n)

Simulation A: one step of P1, one step of P2, …, one step of P(p-1), one step of P1, …, for C(n) iterations

Simulation B: steps of P1 until communication/synchronization, steps of P2 until communication/synchronization, …

Time Tseq(n)

Page 82: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Ti

1 processor

By assumption C(n) = p*Tpar(p,n) < Tseq(n)

Both simulations yield a new, sequential algorithm Tsim(n) with Tsim(n)<Tseq(n) This contradicts that Tseq(n) was time of best possible/best known sequential algorithm

Time Tseq(n)

Page 83: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Lesson: Parallelism offers only „modest potential“, speed-up cannot be more than p on p processors

Lawrence Snyder: Type architecture, shared memory and the corollary of modest potential. Annual Review of Computer Science, 1986

Construction shows that the total parallel work must be at least as large as the sequential work Tseq, otherwise, better sequential algorithm can be constructed.

Crucial assumptions: Sequential simulation possible (enough memory to hold problem and state of parallel processors), sequential memory behaves as parallel memory, … NOT TRUE for real systems and real problems

Page 84: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Ti

1 processor

Time Tseq(n) Aside: Such simulations are actually

sometimes done, and can be very useful to understand (model) and debug parallel algorithms

Some such simulation tools: •SimGrid (INRIA) •LogPOPSim (Hoefler et al.) •…

Possible bachelor thesis subject

Page 85: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Wpar(n) = ∑Ti(n) is called the parallel work of the parallel algorithm = total number of instructions performed by some number of processors

The product C(n) = p*Tpar(p,n) is called the cost of the parallel algorithm: Total time in which the p processors are occupied

Definition: Parallel algorithm is called cost-optimal if C(n) = O(Tseq(n)). A cost-optimal algorithm has linear (perhaps perfect) speedup

Definition: Parallel algorithm is called work-optimal if Wpar(n) = O(Tseq(n)). A work-optimal algorithm has potential for linear speedup (for some number of processors)

Page 86: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Proof (linear-speed up of cost-optimal algorithm): Given cost-optimal parallel algorithm with p*Tpar(p,n) = c*Tseq(n) = O(Tseq(n))

This implies Tpar(p,n) = c*Tseq(n)/p, so Sp(n) = Tseq(n)/Tpar(p,n) = p/c

The constant factor c captures the overheads and load imbalance (see later) of the parallel algorithm relative to best sequential algorithm. The smaller c, the closer the speedup to perfect

Page 87: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Ti

p processors

Given work-optimal parallel algorithm, ∑Ti(n) = Tseq(n), with Tpar(n) = max Ti(n)

Time Tseq(n)

Page 88: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Ti

p processors

Given work-optimal parallel algorithm, ∑Ti(n) = Tseq(n), with Tpar(n) = max Ti(n)

Execute on smaller number of processors, such that ∑Ti(n) = p*Tpar(p,n) = O(Tseq(n))

Time Tseq(n)

Page 89: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Proof idea (work-optimal algorithm can have linear speed-up): 1. Work-optimal algorithm 2. Schedule work-items Ti(n) on p processors, such that

p*Tpar(p,n) = O(Tseq(n)) 3. With this number of processors, algorithm is cost-optimal 4. Cost-optimal algorithms have linear speed-up

Parallel algorithms’ design goal: Work-optimal parallel algorithm with as small Tpar(n) as possible (and therefore large parallelism: many processors can be utilized)

The scheduling in step 2 is possible in principle, but may not be trivial

Page 90: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Example: Non work-optimal algorithm DumbSort with T(n) = O(n2) that can be perfectly parallelized, Tpar(p,n) = O(n2/p)

Well-known that Tseq(n) = O(n log n), with many algorithms and good implementations

Sp(n) = n log n/n2/p = p(log n)/n

Tpar(p,n) < Tseq(n) n2/p < n log n n/p < log n p > n/log n

(Small) linear speedup for fixed n (but not independent of n)

Break-even: When is parallel algorithm faster than sequential?

Non work-optimal algorithm: Speed-up decreases with n

Page 91: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Best known/best possible parallel algorithm often difficult to parallelize - no redundant work (that could have been done in parallel) - tight dependencies (that forces things to be done one after another)

Lesson: Usually does not make sense to parallelize an inferior algorithm ( although sometimes much easier)

Lesson from much hard work in theory and practice: Parallel solution of a given problem often requires a new algorithmic idea!

But: Many algorithms often have a lot of potential for easy parallelization (loops, independent functions, …), so why not?

Also: non-work optimal algorithm can sometimes be useful, as subroutine

Page 92: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Parallel performance in practice

Speedup is an empirical quantity, “measured time”, based on experiment (benchmark)

Tseq(n): Running time for “reasonable”, good, best available, sequential implementation, on “reasonable” inputs

Tpar(p,n): parallel running time measured for a number of experiments with different typical (worst-case?) inputs

Sp(n) = Tseq(n)/Tpar(p,n)

Speed-up is typically not independent of problem size n, and problem instance

Page 93: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Time

Ti

p processors

Oi

Parallelization most often incurs overheads: •Algorithmic: Parallel algorithm may do more work •Coordination: Communication and synchronization •…

Tpar(p,n) Note T(1,n)≥Tseq(n)

Tseq(n)

Page 94: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

The ratio

Tpar(1,n)/Tpar(p,n)

is often termed relative speedup, and may express how well the parallel algorithm utilizes p processors (scalability)

Relative speedup NOT to be confused with absolute speedup. Absolute speedup expresses how much can be gained over the best (known/possible) sequential implementation by parallelization. Absolute speed-up is what eventually matters

Note: Literature is not always clear about this distinction. It is easier to achieve and document good relative speedup. Reporting speed-up relative to an inferior, sequential implementation is incorrect!

Page 95: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Absolute vs. relative speedup and scalability

Good scalability and relative speedup Tpar(1,n)/Tpar(p,n) = Θ(p)

Example: 0.1p ≤ Tpar(1,n)/Tpar(p,n) ≤ 0.5p

…is sometimes reported. But what if Tpar(1,n) = 100Tseq(n)? Or Tseq(n) = O(n) but Tpar(p,n) = O(n log n/p + log n)?

Even when Tpar(1,n) = 100Tseq(n) = O(Tseq(n)), it would take at least 200 processors to be as fast as the sequential algorithm

Page 96: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

David H. Bailey, "Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers", Supercomputing Review, Aug. 1991, pp. 54-55

Empirical, relative speedup without absolute performance baseline (and comparison to reasonable, sequential algorithm and implementation) is misleading

Page 97: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Time

Ti

p processors

Oi

Tpar(p,n) Note T(1,n)≥Tseq(n)

If algorithm is cost-optimal, p*Tpar(p,n) = k*Tseq(n), speedup becomes imperfect, but still linear, Sp(n) = p/k

Tseq(n)

Page 98: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Time

Ti

p processors

Oi

Tpar(p,n) Note T(1,n)≥Tseq(n)

Idle

NB:This denotes cumulated time (“profile”) over the whole Tpar execution; not a trace. Computation, overhead, and idle time is spread over the whole execution

Tseq(n)

Page 99: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Time

Ti

p processors

Oi

Tpar(p,n) Note T(1,n)≥Tseq(n) Idle

NB:This denotes cumulated time (“profile”) over the whole Tpar execution; not a trace. Computation, overhead, and idle time is spread over the whole execution

Tseq(n)

Oi

Idle

Page 100: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Time

Ti

p processors

Oi

Tpar(p,n) Note T(1,n)≥Tseq(n) Idle

Tseq(n)

Oi

Idle

Typical overhead by communication and coordination. The (smallest) time between coordination periods is called the granularity of the parallel computation

Page 101: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Granularity: •“Coarse grained” parallel computation/algorithm: Time/number of instructions between coordination intervals (synchronization operations, communication operations) is large (relative to total time or work) •“Fine grained” parallel computation/algorithm: Time/number of instructions between… is small

Page 102: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Time

Ti

p processors

Oi

Tpar(p,n) Note T(1,n)≥Tseq(n) Idle

Tseq(n)

Oi

Idle

Definition: Difference between max Ti(n) and min Ti(n) is the load imbalance. Achieving Ti(n) ≈ Tj(n) for all processors i,j called load balancing

Page 103: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Time

p processors

Oi

Tpar(p,n) close to Tseq(n)/p

Idle

Best parallelization has no load imbalance (and no overhead), so Tpar(p,n) = Tseq(n)/p This is the best possible parallel time

Tseq(n)

Ti

Page 104: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Time Tseq(n) = (s+r)Tseq(n)

Ti

p processors

Maximum speedup becomes severely limited, e.g., if overheads are sequential and a constant fraction

Tpar(p,n) ≥ s*Tseq(n)+r*Tseq(n)/p

Sequential overhead: constant fraction

Page 105: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Amdahls Law (parallel version): Let a program A contain a fraction r that can be “perfectly” parallelized, and a fraction s=(1-r) that is “purely sequential”, i.e., cannot be parallelized at all. The maximum achievable speedup is 1/s, independently of n

Proof:

Tseq(n) = (s+r)*Tseq(n)

Tpar(p,n) = s*Tseq(n) + r*Tseq(n)/p

Sp(n) = Tseq(n)/(s*Tseq(n)+r*Tseq(n)/p) = 1/(s+r/p) -> 1/s, for p -> ∞

G. Amdahl: Validity of the single processor approach to achieving large scale computing capabilities. AFIPS 1967

Page 106: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Typical victims of Amdahl‘s law:

•Sequential input/output could be a constant fraction •Sequential initialization of global data structures •Sequential processing of „hard-to-parallelize“ parts of algorithm, e.g., shared data structures

Amdahl‘s law limits speed-up in such cases, if they are a constant fraction of total time, independent of problem size

Page 107: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Example:

1. Processor 0: read input, some precomputation 2. Split problem into n/p parts, send part i to processor i

3. All processors i: solve part i 4. All processors i: send partial solution back to processor 0

Typical Amdahl, sequential bottleneck: Constant sequential fraction (2 out of 4 steps, limits speedup)

10n 9n

Amdahl: s=0.1, SU at most 10

When interested in parallel aspects, input-output and problem splitting is often explicitly not measured (see projects)

Page 108: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

// Sequential initialization x = (int*)calloc(n*sizeof(int)); … // Parallelizable part do { for (i=0; i<n; i++) { x[i] = f(i); } // check for convergence done = …; } while (!done)

Example: K iterations before convergence, (parallel) convergence check cheap, f(i) fast O(1)…

Sp(n) -> 1+K

Tseq(n) = n+K+Kn Tpar(p,n) = n+K+Kn/p

Sequential fraction ≈ 1/(1+K)

Page 109: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

// Sequential initialization x = (int*)malloc(n*sizeof(int)); … // Parallelizable part do { for (i=0; i<n; i++) { x[i] = f(i); } // check for convergence done = …; } while (!done)

Example: K iterations before convergence, (parallel) convergence check cheap, f(i) fast O(1)…

Sp(n) -> p when n>p, n->∞

Tseq(n) = 1+K+Kn Tpar(p,n) = 1+K+Kn/p

Sequential part ≈ 1/(1+n)

Note:

If sequential part is constant (but not fraction), Amdahl‘s law does not limit SU

Page 110: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

// Sequential initialization x = (int*)malloc(n*sizeof(int)); … // Parallelizable part do { for (i=0; i<n; i++) { x[i] = f(i); } // check for convergence done = …; } while (!done)

Example: K iterations before convergence, (parallel) convergence check cheap, f(i) fast O(1)…

Tseq(n) = 1+K+Kn Tpar(p,n) = 1+K+Kn/p

Lesson: Be careful with system functions (calloc, malloc)

Sequential part ≈ 1/(1+n)

Sp(n) -> p when n>p, n->∞

Page 111: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Avoiding Amdahl: Scaled speedup

Sequential, strictly non-parallelizable part is often not a constant fraction of the total execution time (number of instructions) The sequential part may be constant, or grow only slowly with problem size n. Thus, to maintain good speedup, problem size should be increased with p

Assume Tseq(n) = t(n)+T(n) with sequential part t(n) and perfectly parallelizable part T(n). Assume t(n)/T(n) -> 0 for n-> ∞ Tpar(p,n) = t(n)+T(n)/p

Page 112: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Sp(n) = (t(n)+T(n))/(t(n)+T(n)/p) = (t(n)/T(n)+1)/(t(n)/T(n)+1/p) -> 1/(1/p) = p

Speed-up as a function of p and n:

for n -> ∞

Lesson: Depending on how fast t(n)/T(n) converges, linear speed-up can be achieved by increasing problem size n

Definition: Speedup as function of p and n, with sequential and parallelizable times t(n) and T(n) is called scaled speed-up

Page 113: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Remarks: •E(p,n) ≤ 1, since Sp(n) = Tseq(n)/Tpar(n,p) ≤ p •E(p,n) = constant: linear speedup

Definition: The efficiency of a parallel algorithm is ratio of best possible parallel time to actual parallel time for given p and n: E(p,n) = (Tseq(n)/p)/Tpar(p,n) = Sp(n)/p = Tseq(n)/(p*Tpar(p,n))

cost

•Cost-optimal algorithms have constant efficienty

Page 114: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Scalability

Definition: A parallel algorithm/implementation is strongly scaling if Sp(n) = Θ(p) (linear, independent of (sufficiently large) n)

Definition: A parallel algorithm/implementation is weakly scaling if there is a slow-growing function f(p), such that for n = Ω(f(p)), E(p,n) remains constant. The function f is called the iso-efficiency function

„Maintain efficiency by increasing problem size as f(p) or more“

J. Gustafson: Reevaluating Amdahls Law. CACM 1988 Ananth Grama, Anshul Gupta, Vipin Kumar: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Transactions Par. Dist. Computing. 1(3): 12-21 (1993)

Page 115: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

// Sequential initialization x = (int*)malloc(n*sizeof(int)); … // Parallelizable part do { for (i=0; i<n; i++) { x[i] = f(i); } // check for convergence done = …; } while (!done)

Assume convergence check takes O(log p) time

Tpar(p,n) = Kn/p+K log p

E(p,n) ≈ Kn/(Kn+pK log p)

For n≥plog p, E(p,n) ≥ 1/2

Example:

Weakly scalable, n has to increase as O(p log p) to maintain constant efficiency, O(log p) per processor (if balanced)

Page 116: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Examples (Tpar, Speedup, Optimality, Efficiency):

1. Tpar0(p,n) = n/p+1

2. Tpar1(p,n) = n/p+log p

3. Tpar2(p,n) = n/p+log2 p

4. Tpar3(p,n) = n/p+√p

5. Tpar4(p,n) = n/p+p

Linear time computation, Tseq(n) = n (constants ignored)

Embarassingly “data parallel” computation, constant overhead logarithmic overhead, e.g. convergence check

Linear overhead, e.g. data exchange

Typical, good, work-optimal parallelizations

Page 117: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

n=128

Page 118: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

1. Tpar0(p,n) = n/p+1: p*Tpar(p,n) = n+p = O(n) for p=O(n)

2. Tpar1(p,n) = n/p+log p: p*Tpar(p,n) = n+p log p = O(n) for p log p = O(n)

3. Tpar2(p,n) = n/p+log2 p: p*Tpar(p,n) = n+p log2 p = O(n) for p log2 p=O(n)

4. Tpar3(p,n) = n/p+√p: p*Tpar(p,n) = n+p√p = O(n) for p√p=O(n)

5. Tpar4(p,n) = n/p+p: p*Tpar(p,n) = n+p2 = O(n) for p2=O(n)

Cost-optimality:

Page 119: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

n=128

Page 120: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

n=128, but p up to 256

Page 121: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

n=16384 (=16K=1282)

Page 122: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

n=2097152 (=2M=1283)

Page 123: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

n=128

Page 124: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

n=16384 (=16K=1272)

Page 125: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

n=2097152 (=2M=1273)

Page 126: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Efficiency, weak scaling

1. Tpar0(p,n) = n/p+1: f0(p) = e/(1-e)*p

2. Tpar1(p,n) = n/p+log p: f1(p) = e/(1-e)*(p log p)

3. Tpar2(p,n) = n/p+log2 p: f2(p) = e/(1-e)*p log2 p

4. Tpar3(p,n) = n/p+√p: f3(p) = e/(1-e)*(p√p)

5. Tpar4(p,n) = n/p+p: f4(p) = e/(1-e)*p2

To maintain constant efficiency e=Tseq(n)/(p*Tpar(p,n)), n has to increase as

Page 127: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Maintained efficiency = 0.9 (90%)

Page 128: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Matrix-vector multiplication parallelizations

Tseq(n) = n2

Tpar0(p,n) = n2/p + n Tpar1(p,n) = n2/p + n + log p

Page 129: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

n=100

Page 130: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

n=1000

Page 131: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Some non work-optimal parallel algorithms

Amdahl case, linear sequential running time:

TparA(p,n) = 0.9n/p+0.1n

1. Tseq(n) = n log n Tpar(p,n) = n2/p+1

2. Tseq(n) = n Tpar(p,n) = (n log n)/p+1

Page 132: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

n1 = 128 n2 = 16384

Page 133: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Limitations of empirical speedup measure

Empirical speedup assumes thatTseq(n) can be measured. For very large n and p, this may not be the case: a large HPC system has much more (distributed) main memory than any single-processor system Scalability measured by other means: •Stepwise speedup (1-1000 processors, 1000-100,000 processors) •Other notions of efficiency

Page 134: Introduction to Parallel ComputingWS16/17 ©Jesper Larsson Träff Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming

©Jesper Larsson Träff WS16/17

Lecture summary, checklist

•Models of parallel computation: Architecture, programming, cost •Flynn’s taxonomy: MIMD, SIMD; SPMD •Sequential baseline •Speedup (in theory and practice) •Work, Cost optimality •Amdahl’s law •Scaled speed-up, strong and weak scaling