introduction to parallel computingws16/17 ©jesper larsson träff linguistic: how are the...

©Jesper Larsson Träff WS16/17

Introduction to Parallel Computing Introduction, problems, models, performance

Jesper Larsson Träff [email protected]

TU Wien Parallel Computing

mailto:[email protected]


Mobile phones: „dual core“, „quad core“ (2012?), …

June 2012: IBM BlueGene/Q, 1572864 cores, 16PF

“Never mind that there's little-to-no software that can take advantage of four processing cores, Xtreme Notebooks has released the first quad-core laptop in the U.S (2007)”.

June 2016: 10,649,600 cores, 93 PFLOPS (125 PFLOPS peak)

Parallelism everywhere! How to use all these resources? Limits?

…octa-core (2016, Samsung Galaxy 7)

Practical Parallel Computing


As per ca. 2010: No sequential computer systems anymore (almost): multi-core, GPU/accelerator enhanced, …

Why?

June 2011: Fujitsu K, 705024 cores, 11PF



June 2011: Fujitsu K, 705024 cores, 11PF

As per 2010: a) All applications must be parallelized; or b) enough independent applications that can run at the same time

Challenge: How do I speed up windows…?



“I spoke to five experts in the course of preparing this article, and they all emphasized the need for the developers who actually program the apps and games to code with multithreaded execution in mind”

7 myths about quad-core phones (Smartphones Unlocked) by Jessica Dolcourt April 8, 2012 12:00 PM PDT

The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software By Herb Sutter The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency. This article appeared in Dr. Dobb's Journal,30(3), March 2005

…many similar quotations can (still) be found



The Free Lunch Is Over A Fundamental Turn Toward Concurrency in Software By Herb Sutter The biggest sea change in software development since the OO revolution is knocking at the door, and its name is Concurrency. This article appeared in Dr. Dobb's Journal,30(3), March 2005

“Free lunch” (as in “There is no such thing as a free lunch”): Exponential increase in single-core performance (18-24 month doubling rate, Moore’s “Law”), no software changes needed to exploit the faster processors


“The free lunch” was over…” ca. 2005

•Clock speed limit around 2003 •Power consumption limit from 2000 •Instructions/clock limit late 90ties

But: Numbers of transistors/chip can continue to increase (>1Billion)

Solution(?): Put more cores on chip „multi-core revolution“

Kunle Olukotun (Stanford), ca. 2010


Parallelism challenge: Solve problem p times faster on p (slower, more energy efficient) cores

Kunle Olukotun (Stanford), ca. 2010

“The free lunch” was over…” ca. 2005

•Clock speed limit around 2003 •Power consumption limit from 2000 •Instructions/clock limit late 90ties


Kunle Olukotun, 2010, Karl Rupp (TU Wien), 2015


Single-core performance does still increase, somewhat, but much slower… Henk Poley, 2014, www.preshing.com

Average, normalized, SPEC benchmark numbers, see www.spec.org

http://www.preshing.com/

http://www.spec.org/


2006: Factor 103-104 increase in integer performance over 1978 high-performance processor

Single-core processor development (“free lunch”) made life very difficult for parallel computing in the 90ties

From Hennessy/Patterson, Computer Architecture


Architectural ideas driving sequential performance increase

•Deep pipelining •Superscalar execution (>1 instruction/clock) through multiple-functional units… •Data parallelism through SIMD units (old HPC idea: vector processing), same instruction on multiple data/clock •Out-of-order execution, speculative execution •Branch prediction •Caches (pre-fetching) •Simplified/better instruction sets (better for compiler) •SMT/Hyperthreading

Increase in clock-frequency (“technological advances”) alone (factor 20-200?) alone does not explain performance increase:

Mostly fully transparent, at most the compiler needs to care: the “free lunch”


•Very diverse parallel architectures •No single, commonly agreed upon abstract model (for designing and analyzing algorithms)

Many different programming paradigms (models), different programming frameworks (interfaces, languages)

•Has been so •Is still so (largely)


Theoretical Parallel Computing

Parallelism was always there! What can be done? Limits?

Assume p processors instead of just one, reasonably connected (memory, network, …) •How much master can some given problem be solved? Can some problems be solved better? •How? New algorithms? New techniques? •Can all problems be solved faster? Are there problems that cannot be solved faster with more processors? •Which assumptions are reasonable? •Does parallelism give new insights into nature of computation?


Algorithm in model

Concrete program (C, C++, Java, Haskell, Fortran,…)

Sequential computing

Concrete architecture

is difficult. Analysis in model (often) has some relation to concrete execution

vs. Parallel computing

Algorithm in model A

Algorithm in model B

Algorithm in model Z

Concrete program: different paradigms (MPI, OpenMP, Cilk, OpenCL, …)

Conc. Arch. A Conc. Arch. Z …


is extremely difficult. Analysis in model may have little relation to concrete execution

Parallel computing





Conc. Arch. A Conc. Arch. Z

Huge gap: no „standard“ abstract model (e.g., RAM),

…


Extremely difficult. Analysis in model may have little relation to concrete execution






Challenges: •Algorithmic: Not all problems seems to be easily parallelizable •Portability: Support for same language/interface on different architectures (e.g. MPI in HPC) •Performance portability (?)

Parallel computing

…







Elements of Parallel computing: •Algorithmic: Find the parallelism •Linguistic: Express the parallelism •Practical: Validate the parallelism (correctness, performance)

•Technical challenge: run-time/compiler-support support for paradigm

Parallel computing

…


Parallel computing: Accomplish something with a coordinated set of processors under control of a parallel algorithm

•It is inevitable: multi-core revolution, GPGPU paradigm, … •It‘s interesting, challenging, highly non-trivial – full of surprises •Key discipline of computer science (von Neumann, golden theory decade: 1980-early 90ies) •It‘s ubiquituous (gates, architecture: pipelines, ILP, TLP, systems: operating systems, software), not always transparent •It‘s useful: large, extremely computationally intensive problems, Scientific Computing, HPC •…

Why study parallel computing?


Parallel computing: The discipline of efficiently utilizing dedicated parallel resources (processors, memories, …) to solve single, given computational problem.

Specifically: Parallel resources with significant inter-communication capabilities, for problems with non-trivial communication and computational demands

Buzz words: tightly coupled, dedicated parallel system; multi-core processor, GPGPU, High-Performance Computing (HPC), …

Focus on properties of solution (time, size, energy, …) to given, individual problem


Distributed computing: The discipline of making independent, non-dedicated resources available and coorperative toward solving specified problem complexes.

Buzz words: internet, grid, cloud, agents, autonomous computing, mobile computing, …

Typical concerns: correctness, availability, progress, security, integrity, privacy, robustness, fault tolerance, …


Concurrent computing: The discipline of managing and reasoning about interacting processes that may (or may not) progress simultaneously

Buzz words: operating systems, synchronization, interprocess communication, locks, semaphores, autonomous computing, process calculi, CSP, CCS, pi-calculus, …

Typical concerns: correctness (often formal), e.g. deadlock-freedom, starvation-freedom, mutual exclusion, fairness


Parallel vs. Concurrent computing

Process

Resource Proc

Process Process Process …

Concurrent computing: Focus on coordination of access to/usage of shared resources (to solve given, computational problem)

Proc

Memory (locks, semaphores, data structures), device, …

Given problem: specification, algorithm, data

Adopted from Madan Musuvathi


Subproblem

Proc

Subproblem Subproblem Subproblem

Problem Specification, algorithm, data

Parallel computing: Focus on dividing given problem (specification, algorithm, data) into subproblems that can be solved by dedicated processors (in coordination)

Proc Proc Proc

Coordination: synchronization, communication


The “problem” of parallelization

Subproblem Subproblem Subproblem Subproblem

Problem

How to divide given problem into subproblems that can be solved in parallel?

•Specification •Algorithm? •Data?

•How is the computation divided? Coordination necessary? Does the sequential algorithm help/suffice? •Where are the data? Is communication necessary?


Aspects of parallelization

•Algorithmic: How to divide computation into independent parts that can be executed in parallel? What kinds of shared resources are necessary? Which kinds of coordination? How can overheads be minimized (redundancy, coordination, synchronization)? •Scheduling/Mapping: How are independent parts of the computation assigned to processors? •Load balancing: How can independent parts be assigned to processors such that all resources are utilized efficiently? •Communication: When must processors communicate? How? •Synchronization: When must processors agree/wait?


Linguistic: How are the algorithmic aspects expressed? Concepts (programming model) and concrete expression (programming language, interface, library) Pragmatic/practical: How does the actual, parallel machine look? What is a reasonable, abstract model? Architectural: which kinds of parallel machines can be realized? How do they look?


Levels of parallelization

Architecture: gates, logical and functional units Computational: Instruction Level Parallelism (ILP)

SIMD

Functional, explicit: threads (possibly concurrent, sharing architecture resources), cores, processors

“Parallel computing”

Not here

Large-scale: coarse grained tasks, multi-level, coupled application parallelism

Not here


Can’t we just leave it all to the compiler/hardware?

Automatic parallelization(?)

•“high-level” problem specification •Sequential program

Efficient code that can be executed on parallel, multi-core processor

Successsful only to a limited extent: •Compilers cannot invent a different algorithm •Hardware parallelism not likely to go much further

Samuel P. Midkiff: Automatic Parallelization: An Overview of Fundamental Compiler Techniques. Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers 2012


Explicitly, parallel programming (today):

•Explicitly parallel code in some parallel language •Support from parallel libraries •(Domain specific languages) •Compiler does as much as compiler can do

Lot’s of interesting problems and tradeoffs, active area of research


Some „given, individual problems“ for this lecture

Problem 1: Matrix-vector multiplication: given nxm matrix A, m-column vector v, compute n-column vector u = Av

x = A[i,j] u[i] = ∑A[i,j]v[j]

Dimensions n,m large: obviously some parallelism

How to ∑ in parallel? Access to vector?

…

…


Problem 1a: Matrix-Matrix multiplication: given nxk matrix A, kxm matrix B, compute nxm matrix product C = AB

x =

C[i,j] = ∑A[i,k]B[k,j]


Problem 1b: Solving sets of linear equations. Given matrix A and vector b, find x such that Ax = b

Preprocess A such that solution to Ax = b can easily be found for any B (LU factorization, …)


Problem 2: Stencil computation, given nxm matrix A, update as follows, and iterate until some convergence criteria is fulfilled

A[i,j]

iterate { for all (i,j) { A[i,j] <- (A[i-1,j]+A[i+1,j]+A[i,j-1]+A[i,j+1])/4 } } until (convergence)

with suitable handling of matrix border

Looks well-behaved, „embarassingly parallel“? Data distribution? Conflicts on updates?

Use: discretization and solution of certain parallel differential equations (PDEs); image processing; …

5-point stencil in 2d


Problem 3: Merging two sorted arrays of size n and m into a sorted array of size n+m

Problem 4: Computation of all prefix-sums: given an array A of size n of elements of some type S with an associative operations +, compute for all indices 0≤i<n the prefix-sums in array B B[i] = ∑0≤j<iA[j]

Easy to do sequentially, but sequential algorithm looks… sequential

Implies solution to problem of computing just ∑. Parallel?

A: 1 2 3 4 5 6 7 …

B: x 1 3 6 10 15 21 28 … All prefix-sums


Problem 5: Sorting a sequence of objects („reals“, integers, objects with order relation) stored in array

Hopefully parallel merge solution can be of help. Other approaches? Quicksort? Integers (counting, bucket, radix)


Parallel computing as a (theoretical) CS discipline

•Than what? •What is a reasonable „model“ for parallel computing? •How fast can a given problem be solved? How many resources can be productively exploited? •Are there problems that cannot be solved in parallel? Fast? At all? •…

(Traditional) objective: Solve a given computational problem faster


Architecture model: Abstraction of the important modules of a computational system (processors, memories), their interconnection and interaction. Computational/execution model: (formal) framework for the design and analysis of algorithms for the computational system; cost model.

M

P

Example: RAM (Random-Access Machine)

Processor (ALU, PC, registers) capable of executing instructions stored in memory on data in memory Execution of instruction, access to memory: first assumption: unit cost

Realistic?

Useful?


M

P


•Memory: load, store •Arithmetic (integer, real numbers) •Logic: and, or, xor, shift •Branch, compare, procedure call

Traditional RAM: Same unit cost for all operations



M

P


Aka von Neumann architecture or stored program computer

John von Neumann (1903-57), Report on EDVAC, 1945; also Eckert&Mauchly, ENIAC



M

P


„von Neumann bottleneck“: Program and data separate from CPU, processing rate limited by memory rate.



M

P


John W. Backus: Can Programming Be Liberated From the von Neumann Style? A Functional Style and its Algebra of Programs. Commun. ACM 21(8): 613-641 (1978), Turing Award Lecture, 1977



M

P

Example: RAM (Random-Access Machine) with cache/memory hierarchy


cache Different (NOT unit) cost model, some memory accesses cheaper than other


M

P

M M M

Increased memory rate, vector computer, ALU operates on vectors instead of scalars



M

P P P P

Shared-memory model, bus based



M

P P P P

Shared-memory model (network based, emulated)



M

P P P P

Synchronous, shared-memory model

Processors in lock-step, uniform memory access time = instruction time: Parallel RAM (PRAM)

Realistic?

Useful?



M

P P P P

Synchronous shared-memory model

PRAM main theoretical model, from late 70ties, throughout 80ties, lost interest ca. 1993. Very important analysis/idea tool



M

P P P P

Shared-memory model, banked, network based, not synchronous

M M M …



UMA (Uniform Memory Access): Access time to memory location is independent of location and accessing processor, e.g., O(1), O(log M), …

NUMA (Non-Uniform Memory Access): Access time dependent on processor and location. Locality: some locations can be accessed faster by a processor than other

M

P P P P

M

P P P P

M M M


Architecture model defines resources, describes •Composition of processor, functional units: ALU, FPU, registers, w-bit words vs. unlimited, Vector Unit (MMX, SSE, AVX) •Types of instructions •Memory system, caches •…

Execution model/cost model specifies •How instructions are executed •(relative) Cost of instructions, memory accesses •…

Level of detail/formality dependent on purpose: What is to be studied (complexity theory, algorithms design, …), what counts (instructions, memory accesses, …)


M

P P P P

Distributed memory model

M M M

Communication network



Parallel architecture model defines •Synchronization between processors •Synchronization operations •Atomic operations, shared resources (memory, registers) •Communication mechanisms: network topology, properties •Memory: shared, distributed, hierarchical, … •…

Cost model defines •Cost of synchronization, atomic operations •Cost of communication (latency, bandwidth, …) •…


Different parallel architecture model: Cellular automaton, systolic array, … : simple processors without memory (finite state automata, FSA), operate in lock step on (potentially infinite) grid, local communication only

[John on Neumann, Arthur W. Burks: Theory of self-reproducing automata, 1966] [H. T. Kung: Why systolic architectures? IEEE Computer 15(1): 37-46, 1982]

State of cell (i,j) in next step determined by •Own state •State of neighboors in some neighborhood, e.g., (i,j-1),(i+1,j), (i,j+1),(i-1,j)s


Flynn‘s taxonomy: Orthognal classification of (parallel) architectures/models

SISD Single Instruction Single Data

MISD Multiple Instruction Single Data

SIMD Single Instruction Multiple Data

MIMD Multiple Instruction Multiple Data

[M. J. Flynn: Some computer organizations and their effectiveness. IEEE Trans. Comp. C-21(9):948-960, 1972]

Intruction stream

Dat

a st

ream


SISD: Single processor, single stream of instructions, operates on single stream of data. Sequential architecture (e.g. RAM) SIMD: Single processor, single stream of operations, operates on multiple data per instruction. Example: traditional vector computer, PRAM (some variants) MISD: Multiple instructions operate on single data stream. Example: pipelined architectures, streaming architectures(?), systolic arrays (70ties architetural idea). MIMD: Multiple instruction streams, multiple data streams

Some say: Empty


M

P SISD

M

P

M M M

SIMD

Typical instances

M

P P P P

M M M … M

P P P P

M M M

Communication network

MIMD


Programming model: Abstraction, close to programming language, defining parallel resources, management of parallel resources, parallelization paradigms, memory structure, memory model, synchronization and communication features, and their semantics

Parallel programming language, or library („interface“) is the concrete implementation of one (or more: multi-modal, hybrid) parallel programming model(s)

Cost of operations: May be specified with programming model; often by architecture/computational model

Execution model: when and how parallelism in programming model is effected


•Parallel resources, entities, units: processes, threads, tasks, … •Expression of parallelism: explicit or implicit •Level and granularity of parallelism •Memory model: shared, distributed, hybrid •Memory semantics („when operations take effect/become visible“) •Data structures, data distributions •Methods of synchronization (implicit/explicit) •Methods and modes of communication

Parallel programming model defines, e.g.,


(*)SPMD: Single Program, Multiple Data (**)PGAS: Partitioned Global Address Space – not in this lecture

Examples:

1. Threads, shared memory, block distributed arrays, fork-join parallelism

2. Distributed memory, explicit message passing, collective communication, one-sided communication („RDMA“), PGAS(**)

3. Data parallel SIMD, SPMD(*) 4. …

Concrete libraries/languages: pthreads, OpenMP, MPI, UPC, TBB, …


Same Program Multiple Data (SPMD)

[F.Darema at al.: A single-program-multiple-data computational model for EPEX/FORTRAN, 1988]

Restricted MIMD model: All processors execute same program •May do so asynchronously, different processors may be in different parts of program at any given time •Same objects (procedures, variables) exist for all processors, “remote procedure call”, “active messages, “remote-memory access” make sense

Programming model concepts (not this lecture): active messages, remote procedure call, … (MPI: RMA/one-sided communication)


Programming language/library/interface/paradigm

Programming model

Architecture model

„Real“ Hardware

Different architectures models can realize given programming model

Closer fit: more efficient use of architecture

Challenge: Programming model that is useful and close to „realistic“ architecture models to enable realistic analysis/prediction

OpenMP MPI

Challenge: Language that conveniently realizes programming model

Algorithms support, „run-time“

Cilk


Examples:

OpenMP programming interface/language for shared-memory model, intended for shared memory architectures; „Data parallel“ Can be implemented with DSM (Distributed Shared Memory) on distributed memory architectures – but performance has usually not been good. Requires DSM implementation/algorithms

MPI interface/library for distributed memory model, can be used on shared-memory architectures, too. Needs algorithmic support (e.g., „collective operations“)

Cilk language (extended C) for shared-memory model, for shared-memory architectures; „Task parallel“. Need run-time (e.g., „work-stealing“)


Performance: Basic observations and goals

Model: p dedicated parallel processors collaborate to solve given problem of input size n. We will be interested in worst-case complexities (execution time)

•Processors can work independently (local memory, program) •Communication and coordination incurs some overhead


Challenge and one main goal of parallel processing

… not to make life easy

Be faster than commonly used (good, best known, best possible…) sequential algorithm and implementation, utilizing the p processors efficiently


Tseq(n): time for 1 processor to solve problem of size n

Tpar(p,n): time for p processors to solve problem of size n

Sp(n) = Tseq(n)/Tpar(p,n)

Speedup measures the gain in moving from sequential to parallel computation (note: parameters p and n)

Main goal: Speeding up computations by parallel processing

Goal: Achieve as large speed-up as possible


What is „time“ (number of instructions)?

-Time for some algorithm for solving problem? -Time for a specific algorithm for solving problem?

-Time for best known algorithm for problem? -Time for best possible algorithm for problem?

-Time for specific input of size n, average case, worst case, …? -Asymptotic time, large n, large p?

-Do constants matter, e.g. O(f(p,n)) or 25n/p+3ln (4 (p/n))… ?

What exactly is Tseq(n), Tpar(p,n)?


Tseq(n): •Theory: Number of instructions (or other critical cost measure) to be executed in the worst case for inputs of size n. •The number of instructions carried out is often termed WORK •Practice: Measured time (or other parameter) of execution over some inputs (experiment design)

Choose sequential algorithm (theory), choose an implementation of this algorithm (practice)

Theory and practice: Always state baseline sequential algorithm/implementation


Examples (theory):

Tseq(n) = O(n): Tseq(n,m) = O(n+m): Tseq(n) = O(n log n): Tseq(n,m) = O(n log n + m): Tseq(n) = O(n3): …

Finding maximum of n numbers in unsorted array; prefix-sums Merging of two sequences; BFS/DFS in graph Comparison-based sorting Single-source Shortest Path (SSSP) Matrix multiplication, input two nxn matrices …

Standard, worst-case, asymptotic complexities

Cormen, Leiserson, Rivest, Stein: Introduction to Algorithms. 3rd ed., MIT Press, 2009

Can be solved in o(n3), Strassen etc.


New issue with modern processors: Is time always the same thing? Clock frequency may not be constant, e.g., can depend on load of system, energy cap, “turbomode”, etc.. Such factors difficult to control

Practice: •Construct good input examples to measure running timeTseq(n); experimental methodology •Worst-case not always possible, not always interesting; best case? •Experimental methods to get stable, accurate Tseq(n): Repeat measurements many times (thumb rule: average over at least 30 repetitions)

Experimental science: Always some assumptions about repeatability, regularity, determinism, …


Parallel performance in theory

Definition: Let Tseq(n) be the (worst-case) time of best possible/best known sequential algorithm, andTpar(n,p) the (worst-case) time of parallel algorithm. The absolute speed-up of Tpar(n,p) with p processors over Tseq(n) is


Observation (proof follows): Best-possible, absolute speed-up is linear in p

Goal: Obtain (linear) speedup for as large p as possible (as function of problem size n), for as many n as possible


Definition: T∞(n): the smallest possible running time of parallel algorithm given arbitrarily many processors. Per definition T∞(n) ≤ Tpar(p,n) for all p. Speedup is limited by

Sp(n) = Tseq(n)/Tpar(p,n) ≤ Tseq(n)/T∞(n)

Definition: Tseq(n)/T∞(n) is called the parallelism of the algorithm. The parallelism is the largest number of processors that can be employed and still give linear speedup (assume Tseq(n)/T∞(n)<p in equation above)


For speedup (and other complexity measures), distinguish: •Problem G to be solved (mathematical specification) •Some algorithm A to solve G •Best possible (lower bound) algorithm A* for G, best known algorithm A+ for G: The complexity of G •Implementation of A on some machine M


Subproblem: Sum of n/p elements

Subproblem: sum of n/p elements

Subproblem:sum of n/p elements


Problem: sum of two n-element vectors

Example: “data parallel” (SIMD) computation

for (i=0; i<n; i++) { a[i] = b[i]+c[i]; }

Algorithm/ implementation:

Tpar(p,n) = Tseq(n)/p Speedup(p) = p

Best possible parallelization: sequential work divided evenly across p processors

Complexity: Tseq(n) = Θ(n)


Subproblem: Sum of n/p elements


Subproblem:sum of n/p elements


Problem: sum of two n-element vectors

Example: “data parallel” (SIMD) computation

for (i=0; i<n; i++) { a[i] = b[i]+c[i]; }

•Perfect speedup •“Embarassingly parallel” •“Pleasantly parallel”

Tpar(p,n) = c(n/p) for constant c≥1

Algorithm/ implementation:


Work

Time Tseq(n)

start

The work, measured in instructions and/or time, that has to be carried out for problem of size n

stop


p processors

Perfect parallelization: Sequential work evenly divided between p processors, no overhead, so Tpar(p,n) = Tseq(n)/p

Perfect speedup Sp(n) = Tseq(n)/(Tseq(n)/p) = p

… and very rare in practice

Time Tseq(n)


Ti

p processors

Sequential work is unevenly divided between p processors: Load imbalance, Tpar(p,n)>Tseq(n)/p, even though ∑Ti(n)=Tseq(n)

start

stop Tpar is time for slowest processor to complete, all processors assumed to start at same time

Define Tpar(p,n) = max Ti(n) over all processors

Time Tseq(n)


Ti

p processors

Wpar(n) = ∑Ti(n) is called the work of the parallel algorithm = total number of instructions performed by the p processors

The product C(n) = p*Tpar(p,n) is called the cost of the parallel algorithm: total time in which the p processors are reserved (and has to be paid for)

Area C(n) = p*Tpar(p,n)

Time Tseq(n)


“Theorem:” Perfect speedup Sp(n) = p is best possible and cannot be exceeded

“Proof”: Assume Sp(n) > p for some n. Tseq(n)/Tpar(p,n) > p implies Tseq(n) > p*Tpar(p,n). A better sequential algorithm could be constructed by simulating the parallel algorithm on a single processor. The instructions of the p processors are carried out in some correct order, one after another on the sequential processor. This contradicts that Tseq(n) was best possible/known time.

Reminder: Speedup is calculated (measured) relative to “best” sequential implementation/algorithm


Ti

p processors

By assumption C(n) = p*Tpar(p,n) < Tseq(n)

Simulation A: one step of P1, one step of P2, …, one step of P(p-1), one step of P1, …, for C(n) iterations

Simulation B: steps of P1 until communication/synchronization, steps of P2 until communication/synchronization, …

Time Tseq(n)


Ti

1 processor

By assumption C(n) = p*Tpar(p,n) < Tseq(n)

Both simulations yield a new, sequential algorithm Tsim(n) with Tsim(n)<Tseq(n) This contradicts that Tseq(n) was time of best possible/best known sequential algorithm

Time Tseq(n)


Lesson: Parallelism offers only „modest potential“, speed-up cannot be more than p on p processors

Lawrence Snyder: Type architecture, shared memory and the corollary of modest potential. Annual Review of Computer Science, 1986

Construction shows that the total parallel work must be at least as large as the sequential work Tseq, otherwise, better sequential algorithm can be constructed.

Crucial assumptions: Sequential simulation possible (enough memory to hold problem and state of parallel processors), sequential memory behaves as parallel memory, … NOT TRUE for real systems and real problems


Ti

1 processor

Time Tseq(n) Aside: Such simulations are actually

sometimes done, and can be very useful to understand (model) and debug parallel algorithms

Some such simulation tools: •SimGrid (INRIA) •LogPOPSim (Hoefler et al.) •…

Possible bachelor thesis subject


Wpar(n) = ∑Ti(n) is called the parallel work of the parallel algorithm = total number of instructions performed by some number of processors

The product C(n) = p*Tpar(p,n) is called the cost of the parallel algorithm: Total time in which the p processors are occupied

Definition: Parallel algorithm is called cost-optimal if C(n) = O(Tseq(n)). A cost-optimal algorithm has linear (perhaps perfect) speedup

Definition: Parallel algorithm is called work-optimal if Wpar(n) = O(Tseq(n)). A work-optimal algorithm has potential for linear speedup (for some number of processors)


Proof (linear-speed up of cost-optimal algorithm): Given cost-optimal parallel algorithm with p*Tpar(p,n) = c*Tseq(n) = O(Tseq(n))

This implies Tpar(p,n) = c*Tseq(n)/p, so Sp(n) = Tseq(n)/Tpar(p,n) = p/c

The constant factor c captures the overheads and load imbalance (see later) of the parallel algorithm relative to best sequential algorithm. The smaller c, the closer the speedup to perfect


Ti

p processors

Given work-optimal parallel algorithm, ∑Ti(n) = Tseq(n), with Tpar(n) = max Ti(n)

Time Tseq(n)


Ti

p processors

Given work-optimal parallel algorithm, ∑Ti(n) = Tseq(n), with Tpar(n) = max Ti(n)

Execute on smaller number of processors, such that ∑Ti(n) = p*Tpar(p,n) = O(Tseq(n))

Time Tseq(n)


Proof idea (work-optimal algorithm can have linear speed-up): 1. Work-optimal algorithm 2. Schedule work-items Ti(n) on p processors, such that

p*Tpar(p,n) = O(Tseq(n)) 3. With this number of processors, algorithm is cost-optimal 4. Cost-optimal algorithms have linear speed-up

Parallel algorithms’ design goal: Work-optimal parallel algorithm with as small Tpar(n) as possible (and therefore large parallelism: many processors can be utilized)

The scheduling in step 2 is possible in principle, but may not be trivial


Example: Non work-optimal algorithm DumbSort with T(n) = O(n2) that can be perfectly parallelized, Tpar(p,n) = O(n2/p)

Well-known that Tseq(n) = O(n log n), with many algorithms and good implementations

Sp(n) = n log n/n2/p = p(log n)/n

Tpar(p,n) < Tseq(n) n2/p < n log n n/p < log n p > n/log n

(Small) linear speedup for fixed n (but not independent of n)

Break-even: When is parallel algorithm faster than sequential?

Non work-optimal algorithm: Speed-up decreases with n


Best known/best possible parallel algorithm often difficult to parallelize - no redundant work (that could have been done in parallel) - tight dependencies (that forces things to be done one after another)

Lesson: Usually does not make sense to parallelize an inferior algorithm ( although sometimes much easier)

Lesson from much hard work in theory and practice: Parallel solution of a given problem often requires a new algorithmic idea!

But: Many algorithms often have a lot of potential for easy parallelization (loops, independent functions, …), so why not?

Also: non-work optimal algorithm can sometimes be useful, as subroutine


Parallel performance in practice

Speedup is an empirical quantity, “measured time”, based on experiment (benchmark)

Tseq(n): Running time for “reasonable”, good, best available, sequential implementation, on “reasonable” inputs

Tpar(p,n): parallel running time measured for a number of experiments with different typical (worst-case?) inputs


Speed-up is typically not independent of problem size n, and problem instance


Time

Ti

p processors

Oi

Parallelization most often incurs overheads: •Algorithmic: Parallel algorithm may do more work •Coordination: Communication and synchronization •…

Tpar(p,n) Note T(1,n)≥Tseq(n)

Tseq(n)


The ratio

Tpar(1,n)/Tpar(p,n)

is often termed relative speedup, and may express how well the parallel algorithm utilizes p processors (scalability)

Relative speedup NOT to be confused with absolute speedup. Absolute speedup expresses how much can be gained over the best (known/possible) sequential implementation by parallelization. Absolute speed-up is what eventually matters

Note: Literature is not always clear about this distinction. It is easier to achieve and document good relative speedup. Reporting speed-up relative to an inferior, sequential implementation is incorrect!


Absolute vs. relative speedup and scalability

Good scalability and relative speedup Tpar(1,n)/Tpar(p,n) = Θ(p)

Example: 0.1p ≤ Tpar(1,n)/Tpar(p,n) ≤ 0.5p

…is sometimes reported. But what if Tpar(1,n) = 100Tseq(n)? Or Tseq(n) = O(n) but Tpar(p,n) = O(n log n/p + log n)?

Even when Tpar(1,n) = 100Tseq(n) = O(Tseq(n)), it would take at least 200 processors to be as fast as the sequential algorithm


David H. Bailey, "Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers", Supercomputing Review, Aug. 1991, pp. 54-55

Empirical, relative speedup without absolute performance baseline (and comparison to reasonable, sequential algorithm and implementation) is misleading


Time

Ti

p processors

Oi


If algorithm is cost-optimal, p*Tpar(p,n) = k*Tseq(n), speedup becomes imperfect, but still linear, Sp(n) = p/k

Tseq(n)


Time

Ti

p processors

Oi


Idle

NB:This denotes cumulated time (“profile”) over the whole Tpar execution; not a trace. Computation, overhead, and idle time is spread over the whole execution

Tseq(n)


Time

Ti

p processors

Oi

Tpar(p,n) Note T(1,n)≥Tseq(n) Idle

NB:This denotes cumulated time (“profile”) over the whole Tpar execution; not a trace. Computation, overhead, and idle time is spread over the whole execution

Tseq(n)

Oi

Idle


Time

Ti

p processors

Oi


Tseq(n)

Oi

Idle

Typical overhead by communication and coordination. The (smallest) time between coordination periods is called the granularity of the parallel computation


Granularity: •“Coarse grained” parallel computation/algorithm: Time/number of instructions between coordination intervals (synchronization operations, communication operations) is large (relative to total time or work) •“Fine grained” parallel computation/algorithm: Time/number of instructions between… is small


Time

Ti

p processors

Oi


Tseq(n)

Oi

Idle

Definition: Difference between max Ti(n) and min Ti(n) is the load imbalance. Achieving Ti(n) ≈ Tj(n) for all processors i,j called load balancing


Time

p processors

Oi

Tpar(p,n) close to Tseq(n)/p

Idle

Best parallelization has no load imbalance (and no overhead), so Tpar(p,n) = Tseq(n)/p This is the best possible parallel time

Tseq(n)

Ti


Time Tseq(n) = (s+r)Tseq(n)

Ti

p processors

Maximum speedup becomes severely limited, e.g., if overheads are sequential and a constant fraction

Tpar(p,n) ≥ s*Tseq(n)+r*Tseq(n)/p

Sequential overhead: constant fraction


Amdahls Law (parallel version): Let a program A contain a fraction r that can be “perfectly” parallelized, and a fraction s=(1-r) that is “purely sequential”, i.e., cannot be parallelized at all. The maximum achievable speedup is 1/s, independently of n

Proof:

Tseq(n) = (s+r)*Tseq(n)

Tpar(p,n) = s*Tseq(n) + r*Tseq(n)/p

Sp(n) = Tseq(n)/(s*Tseq(n)+r*Tseq(n)/p) = 1/(s+r/p) -> 1/s, for p -> ∞

G. Amdahl: Validity of the single processor approach to achieving large scale computing capabilities. AFIPS 1967


Typical victims of Amdahl‘s law:

•Sequential input/output could be a constant fraction •Sequential initialization of global data structures •Sequential processing of „hard-to-parallelize“ parts of algorithm, e.g., shared data structures

Amdahl‘s law limits speed-up in such cases, if they are a constant fraction of total time, independent of problem size


Example:

1. Processor 0: read input, some precomputation 2. Split problem into n/p parts, send part i to processor i

3. All processors i: solve part i 4. All processors i: send partial solution back to processor 0

Typical Amdahl, sequential bottleneck: Constant sequential fraction (2 out of 4 steps, limits speedup)

10n 9n

Amdahl: s=0.1, SU at most 10

When interested in parallel aspects, input-output and problem splitting is often explicitly not measured (see projects)


// Sequential initialization x = (int*)calloc(n*sizeof(int)); … // Parallelizable part do { for (i=0; i<n; i++) { x[i] = f(i); } // check for convergence done = …; } while (!done)

Example: K iterations before convergence, (parallel) convergence check cheap, f(i) fast O(1)…

Sp(n) -> 1+K

Tseq(n) = n+K+Kn Tpar(p,n) = n+K+Kn/p

Sequential fraction ≈ 1/(1+K)


// Sequential initialization x = (int*)malloc(n*sizeof(int)); … // Parallelizable part do { for (i=0; i<n; i++) { x[i] = f(i); } // check for convergence done = …; } while (!done)


Sp(n) -> p when n>p, n->∞

Tseq(n) = 1+K+Kn Tpar(p,n) = 1+K+Kn/p

Sequential part ≈ 1/(1+n)

Note:

If sequential part is constant (but not fraction), Amdahl‘s law does not limit SU




Tseq(n) = 1+K+Kn Tpar(p,n) = 1+K+Kn/p

Lesson: Be careful with system functions (calloc, malloc)

Sequential part ≈ 1/(1+n)

Sp(n) -> p when n>p, n->∞


Avoiding Amdahl: Scaled speedup

Sequential, strictly non-parallelizable part is often not a constant fraction of the total execution time (number of instructions) The sequential part may be constant, or grow only slowly with problem size n. Thus, to maintain good speedup, problem size should be increased with p

Assume Tseq(n) = t(n)+T(n) with sequential part t(n) and perfectly parallelizable part T(n). Assume t(n)/T(n) -> 0 for n-> ∞ Tpar(p,n) = t(n)+T(n)/p


Sp(n) = (t(n)+T(n))/(t(n)+T(n)/p) = (t(n)/T(n)+1)/(t(n)/T(n)+1/p) -> 1/(1/p) = p

Speed-up as a function of p and n:

for n -> ∞

Lesson: Depending on how fast t(n)/T(n) converges, linear speed-up can be achieved by increasing problem size n

Definition: Speedup as function of p and n, with sequential and parallelizable times t(n) and T(n) is called scaled speed-up


Remarks: •E(p,n) ≤ 1, since Sp(n) = Tseq(n)/Tpar(n,p) ≤ p •E(p,n) = constant: linear speedup

Definition: The efficiency of a parallel algorithm is ratio of best possible parallel time to actual parallel time for given p and n: E(p,n) = (Tseq(n)/p)/Tpar(p,n) = Sp(n)/p = Tseq(n)/(p*Tpar(p,n))

cost

•Cost-optimal algorithms have constant efficienty


Scalability

Definition: A parallel algorithm/implementation is strongly scaling if Sp(n) = Θ(p) (linear, independent of (sufficiently large) n)

Definition: A parallel algorithm/implementation is weakly scaling if there is a slow-growing function f(p), such that for n = Ω(f(p)), E(p,n) remains constant. The function f is called the iso-efficiency function

„Maintain efficiency by increasing problem size as f(p) or more“

J. Gustafson: Reevaluating Amdahls Law. CACM 1988 Ananth Grama, Anshul Gupta, Vipin Kumar: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE Transactions Par. Dist. Computing. 1(3): 12-21 (1993)



Assume convergence check takes O(log p) time

Tpar(p,n) = Kn/p+K log p

E(p,n) ≈ Kn/(Kn+pK log p)

For n≥plog p, E(p,n) ≥ 1/2

Example:

Weakly scalable, n has to increase as O(p log p) to maintain constant efficiency, O(log p) per processor (if balanced)


Examples (Tpar, Speedup, Optimality, Efficiency):

1. Tpar0(p,n) = n/p+1

2. Tpar1(p,n) = n/p+log p

3. Tpar2(p,n) = n/p+log2 p

4. Tpar3(p,n) = n/p+√p

5. Tpar4(p,n) = n/p+p

Linear time computation, Tseq(n) = n (constants ignored)

Embarassingly “data parallel” computation, constant overhead logarithmic overhead, e.g. convergence check

Linear overhead, e.g. data exchange

Typical, good, work-optimal parallelizations


n=128


1. Tpar0(p,n) = n/p+1: p*Tpar(p,n) = n+p = O(n) for p=O(n)

2. Tpar1(p,n) = n/p+log p: p*Tpar(p,n) = n+p log p = O(n) for p log p = O(n)

3. Tpar2(p,n) = n/p+log2 p: p*Tpar(p,n) = n+p log2 p = O(n) for p log2 p=O(n)

4. Tpar3(p,n) = n/p+√p: p*Tpar(p,n) = n+p√p = O(n) for p√p=O(n)

5. Tpar4(p,n) = n/p+p: p*Tpar(p,n) = n+p2 = O(n) for p2=O(n)

Cost-optimality:


n=128


n=128, but p up to 256


n=16384 (=16K=1282)


n=2097152 (=2M=1283)


n=128


n=16384 (=16K=1272)


n=2097152 (=2M=1273)


Efficiency, weak scaling

1. Tpar0(p,n) = n/p+1: f0(p) = e/(1-e)*p

2. Tpar1(p,n) = n/p+log p: f1(p) = e/(1-e)*(p log p)

3. Tpar2(p,n) = n/p+log2 p: f2(p) = e/(1-e)*p log2 p

4. Tpar3(p,n) = n/p+√p: f3(p) = e/(1-e)*(p√p)

5. Tpar4(p,n) = n/p+p: f4(p) = e/(1-e)*p2

To maintain constant efficiency e=Tseq(n)/(p*Tpar(p,n)), n has to increase as


Maintained efficiency = 0.9 (90%)


Matrix-vector multiplication parallelizations

Tseq(n) = n2

Tpar0(p,n) = n2/p + n Tpar1(p,n) = n2/p + n + log p


n=100


n=1000


Some non work-optimal parallel algorithms

Amdahl case, linear sequential running time:

TparA(p,n) = 0.9n/p+0.1n

1. Tseq(n) = n log n Tpar(p,n) = n2/p+1

2. Tseq(n) = n Tpar(p,n) = (n log n)/p+1


n1 = 128 n2 = 16384


Limitations of empirical speedup measure

Empirical speedup assumes thatTseq(n) can be measured. For very large n and p, this may not be the case: a large HPC system has much more (distributed) main memory than any single-processor system Scalability measured by other means: •Stepwise speedup (1-1000 processors, 1000-100,000 processors) •Other notions of efficiency


Lecture summary, checklist

•Models of parallel computation: Architecture, programming, cost •Flynn’s taxonomy: MIMD, SIMD; SPMD •Sequential baseline •Speedup (in theory and practice) •Work, Cost optimality •Amdahl’s law •Scaled speed-up, strong and weak scaling

introduction to parallel computingws16/17 ©jesper larsson träff linguistic: how are the...

Documents