comp 322: principles of parallel programming lecture 3 ...vs3/pdf/comp322-lec3-f09-v2.pdfcomp 322:...

COMP 322 Lecture 3 1 September 2009

COMP 322: Principles of Parallel Programming

Lecture 3: Reasoning about Performance (Chapter 3)

Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322

Vivek Sarkar Department of Computer Science

Rice University [email protected]

COMP 322, Fall 2009 (V.Sarkar) 2

Summary of Last Lecture •  Parallel Algorithms for:

— Prefix Sum –  T1 = O(N) –  TN = O(log N)

— Quicksort –  T1 = O(N log N) –  TN = O(log2 N)

•  Upper and lower bounds for greedy schedulers — max(T1/P, T∞) ≤ TP ≤ T1/P + T∞

•  Amdahl’s Law — Speedup(P) = T1/TP ≤ P / (fPAR + P * fSEQ)


Acknowledgments for Todayʼs Lecture •  “Scaling to Petascale: Concepts & Beyond”, Thomas Sterling,

LSU, August 3, 2009 •  “CS380P: Parallel Systems”, course lectures by Prof. Calvin Lin,

UT Austin, Spring 2009 — http://www.cs.utexas.edu/users/lin/cs380p/

•  Course text: “Principles of Parallel Programming”, Calvin Lin & Lawrence Snyder

•  “Introduction to Parallel Computing”, 2nd Edition, Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar, Addison-Wesley, 2003

•  COMP 422 lectures, Spring 2008 — http://www.cs.rice.edu/~vsarkar/comp422

•  COMP 515 lectures, Spring 2009 — http://www.cs.rice.edu/~vsarkar/comp515


Example 3: Parallelizing QuickSort procedure QUICKSORT(S) { if S contains at most one element then return S else { choose an element a randomly from S; // Opportunity 1: Parallel Partition let S1, S2 and S3 be the sequences of elements in S less than, equal to, and greater than a, respectively; // Opportunity 2: Parallel Calls return (QUICKSORT(S1) followed by S2 followed by QUICKSORT(S3)) } // else } // procedure


Approach 3: parallel partition, parallel calls

Depth = O(lg n) and each stage takes O(lg n) parallel time overall span is O(lg2 n)


Non-Parallelizable Code •  Non-parallelizable code = code that is inherently sequential or

limited to a small degree of parallelism •  Sources of non-parallelizable code include

— Fraction of sequential code (Amdahl’s Law) — Dependences in the computation graph — Large critical path length,T∞

•  Mitigate by removing dependences — So as to reduce critical path length,T∞


Data Dependences (pg. 68) •  Simple example of data dependence: S1 PI = 3.14

S2 R = 5.0

S3 AREA = PI * R ** 2

•  Statement S3 cannot be executed in parallel with either S1 or S2 without compromising correct results


Classification of Data Dependences •  Formally:

There is a data dependence from statement S1 to statement S2 (S2 depends on S1) if:

1. Both statements access the same memory location and at least one of them stores onto it, and

2. There is a feasible run-time execution path from S1 to S2

•  True dependence (read-after-write) — S2 reads and S1 writes

•  Anti dependence (write-after-read) — S2 writes and S1 reads

•  Output dependence (write-after-write hazard) — S2 writes and S1 writes


Removing False (Anti / Output) Dependences by Renaming

Before renaming: 1.  sum = a + 1; 2.  first_term = sum * scale1; 3.  sum = b+1; 4.  second_term = sum * Scale2;

After renaming sum to first_sum and second_sum: 1.  first_sum = a + 1; 2.  first_term = first_sum * scale1; 3.  second_sum = b+1; 4.  second_term = second_sum * Scale2;


Latency and Throughput (pp. 62 – 63) •  Latency = amount of time it takes to complete a given unit of

work •  Throughput = amount of work that can be completed per unit

time — Throughput is also referred to as bandwidth, especially when the

work involves data transfer

•  Little’s Law — A system must provide Parallelism ≥ Latency * Throughput, to fully

utilize the available throughput (bandwidth)


Bandwidth vs. Latency in a Pipeline

•  In this example: — Sequential execution takes

4 * 90min = 6 hours — Pipelined execution takes

30+4*40+20 = 3.5 hours •  Bandwidth = loads/hour •  BW = 4/6 l/h w/o pipelining •  BW = 4/3.5 l/h w pipelining

•  BW 1.5 l/h w pipelining, as total loads ∞

•  Pipelining helps bandwidth but not latency (90 min)

•  Bandwidth limited by slowest pipeline stage (40 min)

•  Little’s Law: need at least 3 stages (> 1.5 * 1.5 = 2.25)

A

B

C

D

6 PM 7 8 9

T a s k

O r d e r

Time

30 40 40 40 40 20

Dave Patterson’s Laundry example: 4 people doing laundry

wash (30 min) + dry (40 min) + fold (20 min) = 90 min


Sources of Performance Loss in Real Parallel Machines (pp. 64 – 66)

•  Our discussion of computation graphs and parallel algorithms thus far assumed ideal abstract parallel machines — Reasoning about ideal execution time helps remove algorithmic dependences

that can lead to insufficient parallelism and processor starvation

•  However, real parallel machines exhibit additional sources of performance loss due to latency, contention, and overhead

•  These can be mitigated by paying attention to three dimensions in the “DIG” acronym when mapping from ideal parallelism to useful parallelism 1.  Increase Data locality in all computations

–  Ideal machine models assume that a memory access is a constant-time operation; however the latency of a memory access can vary by multiple orders of magnitude in real machines. Increasing data locality helps bridge this gap.

2.  Decrease load Imbalance across computation, memory, & communication resources –  Ideal machine models ignore sources of contention in real machines e.g., from two virtual

channels mapped to the same physical link or two variables mapped to the same memory module. Redistributing requests across physical resources helps bridge this gap.

3.  Increase Granularity of computation and communication –  Ideal machine models ignore large overheads involved in scheduling tasks and

communicating data. Increasing the granularity of tasks and data transfers helps bridge this gap.

COMP 322, Fall 2009 (V.Sarkar) 13 13

The Memory Wall

Memory Access Time

CPU Time

Ratio

THE WALL

COMP 322, Fall 2009 (V.Sarkar) 14 14

SMP Node Diagram

MP L1 L2

MP L1 L2

L3

MP L1 L2

MP L1 L2

L3

M1 M2 Mn‐1

Controller

S

S

NIC NIC USB Peripherals

JTAG

Legend : MP : MicroProcessor L1,L2,L3 : Caches M1.. : Memory Banks S : Storage NIC : Network Interface Card

Ethernet

PCI‐e


Levels of the Memory Hierarchy

CPU Registers 100s Bytes <1s ns

Cache 10s‐100s K Bytes 1‐10 ns $10/ MByte

Main Memory M Bytes 100ns‐ 300ns $1/ MByte

Disk 10s G Bytes, 10 ms (10,000,000 ns) $0.0031/ MByte

Capacity Access Time Cost

Tape infinite sec‐min $0.0014/ MByte

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

Staging Xfer Unit

prog./compiler 1‐8 bytes

cache cntl 8‐128 bytes

OS 512‐4K bytes

user/operator Mbytes

Upper Level

Lower Level

faster

Larger

Copyright 2001, UCB, David Pa\erson


Cache Performance

16

T = total execu_on _me Tcycle = _me for a single processor cycle Icount = total number of instruc_ons IALU = number of ALU instruc_ons (e.g. register – register) IMEM = number of memory access instruc_ons ( e.g. load, store) CPI = average cycles per instruc_ons CPIALU = average cycles per ALU instruc_ons

CPIMEM = average cycles per memory instruc_on rmiss = cache miss rate rhit = cache hit rate CPIMEM‐MISS = cycles per cache miss CPIMEM‐HIT=cycles per cache hit MALU = instruc_on mix for ALU instruc_ons MMEM = instruc_on mix for memory access instruc_on


Cache Performance: Example

17


Performance: Locality •  Temporal Locality is a property that if a program accesses

a memory location, there is a much higher than random probability that the same location would be accessed again.

•  Spatial Locality is a property that if a program accesses a memory location, there is a much higher than random probability that the nearby locations would be accessed soon.

•  A couple of key factors affect the relationship between locality and scheduling : — Size of dataset being processed by each processor — How much reuse is present in the code processing a chunk of

iterations. These factors were ignored in our idealized model of

computation graph scheduling

18


Idleness •  Idleness = state of a processor when it cannot find any useful

work to execute •  Sources of idleness include:

— Load imbalance that prevents the workload from being “infinitely divisible” –  Mitigate by reducing load imbalance

— Memory-bound computations –  Mitigate by increasing locality –  Mitigate by overlapping computation with latency


Contention •  Contention = degradation of system performance caused by

competition for a shared resource — Impact increases with increasing number of processors — Shared resource is often called a serialization bottleneck

•  Sources of idleness include: — Acquiring and releasing a single lock (lock contention) — Acquiring and releasing a single cache line (cache contention)


Cache Coherence on Bus-Based Machines (pp. 34 – 36)


Prefetching and Multithreading Approaches for Hiding Memory Latency

•  Consider the problem of browsing the web on a very slow network connection. We deal with the problem in one of two possible ways: —  we anticipate which pages we are going to browse ahead of time and

issue requests for them in advance; —  we open multiple browsers and access different pages in each

browser, thus while we are waiting for one page to load, we could be reading others; or

•  The first approach is called prefetching, the second multithreading


Overhead •  Overhead = any cost that gets added to a sequential

computation so as to enable it to run in parallel •  Sources of overhead include

— Communication: can be explicit via messages, or implicit via a memory hierarchy (caches) e.g., transmission delay, data marshalling & demarshalling

— Synchronization: extra processing to ensure that dependences in computation graph are satisfied

— Computation: extra work added to obtain a parallel algorithm — Memory: extra memory used to obtain a parallel algorithm — Task creation and termination: extra processing performed at the

start and end of each task e.g., a forall iteration or an async statement

•  For simplicity, we assume that all overhead can be executed in parallel


Overhead --- Mitigate by Increasing Task Granularity

24

v w

v = overhead w = work unit W = Total work Ti = execu_on _me with i processors P = # processors S = speedup

Assump_on : Workload is infinitely divisible

Implica_on: For a given P, task granularity, wi = W/P > v, is a necessary condi_on to obtain speedup, S > P/2

Larger overhead, v increase task granularity, wi


Scalable Performance: Key Terms and Concepts

•  Scalable Speedup: Relative reduction of execution time of a fixed size workload through parallel execution (ideally = N, but is < N in practice)

•  Scalable Efficiency : Ratio of the actual performance to the best possible performance (ideally = 1, but is < 1 in practice)

25

€

Efficiency =execution _ time_on _one_ processor

(execution _ time_on _multiple_ processors× number_of _ processors)


Example with Overhead (pg. 82) •  Consider a problem with sequential execution time TS, that

incurs 0.2 TS overhead per processor when executed in parallel •  Therefore

— T1 = TS — T2 = TS/2 + 0.2 TS = 0.7 TS — T10 = TS/10 + 0.2 TS = 0.3 TS — T100 = TS/100 + 0.2 TS = 0.21 TS

 Speedup(1) = 1, Efficiency(1) = 1  Speedup(2) = 1/0.7 = 1.43, Efficiency(2) = 1.43/2 = 0.71  Speedup(10) = 1/0.3 = 3.33, Efficiency(10) = 3.33/10 = 0.33  Speedup(100) = 1/0.21 = 4.76, Efficiency(100) = 4.76/100 =

0.047


Another Example with Overhead •  Consider a problem with sequential execution time TS(N), a

function of input size N, and assume a fixed overhead of TOVHD per processor when executed in parallel

•  Therefore — T1(N) = TS(N) — TP(N) = TS(N)/P + TOVHD , for P > 1  Speedup(P) = T1(N) / TP(N) = TS(N) / (TS(N)/P + TOVHD)  Efficiency(P) = Speedup(P) / P = TS(N) / (TS(N) + P*TOVHD)

•  Half-performance metric — N1/2 = input size that achieves Efficiency(P) = 0.5 for a given P — A larger value of N1/2 indicates that the problem is harder to

parallelize efficiently


Strong Scaling, Weak Scaling

28

Strong Scaling Weak S

caling

Strong Scaling

Weak Scaling

Tota

l Pr

oblem S

ize

Machine Scale (# of nodes)

Gran

ular

ity

(size

/ no

de)


Summary of Todayʼs Lecture Key concepts: •  Latency •  Throughput / Bandwidth •  Little’s Law •  DIG acronym

1.  Increase Data locality in all computations –  Addresses idleness arising from large memory access

latencies 2. Decrease load Imbalance across computation, memory, &

communication resources –  Addresses contention in physical resources

3. Increase Granularity of computation and communication –  Addresses overheads in real machines


HOMEWORK #1(Written Assignment)

1.  Exercise 6, Chapter 3, page 85 2.  Analyze the speedup, efficiency, and half-performance metric

of Parallel Quicksort as a function of N and P.

Due Date: In class on Thursday, Sep 10th

comp 322: principles of parallel programming lecture 3 ...vs3/pdf/comp322-lec3-f09-v2.pdfcomp 322:...

Documents