comp 322: principles of parallel programming lecture 3 ...vs3/pdf/comp322-lec3-f09-v2.pdfcomp 322:...

30
COMP 322 Lecture 3 1 September 2009 COMP 322: Principles of Parallel Programming Lecture 3: Reasoning about Performance (Chapter 3) Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322 Vivek Sarkar Department of Computer Science Rice University [email protected]

Upload: dinhcong

Post on 22-Apr-2018

225 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322 Lecture 3 1 September 2009

COMP 322: Principles of Parallel Programming

Lecture 3: Reasoning about Performance (Chapter 3)

Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322

Vivek Sarkar Department of Computer Science

Rice University [email protected]

Page 2: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 2

Summary of Last Lecture •  Parallel Algorithms for:

— Prefix Sum –  T1 = O(N) –  TN = O(log N)

— Quicksort –  T1 = O(N log N) –  TN = O(log2 N)

•  Upper and lower bounds for greedy schedulers — max(T1/P, T∞) ≤ TP ≤ T1/P + T∞

•  Amdahl’s Law — Speedup(P) = T1/TP ≤ P / (fPAR + P * fSEQ)

Page 3: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 3

Acknowledgments for Todayʼs Lecture •  “Scaling to Petascale: Concepts & Beyond”, Thomas Sterling,

LSU, August 3, 2009 •  “CS380P: Parallel Systems”, course lectures by Prof. Calvin Lin,

UT Austin, Spring 2009 — http://www.cs.utexas.edu/users/lin/cs380p/

•  Course text: “Principles of Parallel Programming”, Calvin Lin & Lawrence Snyder

•  “Introduction to Parallel Computing”, 2nd Edition, Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar, Addison-Wesley, 2003

•  COMP 422 lectures, Spring 2008 — http://www.cs.rice.edu/~vsarkar/comp422

•  COMP 515 lectures, Spring 2009 — http://www.cs.rice.edu/~vsarkar/comp515

Page 4: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 4

Example 3: Parallelizing QuickSort procedure QUICKSORT(S) { if S contains at most one element then return S else { choose an element a randomly from S; // Opportunity 1: Parallel Partition let S1, S2 and S3 be the sequences of elements in S less than, equal to, and greater than a, respectively; // Opportunity 2: Parallel Calls return (QUICKSORT(S1) followed by S2 followed by QUICKSORT(S3)) } // else } // procedure

Page 5: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 5

Approach 3: parallel partition, parallel calls

Depth = O(lg n) and each stage takes O(lg n) parallel time overall span is O(lg2 n)

Page 6: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 6

Non-Parallelizable Code •  Non-parallelizable code = code that is inherently sequential or

limited to a small degree of parallelism •  Sources of non-parallelizable code include

— Fraction of sequential code (Amdahl’s Law) — Dependences in the computation graph — Large critical path length,T∞

•  Mitigate by removing dependences — So as to reduce critical path length,T∞

Page 7: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 7

Data Dependences (pg. 68) •  Simple example of data dependence: S1 PI = 3.14

S2 R = 5.0

S3 AREA = PI * R ** 2

•  Statement S3 cannot be executed in parallel with either S1 or S2 without compromising correct results

Page 8: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 8

Classification of Data Dependences •  Formally:

There is a data dependence from statement S1 to statement S2 (S2 depends on S1) if:

1. Both statements access the same memory location and at least one of them stores onto it, and

2. There is a feasible run-time execution path from S1 to S2

•  True dependence (read-after-write) — S2 reads and S1 writes

•  Anti dependence (write-after-read) — S2 writes and S1 reads

•  Output dependence (write-after-write hazard) — S2 writes and S1 writes

Page 9: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 9

Removing False (Anti / Output) Dependences by Renaming

Before renaming: 1.  sum = a + 1; 2.  first_term = sum * scale1; 3.  sum = b+1; 4.  second_term = sum * Scale2;

After renaming sum to first_sum and second_sum: 1.  first_sum = a + 1; 2.  first_term = first_sum * scale1; 3.  second_sum = b+1; 4.  second_term = second_sum * Scale2;

Page 10: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 10

Latency and Throughput (pp. 62 – 63) •  Latency = amount of time it takes to complete a given unit of

work •  Throughput = amount of work that can be completed per unit

time — Throughput is also referred to as bandwidth, especially when the

work involves data transfer

•  Little’s Law — A system must provide Parallelism ≥ Latency * Throughput, to fully

utilize the available throughput (bandwidth)

Page 11: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 11

Bandwidth vs. Latency in a Pipeline

•  In this example: — Sequential execution takes

4 * 90min = 6 hours — Pipelined execution takes

30+4*40+20 = 3.5 hours •  Bandwidth = loads/hour •  BW = 4/6 l/h w/o pipelining •  BW = 4/3.5 l/h w pipelining

•  BW 1.5 l/h w pipelining, as total loads ∞

•  Pipelining helps bandwidth but not latency (90 min)

•  Bandwidth limited by slowest pipeline stage (40 min)

•  Little’s Law: need at least 3 stages (> 1.5 * 1.5 = 2.25)

A

B

C

D

6 PM 7 8 9

T a s k

O r d e r

Time

30 40 40 40 40 20

Dave Patterson’s Laundry example: 4 people doing laundry

wash (30 min) + dry (40 min) + fold (20 min) = 90 min

Page 12: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 12

Sources of Performance Loss in Real Parallel Machines (pp. 64 – 66)

•  Our discussion of computation graphs and parallel algorithms thus far assumed ideal abstract parallel machines — Reasoning about ideal execution time helps remove algorithmic dependences

that can lead to insufficient parallelism and processor starvation

•  However, real parallel machines exhibit additional sources of performance loss due to latency, contention, and overhead

•  These can be mitigated by paying attention to three dimensions in the “DIG” acronym when mapping from ideal parallelism to useful parallelism 1.  Increase Data locality in all computations

–  Ideal machine models assume that a memory access is a constant-time operation; however the latency of a memory access can vary by multiple orders of magnitude in real machines. Increasing data locality helps bridge this gap.

2.  Decrease load Imbalance across computation, memory, & communication resources –  Ideal machine models ignore sources of contention in real machines e.g., from two virtual

channels mapped to the same physical link or two variables mapped to the same memory module. Redistributing requests across physical resources helps bridge this gap.

3.  Increase Granularity of computation and communication –  Ideal machine models ignore large overheads involved in scheduling tasks and

communicating data. Increasing the granularity of tasks and data transfers helps bridge this gap.

Page 13: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 13 13

The Memory Wall

Memory Access Time

CPU Time

Ratio

THE WALL

Page 14: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 14 14

SMP Node Diagram

MP L1 L2 

MP L1 L2 

L3 

MP L1 L2 

MP L1 L2 

L3 

M1  M2  Mn‐1 

Controller 

NIC  NIC USB Peripherals 

JTAG 

Legend :  MP : MicroProcessor L1,L2,L3 : Caches M1.. : Memory Banks S : Storage NIC : Network Interface Card 

Ethernet 

PCI‐e 

Page 15: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 15

Levels of the Memory Hierarchy

CPU Registers 100s Bytes <1s ns 

Cache 10s‐100s K Bytes 1‐10 ns $10/ MByte 

Main Memory M Bytes 100ns‐ 300ns $1/ MByte 

Disk 10s G Bytes, 10 ms  (10,000,000 ns) $0.0031/ MByte 

Capacity Access Time Cost 

Tape infinite sec‐min $0.0014/ MByte 

Registers 

Cache 

Memory 

Disk 

Tape 

Instr. Operands 

Blocks 

Pages 

Files 

Staging Xfer Unit 

prog./compiler 1‐8 bytes 

cache cntl 8‐128 bytes 

OS 512‐4K bytes 

user/operator Mbytes 

Upper Level 

Lower Level 

faster 

Larger 

Copyright 2001, UCB, David Pa\erson 

Page 16: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 16

Cache Performance

16

T = total execu_on _me Tcycle = _me for a single processor cycle Icount = total number of instruc_ons IALU = number of ALU instruc_ons (e.g. register – register) IMEM = number of memory access instruc_ons ( e.g. load, store) CPI = average cycles per instruc_ons CPIALU = average cycles per ALU instruc_ons 

CPIMEM = average cycles per memory instruc_on rmiss = cache miss rate rhit = cache hit rate CPIMEM‐MISS = cycles per cache miss CPIMEM‐HIT=cycles per cache hit MALU = instruc_on mix for ALU instruc_ons MMEM = instruc_on mix for memory access instruc_on 

Page 17: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 17

Cache Performance: Example

17

Page 18: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 18

Performance: Locality •  Temporal Locality is a property that if a program accesses

a memory location, there is a much higher than random probability that the same location would be accessed again.

•  Spatial Locality is a property that if a program accesses a memory location, there is a much higher than random probability that the nearby locations would be accessed soon.

•  A couple of key factors affect the relationship between locality and scheduling : — Size of dataset being processed by each processor — How much reuse is present in the code processing a chunk of

iterations. These factors were ignored in our idealized model of

computation graph scheduling

18

Page 19: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 19

Idleness •  Idleness = state of a processor when it cannot find any useful

work to execute •  Sources of idleness include:

— Load imbalance that prevents the workload from being “infinitely divisible” –  Mitigate by reducing load imbalance

— Memory-bound computations –  Mitigate by increasing locality –  Mitigate by overlapping computation with latency

Page 20: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 20

Contention •  Contention = degradation of system performance caused by

competition for a shared resource — Impact increases with increasing number of processors — Shared resource is often called a serialization bottleneck

•  Sources of idleness include: — Acquiring and releasing a single lock (lock contention) — Acquiring and releasing a single cache line (cache contention)

Page 21: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 21

Cache Coherence on Bus-Based Machines (pp. 34 – 36)

Page 22: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 22

Prefetching and Multithreading Approaches for Hiding Memory Latency

•  Consider the problem of browsing the web on a very slow network connection. We deal with the problem in one of two possible ways: —  we anticipate which pages we are going to browse ahead of time and

issue requests for them in advance; —  we open multiple browsers and access different pages in each

browser, thus while we are waiting for one page to load, we could be reading others; or

•  The first approach is called prefetching, the second multithreading

Page 23: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 23

Overhead •  Overhead = any cost that gets added to a sequential

computation so as to enable it to run in parallel •  Sources of overhead include

— Communication: can be explicit via messages, or implicit via a memory hierarchy (caches) e.g., transmission delay, data marshalling & demarshalling

— Synchronization: extra processing to ensure that dependences in computation graph are satisfied

— Computation: extra work added to obtain a parallel algorithm — Memory: extra memory used to obtain a parallel algorithm — Task creation and termination: extra processing performed at the

start and end of each task e.g., a forall iteration or an async statement

•  For simplicity, we assume that all overhead can be executed in parallel

Page 24: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 24

Overhead --- Mitigate by Increasing Task Granularity

24

v w

v = overhead w = work unit W = Total work Ti = execu_on _me with i processors P = # processors S = speedup 

Assump_on : Workload is  infinitely divisible 

Implica_on: For a given P, task granularity, wi = W/P > v, is a necessary condi_on to obtain speedup, S > P/2 

Larger overhead, v    increase task granularity, wi 

Page 25: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 25

Scalable Performance: Key Terms and Concepts

•  Scalable Speedup: Relative reduction of execution time of a fixed size workload through parallel execution (ideally = N, but is < N in practice)

•  Scalable Efficiency : Ratio of the actual performance to the best possible performance (ideally = 1, but is < 1 in practice)

25

Efficiency =execution _ time_on _one_ processor

(execution _ time_on _multiple_ processors× number_of _ processors)

Page 26: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 26

Example with Overhead (pg. 82) •  Consider a problem with sequential execution time TS, that

incurs 0.2 TS overhead per processor when executed in parallel •  Therefore

— T1 = TS — T2 = TS/2 + 0.2 TS = 0.7 TS — T10 = TS/10 + 0.2 TS = 0.3 TS — T100 = TS/100 + 0.2 TS = 0.21 TS

 Speedup(1) = 1, Efficiency(1) = 1  Speedup(2) = 1/0.7 = 1.43, Efficiency(2) = 1.43/2 = 0.71  Speedup(10) = 1/0.3 = 3.33, Efficiency(10) = 3.33/10 = 0.33  Speedup(100) = 1/0.21 = 4.76, Efficiency(100) = 4.76/100 =

0.047

Page 27: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 27

Another Example with Overhead •  Consider a problem with sequential execution time TS(N), a

function of input size N, and assume a fixed overhead of TOVHD per processor when executed in parallel

•  Therefore — T1(N) = TS(N) — TP(N) = TS(N)/P + TOVHD , for P > 1  Speedup(P) = T1(N) / TP(N) = TS(N) / (TS(N)/P + TOVHD)  Efficiency(P) = Speedup(P) / P = TS(N) / (TS(N) + P*TOVHD)

•  Half-performance metric — N1/2 = input size that achieves Efficiency(P) = 0.5 for a given P — A larger value of N1/2 indicates that the problem is harder to

parallelize efficiently

Page 28: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 28

Strong Scaling, Weak Scaling

28

Strong Scaling Weak S

caling

Strong Scaling

Weak Scaling

Tota

l Pr

oblem S

ize

Machine Scale (# of nodes)

Gran

ular

ity

(size

/ no

de)

Page 29: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 29

Summary of Todayʼs Lecture Key concepts: •  Latency •  Throughput / Bandwidth •  Little’s Law •  DIG acronym

1.  Increase Data locality in all computations –  Addresses idleness arising from large memory access

latencies 2. Decrease load Imbalance across computation, memory, &

communication resources –  Addresses contention in physical resources

3. Increase Granularity of computation and communication –  Addresses overheads in real machines

Page 30: COMP 322: Principles of Parallel Programming Lecture 3 ...vs3/PDF/comp322-lec3-f09-v2.pdfCOMP 322: Principles of Parallel Programming ... Increasing data locality helps bridge this

COMP 322, Fall 2009 (V.Sarkar) 30

HOMEWORK #1(Written Assignment)

1.  Exercise 6, Chapter 3, page 85 2.  Analyze the speedup, efficiency, and half-performance metric

of Parallel Quicksort as a function of N and P.

Due Date: In class on Thursday, Sep 10th