eecs 570 lecture 1 parallel computer architecture · lecture 1 eecs 570 slide 5 meeting times...

68
Lecture 1 Slide 1 EECS 570 EECS 570 Lecture 1 Parallel Computer Architecture Winter 2020 Prof. Satish Narayanasamy http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Austin, Adve , Falsafi , Martin, Narayanasamy , Nowatzyk , Peh , and Wenisch of CMU, EPFL, MIT, UPenn , U - M, UIUC.

Upload: others

Post on 03-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 1EECS 570

EECS 570

Lecture 1

Parallel Computer ArchitectureWinter 2020

Prof. Satish Narayanasamy

http://www.eecs.umich.edu/courses/eecs570/

Slides developed in part by Profs. Austin, Adve, Falsafi, Martin, Narayanasamy, Nowatzyk, Peh, and Wenisch of CMU, EPFL, MIT, UPenn, U-M, UIUC.

Page 2: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 2EECS 570

Announcements

No discussion this Friday.

Online quizzes (Canvas) on 1st readings due Monday, 1:30pm.

Sign up for piazza.

Page 3: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 3EECS 570

Readings

For Monday 1/13 (quizzes due by 1:30pm) David Wood and Mark Hill. “Cost-Effective Parallel

Computing,” IEEE Computer, 1995. Mark Hill et al. “21st Century Computer Architecture.”

CCC White Paper, 2012.

For Wednesday 1/15: Christina Delimitrou and Christos Kozyrakis. Amdahl's law for

tail latency. Commun. ACM, July 2018. H Kim, R Vuduc, S Baghsorkhi, J Choi, Wen-mei Hwu,

Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU), Ch. 1

Page 4: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 4EECS 570

EECS 570 Class Info

Instructor: Professor Satish Narayanasamy URL: http://www.eecs.umich.edu/~nsatish

Research interests: Multicore / multiprocessor arch. & programmability Data center architecture, server energy-efficiency Accelerators for medical imaging, data analytics

GSI: Subarno Banerjee ([email protected])

Class info: URL: http://www.eecs.umich.edu/courses/eecs570/ Canvas for reading quizzes & reporting grades Piazza for discussions & project coordination

Page 5: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 5EECS 570

Meeting Times

Lecture

MW 1:30pm – 2:50pm (1017 Dow)

Discussion

F 1:30pm – 2:20pm (1303 EECS)

Talk about programming assignments and projects

Make-up lectures

Keep the slot free, but we often won’t meet

Office Hours

Prof. Satish: M 3-4pm (4721 BBB) & by appt.

Subarno: Tue 9-10am, Thurs 4-5pm (Location: BBB Learning Center)Fri 1:30-2:30pm (Location: 1303 EECS) when no discussion

Q&AUse Piazza for all technical questionsUse e-mail sparingly

Page 6: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 6EECS 570

Who Should Take 570?

Graduate Students (& seniors interested in research)1. Computer architects to be2. Computer system designers3. Those interested in computer systems

Required Background Computer Architecture (e.g., EECS 470) C / C++ programming

Page 7: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 7EECS 570

Grading

2 Prog. Assignments: 5% & 10%

Reading Quizzes: 10%

Midterm exam: 25%

Final exam: 25%

Final Project: 25%

Attendance & participation count

(your goal is for me to know who you are)

Page 8: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 8EECS 570

Grading (Cont.)

Group studies are encouraged

Group discussions are encouraged

All programming assignments must be results of individual work

All reading quizzes must be done individually, questions/answers should not be posted publicly

There is no tolerance for academic dishonesty. Please refer to the University Policy on cheating and plagiarism. Discussion and group studies are encouraged, but all submitted material must be the student's individual work (or in case of the project, individual group work).

Page 9: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 9EECS 570

Some Advice on Reading…

If you carefully read every paper start to finish…

…you will never finish

Learn to skim past details

Page 10: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 10EECS 570

Reading Quizzes

• You must take an online quiz for every paperQuizzes must be completed by class start via Canvas

• There will be 2 multiple choice questions The questions are chosen randomly from a list

You only have 5 minutes Not enough time to find the answer if you haven’t read the paper

You only get one attempt

• Some of the questions may be reused on the midterm/final

• 4 lowest quiz grades (of about 40) will be dropped over the course of the semester (e.g., skip some if you are travelling) Retakes/retries/reschedules will not be given for any reason

Page 11: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 11EECS 570

Final Project

• Original research on a topic related to the course Goal: a high-quality 6-page workshop paper by end of term

25% of overall grade

Done in groups of 3-4

Poster session - April 22nd, 1:30am-3:30pm (tentative)

• See course website for timeline

• Available infrastructure FeS2 and M5 multiprocessor simulators

GPGPUsim

Pin

Xeon Phi accelerators

• Suggested topic list will be distributed in a few weeksYou may propose other topics if you convince me they are worthwhile

Page 12: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 12EECS 570

Course OutlineUnit I – Parallel Programming Models Message passing, shared memory (pthreads and GPU)

Unit II – Synchronization Synchronization, Locks, Lock-free structures Transactional Memory

Unit III – Coherency and Consistency Snooping bus-based systems Directory-based distributed shared memory Memory Models

Unit IV – Interconnection Networks On-chip and off-chip networks

Unit V – Applications & Architectures Scientific, commercial server, and data center applications Simultaneous & speculative threading

Page 13: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 13EECS 570

Parallel Computer Architecture

The Multicore Revolution

Why did it happen?

Page 14: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 14EECS 570

If you want to make your computer faster, there are only two options:

1. increase clock frequency

2. execute two or more things in parallel

Instruction-Level Parallelism (ILP)

Programmer specified explicit parallelism

Page 15: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 15EECS 570

The ILP Wall

• 6-issue has higher IPC than 2-issue, but not by 3x Memory (I & D) and dependence (pipeline) stalls limit IPC

Olukotun et al ASPLOS 96

Page 16: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 16EECS 570

Single-thread performance

Conclusion: Can’t scale MHz or issue width to keep selling chips

Hence, multicore!

1

10

100

1000

10000

1985 1990 1995 2000 2005 2010

Perf

orm

an

ce

Source: Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 4th ed.

52%/yr.

15%/yr.

Page 17: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 17EECS 570

0.001

0.01

0.1

1

10

100

1000

10000

100000

1000000

1985 1990 1995 2000 2005 2010 2015 2020

Transistors (100,000's)

Power (W)

Performance (GOPS)

Efficiency (GOPS/W)

E2UDC ERC VisionThe Power Wall

Limits on heat extraction

Limits on energy-efficiency of operations

Page 18: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 18EECS 570

0.001

0.01

0.1

1

10

100

1000

10000

100000

1000000

1985 1990 1995 2000 2005 2010 2015 2020

Transistors (100,000's)

Power (W)

Performance (GOPS)

Efficiency (GOPS/W)

E2UDC ERC VisionThe Power Wall

Era of High Performance Computing Era of Energy-Efficient Computingc. 2000

Limits on heat extraction

Limits on energy-efficiency of operations

Stagnates performance growth

Page 19: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 19EECS 570

Classic CMOS Dennard Scaling:the Science behind Moore’s Law

Scaling:

Oxide: tOX/a

Results:

Power Density:

Voltage: V/a

Power/ckt: 1/a2

~Constant

Source: Future of Computing Performance:

Game Over or Next Level?,

National Academy Press, 2011

P = C V2 f

Page 20: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 20EECS 570

Power Density: ~Constant

Post-classic CMOS Dennard Scaling

Scaling:

Oxide: tOX/a

Results:

Voltage: V/a V

Power/ckt: 1

a2

1/a2

Post Dennard CMOS Scaling Rule TODO:Chips w/ higher power (no), smaller (),

dark silicon (☺), or other (?)

P = C V2 f

Page 21: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 21EECS 570

Leakage Killed Dennard Scaling

Leakage:

• Exponential in inverse of Vth

• Exponential in temperature

• Linear in device count

To switch well

• must keep Vdd/Vth > 3

➜Vdd can’t go down

Page 22: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 22EECS 570

Power = CV2F F VScale clock frequency to 80%

Now add a second core

Same power budget, but 1.6x performance!

But: Must parallelize application Remember Amdahl’s Law!

Multicore: Solution to Power-constrained design?

Performance Power

Page 23: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 23EECS 570

What Is a Parallel Computer?

“A collection of processing elements that communicate and cooperate to solve large problems fast.”

Almasi & Gottlieb, 1989

Page 24: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 24EECS 570

Spectrum of Parallelism

Bit-level

EECS 370

Pipelining ILPMultithreadingMultiprocessing

EECS 470

Distributed

EECS 570 EECS 591

Why multiprocessing?

• Desire for performance

• Techniques from 370/470 difficult to scale further

Page 25: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 25EECS 570

Why Parallelism Now?• These arguments are no longer theoretical

• All major processor vendors are producing multicore chips Every machine will soon be a parallel machine

All programmers will be parallel programmers???

• New software model Want a new feature? Hide the “cost” by speeding up the code first

All programmers will be performance programmers???

• Some may eventually be hidden in libraries, compilers, and high level languages But a lot of work is needed to get there

• Big open questions: What will be the killer apps for multicore machines?

How should the chips, languages, OS be designed to make it easier for us to develop parallel programs?

Page 26: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 26EECS 570

Multicore in Products

• “We are dedicating all of our future product development to multicoredesigns. … This is a sea change in computing”

Paul Otellini, President, Intel (2005)

• All microprocessor companies switch to MP (2X cores / 2 yrs)

Intel’s Nehalem-

EX

Azul’s Vega nVidia’s Tesla

Processors/System 4 16 4

Cores/Processor 8 48 448

Threads/Processor 2 1

Threads/System 64 768 1792

Page 27: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 27EECS 570

Revolution Continues..

Azul’s Vega 3 730054-core chip

864 cores768 GB MemoryMay 2008

Blue Gene/Q Sequoia16-core chip

1.6 million cores1.6 PB2012

Sun’s Modular DataCenter ‘088-core chip, 8-thread/core816 cores / 160 sq.feet

Lakeside Datacenter (Chicago)1.1 milion sq.feet

~45 million threads

Page 28: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 28EECS 570

Multiprocessors Are Here To Stay

• Moore’s law is making the multiprocessor a commodity part 1B transistors on a chip, what to do with all of them? Not enough ILP to justify a huge uniprocessor Really big caches? thit increases, diminishing %miss returns

• Chip multiprocessors (CMPs) Every computing device (even your cell phone)

is now a multiprocessor

Page 29: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 29EECS 570

Parallel Programming Intro

Page 30: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 30EECS 570

Motivation for MP Systems

• Classical reason for multiprocessing:More performance by using multiple processors in parallel

Divide computation among processors and allow them to work concurrently

Assumption 1: There is parallelism in the application

Assumption 2: We can exploit this parallelism

Page 31: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 31EECS 570

Finding Parallelism1. Functional parallelism Car: {engine, brakes, entertain, nav, …} Game: {physics, logic, UI, render, …} Signal processing: {transform, filter, scaling, …}

2. Data parallelism Vector, matrix, db table, pixels, …

3. Request parallelism Web, shared database, telephony, …

Page 32: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 32EECS 570

Computational Complexity of (Sequential) Algorithms

• Model: Each step takes a unit time

• Determine the time (/space) required by the algorithm as a function of input size

Page 33: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 33EECS 570

Sequential Sorting Example

• Given an array of size n

• MergeSort takes O(n log n) time

• BubbleSort takes O(n2) time

• But, a BubbleSort implementation can sometimes be faster than a MergeSort implementation

• Why?

Page 34: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 34EECS 570

Sequential Sorting Example

• Given an array of size n

• MergeSort takes O(n log n) time

• BubbleSort takes O(n2) time

• But, a BubbleSort implementation can sometimes be faster than a MergeSort implementation

• The model is still useful Indicates the scalability of the algorithm for large inputs Lets us prove things like a sorting algorithm requires at least

O(n log n) comparisons

Page 35: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 35EECS 570

We need a similar model for parallel algorithms

Page 36: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 36EECS 570

Sequential Merge Sort

16MB input (32-bit integers)

Recurse(left)

Recurse(right)

Copy back to input array

Merge to scratch array

Time

SequentialExecution

Page 37: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 37EECS 570

Parallel Merge Sort (as Parallel Directed Acyclic Graph)

16MB input (32-bit integers)

Recurse(left) Recurse(right)

Copy back to input array

Merge to scratch array

Time ParallelExecution

Page 38: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 38EECS 570

Parallel DAG for Merge Sort (2-core)

Sequential Sort

Merge

Sequential Sort

Time

Page 39: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 39EECS 570

Parallel DAG for Merge Sort (4-core)

Page 40: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 40EECS 570

Parallel DAG for Merge Sort (8-core)

Page 41: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 41EECS 570

The DAG Execution Model of a Parallel Computation

• Given an input, dynamically create a DAG

• Nodes represent sequential computation Weighted by the amount of work

• Edges represent dependencies: Node A → Node B means that B cannot be scheduled unless

A is finished

Page 42: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 42EECS 570

Sorting 16 elements in four cores

Page 43: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 43EECS 570

Sorting 16 elements in four cores(4 element arrays sorted in constant time)

1 16

1 8

1

1

1 8

1

1

Page 44: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 44EECS 570

Performance Measures

• Given a graph G, a scheduler S, and P processors

• Tp(S) : Time on P processors using scheduler S

• Tp : Time on P processors using best scheduler

• T1 : Time on a single processor (sequential cost)

• T∞ : Time assuming infinite resources

Page 45: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 45EECS 570

Work and Depth

• T1 = Work The total number of operations executed by a computation

• T∞ = Depth The longest chain of sequential dependencies (critical path)

in the parallel DAG

Page 46: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 46EECS 570

T∞ (Depth): Critical Path Length(Sequential Bottleneck)

Page 47: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 47EECS 570

T1 (work): Time to Run Sequentially

Page 48: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 48EECS 570

Sorting 16 elements in four cores(4 element arrays sorted in constant time)

1 16

1 8

1

1

1 8

1

1

Work = Depth =

Page 49: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 49EECS 570

Some Useful Theorems

Page 50: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 50EECS 570

Work Law

• “You cannot avoid work by parallelizing”

T1 / P ≤ TP

Page 51: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 51EECS 570

Work Law

• “You cannot avoid work by parallelizing”

T1 / P ≤ TP

Speedup = T1 / TP

Page 52: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 52EECS 570

Work Law

• “You cannot avoid work by parallelizing”

• Can speedup be more than 2 when we go from 1-core to 2-core in practice?

T1 / P ≤ TP

Speedup = T1 / TP

Page 53: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 53EECS 570

Depth Law

• More resources should make things faster

• You are limited by the sequential bottleneck

TP ≥ T∞

Page 54: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 54EECS 570

Amount of Parallelism

Parallelism = T1 / T∞

Page 55: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 55EECS 570

Maximum Speedup Possible

Parallelism

“speedup is bounded above by available parallelism”

Speedup T1 / TP ≤ T1 / T∞

Page 56: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 56EECS 570

Greedy Scheduler

• If more than P nodes can be scheduled, pick any subset of size P

• If less than P nodes can be scheduled, schedule them all

Page 57: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 57EECS 570

More Reading: http://www.cs.cmu.edu/afs/cs/academic/class/15492-f07/www/scribe/lec4/lecture4.pdf

Page 58: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 58EECS 570

Performance of the Greedy Scheduler

Work law T1 / P ≤ TP

Depth law T∞ ≤ TP

TP(Greedy) ≤ T1 / P + T∞

Page 59: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 59EECS 570

Greedy is optimal within factor of 2

Work law T1 / P ≤ TP

Depth law T∞ ≤ TP

TP ≤ TP(Greedy) ≤ 2 TP

Page 60: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 60EECS 570

Work/Depth of Merge Sort (Sequential Merge)

• Work T1 : O(n log n)

• Depth T∞ : O(n) Takes O(n) time to merge n elements

• Parallelism: T1 / T∞ = O(log n) → really bad!

Page 61: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 61EECS 570

Main Message

• Analyze the Work and Depth of your algorithm

• Parallelism is Work/Depth

• Try to decrease Depth the critical path a sequential bottleneck

• If you increase Depth better increase Work by a lot more!

Page 62: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 62EECS 570

Amdahl’s law

• Sorting takes 70% of the execution time of a sequential program

• You replace the sorting algorithm with one that scales perfectly on multi-core hardware

• How many cores do you need to get a 4x speed-up on the program?

Page 63: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 63EECS 570

Amdahl’s law, 𝑓 = 70%

f = the parallel portion of execution

1 - f = the sequential portion of execution

c = number of cores used

Speedup(f, c) = 1 / ( 1 – f) + f / c

Page 64: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 64EECS 570

Amdahl’s law, 𝑓 = 70%

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Spe

edu

p

#cores

Desired 4x speedup

Speedup achieved (perfect scaling on 70%)

Page 65: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 65EECS 570

Amdahl’s law, 𝑓 = 70%

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Spe

edu

p

#cores

Desired 4x speedup

Speedup achieved (perfect scaling on 70%)

Limit as c→∞ = 1/(1-f) = 3.33

Page 66: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 66EECS 570

Amdahl’s law, 𝑓 = 10%

0.94

0.96

0.98

1.00

1.02

1.04

1.06

1.08

1.10

1.12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Spe

edu

p

#cores

Speedup achieved with perfect scaling

Amdahl’s law limit, just 1.11x

Page 67: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 67EECS 570

Amdahl’s law, 𝑓 = 98%

0

10

20

30

40

50

60

1 7

13

19

25

31

37

43

49

55

61

67

73

79

85

91

97

10

3

10

9

11

5

12

1

12

7

Spe

edu

p

#cores

Page 68: EECS 570 Lecture 1 Parallel Computer Architecture · Lecture 1 EECS 570 Slide 5 Meeting Times Lecture MW 1:30pm –2:50pm (1017 Dow) Discussion F 1:30pm –2:20pm (1303 EECS) Talk

Lecture 1 Slide 68EECS 570

Lesson

• Speedup is limited by sequential code

• Even a small percentage of sequential code can greatly limit potential speedup