lecture 1: why parallelism? why effi

40
Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2018 Lecture 1: Why Parallelism? Why Eciency?

Upload: others

Post on 11-Jun-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 1: Why Parallelism? Why Effi

Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2018

Lecture 1:

Why Parallelism? Why Efficiency?

Page 2: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Hi!

Prof. Mowry Aayush

Haoran

Ting

Prof. Railing

Frank

Hao

Page 3: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

What will you be doing in this course?

Page 4: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Assignments▪ Four programming assignments

- First assignment is done individually, the rest will be done in pairs - Each uses a different parallel programming environment

Assignment 1: ISPC programming on Intel quad-core CPU (and Xeon Phi)

Assignment 2: CUDA programming on NVIDIA GPUs

Assignment 3: Parallel Programming via a Shared-Address Space Model

Assignment 4: Parallel Programming via a Message Passing Model

Page 5: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

If you are on the Wait List▪ We will hand out Assignment 1 later this week

▪ Our algorithm for filling the K remaining slots in the class: - the first K students on the Wait List who hand in

Assignment 1 and receive an A on it are enrolled in the class

Page 6: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Exams▪ We will have two midterm-style exams

- Each covers roughly half of the course material - Closed-book, closed-notes

▪ No final exam - We use the final exam slot for our project poster session

Page 7: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Final project▪ 6-week self-selected final project

▪ Performed in groups (by default, 2 people per group)

▪ Start thinking about your project ideas TODAY!

▪ Poster session during the final exam slot

▪ Check out last semester’s projects:

http://15418.courses.cs.cmu.edu/fall2017/article/10

Page 8: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Participation Grade: Take-Home Quizzes▪ We will have 2-4 take-home quizzes

- Goal: help you prepare for the exams - You must complete the quiz on your own - We will grade your work to give you feedback, but only a

participation grade will go into the gradebook

Page 9: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Participation Grade: In-Class Mini-Quizzes▪ In most lectures, we will have a simple in-class (online) quiz

▪ The quizzes should be easy - the goal is just to demonstrate that you are paying

attention in class

▪ They also give us feedback on what the class is understanding

▪ Grace budget: - Full credit for this portion of the participation grade if you

complete 70% of the in-class quizzes

Page 10: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Grades

40% Programming assignments (4) 30% Exams (2) 25% Final project 5% Participation (take-home and in-class quizzes)

Each student (or group) gets up to five late days on programming assignments (see syllabus for details)

Page 11: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Getting started▪ Pay attention to Piazza posts

- http://piazza.com/cmu/fall2018/1541815618

▪ Textbook - There is no course textbook, but please see web site for suggested references

Page 12: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Regarding the class meeting times▪ We meet 3 days a week (MWF) for the first 2/3 of the semester

▪ Same content as 2 days a week over a full semester, but two major advantages this way: - you are better prepared to do an interesting project - more time to focus on your project

Lectures

Project

Page 13: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

A Brief History of Parallel Computing▪ Initial Focus (starting in 1970s): “Supercomputers” for Scientific Computing

C.mmp at CMU (1971) 16 PDP-11 processors

Cray XMP (circa 1984) 4 vector processors

Thinking Machines CM-2 (circa 1987) 65,536 1-bit processors +

2048 floating-point co-processors

SGI UV 1000cc-NUMA (today) 4096 processor cores

Blacklight at the Pittsburgh Supercomputer Center

Page 14: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

A Brief History of Parallel Computing▪ Initial Focus (starting in 1970s): “Supercomputers” for Scientific Computing

▪ Another Driving Application (starting in early ‘90s): Databases

Sun Enterprise 10000 (circa 1997) 16 UltraSPARC-II processors

Oracle Supercluster M6-32 (today) 32 SPARC M2 processors

Page 15: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Setting Some Context▪ Before we continue our multiprocessor story, let’s pause to consider:

- Q: what had been happening with single-processor performance?

▪ A: since forever, they had been getting exponentially faster - Why?

Image credit: Olukutun and Hammond, ACM Queue 2005

Page 16: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

A Brief History of Processor Performance▪ Wider data paths

- 4 bit → 8 bit → 16 bit → 32 bit → 64 bit

▪ More efficient pipelining

- e.g., 3.5 Cycles Per Instruction (CPI) → 1.1 CPI

▪ Exploiting instruction-level parallelism (ILP) - “superscalar” processing: e.g., issue up to 4 instructions/cycle

▪ Faster clock rates

- e.g., 10 MHz → 200 MHz → 3 GHz

▪ During the 80s and 90s: large exponential performance gains - and then…

Page 17: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

A Brief History of Parallel Computing▪ Initial Focus (starting in 1970s): “Supercomputers” for Scientific Computing

▪ Another Driving Application (starting in early ‘90s): Databases

▪ Inflection point in 2004: Intel hits the Power Density Wall

Pat Gelsinger, ISSCC 2001

Page 18: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

From the New York TimesIntel's Big Shift After Hitting Technical Wall

The warning came first from a group of hobbyists that tests the speeds of computer chips. This year, the group discovered that the Intel Corporation's newest microprocessor was running slower and hotter than its predecessor. What they had stumbled upon was a major threat to Intel's longstanding approach to dominating the semiconductor industry - relentlessly raising the clock speed of its chips. Then two weeks ago, Intel, the world's largest chip maker, publicly acknowledged that it had hit a "thermal wall" on its microprocessor line. As a result, the company is changing its product strategy and disbanding one of its most advanced design groups. Intel also said that it would abandon two advanced chip development projects, code-named Tejas and Jayhawk. Now, Intel is embarked on a course already adopted by some of its major rivals: obtaining more computing power by stamping multiple processors on a single chip rather than straining to increase the speed of a single processor. … John Markoff, New York Times, May 17, 2004

Page 19: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Spring 2016

ILP tapped out + end of frequency scaling

No further benefit from ILP

Processor clock rate stops increasing

Image credit: “The free Lunch is Over” by Herb Sutter, Dr. Dobbs 2005

= Transistor density= Clock frequency

= Instruction-level parallelism (ILP)= Power

Page 20: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Programmer’s Perspective on PerformanceQuestion: How do you make your program run faster?

Answer before 2004: - Just wait 6 months, and buy a new machine! - (Or if you’re really obsessed, you can learn about parallelism.)

Answer after 2004: - You need to write parallel software.

Page 21: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Parallel Machines TodayExamples from Apple’s product line:

Mac Pro 12 Intel Xeon E5 cores

iMac 12 Intel Xeon E5 cores

(images from apple.com)

MacBook Pro Retina 15” 4 Intel Core i7 cores

iPad Retina 2 Swift cores

iPhone 6s 2 A9 cores

Page 22: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Intel Skylake (2015)Quad-core CPU + multi-core GPU integrated on one chip

(aka “6th generation Core i7”)

CPU core

CPU core

CPU core

CPU core

Integrated GPU

Page 23: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Intel Xeon Phi 7120A “coprocessor”▪ 61 “simple” x86 cores (1.3 Ghz, derived from Pentium) ▪ Targeted as an accelerator for supercomputing applications

Page 24: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

NVIDIA Maxwell GTX 980 GPU (2014)Sixteen major processing blocks (but much, much more parallelism available... details coming next class)

Page 25: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Mobile parallel processingPower constraints heavily influence design of mobile systems

NVIDIA Tegra K1: Quad-core ARM A57 CPU + 4 ARM A53 CPUs +

NVIDIA GPU + image processor...

Apple A9: (in iPhone 6s) Dual-core CPU + GPU + image processor

and more on one chip

Page 26: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Supercomputing▪ Today: clusters of multi-core CPUs + GPUs

▪ Oak Ridge National Laboratory: Titan (#2 supercomputer in world) - 18,688 x 16 core AMD CPUs + 18,688 NVIDIA K20X GPUs

Page 27: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

What is a parallel computer?

Page 28: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

One common definitionA parallel computer is a collection of processing elements that cooperate to solve problems quickly

We’re going to use multiple processors to get it

We care about performance * We care about efficiency

* Note: different motivation from “concurrent programming” using pthreads in 15-213

Page 29: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

DEMO 1 (This semester’s first parallel program)

Page 30: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

SpeedupOne major motivation of using parallel processing: achieve a speedup

For a given problem:

speedup( using P processors ) = execution time (using 1 processor)

execution time (using P processors)

Page 31: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Class observations from demo 1

▪ Communication limited the maximum speedup achieved - In the demo, the communication was telling each other the partial sums

▪ Minimizing the cost of communication improves speedup - Moving students (“processors”) closer together (or let them shout)

Page 32: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

DEMO 2 (scaling up to four “processors”)

Page 33: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Class observations from demo 2

▪ Imbalance in work assignment limited speedup - Some students (“processors”) ran out work to do (went idle),

while others were still working on their assigned task

▪ Improving the distribution of work improved speedup

Page 34: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

DEMO 3 (massively parallel execution)

Page 35: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Class observations from demo 3

▪ The problem I just gave you has a significant amount of communication compared to computation

▪ Communication costs can dominate a parallel computation, severely limiting speedup

Page 36: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Course theme 1: Designing and writing parallel programs ... that scale!

▪ Parallel thinking 1. Decomposing work into pieces that can safely be performed in parallel

2. Assigning work to processors 3. Managing communication/synchronization between the processors so

that it does not limit speedup

▪ Abstractions/mechanisms for performing the above tasks - Writing code in popular parallel programming languages

Page 37: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Course theme 2: Parallel computer hardware implementation: how parallel computers work

▪ Mechanisms used to implement abstractions efficiently - Performance characteristics of implementations

- Design trade-offs: performance vs. convenience vs. cost

▪ Why do I need to know about hardware? - Because the characteristics of the machine really matter

(recall speed of communication issues in earlier demos) - Because you care about efficiency and performance

(you are writing parallel programs after all!)

Page 38: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Course theme 3: Thinking about efficiency

▪ FAST != EFFICIENT

▪ Just because your program runs faster on a parallel computer, it does not mean it is using the hardware efficiently - Is 2x speedup on computer with 10 processors a good result?

▪ Programmer’s perspective: make use of provided machine capabilities

▪ HW designer’s perspective: choosing the right capabilities to put in system (performance/cost, cost = silicon area?, power?, etc.)

Page 39: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Fundamental Shift in CPU Design PhilosophyBefore 2004:

- within the chip area budget, maximize performance - increasingly aggressive speculative execution for ILP

After 2004: - area within the chip matters (limits # of cores/chip):

- maximize performance per area - power consumption is critical (battery life, data centers)

- maximize performance per Watt - upshot: major focus on efficiency of cores

Page 40: Lecture 1: Why Parallelism? Why Effi

CMU 15-418/618, Fall 2018

Summary▪ Today, single-thread performance is improving very slowly

- To run programs significantly faster, programs must utilize multiple processing elements

- Which means you need to know how to write parallel code

▪ Writing parallel programs can be challenging - Requires problem partitioning, communication, synchronization - Knowledge of machine characteristics is important

▪ I suspect you will find that modern computers have tremendously more processing power than you might realize, if you just use it!

▪ Welcome to 15-418!