concurrent and distributed programming -...

35
Concurrent and Distributed Programming Lecture 1 Introduction References: “Intro to parallel computing” by Blaise Barney Lecture 1, CS149 Stanford by Aiken & Olukotun, “Intro to parallel programming”, Gupta

Upload: lamnguyet

Post on 15-Aug-2018

280 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Concurrent and Distributed Programming

Lecture 1Introduction

References: “Intro to parallel computing” by Blaise Barney

Lecture 1, CS149 Stanford by Aiken & Olukotun,“Intro to parallel programming”, Gupta

Page 2: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 2

Administration● In charge: Assaf Schuster● Lectures: Mark Silberstein

● Reception hours upon request ● Office: Fishbach 453

● TA in charge: Nadav Amit● Administrative questions

● TA: Uri Verner, Checker: Yossi Kuperman, Assistant: Adi Omri

● Syllabus on the site – slight changes possible

Page 3: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 3

Grading● 50% Exam, 50% Home assignments (only if

Exam >55)● 4 assignments planned, see the syllabus to check the

deadlines– Threads (Java) – decent, OpenMP (C) – easy, BSP (C) - easy

– MPI (C) – hard

● We use plagiarism software to detect copying, and punish badly

● Late submissions: 3 late days for free/semester, 3pts/day otherwise

● No postponement (don't bother asking)

Page 4: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 4

Prerequisites

● Formal● 234123 Operating Systems● 234267 Digital Computer Structure

– Or EE analog

● Informal● Programming skills (makefile, Linux, ssh)● Positive attitude

Page 5: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 5

Question

● I write “hello world” program in plain C● I have the latest and the greatest computer with

the latest and the greatest OS and compiler

Does my program run in parallel?

Page 6: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 6

Serial vs. parallel program

● One instruction at a time

● Multiple instructions in parallel

Page 7: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 7

Why parallel programming?

● Most software will have to written as a parallel software

● Why ?

Page 8: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 8

Free lunch ...● Early 90-s:

● Wait 18 months – get new CPU with x2 speedMoore's law: #transitors per chip = 2(years)

Source: Wikipedia

Page 9: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 9

is over

● More transistors = more execution units● BUT performance per unit does not improve

● Bad news: serial programs will not run faster

● Good news: parallelization has high performance potential● May get faster on newer architectures

Page 10: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 10

Parallel programming is hard● Need to optimize for performance

● Understand management of resources● Identify bottlenecks

● No one technology fits all needs● Zoo of programming models, languages, runtimes

● Hardware architecture is a moving target● Parallel thinking is NOT intuitive● Parallel debugging is not fun

But there is no detour

Page 11: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 11

Concurrent and Distributed Programming Course

● Learn● basic principles ● Parallel and distributed architectures● Parallel programming techniques ● Basics of programming for performance

● This is a practical course – you will fail unless you complete all home assignments

Page 12: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 12

Parallelizing Game of Life(Cellular automaton)

● Given a 2D grid● vt(i,j)=F(vt-1(of all its neighbors))

i-1

j-1

i+1

j+1

i,j

Page 13: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 13

Problem partitioning● Domain decomposition

● (SPMD)● Input domain ● Output domain● Both

● Functional decomposition● (MPMD)● Independent tasks● Pipelining

Page 14: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 14

We choose: domain decomposition

● The field is split between processors

i-1

j-1

i+1

j+1

i,j

CPU 0 CPU 1

Page 15: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 15

Issue 1. Memory

● Can we access v(i+1,j) from CPU 0 as in serial program?

i-1

j-1

i+1

j+1

i,j

CPU 0 CPU 1

Page 16: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 16

It depends...

● NO: Distributed memory space architecture: disjoint memory space

● YES: Shared memory space architecture: same memory space

CPU 0 CPU 1

Memory Memory

Network

CPU 0 CPU 1

Memory

Page 17: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 17

Someone has to pay

● Harder to program, easier to build hardware

● Easier to program, harder to build hardware

CPU 0 CPU 1

Memory Memory

Network

CPU 0 CPU 1

Memory

Page 18: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 18

Memory architecture

● CPU0: Time to access v(i+1,j) = Time to access v(i-1,j)?

i-1

j-1

i+1

j+1

i,j

CPU 0 CPU 1

Page 19: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 19

Hardware shared memory flavors 1

● Uniform memory access: UMA● Same cost of accessing any data by all processors

Page 20: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 20

Hardware shared memory flavors 2● NON-Uniform memory access: NUMA

● Software - DSM: emulation of NUMA in software over distributed memory space

Page 21: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 21

Software Distributed Shared Memory (SDSM)

CPU 0 CPU 1

Memory MemoryNetwork

Process 1 Process 2

DSM daemons

Page 22: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 22

Memory-optimized programming

● Most modern systems are NUMA or distributed● Access time difference: local vs. remote data:

x100-10000● Memory accesses: main source of

optimization in parallel and distributed programs

Page 23: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 23

Issue 2: Control

● Can we assign one v per CPU?

i-1

j-1

i+1

j+1

i,j

Page 24: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 24

Task management overhead

● Each task has a state that should be managed● More tasks – more state to manage

● Who manages tasks?

● How many tasks should be run?● Does that depend on F?

– Reminder: v(i,j)= F(all v's neighbors)

Page 25: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 25

Question

● Every process reads the data from its neighbors● Will it produce correct results?

i-1

j-1

i+1

j+1

i,j

Page 26: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 26

Synchronization

● The order of reads and writes made in different tasks is non-deterministic

● Synchronization required to enforce the order● Locks, semaphores, barriers, conditionals

Page 27: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 27

Check point

Fundamental hardware-related issues

● Memory accesses● Optimizing locality of accesses

● Control● Overhead

● Synchronization

Page 28: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 28

Parallel programming issues

i-1

j-1

i+1

j+1

i,j

CPU 0 CPU 1

CPU 2 CPU 4

● We decide to split this 3x3 grid like this:

Problems?

Page 29: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 29

Issue 1: Load balancing

● Always waiting for the slowest task● Solutions?

i-1

j-1

i+1

j+1

i,j

CPU 0 CPU 1

CPU 2 CPU 4

Page 30: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 30

Issue 2: Granularity● G=Computation/Communication ● Fine-grain parallelism.

● G is small● Good load balancing● Potentially high overhead

● Coarse-grain parallelism● G is large● Potentially bad load balancing● Low overhead

Which granularity works for you?

Page 31: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 31

It depends..

● For each combination of computing platform and parallel application● The goal is to minimize overheads and maximize

CPU utilization

● High performance requires ● Enough parallelism to keep CPUs busy● Low relative overheads (communications,

synchronization, task control, memory accesses versus computations)

● Good load balancing

Page 32: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 32

Summary

● Parallelism does not come for free● Overhead● Memory-aware access● Synchronization● Granularity

Page 33: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 33

Flynn's HARDWARE Taxonomy

Page 34: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 34

SIMD

Lock-step execution

● Example: vector operations (SSE)

Page 35: Concurrent and Distributed Programming - …webcourse.cs.technion.ac.il/236370/Winter2010-2011/ho/WC... · 2010-10-17 · Concurrent and Distributed Programming Course Learn basic

Mark Silberstein, 236370, Winter11 35

MIMD

● Example: multicores