1 eece571r -- harnessing massively parallel processors matei/eece571/ lecture 1: introduction matei...

75
1 EECE571R -- Harnessing Massively Parallel Processors http://www.ece.ubc.ca/~matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc dot ca Acknowledgement: some slides borrowed from presentations by Jim Demel, Horst Simon, Mark D. Hill, David Patterson, Saman Amarasinghe, Richard Brunner, Luddy Harrison, Jack Dongarra

Upload: claire-burns

Post on 28-Dec-2015

235 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

1

EECE571R -- Harnessing Massively

Parallel Processors

http://www.ece.ubc.ca/~matei/EECE571/

Lecture 1: Introduction

Matei Ripeanumatei at ece dot ubc dot ca

Acknowledgement: some slides borrowed from presentations by Jim Demel, Horst Simon, Mark D. Hill, David Patterson, Saman Amarasinghe, Richard Brunner, Luddy Harrison, Jack Dongarra

Page 2: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

2

Outline

• Why powerful computers must be parallel processors

• Science problems require powerful computers

• Why writing (fast) parallel programs is hard

• Why hybrid architectures (e.g., GPU based platforms)?

• Structure of this course

Other problems too

Including your laptops and handhelds

all

But things are improving

Page 3: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

3

Units of Measure

• High Performance Computing (HPC) units are:- Flop: floating point operation

- Flops/s: floating point operations per second

- Bytes: size of data (a double precision floating point number is 8bytes)

• Typical sizes are millions, billions, trillions…Mega Mflop/s = 106 flop/sec Mbyte = 220 = 1048576 ~ 106 bytes

Giga Gflop/s = 109 flop/sec Gbyte = 230 ~ 109 bytes

Tera Tflop/s = 1012 flop/sec Tbyte = 240 ~ 1012 bytes

Peta Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytes

Exa Eflop/s = 1018 flop/sec Ebyte = 260 ~ 1018 bytes

Zetta Zflop/s = 1021 flop/sec Zbyte = 270 ~ 1021 bytes

Yotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes

• Current fastest (public) machine ~ 2.6 Pflop/s- Up-to-date list at www.top500.org

Page 4: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Performance evolution – Top500 List

Page 5: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

5

Why powerful computers are

parallel

circa 1991-2006

all (2007)

Page 6: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

6

Tunnel Vision by Experts

• “I think there is a world market for maybe five computers.”

- Thomas Watson, chairman of IBM, 1943.

• “There is no reason for any individual to have a computer in their home”

- Ken Olson, president and founder of Digital Equipment Corporation, 1977.

• “640K [of memory] ought to be enough for anybody.”- Bill Gates, chairman of Microsoft,1981.

• “On several recent occasions, I have been asked whether parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it.”

- Ken Kennedy, CRPC Directory, 1994

Slide source: Warfield et al.

Page 7: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

7

Technology Trends: Microprocessor Capacity

2X transistors/Chip Every 1.5 years

Called “Moore’s Law”

Moore’s Law

Microprocessors have become smaller, denser, and more powerful.

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

Slide source: Jack Dongarra

Page 8: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

8

Microprocessor Transistors per Chip

i4004

i80286

i80386

i8080

i8086

R3000R2000

R10000

Pentium

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Year

Tran

sist

ors

• Growth in transistors per chip • Increase in clock rate

0.1

1

10

100

1000

1970 1980 1990 2000

Year

Clo

ck R

ate

(MH

z)

Page 9: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

9

Impact of Device Shrinkage

• What happens when the feature size (transistor size) shrinks by a factor of x ?

• Clock rate goes up by x because wires are shorter, smaller gate capacities

- actually less than x, because of power consumption

• Transistors per unit area goes up by x2

• Die size also tends to increase- typically another factor of ~x

• Raw computing power of the chip goes up by ~ x4 !- typically x3 is devoted to either on-chip

- parallelism: hidden parallelism such as ILP- locality: caches

• So most programs significantly faster, without changing them

Page 10: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

10

But there are limiting forces

• Cost - Moore’s 2nd law (Rock’s law):

costs go up

• Yield- The percentage of the chips that

are usable

-E.g., Cell processor (PS3) is sold with 7 out of 8 “on” to improve yield

Manufacturing costs and yield problems limit use of density

Page 11: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

11

Power Density Limits Serial Performance

Page 12: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

12

Revolution is Happening Now

• Chip density is continuing increase ~2x every 2 years

- Clock speed is not!

- Number of processor cores may double instead

• There is little or no more hidden parallelism (ILP) to be found

• Parallelism must be exposed to and managed by software

Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

Page 13: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

13

Parallelism in 2011?

• These arguments are no longer theoretical

• All major processor vendors are producing multicore chips- Every machine will soon be a parallel machine

- To keep doubling performance, parallelism must double

• Which commercial applications can use this parallelism?- Do they have to be rewritten from scratch?

• Will all programmers have to be parallel programmers?- New software model needed

- Try to hide complexity from most programmers – eventually

- In the meantime, need to understand it

• Computer industry betting on this change- But does not have all the answers

Page 14: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

14

More Limits: How fast can a serial computer be?

• Consider the 1 Tflop/s sequential machine:

- Data must travel some distance, r, to get from memory to processor.

- To get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm.

• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:

- Each bit occupies about 1 square Angstrom, or the size of a small atom.

• No choice but parallelism

r = 0.3 mm

1 Tflop/s, 1 Tbyte sequential machine

Page 15: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Performance Development

0.11

10

1001000

10000100000

1E+061E+071E+081E+09

1E+101E+11

1996

2002

2008

2014

2020

1 Eflop/ s

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s SUM

N=1

N=500

2.6PFlops (2010)

Page 16: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Minimum

Average

Maximum

1

10

100

1,000

10,000

100,000

1,000,000

# p

rocessors

.

Concurrency Levels

notebook computer

Page 17: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Moore’s Law reinterpreted

• Number of cores per chip will double every two years

• Clock speed will not increase (possibly decrease)

• Need to deal with systems with large number of concurrent threads - Millions for HPC systems

- Thousands for servers

- Hundreds for workstations and notebooks

• Need to deal with inter-chip parallelism as well as intra-chip parallelism

Page 18: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

18

Outline

• Why powerful computers must be parallel processors

• Science problems require powerful computers

• Why writing (fast) parallel programs is hard

• Why hybrid architectures (e.g., GPU based platforms)?

• Structure of the course

Commercial problems too

Including your laptops and handhelds

all

But things are improving

Page 19: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

19

Computational Science

Nature, March 23, 2006

“An important development in sciences is occurring at the intersection of computer science and the sciences that has the potential to have a profound impact on science. It is a leap from the application of computing … to the integration of computer science concepts, tools, and theorems into the very fabric of science.” -Science 2020 Report, March 2006

Page 20: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

20

Drivers for Change

• Continued exponential increase in computational power simulation is becoming third pillar of science, complementing

theory and experiment

• Continued exponential increase in experimental data techniques and technology in data analysis, visualization,

analytics, networking, and collaboration tools are becoming essential in all data rich scientific applications

Page 21: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

21

Simulation: The Third Pillar of Science

• Traditional scientific and engineering method:(1) Do theory or paper design

(2) Perform experiments or build system

• Limitations: –Too difficult—build large wind tunnels

–Too expensive—build a throw-away passenger jet

–Too slow—wait for climate or galactic evolution

–Too dangerous—weapons, drug design, climate

experimentation

• Computational science and engineering paradigm:(3) Use high performance computer systems

to simulate and analyze the phenomenon

- Based on known physical laws and efficient numerical methods

- Analyze simulation results with computational tools and methods beyond what is used traditionally for experimental data analysis

Simulation

Theory Experiment

Page 22: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

22

What Supercomputers Do

One Example- simulation replacing experiment that is too dangerous

Page 23: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

23

Global Climate Modeling Problem

• Problem is to compute:f(latitude, longitude, elevation, time) “weather” =

(temperature, pressure, humidity, wind velocity)

• Approach:- Discretize the domain, e.g., a measurement point every 10 km

- Devise an algorithm to predict weather at time t+t given t

• Uses:- Predict major events,

e.g., El Nino

- Use in setting air emissions standards

- Evaluate global warming scenarios

Source: http://www.epm.ornl.gov/chammp/chammp.html

Page 24: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

24

Global Climate Modeling Computation

• One piece is modeling the fluid flow in the atmosphere- Solve Navier-Stokes equations

- Roughly 100 Flops per grid point with 1 minute timestep

• Computational requirements:- To match real-time, need 5 x 1011 flops in 60 seconds = 8 Gflop/s

- Weather prediction (7 days in 24 hours) 56 Gflop/s

- Climate prediction (50 years in 30 days) 4.8 Tflop/s

- To use in policy negotiations (50 years in 12 hours) 288 Tflop/s

• To double the grid resolution / time resolution, computation is 8x to 16x

• State of the art models require integration of atmosphere, clouds, ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry and more

• Current models are coarser than this

Page 25: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

25

High Resolution Climate Modeling on NERSC-3 – P. Duffy,

et al., LLNL

Page 26: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

26

Which commercial applications require parallelism?

Embed SPEC DB Games ML HPC1 Finite State Mach.2 Combinational3 Graph Traversal4 Structured Grid5 Dense Matrix6 Sparse Matrix7 Spectral (FFT)8 Dynamic Prog9 N-Body

10 MapReduce11 Backtrack/ B&B12 Graphical Models13 Unstructured Grid

• Claim: parallel architecture, language, compiler … must do at least these well to run future parallel apps well• Note: MapReduce is embarrassingly parallel; FSM embarrassingly sequential?

Analyzed in detail in “Berkeley View” reportwww.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

Page 27: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

27

Outline

• Why powerful computers must be parallel processors

• Science problems require powerful computers

• Why writing (fast) parallel programs is hard

• Principles of parallel computing performance

• Structure of the course

Commercial problems too

Including your laptops and handhelds

all

But things are improving

Page 28: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

28

Principles of Parallel Computing

• Finding enough parallelism (Amdahl’s Law)

• Granularity

• Locality

• Load balance

• Coordination and synchronization

• Performance modeling

All of these things makes parallel programming even harder than sequential programming.

Page 29: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

29

“Automatic” Parallelism in Modern Machines• Bit level parallelism

- within floating point operations, etc.

• Instruction level parallelism (ILP)- multiple instructions execute per clock cycle

• Memory system parallelism- overlap of memory operations with computation

• OS parallelism- multiple jobs run in parallel on commodity SMPs

Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks

Page 30: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Amdahl’s Law

• Simple software assumption - Fraction F of execution time perfectly parallelizable

- No Overhead for

– Scheduling

– Synchronization

– Communication, etc.

- Fraction 1 – F Completely Serial

• Time on 1 core = (1 – F) / 1 + F / 1 = 1

• Time on N cores = (1 – F) / 1 + F / N

Page 31: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Finding Enough Parallelism

• Implications:- Attack the common case when introducing parallelization's: When F

is small, optimizations will have little effect.

- Even if the parallel part speeds up perfectly performance is limited by the sequential part

- As N approaches infinity, speedup is bound by 1/(1 – F ).

- The aspects you ignore will limit speedup

• Discussion:- Can you ever obtain super-linear speedups?

Amdahl’s Speedup =1

+1 - F

1

F

N

Page 32: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

32

Overhead of Parallelism

• Given enough parallel work, this is the biggest barrier to getting desired speedup

• Parallelism overheads include:- cost of starting a thread or process

- cost of communicating shared data

- cost of synchronizing

- extra (redundant) computation

• Each of these can be in the range of milliseconds (=millions of flops) on some systems

• Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (i.e. large granularity), but not so large that there is not enough parallel work

Page 33: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

33

Locality and Parallelism

• Large memories are slow, fast memories are small

• Storage hierarchies are large and fast on average

• Parallel processors, collectively, have large, fast cache- the slow accesses to “remote” data we call “communication”

• Algorithm should do most work on local data

ProcCache

L2 Cache

L3 Cache

Memory

Conventional Storage Hierarchy

ProcCache

L2 Cache

L3 Cache

Memory

ProcCache

L2 Cache

L3 Cache

Memory

potentialinterconnects

Page 34: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

34

Processor-DRAM Gap (latency)

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

Goal: find algorithms that minimize communication, not necessarily arithmetic

Page 35: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

35

Load Imbalance

• Load imbalance is the time that some processors in the system are idle due to

- insufficient parallelism (during that phase)

- unequal size tasks

• Examples of the latter- adapting to “interesting parts of a domain”

- tree-structured computations

- fundamentally unstructured problems

• Algorithm needs to balance load- Sometimes can determine work load, divide up evenly, before starting

- “Static Load Balancing”

- Sometimes work load changes dynamically, need to rebalance dynamically

- “Dynamic Load Balancing”

Page 36: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

36

Building Parallel Software

• 2 types of programmers 2 layers

• Efficiency Layer (10% of today’s programmers)- Expert programmers build Libraries implementing motifs, “Frameworks”,

OS, ….

- Highest fraction of peak performance possible

• Productivity Layer (90% of today’s programmers)- Domain experts / Naïve programmers productively build parallel applications

by composing frameworks & libraries

- Hide as many details of machine, parallelism as possible

- Willing to sacrifice some performance for productive programming

• You may want to work at either level- Goal of this course: understand enough of the efficiency layer to use

parallelism effectively

Page 37: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

37

Improving Real Performance

0.1

1

10

100

1,000

2000 2004T

eraf

lop

s1996

Peak Performance grows exponentially, a la Moore’s Law

But efficiency (the performance relative to the hardware peak) has declined

was 40-50% on the vector supercomputers of 1990s

Can be as little as 5-10% on parallel supercomputers of today

Close the gap through ... Mathematical methods and algorithms that

achieve high performance on a single processor and scale to thousands of processors

More efficient programming models and tools for massively parallel supercomputers

PerformanceGap

Peak Performance

Real Performance

Page 38: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

38

Outline

• Why powerful computers must be parallel processors

• Science problems require powerful computers

• Why writing (fast) parallel programs is hard

• Why hybrid architectures (e.g., GPU based platforms)?

• Structure of the course

Commercial problems too

Including your laptops and handhelds

all

But things are improving

Page 39: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Hybrid architectures in Top 500 sites

Page 40: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Amdahl’s Law [1967]

Amdahl’s Speedup =1

+1 - F

1

F

N

Page 41: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Designing a milticore platform

• Designers must confront single-core design options- Instruction fetch, wakeup, select

- Execution unit configuration & operand bypass

- Load/queue(s) & data cache

- Checkpoint, log, runahead, commit.

• As well as additional degrees of freedom- How many cores? How big each?

- Shared caches: How many levels? How many banks?

- On-chip interconnect: bus, switched?

Page 42: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Want Simple Multicore Hardware Model

To Complement Amdahl’s Simple Software Model

(1) Chip Hardware Roughly Partitioned into- (I) Multiple Cores (with L1 caches)

- (II) The Rest (L2/L3 cache banks, interconnect, pads, etc.)- Assume: Changing Core Size/Number does NOT change The Rest

(2) Resources for Multiple Cores Bounded- Bound of N resources per chip for cores

- Due to area, power, cost ($$$), or multiple factors

- Bound = Power? (but pictures here use Area)

Page 43: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Simple Multicore Hardware Model, (cont)

(3) Architects can improve single-core performance using more of the bounded resource

• A Simple Base Core- Consumes 1 Base Core Equivalent (BCE) resources

- Provides performance normalized to 1

• An Enhanced Core (in same process generation)- Consumes R x BCEs

- Performance as a function or R: Perf(R)

• What does function Perf(R) look like?

Page 44: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

More on Enhanced Cores

• (Performance Perf(R) consuming R BCEs resources)

• If Perf(R) > R Always enhance core- Cost-effectively speedups both sequential & parallel

• Therefore, equations assume Perf(R) < R

• Graphs Assume Perf(R) = square root of R- 2x performance for 4 BCEs, 3x for 9 BCEs, etc.

- Why? Models diminishing returns with “no coefficients”

• Two options symmetric / asymmetric multicore chips

Page 45: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Q1: How Many (Symmetric) Cores per Chip?

• Each Chip Bounded to N BCEs (for all cores)

• Each Core consumes R BCEs

• Assume Symmetric Multicore = All Cores Identical

• Therefore, N/R Cores per Chip — (N/R)*R = N

• For an N = 16 BCE Chip:

Sixteen 1-BCE cores Four 4-BCE cores One 16-BCE core

Page 46: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Performance of Symmetric Multicore Chips

• Serial Fraction 1-F uses 1 core at rate Perf(R)

• Serial time = (1 – F) / Perf(R)

• Parallel Fraction uses N/R cores at rate Perf(R) each

• Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N

• Therefore, w.r.t. one base core:

• Implications?

Symmetric Speedup =1

+1 - F

Perf(R)

F * R

Perf(R)*N

Enhanced Cores speed Serial & Parallel

Page 47: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Symmetric Multicore Chip, N = 16 BCEs

F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16))

Need more parallelism to have multicore optimal!

(16 cores) (8 cores) (2 cores) (1 core)

F=0.5R=16,Cores=1,Speedup=4

(4 cores)

Page 48: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Symmetric Multicore Chip, N = 16 BCEs

At F=0.9, Multicore optimal, but speedup limited

Need to obtain even more parallelism!

F=0.5R=16,Cores=1,Speedup=4

F=0.9, R=2, Cores=8, Speedup=6.7

Page 49: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Symmetric Multicore Chip, N = 16 BCEs

F matters: Amdahl’s Law applies to multicore chips

Researchers should target parallelism F first

F1, R=1, Cores=16, Speedup16

Page 50: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Symmetric Multicore Chip, N = 16 BCEs

As Moore’s Law enables N to go from 16 to 256 BCEs,

More core enhancements? More cores? Or both?

Recall F=0.9, R=2, Cores=8, Speedup=6.7

Page 51: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Symmetric Multicore Chip, N = 256 BCEs

As Moore’s Law increases N, often need enhanced core designs

Researcher should target single-core performance too

F=0.9R=28 (vs. 2)Cores=9 (vs. 8)Speedup=26.7 (vs. 6.7)CORE ENHANCEMENTS!

F1R=1 (vs. 1)Cores=256 (vs. 16)Speedup=204 (vs. 16) MORE CORES!

F=0.99R=3 (vs. 1)

Cores=85 (vs. 16)Speedup=80 (vs. 13.9)

CORE ENHANCEMENTS& MORE CORES!

Page 52: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Outline

• Symmetric Multicore Chips

• Asymmetric Multicore Chips

Page 53: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Asymmetric (Heterogeneous) Multicore Chips

• Symmetric Multicore Required All Cores Equal

• Why Not Enhance Some (But Not All) Cores?

• For Amdahl’s Simple Software Assumptions- One Enhanced Core

- Others are Base Cores

• How does this affect our hardware model?

Page 54: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

How Many Cores per Asymmetric Chip?

• Each Chip Bounded to N BCEs (for all cores)

• One R-BCE Core leaves N-R BCEs

• Use N-R BCEs for N-R Base Cores

• Therefore, 1 + N - R Cores per Chip

• For an N = 16 BCE Chip:

Symmetric: Four 4-BCE cores Asymmetric: One 4-BCE core& Twelve 1-BCE base cores

Page 55: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Performance of Asymmetric Multicore Chips

• Serial Fraction 1-F same, so time = (1 – F) / Perf(R)

• Parallel Fraction F- One core at rate Perf(R)

- N-R cores at rate 1

- Parallel time = F / (Perf(R) + N - R)

• Therefore, w.r.t. one base core:

Asymmetric Speedup =1

+1 - F

Perf(R)

F

Perf(R) + N - R

Page 56: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Asymmetric Multicore Chip, N = 256 BCEs

Number of Cores = 1 (Enhanced) + 256 – R (Base)

How do Asymmetric & Symmetric speedups compare?

(256 cores) (253 cores) (193 cores) (1 core)(241 cores)

Page 57: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

57

Outline

• Why powerful computers must be parallel processors

• Science problems require powerful computers

• Why writing (fast) parallel programs is hard

• Why hybrid architectures (e.g., GPU based platforms)?

• Structure of the course

Commercial problems too

Including your laptops and handhelds

all

But things are improving

Page 58: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Course Mechanics

Course page: http://www.ece.ubc.ca/~matei/EECE571/

Office hours: after each class or by appointment (email me)

Email: matei @ ece.ubc.caOffice: KAIS 4033

Page 59: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

EECE 571R: Course Goals

• Primary- Gain understanding of fundamental issues that affect design of:

- Parallel applications for massively multicore processors

- Survey main current research themes

• Secondary- By studying a set of outstanding papers, build knowledge of how to do

& present research

- Learn how to read papers & evaluate ideas

What I’ll Assume You Know• You are familiar with

- C programming; programming for traditional multicore machines threading, synchronization

- Traditional CPU architectures: caching, pipelines,

• If there are things that don’t make sense, ask!

Page 60: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

This class

• Weekly schedule (tentative)

• Note: Dates are tentative- But always on Mondays 5-8pm, KAIS 4018

Class structure:

• Lectures – 1h- Technology

- Research topic

• Paper discussion – 1h

• Projects – 1h

Page 61: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Administravia: Grading

• Paper reviews:25%

• Class participation 15%

• Discussion leading: 10%

• Project: 50%

Page 62: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Administravia: Paper Reviewing (1)

• Goals:- Think of what you read

- Expand your knowledge beyond the papers that are assigned

- Get used to writing paper reviews

• Reviews due by midnight the day before the class

• Have an eye on the writing style / Be professional in your writing- Clarity

- Beware of traps: learn to use them in writing and detect them in reading

- Detect (and stay away from) trivial claims.

Page 63: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Administravia: Paper Reviewing (2)

Follow the format provided when relevant.

• Summarize the main contribution of the paper

• Critique the main contribution: - Significance:

- Rate the significance of the paper on a scale of 5 (breakthrough), 4 (significant contribution), 3 (modest contribution), 2 (incremental contribution), 1 (no contribution or negative contribution).

- More importantly: Explain your rating in a sentence or two.

- Rate how convincing the methodology is.

- Do the claims and conclusions follow from the experiments?

- Are the assumptions realistic?

- Are the experiments well designed?

- Are there different experiments that would be more convincing?

- Are there other alternatives the authors should have considered?

- (And, of course, is the paper free of methodological errors?)

- What is the most important limitation of the approach?

Page 64: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Administravia: Paper Reviewing (3)

• What are the two strongest and/or most interesting ideas in the paper?

• What are the two most striking weaknesses in the paper?

• Name two questions that you would like to ask the authors.

• Detail an interesting extension to the work not mentioned in the future work section.

• Optional comments on the paper that you’d like to see discussed in class.

Page 65: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Administravia: Discussions Leading

• Come prepared! - Prepare a 3-5 minute summary of the paper

- Prepare discussion outline

- Prepare questions:

- “What if”s

- Unclear aspects of the solution proposed

- …

- Similar ideas in different contexts

- Initiate short brainstorming sessions

• Leaders do NOT need to submit paper reviews

• Main goals: - Keep discussion flowing

- Keep discussion relevant

- Engage everybody (I’ll have an eye on this, too)

Page 66: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Administravia: Projects

• It’s just a class, not your PhD.

• Aim high!- Goal: With one/two extra months of work you project should be ready to be

submitted to a decent conference / publication

- It is doable!

• Combine with your research if relevant to this course

• Get informal approval from all instructors if you overlap final projects:

- Don’t sell the same piece of work twice

- You can get more than twice as many results with less than twice as much work

Page 67: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Administravia: Project timeline (tentative)

• 3rd week – 3-slide idea presentation : “What is the (research) question you aim to answer?”

• 5th week: 2-page project proposal

--------------------- 1-week --- spring reading period

• 8th week: 4-page Midterm project due- Have a clear image of what’s possible/doable

- Report preliminary results

- Includes related work

• Final week [see schedule] In-class project presentation- Presentation

- Demo, if appropriate

- 8-page write-up

Page 68: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Next Class (Mon, 17/01)

• Note room change: KAIS 4018

• Discussion of - Project ideas

- Papers

To do:

• Subscribe to mailing list

• Volunteers for discussion leaders for class next week

Page 69: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

69

What you should get out of the course

In depth understanding of:

• When is parallel computing useful?

• Overview of programming models (software) and tools for massively parallel processors

• Some important parallel applications and the algorithms

• Performance analysis and tuning

• Exposure to various open research questions

Page 70: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Questions?

Page 71: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

71

Extra slides

Page 72: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Open research questions (Dongarra)

Page 73: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Open research questions (Hwu)

• Accelerate computations that have no good parallel structure today.

- E.g., graph analysis

• Cata distriubtions cause catastrophical load imbalances in parallel algorithms

- free-scal egraphs, MRI spiral scan, astronomy

• Computatios with no data reuse - E.g., matrix vector multiplcation

• Algorithm optimizations that are hard to do today- how do I extract locality and regularity of access

Page 74: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

74

Applications

• Applications:1. What are the apps?

2. What are kernels of apps?

• Hardware:3. What are the HW building blocks?

4. How to connect them?

• Programming Model / Systems Software:

5. How to describe apps and kernels?

6. How to program the HW?

• Evaluation: 7. How to measure success?

(Inspired by a view of the Golden Gate Bridge from Berkeley)

CS267 focus is here

Page 75: 1 EECE571R -- Harnessing Massively Parallel Processors matei/EECE571/ Lecture 1: Introduction Matei Ripeanu matei at ece dot ubc

Personal Health

Image Retriev

al

Hearing, Music

Speech

Parallel Browse

rMotifs/Dwarfs

Sketching

Legacy Code

Schedulers

Communication & Synch.

Primitives

Par Lab Research OverviewEasy to write correct programs that run efficiently on

manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

HypervisorOS

Arch.

Productivi

ty Layer

Efficienc

y Layer Cor

rect

ness

Applicatio

nsComposition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verification

Dynamic Checking

Debuggingwith Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Efficiency Language Compilers