zvika guz 1, oved itzhak 1, idit keidar 1, avinoam kolodny 1, avi mendelson 2, and uri c. weiser 1...

31
Zvika Guz 1 , Oved Itzhak 1 , Idit Keidar 1 , Avinoam Kolodny 1 , Avi Mendelson 2 , and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel Workloads 1 Technion – Israel Institute of Technology, 2 Microsoft Corporation

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Zvika Guz1, Oved Itzhak1, Idit Keidar1, Avinoam Kolodny1, Avi Mendelson2, and Uri C. Weiser1

Threads vs. Caches: Modeling the Behavior of Parallel Workloads

1Technion – Israel Institute of Technology, 2Microsoft Corporation

Page 2: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Challenges: Single-core performance trend is gloomy

Exploit chip-multiprocessors with multithreaded applications

The memory gap is paramount Latency, bandwidth, power

2

Chip-Multiprocessor Era

2[Figure: Hennessy and Patterson, Computer Architecture- A Quantitative approach]

Two basic remedies: Cache – Reduce the number of out-of-die memory accesses Multi-threading – Hide memory accesses behind threads execution

How do they play together? How do we make the most out of them?

Page 3: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

3

Outline

3

Page 4: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

4

Outline

4

Page 5: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Cache-Machines vs. MT-Machines

# of Threads

Cache/Thread

Thread Context

Cache

Cache Architecture

Region

Many-Core – CMP with many, simple cores Tens hundreds of Processing Elements (PEs)

MT Architecture

Region

Intel’s Larrabee

Nvidia’s GT200

5

Nvidia’s Fermi

Cache

Core

Multi-Core

Region

Uni-Processor

Region

Cache

cccc

What are the basic tradeoffs? How will workloads behave across the range?

Predicting performance

Page 6: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

6

Outline

6

Page 7: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Use both cache and many threads to shield memory access The uniform framework renders the comparison meaningful We derive simple, parameterized equations for performance, power, BW,..

A Unified Machine Model

7

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

Cache

To Memory

Threads Architectural States

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C

C

C

C

C C

C C

C C

C C

C

C

C

C

Page 8: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Cache Machines

8

C

Many cores (each may have its private L1) behind a shared cache

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C

C

C

C

Cache

To Memory

C

C

C

C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

# Threads

Performance

Cache Non Effective point (CNE)

Page 9: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Memory latency shielded by multiple thread execution

Multi-Thread Machines

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

To Memory

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

Threads Architectural States

Ban

dw

idth

L

imit

atio

ns

# Threads

PerformanceMax performance

executionMemory access

9

Page 10: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Analysis (1/3) Given a ratio of memory access instructions rm (0≤rm≤1)

Every 1/rm instruction accesses memory A thread executes 1/rm instructions

Then stalls for tavg cycles

tavg=Average Memory Access Time (AMAT) [cycles]

10

Cache

Thread Context

t [cycles]

ld

1CPIexerm

avgt

ld

Page 11: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

PE stays idle unless filled with instructions from other threads Each thread occupies the PE for additional cycles

threads needed to fully utilize each PE

Analysis (2/3)

t [cycles]

ld

1CPIexerm

avgt

ld ld ld ld

1CPIexerm

1exe

avg

m

CPI

r

t

1CPIexerm

11

Cache

Thread Context

Page 12: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Analysis (3/3) Machine utilization:

Performance in Operations Per Seconds [OPS]:

1min 1, threads

avgm

PEexe

rN tCPI

n

Number of available threads

[ ]PEexe

fPerformance N OPS

CPI

Peak Performance

#Threads needed to utilize a single PE

12

Cache

Thread Context

Page 13: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Performance Model

13

$ $ $

,

min , [ ]1 $,

( , ) 1 ( , )

PEexe

max

m reg hit threads

max

ex m hit hit mem

Power

fN

CPI

BWPerformance OPS

r b P n

e r P S n e P S n e

1 av

threads

mPE

exg

e

n

rN

CPIt

min 1 ,Machine Utilization

$ [ ]$, 1 $, hit threads hit threads mavg cyclesAMAT P n tt t P n

PE Utilization

Off-Chip BW

Power

Page 14: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

14

Outline

14

Page 15: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

15

# Threads

3 regions: Cache efficiency region, The Valley, MT efficiency region

Unified Machine PerformanceP

erfo

rman

ce

Ca

ch

e r

egio

n

MT regionThe Valley

Page 16: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

0

100

200

300

400

500

600

700

800

900

1000

1100

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1000

0

1100

0

1200

0

1300

0

1400

0

1500

0

1600

0

1700

0

1800

0

1900

0

2000

0

GO

PS

Number Of Threads

Performance for Different Cache Sizes (Limited BW)

no $

16M

32M

64M

128M

perfect $

Increase in cache size cache suffices for more in-flight threads Extends the $ region

17

Increase in cache size

Cache Size Impact

..AND also Valuable in the MT region Caches reduce off-chip bandwidth delay the BW saturation point

Page 17: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Simulation results from the PARSEC workloads kit Swaptions:

Perfect Valley

Hit Rate Function Impact

Swaptions

0

20

40

60

80

100

120

140

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

19

Page 18: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Simulation results from the PARSEC workloads kit Raytrace:

Monotonically-increasing performance

Hit Rate Function Impact

Raytrace

0

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

20

Page 19: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Three applications families based on cache miss rate dependency: A “strong” function of number of threads – f(Nq) when q>1 A “weak” function of number of threads - f(Nq) when q≤1 Not a function of number of threads

Threads

Per

form

ance

Hit Rate Dependency – 3 ClassesP

erfo

rman

ce

# Threads

21

Page 20: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Simulation results from the PARSEC workloads kit Canneal

Not enough parallelism available

Workload Parallelism Impact

Canneal

0

2

4

6

8

10

12

14

16

18

20

22

24

26

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Simulation

Analytical Model

Cache Hit Rate

22

Page 21: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

23

Outline

23

Page 22: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

A high-level model for many-core engines A unified framework for machines and workloads from across the range

A vehicle to derive intuition Qualitative study of the tradeoffs A tool to understand parameters impact Identifies new behaviors and the applications that exhibit them Enables reasoning of complex phenomena

First step towards escaping the valley

24

Summary

24

Thank [email protected]

Page 23: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

25

Backup

25

Page 24: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

26

Model Parameters

26

Page 25: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

27

Model Parameters

27

Parameter Description

NPENumber of PEs (in-order processing elements)

S$Cache size [Bytes]

NmaxMaximal number of thread contexts in the register file

CPIexeAverage number of cycles required to execute an instruction assuming a perfect (zero-latency) memory system [cycles]

f Processor frequency [Hz]

t$Cache latency [cycles]

tmMemory latency [cycles]

BWmaxMaximal off-chip bandwidth [GB/sec]

bregOperands size [Bytes]

Machine parameters:

Page 26: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

28

Model Parameters

28

Workload parameters:

Parameter Description

n Number of threads that execute or are in ready state (not blocked) concurrently

rmFraction of instructions accessing memory out of the total number of instructions [0≤rm≤1]

Phit(s, n) Cache hit rate for each thread, when n threads are using a cache of size s

Page 27: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

29

Model Parameters

29

Power parameters:

Parameter Description

eexEnergy per operation [j]

e$Energy per cache access [j]

emem Energy per memory access [j]

PowerleakageLeakage power [W]

Page 28: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

30

Parsec Workloads

30

Page 29: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Model Validation, PARSEC Workloads

Raytrace

0

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

Dedup

0

10

20

30

40

50

60

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of ThreadsP

erf

orm

an

ce

(G

OP

S)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

Canneal

0

2

4

6

8

10

12

14

16

18

20

22

24

26

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Simulation

Analytical Model

Cache Hit Rate

Bodytrack

0

1

2

3

4

5

6

7

8

9

10

0 20 40 60 80 100 120 140 160 180 200

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

Swaptions

0

20

40

60

80

100

120

140

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

Blackscholes

0

20

40

60

80

100

120

140

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)Analytical Model

Simulation

Cache Hit Rate

Page 30: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Related Work

32

Page 31: Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1 Threads vs. Caches: Modeling the Behavior of Parallel

Similar approach of using high level models: Morad et al., CA-Letters 2005 Hill and Michael, IEEE Computer 2008 Eyerman and Eeckhout, ISCA-2010

Related Work

33

Agrawal, TPDS-1992

Saavedra-Barrera and Culler, Berkeley 1991

Sorin et al., ISCA-1998

Hong and Kim, ISCA-2009

Baghsorkhi et al., PPoPP-2010

Thread Context

Cache

Cache Architecture

Region

MT Architecture

Region

Cache

Core

Multi-Core

Region

Uni-Processor

Region

Cache

cccc