zvika guz 1, oved itzhak 1, idit keidar 1, avinoam kolodny 1, avi mendelson 2, and uri c. weiser 1...

Zvika Guz1, Oved Itzhak1, Idit Keidar1, Avinoam Kolodny1, Avi Mendelson2, and Uri C. Weiser1

Threads vs. Caches: Modeling the Behavior of Parallel Workloads

1Technion – Israel Institute of Technology, 2Microsoft Corporation

Challenges: Single-core performance trend is gloomy

Exploit chip-multiprocessors with multithreaded applications

The memory gap is paramount Latency, bandwidth, power

2

Chip-Multiprocessor Era

2[Figure: Hennessy and Patterson, Computer Architecture- A Quantitative approach]

Two basic remedies: Cache – Reduce the number of out-of-die memory accesses Multi-threading – Hide memory accesses behind threads execution

How do they play together? How do we make the most out of them?

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

3

Outline

3



Few examples

Summary

4

Outline

4

Cache-Machines vs. MT-Machines

# of Threads

Cache/Thread

Thread Context

Cache

Cache Architecture

Region

Many-Core – CMP with many, simple cores Tens hundreds of Processing Elements (PEs)

MT Architecture

Region

Intel’s Larrabee

…

Nvidia’s GT200

5

Nvidia’s Fermi

Cache

Core

Multi-Core

Region

Uni-Processor

Region

Cache

cccc

What are the basic tradeoffs? How will workloads behave across the range?

Predicting performance



Few examples

Summary

6

Outline

6

Use both cache and many threads to shield memory access The uniform framework renders the comparison meaningful We derive simple, parameterized equations for performance, power, BW,..

A Unified Machine Model

7

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

Cache

To Memory

Threads Architectural States

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C

C

C

C

C C

C C

C C

C C

C

C

C

C

Cache Machines

8

C

Many cores (each may have its private L1) behind a shared cache

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C

C

C

C

Cache

To Memory

C

C

C

C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

# Threads

Performance

Cache Non Effective point (CNE)

Memory latency shielded by multiple thread execution

Multi-Thread Machines

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

To Memory

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

C C

Threads Architectural States

Ban

dw

idth

L

imit

atio

ns

# Threads

PerformanceMax performance

executionMemory access

9

Analysis (1/3) Given a ratio of memory access instructions rm (0≤rm≤1)

Every 1/rm instruction accesses memory A thread executes 1/rm instructions

Then stalls for tavg cycles

tavg=Average Memory Access Time (AMAT) [cycles]

10

Cache

Thread Context

t [cycles]

ld

1CPIexerm

avgt

ld

PE stays idle unless filled with instructions from other threads Each thread occupies the PE for additional cycles

threads needed to fully utilize each PE

Analysis (2/3)

t [cycles]

ld

1CPIexerm

avgt

ld ld ld ld

1CPIexerm

1exe

avg

m

CPI

r

t

1CPIexerm

11

Cache

Thread Context

Analysis (3/3) Machine utilization:

Performance in Operations Per Seconds [OPS]:

1min 1, threads

avgm

PEexe

rN tCPI

n

Number of available threads

[ ]PEexe

fPerformance N OPS

CPI

Peak Performance

#Threads needed to utilize a single PE

12

Cache

Thread Context

Performance Model

13

$ $ $

,

min , [ ]1 $,

( , ) 1 ( , )

PEexe

max

m reg hit threads

max

ex m hit hit mem

Power

fN

CPI

BWPerformance OPS

r b P n

e r P S n e P S n e

1 av

threads

mPE

exg

e

n

rN

CPIt

min 1 ,Machine Utilization

$ [ ]$, 1 $, hit threads hit threads mavg cyclesAMAT P n tt t P n

PE Utilization

Off-Chip BW

Power



Few examples

Summary

14

Outline

14

15

# Threads

3 regions: Cache efficiency region, The Valley, MT efficiency region

Unified Machine PerformanceP

erfo

rman

ce

Ca

ch

e r

egio

n

MT regionThe Valley

0

100

200

300

400

500

600

700

800

900

1000

1100

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1000

0

1100

0

1200

0

1300

0

1400

0

1500

0

1600

0

1700

0

1800

0

1900

0

2000

0

GO

PS

Number Of Threads

Performance for Different Cache Sizes (Limited BW)

no $

16M

32M

64M

128M

perfect $

Increase in cache size cache suffices for more in-flight threads Extends the $ region

17

Increase in cache size

Cache Size Impact

..AND also Valuable in the MT region Caches reduce off-chip bandwidth delay the BW saturation point

Simulation results from the PARSEC workloads kit Swaptions:

Perfect Valley

Hit Rate Function Impact

Swaptions

0

20

40

60

80

100

120

140

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

19

Simulation results from the PARSEC workloads kit Raytrace:

Monotonically-increasing performance

Hit Rate Function Impact

Raytrace

0

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

20

Three applications families based on cache miss rate dependency: A “strong” function of number of threads – f(Nq) when q>1 A “weak” function of number of threads - f(Nq) when q≤1 Not a function of number of threads

Threads

Per

form

ance

Hit Rate Dependency – 3 ClassesP

erfo

rman

ce

# Threads

21

Simulation results from the PARSEC workloads kit Canneal

Not enough parallelism available

Workload Parallelism Impact

Canneal

0

2

4

6

8

10

12

14

16

18

20

22

24

26

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Simulation

Analytical Model

Cache Hit Rate

22



Few examples

Summary

23

Outline

23

A high-level model for many-core engines A unified framework for machines and workloads from across the range

A vehicle to derive intuition Qualitative study of the tradeoffs A tool to understand parameters impact Identifies new behaviors and the applications that exhibit them Enables reasoning of complex phenomena

First step towards escaping the valley

24

Summary

24

Thank [email protected]

25

Backup

25

26

Model Parameters

26

27

Model Parameters

27

Parameter Description

NPENumber of PEs (in-order processing elements)

S$Cache size [Bytes]

NmaxMaximal number of thread contexts in the register file

CPIexeAverage number of cycles required to execute an instruction assuming a perfect (zero-latency) memory system [cycles]

f Processor frequency [Hz]

t$Cache latency [cycles]

tmMemory latency [cycles]

BWmaxMaximal off-chip bandwidth [GB/sec]

bregOperands size [Bytes]

Machine parameters:

28

Model Parameters

28

Workload parameters:


n Number of threads that execute or are in ready state (not blocked) concurrently

rmFraction of instructions accessing memory out of the total number of instructions [0≤rm≤1]

Phit(s, n) Cache hit rate for each thread, when n threads are using a cache of size s

29

Model Parameters

29

Power parameters:


eexEnergy per operation [j]

e$Energy per cache access [j]

emem Energy per memory access [j]

PowerleakageLeakage power [W]

30

Parsec Workloads

30

Model Validation, PARSEC Workloads

Raytrace

0

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

Dedup

0

10

20

30

40

50

60

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of ThreadsP

erf

orm

an

ce

(G

OP

S)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

Canneal

0

2

4

6

8

10

12

14

16

18

20

22

24

26

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Simulation

Analytical Model

Cache Hit Rate

Bodytrack

0

1

2

3

4

5

6

7

8

9

10

0 20 40 60 80 100 120 140 160 180 200

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

Swaptions

0

20

40

60

80

100

120

140

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)

Analytical Model

Simulation

Cache Hit Rate

Blackscholes

0

20

40

60

80

100

120

140

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Pe

rfo

rma

nc

e (

GO

PS

)

0

10

20

30

40

50

60

70

80

90

100

Ca

ch

e H

it R

ate

(%

)Analytical Model

Simulation

Cache Hit Rate

Related Work

32

Similar approach of using high level models: Morad et al., CA-Letters 2005 Hill and Michael, IEEE Computer 2008 Eyerman and Eeckhout, ISCA-2010

Related Work

33

Agrawal, TPDS-1992

Saavedra-Barrera and Culler, Berkeley 1991

Sorin et al., ISCA-1998

Hong and Kim, ISCA-2009

Baghsorkhi et al., PPoPP-2010

Thread Context

Cache

Cache Architecture

Region

MT Architecture

Region

Cache

Core

Multi-Core

Region

Uni-Processor

Region

Cache

cccc

zvika guz 1, oved itzhak 1, idit keidar 1, avinoam kolodny 1, avi mendelson 2, and uri c. weiser 1...

Documents

cache machines

threads performance

performance slide

memory latency

memory gap

core cmp

singlecore performance

threads execution