zvika guz 1, oved itzhak 1, idit keidar 1, avinoam kolodny 1, avi mendelson 2, and uri c. weiser 1...
Post on 19-Dec-2015
218 views
TRANSCRIPT
Zvika Guz1, Oved Itzhak1, Idit Keidar1, Avinoam Kolodny1, Avi Mendelson2, and Uri C. Weiser1
Threads vs. Caches: Modeling the Behavior of Parallel Workloads
1Technion – Israel Institute of Technology, 2Microsoft Corporation
Challenges: Single-core performance trend is gloomy
Exploit chip-multiprocessors with multithreaded applications
The memory gap is paramount Latency, bandwidth, power
2
Chip-Multiprocessor Era
2[Figure: Hennessy and Patterson, Computer Architecture- A Quantitative approach]
Two basic remedies: Cache – Reduce the number of out-of-die memory accesses Multi-threading – Hide memory accesses behind threads execution
How do they play together? How do we make the most out of them?
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
3
Outline
3
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
4
Outline
4
Cache-Machines vs. MT-Machines
# of Threads
Cache/Thread
Thread Context
Cache
Cache Architecture
Region
Many-Core – CMP with many, simple cores Tens hundreds of Processing Elements (PEs)
MT Architecture
Region
Intel’s Larrabee
…
Nvidia’s GT200
5
Nvidia’s Fermi
Cache
Core
Multi-Core
Region
Uni-Processor
Region
Cache
cccc
What are the basic tradeoffs? How will workloads behave across the range?
Predicting performance
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
6
Outline
6
Use both cache and many threads to shield memory access The uniform framework renders the comparison meaningful We derive simple, parameterized equations for performance, power, BW,..
A Unified Machine Model
7
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
Cache
To Memory
Threads Architectural States
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C
C
C
C
C C
C C
C C
C C
C
C
C
C
Cache Machines
8
C
Many cores (each may have its private L1) behind a shared cache
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C
C
C
C
Cache
To Memory
C
C
C
C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
# Threads
Performance
Cache Non Effective point (CNE)
Memory latency shielded by multiple thread execution
Multi-Thread Machines
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
To Memory
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
C C
Threads Architectural States
Ban
dw
idth
L
imit
atio
ns
# Threads
PerformanceMax performance
executionMemory access
9
Analysis (1/3) Given a ratio of memory access instructions rm (0≤rm≤1)
Every 1/rm instruction accesses memory A thread executes 1/rm instructions
Then stalls for tavg cycles
tavg=Average Memory Access Time (AMAT) [cycles]
10
Cache
Thread Context
t [cycles]
ld
1CPIexerm
avgt
ld
PE stays idle unless filled with instructions from other threads Each thread occupies the PE for additional cycles
threads needed to fully utilize each PE
Analysis (2/3)
t [cycles]
ld
1CPIexerm
avgt
ld ld ld ld
1CPIexerm
1exe
avg
m
CPI
r
t
1CPIexerm
11
Cache
Thread Context
Analysis (3/3) Machine utilization:
Performance in Operations Per Seconds [OPS]:
1min 1, threads
avgm
PEexe
rN tCPI
n
Number of available threads
[ ]PEexe
fPerformance N OPS
CPI
Peak Performance
#Threads needed to utilize a single PE
12
Cache
Thread Context
Performance Model
13
$ $ $
,
min , [ ]1 $,
( , ) 1 ( , )
PEexe
max
m reg hit threads
max
ex m hit hit mem
Power
fN
CPI
BWPerformance OPS
r b P n
e r P S n e P S n e
1 av
threads
mPE
exg
e
n
rN
CPIt
min 1 ,Machine Utilization
$ [ ]$, 1 $, hit threads hit threads mavg cyclesAMAT P n tt t P n
PE Utilization
Off-Chip BW
Power
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
14
Outline
14
15
# Threads
3 regions: Cache efficiency region, The Valley, MT efficiency region
Unified Machine PerformanceP
erfo
rman
ce
Ca
ch
e r
egio
n
MT regionThe Valley
0
100
200
300
400
500
600
700
800
900
1000
1100
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1000
0
1100
0
1200
0
1300
0
1400
0
1500
0
1600
0
1700
0
1800
0
1900
0
2000
0
GO
PS
Number Of Threads
Performance for Different Cache Sizes (Limited BW)
no $
16M
32M
64M
128M
perfect $
Increase in cache size cache suffices for more in-flight threads Extends the $ region
17
Increase in cache size
Cache Size Impact
..AND also Valuable in the MT region Caches reduce off-chip bandwidth delay the BW saturation point
Simulation results from the PARSEC workloads kit Swaptions:
Perfect Valley
Hit Rate Function Impact
Swaptions
0
20
40
60
80
100
120
140
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
19
Simulation results from the PARSEC workloads kit Raytrace:
Monotonically-increasing performance
Hit Rate Function Impact
Raytrace
0
10
20
30
40
50
60
70
80
90
100
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
20
Three applications families based on cache miss rate dependency: A “strong” function of number of threads – f(Nq) when q>1 A “weak” function of number of threads - f(Nq) when q≤1 Not a function of number of threads
Threads
Per
form
ance
Hit Rate Dependency – 3 ClassesP
erfo
rman
ce
# Threads
21
Simulation results from the PARSEC workloads kit Canneal
Not enough parallelism available
Workload Parallelism Impact
Canneal
0
2
4
6
8
10
12
14
16
18
20
22
24
26
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Simulation
Analytical Model
Cache Hit Rate
22
The many-core span Cache-Machines ↔ MT-Machines
A high-level analytical model Performance curves study
Few examples
Summary
23
Outline
23
A high-level model for many-core engines A unified framework for machines and workloads from across the range
A vehicle to derive intuition Qualitative study of the tradeoffs A tool to understand parameters impact Identifies new behaviors and the applications that exhibit them Enables reasoning of complex phenomena
First step towards escaping the valley
24
Summary
24
Thank [email protected]
25
Backup
25
26
Model Parameters
26
27
Model Parameters
27
Parameter Description
NPENumber of PEs (in-order processing elements)
S$Cache size [Bytes]
NmaxMaximal number of thread contexts in the register file
CPIexeAverage number of cycles required to execute an instruction assuming a perfect (zero-latency) memory system [cycles]
f Processor frequency [Hz]
t$Cache latency [cycles]
tmMemory latency [cycles]
BWmaxMaximal off-chip bandwidth [GB/sec]
bregOperands size [Bytes]
Machine parameters:
28
Model Parameters
28
Workload parameters:
Parameter Description
n Number of threads that execute or are in ready state (not blocked) concurrently
rmFraction of instructions accessing memory out of the total number of instructions [0≤rm≤1]
Phit(s, n) Cache hit rate for each thread, when n threads are using a cache of size s
29
Model Parameters
29
Power parameters:
Parameter Description
eexEnergy per operation [j]
e$Energy per cache access [j]
emem Energy per memory access [j]
PowerleakageLeakage power [W]
30
Parsec Workloads
30
Model Validation, PARSEC Workloads
Raytrace
0
10
20
30
40
50
60
70
80
90
100
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
Dedup
0
10
20
30
40
50
60
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of ThreadsP
erf
orm
an
ce
(G
OP
S)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
Canneal
0
2
4
6
8
10
12
14
16
18
20
22
24
26
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Simulation
Analytical Model
Cache Hit Rate
Bodytrack
0
1
2
3
4
5
6
7
8
9
10
0 20 40 60 80 100 120 140 160 180 200
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
Swaptions
0
20
40
60
80
100
120
140
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)
Analytical Model
Simulation
Cache Hit Rate
Blackscholes
0
20
40
60
80
100
120
140
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Number Of Threads
Pe
rfo
rma
nc
e (
GO
PS
)
0
10
20
30
40
50
60
70
80
90
100
Ca
ch
e H
it R
ate
(%
)Analytical Model
Simulation
Cache Hit Rate
Related Work
32
Similar approach of using high level models: Morad et al., CA-Letters 2005 Hill and Michael, IEEE Computer 2008 Eyerman and Eeckhout, ISCA-2010
Related Work
33
Agrawal, TPDS-1992
Saavedra-Barrera and Culler, Berkeley 1991
Sorin et al., ISCA-1998
Hong and Kim, ISCA-2009
Baghsorkhi et al., PPoPP-2010
Thread Context
Cache
Cache Architecture
Region
MT Architecture
Region
Cache
Core
Multi-Core
Region
Uni-Processor
Region
Cache
cccc